Estimating marginal likelihoods from the posterior draws through a geometric identity

Abstract This article develops a new estimator of the marginal likelihood that requires only a sample of the posterior distribution as the input from the analyst. This sample may come from any sampling scheme, such as Gibbs sampling or Metropolis–Hastings sampling. The presented approach can be implemented generically in almost any application of Bayesian modeling and significantly decreases the computational burdens associated with marginal likelihood estimation compared to existing techniques. The functionality of this method is demonstrated in the context of probit and logit regressions, on two mixtures of normals models, and also on a high-dimensional random intercept probit. Simulation results show that the simple approach presented here achieves excellent stability in low-dimensional models, and also clearly outperforms existing methods when the number of coefficients in the model increases.


Introduction
Bayesian model selection relies on the posterior probabilities of the H candidate models M 1 , . . . , M H conditional on the data [3,19]. Two major strategies exist for estimating the posterior probabilities p(M h |y) for each of the h = 1, . . . , H candidate models: (a) the model indicators are considered as part of the parameter space and jointly drawn during MCMC sampling together with all other unobserved quantities [17]. The drawback of this strategy is the necessity of including all candidate models in the estimation simultaneously. If the analyst then wishes to consider an additional model that was not a part of the initial candidate set, another simultaneous estimation needs to be carried out. This exacts a significant cost on the analyst in terms of computing time and effort. Strategy (b) estimates the marginal likelihood for each model which allows for easy calculation of the posterior probabilities independent from the estimation of the other candidate models [19,27]. Despite this appealing characteristic, calculating the marginal likelihood is a non-trivial integration problem, and as such it is still associated with significant effort on the part of the analyst and potential imprecision in the case of high-dimensional or multi-level models. Comparative studies of existing estimation techniques for the marginal likelihood only provide clear evidence of precision for candidate models of lesser dimensions, while Bayesian analysis frequently requires more complex models [2,12,14]. Regardless of which of these two methodological approaches is used to identify the model with the highest posterior probability, reporting the marginal likelihood of this model supports the comparability and reproducibility of the modelling results and thus is a desirable practice.
This article presents a technique for estimating the marginal likelihood, which alleviates both of the drawbacks that hinder existing approaches. The method presented herein requires only a sample of the pos-terior distribution as an input and is thus implementable as a generic function allowing for its use in a variety of applications. Additionally, the approach shows significantly less sensitivity to an increase in the number of model coefficients compared to existing approaches.
We We refer to the integral in the denominator in (1.3) as the non-normalized posterior integral over A and abbreviate it by κ A . Integrating 1 over a K-dimensional bounded set has the geometric interpretation of a generalized volume, or hypervolume, and we will refer to the integral in the nominator in (1.3) as the volume of A. From 1.3 we see that the original integration problem in (1.1) can be split up into two separate problems, namely the volume of A, and the non-normalized posterior integral over A. Splitting up the estimation of the marginal likelihood into these two sub-problems reduces computation effort significantly, and leads to very stable estimates. In particular, in the case of high-dimensional models. The simpler computation results from the relative ease of implementing techniques to estimate a geometric volume as opposed to techniques for estimating a high-dimensional integral. Additionally, the posterior distribution can be used as an importance density for estimating κ A , where all drawbacks of earlier approaches exploiting the posterior distribution as importance density are overcome by limiting the region of integration to A, as defined subsequently.
The remainder of this article is organized as follows: Section 2 details the proposed estimation method. Section 3 applies the developed method to several types of models, these are probit and logit regression, mixture of normals, and the random intercept probit. For each of the applications, a comparison to existing methods is presented, such as Chib's method [5], importance sampling, and bridge sampling [21]. Section 4 summarizes and concludes.

The approach
In this section we present a new estimator for the marginal likelihood by separately estimating the volume of A and the corresponding non-normalized posterior integral as defined in Section 1. We also present a method for choosing A in such a way that the quotient of these estimators yields a stable estimate of the marginal likelihood.
First, we turn to the non-normalized posterior integral. A common technique for numerical integration is importance sampling, and since the posterior distribution is part of the numerator in the non-normalized posterior integral this suggests the posterior as the importance density. Since draws from the posterior distribution are usually available as a natural output of Bayesian analysis, the importance sampling estimator of the posterior integralκ A is available almost ad hoc once A has been defined: where θ (l) with l = 1, . . . , L refers to the posterior draws after the burn in. Earlier attempts for estimating the marginal likelihood by importance sampling have likewise exploited the posterior distribution as importance density, among them most prominently is the harmonic mean estimatorp HM [22]. In this approach, after some reordering, one integrates both sides of Bayes theorem over the whole prior domain: and since integrating the prior over its own domain yields 1, the harmonic mean estimator is given bŷ It is well known thatp HM suffers from pseudo-bias due to the use of an unrestricted prior domain as the region of integration [20], and from significant instability [12]. To be valid, the importance density must have its support defined across the entire region of integration. The posterior distribution however, generally does not have such a wide support in practice, and thus is an improper choice as importance density in this respect. The instability stems from the dominating effect that only a small number of draws from the MCMC sampler has on the sum in equation (2.2). Some of the L posterior draws usually result in low values of p(y|θ (l) ), and since the likelihood enters the estimator reciprocal, these small values dominatep HM .
However, shortcomings of the harmonic mean estimator guide the way to a definition of A allowing a stable estimation of the marginal likelihood from identity (1.3). Firstly, to ensure the posterior distribution is a proper choice for the importance density, as required in our approach, the region of integration A must have full support of the posterior distribution. Secondly, to avoid instability, the region of integration A may only contain points for which the sum in (2.1) is stable independently of any specific run of the MCMC sampler. To address both of these requirements and at the same time allow for a simple estimation of the volume and the non-normalized posterior integral we define A as the intersection of two sets A 1 and A 2 .
Set A 1 is defined by a threshold value ρ and only those points θ lie in A 1 whose non-normalized posterior p * (θ|y) exceeds this threshold, such that Considering the series p * ,(1,...,L) = p * (θ (1) |y), . . . , p * (θ (L) |y), a natural way of determining the threshold ρ is to ensure that the lowest values of p * ,(1,...,L) are not destabilizing the sum in (2.1) by setting ρ as a quantile of p * ,(1,...,L) . We define ρ as the median of p * ,(1,...,L) because of its robustness even in models of high dimension, and because the posterior distribution has sufficient support to serve as importance density wherever the non-normalized posterior exceeds its median. Thereby, an almost perfectly stable estimation of (2.1) is ensured by excluding all values of p * ,(1,...,L) from the estimation ofκ A that might destabilize the sum in (2.1), and a graphical demonstration thereof is given subsequently after the derivation of the final set A.
However, usage of A 1 as A from (1.3) does not allow easy estimation of its volume and would constitute an integration problem with comparable burdens as the original formulation of the marginal likelihood in (1.1). To facilitate easy estimation of the volume of A a second set A 2 is defined, in such a way that the volume of the intersection of A 1 and A 2 can be estimated by means of statistical standard techniques only.
As long as A 1 and A 2 have the same dimension K and an intersection A, the volume of this intersection V A can be written as V A = πV A 2 , where π refers to the ratio of points lying in A 2 that also lie in A 1 , and V A 2 is the volume of A 2 . Hence, for an easy estimation of V A we define A 2 in such a way that its volume V A 2 can be calculated analytically and that drawing uniformly from within it is efficient and feasible. Then π can simply be estimated by drawing K-dimensional vectors θ (r) uniformly from A 2 such that where I( ⋅ ) refers to the indicator function, and R is the number of random draws from within A 2 . Notice that all of the above is valid for unimodal as well as for multimodal distributions. However, henceforth we derive the algorithm for estimating the marginal likelihood for unimodal distributions, while a minor change which will allow for the estimation of p(M|y) for multimodal distributions is presented in Section 3.2 and applied to a mixture of normal distributions.
Even though other definitions of A 2 are possible, we choose a K-dimensional ellipsoid as A 2 . This choice shows outstanding efficiency of the resulting estimator, and to dispel doubts that the choice of an ellipsoid might limit the applicability of the approach, applications in Section 3 focus on posterior distributions with non-elliptical contours. The set of points θ lying in A 2 is thus defined by where C is a positive definite matrix of dimension K × K with its eigenvectors defining the principal axes of the ellipsoid. θ * is a point with support of the posterior distribution, and we define θ * as its posterior mode to ensure substantial overlap between A 2 and A 1 . The last step in defining A 2 is thus choosing C. Consider matrix T = (θ (1) , . . . , θ (L) ) , and its covariance matrix D = cov(T), we define where α is a scalar with domain ℝ + , and is employed as a tuning parameter in the presented approach. In all applications in Section 3, α was set in such a way that the resulting intersection of A 1 and A 2 contains about 49 % of the L posterior draws of θ (l) . We include in the Appendix a trivial binary search method for finding the value of α. The choice of setting α to contain 49 % of the posterior draws can be understood intuitively: since ρ is defined as the median of p * ,(1,...,L) , set A 1 contains exactly (for an even L) 50 % of the posterior draws, and since A is the intersection of A 1 and A 2 , it cannot contain more than these 50 %. While the majority of points θ (l) for which p * (θ (l) |y) > ρ have comparably low distance to the posterior mode θ * , in high-dimensional models with very non-elliptical contours some points far away from the mode can still fulfill p * (θ (l) |y) > ρ. If α is chosen to include all 50 % of the posterior draws fulfilling this condition, the resulting ellipsoid from (2.3) might have much mass in its outer layers where only few points exceed the threshold ρ, with potentially negative effects on the efficiency of estimatingπ . Setting α to include the 49 % of draws with lowest distance to the mode increases the efficiency of estimatorπ , while no evidence for a negative impact on the stability ofκ A is found. Figure 1 compares the values of the non-normalized posterior p * (θ (l) |y) for all l = 1, . . . , L draws of the posterior distribution to those values of p * (θ (l) |y) for which θ (l) ∈ A, for a low and a high-dimensional model. The almost completely flat lower boundary of the non-normalized posterior values under A allow for an extraordinarily stable estimation of the posterior integral in (2.1) independent of the dimension of the respective model.
Algorithm I for the estimation of the marginal likelihood is thus given by: (1) Run an MCMC sampler to obtain L posterior draws θ (l) after the burn in, calculate the series of nonnormalized posterior density values p * ,(1,...,L) , and set ρ to its median, θ * to the posterior mode, and D to the covariance matrix of the posterior draws. 2 , count the numberr of draws for which p * (θ (r) |y) > ρ, and setπ =̄r R . (4) Estimate the volume of A asV A =πV A 2 , and obtain the estimator for the non-normalized posterior integralκ A from (2.1). (5) Calculate the final estimator of the marginal likelihood aŝ An algorithm for efficiently drawing uniformly from within a hyperellipsoid is found in the Appendix which requires minimal computing time even for large values of R. Volume V A 2 is calculated rather than estimated, and the Appendix shows a recursive algorithm returning log(V A 2 ) with minimal computing time even for very high K. Thus, the calculation ofπ and V A 2 , and consequentlyV A , is achieved with low computational effort and high precision.

Applications
In this section the developed approach is applied to several important classes of models. The estimator is compared to those existing methods that have consistently shown excellent performance within the respective class of models in earlier comparative studies. These are the approaches of Chib [5], along with importance sampling and bridge sampling [21]. Chib's method exploits the basic marginal likelihood equation where θ * is a high density point for which the posterior mode is used, and will be referred to throughout this section. Thus an estimate of the marginal likelihood requires an estimate of the posterior density at θ * , and we follow the estimation technique for the posterior ordinatep (θ * |y) as outlined in [5] throughout this section. Chib's method does not require any input other than draws from the Gibbs sampler, however, for each variable block within the Gibbs sampler one needs to re-run a reduced sampler. In particular, for hierarchical and multi-level models, this incurs a certain degree of burden on the analyst. Furthermore, Chib's estimator as outlined above is applicable in a Gibbs setting only, while a generalization of this approach also allows estimation of the marginal likelihood based on output from the Metropolis-Hastings algorithm [6]. Importance sampling and bridge sampling both require as an input an importance density q(θ), which approximates the posterior distribution. The degree to which this approximation successfully mimics the posterior distribution has substantial impact on the efficiency of the estimators. For the importance sampling estimator m = 1, . . . , M draws θ (m) are obtained from q(θ), and an estimator is then given by averaging over the non-normalized posterior values weighted by the importance density ordinates such that If q(θ) does not have fatter tails than the posterior density, the importance density estimator can become unstable and resulting in high standard errors. The bridge sampling technique somewhat alleviates this problem by weakening the demands on the importance density with respect to its tail behavior [12]. The bridge sampling estimator of [21] is the limit of the recurrencê The problem of finding an adequate importance density is not trivial, in particular for high-dimensional densities tuning of the importance density can become very time consuming. An approach to define the importance density q(θ) based on output from the Gibbs sampler is found in [8], where θ is split into D blocks (θ 1 , . . . , θ D ) and based on the conditional distributions of these parameter blocks a mixture importance density is constructed as a byproduct of the Gibbs sampler j>d , y) refers to the product of conditional densities from which θ was drawn in the s th iteration of the Gibbs sampler, and l 1 , . . . , l S refer to a series of S Gibbs iterations chosen by the analyst, see [9,11] for details. Then q(θ) converges to the posterior density when S → ∞. Throughout this section an importance density as defined in (3.1) is used, unless stated otherwise.
Besides comparing point estimates, the expected relative mean-squared error (RE 2 ) gives evidence on the stability of the estimators, and respective estimation techniques for RE 2 are summarized in [12]. However, recent research investigated the accuracy of these estimation techniques [2] and calculated numerical standard errors of the outlined estimators for 500 independent runs of the MCMC sampler. The analysis shows that there is significant risk of overestimating or underestimating the standard errors under given techniques when these require as an input the specific lag at which the autocorrelation of the MCMC chain tapers off (which is the case for all estimators here except the importance sampling estimatorp IS ). However, since this paper aims to complete a pure comparison of the presented estimator to the existing ones, we abstract from this discussion and estimate RE 2 without relying on a measure of the autocorrelation by the bootstrap method, even though this increases computational effort significantly (see e.g. [18] for details about the bootstrap method). For each bootstrap replication ofp A , we draw with replacement L out of the L posterior draws and carry out the estimation ofp A anew. The same applies for bootstrap replications ofp CH when the model is estimated by a one-block Gibbs sampler. For models requiring multi-block samplers only the high density point θ * and the conditional density of the first variable block can be calculated from the bootstrap sample, while for each of the remaining variable blocks the respectively reduced Gibbs sampler is required to be carried out anew to obtain the ordinate of the particular conditional density. Similarly, M out of the M importance draws are drawn to obtain a bootstrap replication ofp IS , and these importance draws are combined with M out of the M selected posterior draws forp BS . Reported standard errors are calculated from 200 bootstrap replications each.
Defining the quantities L, M, S and R to yield comparable computational time with respect to the calculation of the different estimators, and thus allow for a fair comparison, would require tuning these quantities in each application. However, we set these quantities in such way thatp A andp CH show about the same computation time, while even under low values of S importance sampling and bridge sampling require significantly more time. In all subsequent applications we obtain L = 100,000 draws from the Gibbs sampler as input forp A andp CH . Additionally, we set R = 100,000 to ensure precise estimation ofπ considering its standard error is √π (1−π ) R . However, to ensure precise estimation even for high-dimensional distributions with very non-elliptical contours and potentially small values ofπ , we define as a generic rule the lower bound of draws to ber = 1,000. M and S are set in the respective application.

Probit and logit
The most exploited data with respect to comparing estimation techniques of the marginal likelihood are the prostatic nodal involvement data, see e.g. [5,7,14]. These data contain N = 53 patients with cancer of the prostate and the binary response indicates whether the cancer has spread to the surrounding lymph nodes. Potential explanatory variables are the patient's age at time of diagnosis (x 1 ), the level of serum acid phosphate (x 2 ), the binary result of an X-ray examination (x 3 ), the size of the tumor (x 4 ), with 0 for small and 1 for large, and the pathological grade of the tumor (x 5 ), with 0 if less serious and 1 if more severe. All of the above cited articles agree on the addition of an intercept for modeling the binary outcome.
Each of the aforementioned three instances in literature where these data are exploited for marginal likelihood estimation assume a different model specification and prior setting. We use a probit link [5,7] as well as a logit link function [14], given by probit: where x i is a row vector containing the explanatory variables of respondent i, the column vector β holds the corresponding coefficients, and Φ( ⋅ ) refers to the cumulative distribution function of a standard normal distribution. We use the prior setting in [7] for the priors in the probit model specification. Therefore, we use a normal prior with mean 0 and standard deviation 10 for the intercept, and normal priors with mean 0.75 and a standard deviation of 5 for the other explanatory variables. For achieving comparability between the probit and logit models, in the logit specification all parameters of the normal prior are scaled by 1.6 [14].
The probit models are estimated as outlined in [1], while the logit models are estimated by auxiliary mixture sampling as explained in [13] and [15].
Since the presented approach exploits the posterior distribution as importance density in a similar way as the harmonic mean estimator does, we also reportp HM in this section to demonstrate the substantial impact the proposed limitation of the region of integration to set A has. We set the number of draws of the importance density to M = 20,000 forp IS , and combine these draws with randomly selected (M = 20,000) posterior draws for the calculation ofp BS . The mixture importance density is constructed from S = 100 components as outlined in (3.1).
Based on three independent runs of the MCMC sampler for each model, Table 1 displays the outcome of the different estimators of the marginal likelihood on the logarithmic scale. Firstly, for models of relatively low dimension, i.e. M 1 − M 7 and M 10 − M 16 , we see thatp IS ,p BS , andp CH have the smallest standard errors, while the accuracy ofp A clearly allows for its application without increasing the risk of false model choice to a relevant degree. Secondly, we see that the advantage of existing methods vanishes when the dimension of the investigated model increases. In particular, for M 9 and M 18 we see thatp A outperforms the other estimators except the bridge sampling estimatorp BS . Thus the estimatorp A is more robust to an increase in the complexity of the model than importance sampling and Chib's method in this example. This statement is also supported by consideration of the standard errors from the model of lowest dimension, M 1 and the model of highest dimension, M 18 , which have increased by factor 5 or more for all other estimators, while the standard errors ofp A have not responded to an increase of the dimensionality within each group of models with the same link function.

Variables Model ln(p HM ) ln(p IS ) ln(p BS ) ln(p CH ) ln(p A ) Model ln(p HM ) ln(p IS ) ln(p BS ) ln(p CH ) ln(p A )
1,

Mixture of normals
A general formulation of a mixture distribution is given by Denoting the indicator holding the assignment of observation y i to a mixture component by I i , where this indicator takes labels s = 1, . . . , S, the model likelihood and the prior of the series I N = (I 1 , . . . , I N ) are invariant to re-labelling. When the prior of the parameters θ s is invariant too, then the posterior distribution of a mixture model is multimodal with respect to the S! permutations of the indicator labels. This is a challenge for marginal likelihood estimation of mixture models [11]. Parameter estimation of mixture models through MCMC sampling frequently suffers from the un-identifiability of the indicator labels, which can result in undesired label switching. However, several techniques have been proposed to address the label switching problem [10,26]. We first consider a posterior sample θ L,1 = (θ (1) , . . . , θ (L) ) in which all draws belong to one permutation of the indicator labels. Since the marginal likelihood is the normalizing constant of the whole unconstrained posterior distribution, not only of the subspace belonging to one permutation, calculatingp A from only θ L,1 would result in an incomplete estimate of the marginal likelihood. Thus we define a second sample of the posterior distribution θ L,S! = (θ (1) , . . . , θ (L) ) in which the draws represent a balanced sample from all of the S! permutations of the labels. In this section the random permutation sampler of [10] is applied to obtain θ L,S! , while other approaches exploring all modes of the posterior distribution could be used as well (see e.g. [4]).
A derivative of Algorithm I allowing the estimation of the marginal likelihood for multimodal mixture models is then found by first estimating ρ, θ * , D, A, and thusπ andV A , from θ L,1 . Secondly based on the thereby defined A integralκ A is subsequently estimated from θ L,S! . Thusp A can be implemented for mixture models as generic as for unimodal distributions, however, two runs of the MCMC sampler are required. Nevertheless, since the construction of the importance density from (3.1) requires output from a Gibbs sampler run exploring all modes of the posterior as well, while usually a sample θ L,1 is additionally needed for parameter estimation, the requirement of two posterior samples does not mean additional burden compared to importance density based techniques in practice.
Here we demonstrate the function of the proposed estimator by applying it to the simulation example used in [10] and [16]. Two examples of a mixture of two normal distributions are defined. The first example is defined in such a way that the two states are easily separated by the Gibbs sampler and set η 1 = η 2 = 0.5, μ 1 = 1, μ 2 = 2, σ 2 1 = 0.01, and σ 2 2 = 0.05. The second example is defined with the same coefficients as the first example, but with σ 2 1 = 0.1 and σ 2 2 = 0.15, resulting in significant overlap of the distributions. We simulate 1,000 observations from each of the two mixture distributions, and prepare estimation by defining natural conjugate priors for μ s and σ 2 s . An inverse gamma distribution is selected as prior for σ 2 s where the scale parameter and the shape parameter equal 3, imposing a vague prior considering the magnitudes of the true parameters. Then the prior of μ s is a normal distribution with mean 0 and variance 100σ 2 s . Vector (η 1 , η 2 ) follows a priori a Dirichlet distribution with parameters a η 1 = a η 2 = 5. The Gibbs sampler as outlined in [23] is then applied to obtain the posterior sample θ L,1 , and combined with the random permutation step to obtain θ L,S! .
The approach presented here is compared to importance sampling and bridge sampling where the importance density in (3.1) is constructed from S = 100 components, and M is again set to 20,000. As discussed in the literature (see e.g. [12]) Chib's method does not perform comparably well when applied to multimodal models as it does for unimodal models, and we omit reporting the outcomes of this estimator here. The left plot in Figure 2 displays a scatter plot of σ 2 1 and σ 2 2 from all l = 1, . . . , L draws of the series θ L,S! of the MCMC sampler (black, in the background) overlaid by those draws for which θ (l) ∈ A (in the front). We see that the posterior distribution of σ 2 1 and σ 2 2 under the two permutations overlap to a significant extent and that set A is centered around one of the two permutations of the component labels.   very low standard errors and outperform the estimator presented in this article. However, in absolute terms the standard errors ofp A are still very small and would usually have no practical impact on model choice. For the parameter specification with overlap of the distributions of σ 2 1 and σ 2 2 bridge sampling again has the smallest standard errors, while the magnitudes have increased noticeably. Additionally, we see that the standard errors of importance sampling and the method presented here are on the same scale.
We interpret the results from this simulation study as evidence thatp A provides unbiased and stable estimates when applied to multimodal models. Considering the significant challenges associated with marginal likelihood estimation of such mixture models, and the limited applicability of Chib's estimator to them, the presented approach can significantly contribute to making the task of model selection less cumbersome for this important class of models.

Random intercept probit
In this subsection the sensitivity of the discussed estimators of the marginal likelihood to an increase of the dimensionality of the candidate models is explored. This paper provides for the first time a comparison of the discussed estimation techniques for a high-dimensional unit-level model, and discloses the shortcomings of existing approaches. As one instance of a comparative study exploring the existing techniques with respect to a unit-level model, Frühwirth-Schnatter and Wagner [14] fit a random intercept logit model, with up to K = 25 coefficients, to agricultural data and compare the estimates from Chib's method, as well as from importance and bridge sampling under the mixture importance density from (3.1). In this section we increase the number of model dimensions in five applications up to K = 142 to demonstrate the extraordinary stability of the presented estimator in comparison to the existing approaches.
The applications of this section exploit data on reading proficiency in US schools. Data stem from the "Program for International Student Assessment" (PISA), and here data from the year 2009 is used as provided in [25]. Reading proficiency is measured in terms of a continuous score, however, since limited dependent variable models have shown to be a bigger challenge with respect to marginal likelihood estimation, the score is censored in this example. Consequently, the observed dependent variable y is defined according to whether a student exceeded the median score of all students (1) or fell behind it (0). As explanatory variables we use a student's grade (x 1 ; a number between 9 and 11), gender (x 2 ; 1 for female, 0 for male), whether a student's family has lived in the USA for at least three generations (x 3 ; 1 for yes, 0 for no), the index of economic, social and cultural status (x 4 ; a continuous value centered at 0), and an indicator whether a student has stated they enjoy reading (x 5 ; a continuous value centered at 0). Skrondal and Rabe-Hesketh [24] estimate an older version of the PISA data and add a random intercept for each school to capture overdispersion. In this application we use a probit link function, such that the model is given by where t = 1, . . . , T refers to the school of student i with her or his characteristics being held by the row vector x i . We define a normal prior for β such that β ∼ N(0, 100 × I), where I refers to the identity matrix with five columns. We specify a natural conjugate prior for μ and σ 2 . The variance of the random intercept is estimated to be 0.28 in [24], and we specify an inverted Gamma distribution as its prior with σ 2 ∼ G −1 (2, 0.5) having a median of around 1.2 and thus being vague enough for this application. The prior of μ is then a normal distribution with mean 0 and variance 100σ 2 . The Gibbs sampler is outlined in the Appendix. This model is estimated for five different data sets. The first data set contains all 21 public schools in the PISA sample from the Northeast of the US, the second data set contains the 28 schools of the US region West, the third contains the 36 schools from the Midwest, and the fourth data set contains the 50 schools from the remaining region, South. For the fifth data set we combine these regions and estimate reading proficiency for all 135 public schools in the sample. Considering the model estimates an individual intercept for each school, the total number of coefficients K is the number of schools in the respective data set plus 7.
The estimation techniques explored in this paper are the method presented in this article, Chib's method, and importance and bridge sampling. As discussed earlier, the mixture importance distribution from equation (3.1) approaches the posterior distribution for S → ∞, and S = 100 performed well for the low-dimensional models from the proceeding sections. However, in a hierarchical model each of the S com- j>d , y) of the mixture importance density becomes a very narrow and almost spike shaped distribution conditional on a draw of the latent variables Z = (z 11 , . . . , z NT ) . As a result the final importance density has hardly any mass in between its S spikes and clearly fails the requirement to have fatter tails than its target distribution, with negative impact on the stability ofp IS .
The hedgehog-like shape of the importance density may also induce problems for bridge sampling. When constructing the importance density from (3.1) in an unsupervised manner, it is likely that some of the posterior draws θ (1) , . . . , θ (M 2 ) intended to be used during bridge sampling have originally been drawn from one of the S components of the mixture importance density. If a draw θ (m2) has been drawn from one of the S components of the importance density (i.e. from the center region of one of the spikes), then the magnitude of q(θ (m2) ) is exceedingly high compared to an importance density constructed from components M 2 which none of the posterior draws have been drawn from. As a result, the sum in the denominator of the bridge sampling estimator ∑ q(θ (m 2 ) ) M 2 q(θ (m 2 ) )+M 2 p * (θ (m 2 ) |y)/p (t−1) BS is dominated by these high values of q(θ (m 2 ) ) with neg-ative effects on the stability of the estimator. To avoid this source of instability, we split the posterior sample and construct the importance density from the first half of the L posterior draws θ (1) , . . . , θ (L/2) and select the M 2 posterior draws for usage during bridge sampling from the second half θ (L/2) , . . . , θ (L) .
Following the preceding discussion about possible drawbacks of the mixture importance density when applied to unit-level models, we compare it to another approach for the unsupervised construction of an importance density. Likewise as for (3.1) θ is split into D blocks (θ 1 , . . . , θ D ), and the marginal distribution of each block q(θ d ) is estimated from the posterior draws. The product of the estimated marginal densities is then used as importance density (see e.g. [7]): Here, the marginal distribution of μ is estimated by a normal distribution with its mean being the mean of the posterior draws (μ (1) , . . . , μ (L) ), and its variance as the variance of the posterior draws. The marginal distribution of σ 2 is approximated by a log-normal distribution with its mean as the mean of the log-posterior draws (log(σ 2,(1) ), . . . , log(σ 2,(L) )), and its variance as the variance of this series. Likewise, q(θ d ) of β and of the random intercepts γ 1 , . . . , γ T are approximated by normal distributions with their means and their variances estimated from the respective posterior draws. As this construction of the importance density decreases its ability of drawing from high-density regions of the posterior distribution compared to the mixture importance density approach, we set M = 1,000,000. For the mixture importance density, we set S = 1,000 and M = 50,000. Considering these demanding settings for both importance densities, importance sampling and bridge sampling require long computational runtime, while calculation ofp A is much faster. Table 3 displays the results of the comparative study. While the magnitudes of the estimates cannot be compared between the different values of K as these are related to different data sets, the standard errors allow conclusions about the sensitivity of the respective estimators to an increase of the number of coefficients of the underlying model. Firstly, we see that constructing the importance density as in (3.3) can be ruled out for this model and we skip reporting estimates beyond K = 28. Secondly, we see from the high standard errors that, even for this relatively low value of K, importance sampling applying the mixture importance density produces less reliable estimates than the remaining three techniques, and this reliability worsens with an increase in K. However, up to K = 57 all three remaining techniques provide estimates with relatively low standard errors, whilep A is by far the most stable for all values of K. Thirdly, in practice one would usually wish to analyze all public schools in the US in the same model, which is marked in Table 3 through K = 142. For this model we see that all existing techniques are not able to produce stable estimates, while in contrast p A provides outstandingly stable estimates with low standard errors showing only very limited sensitivity to the increase in the number of model coefficients from K = 28 to K = 142.
The outstanding stability ofp A is a result of directly addressing the magnitude of the non-normalized posterior as the criterion to qualify for A. For importance density based methods the importance density has to approximate the posterior distribution in all K dimensions, whilep A only relies on the univariate aggregate p * (θ|y). This is a significant advantage with respect to the stability of the estimates. The right plot in Figure 2 compares all posterior draws of the coefficients μ and σ 2 of the data with K = 28, to those draws lying in A, and demonstrates that the posterior distribution has very non-elliptical contours obviously not hampering the stability ofp A .

Conclusion and discussion
This article develops a new approach to marginal likelihood estimation and compares it to existing estimation techniques. This comparison reveals two major advantages of the presented approach. Firstly, it requires only a sample of posterior draws as an input from the analyst, and these draws can come from any sampling scheme, such as Gibbs sampling or Metropolis-Hastings sampling. Furthermore, and in contrast to some widely used earlier approaches, only draws from one run of the MCMC sampler are required, except for  Table 3: PISA data; logarithm of different estimators and for five different data sets. Importance sampling and bridge sampling using an importance density constructed as product of estimated marginal densities as in (3.3) are referenced byp 1 IS andp 1 BS , whilep 2 IS andp 2 BS use a mixture importance density constructed as in (3.1),p CH refers to Chib's method, and the estimator proposed in this paper is referenced asp A . Relevant standard errors in parentheses, results from three independent MCMC runs per data set are reported multi-modal posterior distributions where a second run is required. Hence, the presented approach can be implemented generically for many important classes of models. This property substantially decreases the burden for the analyst associated with computing the marginal likelihood.
The second advantage of the presented approach is improved stability when the number of model coefficients increases, as illustrated in Section 3.3. Herein, the high level of stability, in contrast to the existing approaches, was shown up to a 142-dimensional random intercept probit. While this result can encourage analysts to report the marginal likelihood as part of Bayesian analyses more frequently, no general rule can be derived as to the number of dimensions for which the method will provide stable estimates. Therefore, assessments of the relationship between the number of dimensions, the type of the analyzed model, and the precision of the resulting estimates are suggested for future research.
Comparable uncertainty remains regarding the best choice of the tuning parameters for the presented approach, these are ρ and α. However, despite the very different model types on which the presented approach was tested in Section 3, all applications utilized the same values for these parameters. The high stability of the estimates in these applications suggests that the precision of the method may not be sensitive to the choice of tuning parameters. Nevertheless, future research should provide a better understanding how and to what extent an optimized choice of these parameters could further improve the precision of the method. As a final remark, and as mentioned in Section 2, we note that in principle the choice of the geometric object for A 2 , which was chosen to be a K-dimensional ellipsoid in this article, is a tuning parameter. The availability of methods for drawing uniformly from within an ellipsoid, and for providing the log of its volume even when K is high, make it a convenient and efficient choice. Nevertheless, it cannot be ruled out that a different definition of A 2 could further improve the properties of the presented approach.
A Subroutines for Algorithm I A.1 Log of volume of K-dimensional ellipsoid Algorithm 1 returns the log of the volume of a K-dimensional ellipsoid.  (1); v (1) = log(2) #Calculats the volume of i-dimensional spheres of unit radius 5: for i = 2 to K do 6: v (i) = log( 2π i−1 ) + v (i−2)

7:
end for #Scales the volume of K-dimensional unit sphere with respect to main axes 8: log(V A 2 ) = v (K) + ∑ K k=1 log( √ e (k) ) 9: end procedure A.2 Drawing uniformly from within K-dimensional ellipsoid Algorithm 2 returns one point θ (r) uniformly drawn from within a K-dimensional ellipsoid.

Algorithm 2.
Drawing uniformly from within a K-dimensional ellipsoid.