Dynamic Shrinkage Priors for Large Time-Varying Parameter Regressions Using Scalable Markov Chain Monte Carlo Methods

Abstract Time-varying parameter (TVP) regression models can involve a huge number of coefficients. Careful prior elicitation is required to yield sensible posterior and predictive inferences. In addition, the computational demands of Markov Chain Monte Carlo (MCMC) methods mean their use is limited to the case where the number of predictors is not too large. In light of these two concerns, this paper proposes a new dynamic shrinkage prior which reflects the empirical regularity that TVPs are typically sparse (i.e. time variation may occur only episodically and only for some of the coefficients). A scalable MCMC algorithm is developed which is capable of handling very high dimensional TVP regressions or TVP Vector Autoregressions. In an exercise using artificial data we demonstrate the accuracy and computational efficiency of our methods. In an application involving the term structure of interest rates in the eurozone, we find our dynamic shrinkage prior to effectively pick out small amounts of parameter change and our methods to forecast well.


Introduction
The increasing availability of large data sets in economics has led to interest in regressions involving large numbers of explanatory variables.Given the evidence of instability and parameter change in many macroeconomic variables, there is also an interest in time-varying parameter (TVP) regression models and multi-equation extensions such as time-varying parameter Vector Autoregressions (TVP-VARs).This combination of large numbers of explanatory variables with TVPs can lead to regressions with a huge number of parameters.But such regressions are often sparse, in the sense that most of these parameters are zero.In this context, Bayesian methods have proved particularly useful since Bayesian priors can be used to find and impose this sparsity, leading to more accurate inferences and forecasts.A range of priors have been suggested for high-dimensional regression models (see, among many others, Ishwaran and Rao, 2005;Park and Casella, 2008;Griffin and Brown, 2010;Carvalho et al., 2010;Bhattacharya et al., 2015).There is also a growing literature which extends these methods to the TVP case.Examples include Belmonte et al. (2014), Kalli and Griffin (2014), Eisenstat et al. (2016), Kowal et al. (2019), Petrova (2019), Kalli and Griffin (2019), Knaus et al. (2021), Chan et al. (2020), Hauzenberger et al. (2022) and Fischer et al. (2023).
Most of these papers assume particular forms of parameter change (e.g., it is common to assume parameters evolve according to random walks) and use computationallydemanding Markov Chain Monte Carlo (MCMC) methods.The former aspect can be problematic (e.g., if parameter change is rare and abrupt, then a model which assumes all parameters evolve gradually according to random walks is inappropriate).The latter aspect means these methods are not scalable (i.e., MCMC-based methods cannot handle models with huge numbers of coefficients).
The contributions of the present paper relate to issues of prior elicitation and computation in TVP regressions.With regards to prior elicitation, we develop novel dynamic shrinkage priors for TVP regressions.These modify recent approaches to dynamic shrinkage priors in papers such as Kowal et al. (2019).We work with the static representation of the TVP regression model which breaks the coefficients into two groups.One group contains constant coefficients (we call these α).The other, which we call β, are TVPs.
In the static representation, the dimension of β can be enormous.Our dynamic globallocal shrinkage priors are carefully designed to push unimportant elements in β to zero in a time-varying fashion.This is done using a global shrinkage parameter that varies over time as well as local shrinkage parameters.The global shrinkage parameter has an interpretation similar to a dynamic factor model with a single factor.This single factor can be used to find periods of time-variation in coefficients and periods when they are constant.Since the assumption of a common volatility factor hampers the use of standard stochastic volatility MCMC algorithms based on a mixture of Gaussians approximation (Kim et al., 1998), we propose a simple approximation that works particularly well in high dimensional settings.
With regards to computation, we develop a scalable MCMC algorithm.This algorithm is suitable for cases where the posterior for β, conditional on the other parameters in the model, is Gaussian.This occurs for a wide range of global-local shrinkage priors including the dynamic shrinkage priors used in this paper.In this case, the exact MCMC algorithm of Bhattacharya et al. (2016) is the state of the art.1However, even it is too computationally slow to handle the huge number of regressors that appear in the static representation of the TVP regression model.Recently, Johndrow et al. (2017) has proposed an approximate algorithm based on this exact algorithm which is computationally much more efficient in sparse models and, thus, is scalable.
In our paper, it is precisely this scalable MCMC algorithm which forms the basis of the algorithm we use.It involves a thresholding step (described below) which we implement in a different manner than Johndrow et al. (2017).In particular, as opposed to fixing the threshold to a small number, we set it adaptively.Since this would typically imply a number of thresholds that match the dimension of β, we use a method called Signal Adaptive Variable Selection (SAVS), see Ray and Bhattacharya (2018), to determine the thresholds in a novel way.SAVS has the advantage of being computationally fast and easy to implement.Recent papers use SAVS for determining variable relevance (Hahn and Carvalho, 2015), portfolio applications (Puelz et al., 2020) or improving macroeconomic forecasts (Huber et al., 2021).We solely use SAVS to identify which variables can be safely set to zero in order to construct an approximate posterior distribution for the TVPs.Thus, the use of SAVS in the context of the algorithm of Johndrow et al. (2017) provides two-fold benefits: computational improvements and more flexibility due to its adaptive nature.
We investigate the use of our methods in artificial and real data.The artificial data exercise demonstrates that our scalable algorithm is a good approximation to exact MCMC and that its computational benefits are substantial.Our application to the eurozone yield curve shows how our methods can effectively pick out small amounts of occasional parameter change in some parameters.Furthermore, allowing for such change in the coefficients improves forecasts.
The remainder of the paper is organized as follows.The second section defines the TVP regression and TVP-VAR models used in this paper.The third section discusses MCMC methods for the regression coefficients and introduces our computationally-efficient approximate method.Section 4 develops different dynamic shrinkage priors and discusses Bayesian estimation.This section also describes a novel method for drawing the volatilities in the context of a multivariate stochastic volatility process with a common factor.
Sections 5 and 6 present our artificial data exercise and our empirical application, respectively.Section 7 summarizes and concludes.

Static Representation of a TVP Regression 2.1 A TVP regression
The static representation of a TVP regression model involving a T -dimensional dependent variable, y, and a T × K-dimensional matrix of predictors, X is: where α is a K-dimensional vector of time-invariant coefficients, β t is a K × 1 vector of time-varying coefficients and L = diag(σ 1 , . . ., σ T ) with σ t denoting time-varying error volatilities.The TVP part of this model arises through the W β term.W is a T × k(= T K) matrix given by: with x t denoting a K-dimensional sub-vector of X. Equation 1 is simply a regression which leads to the terminology static representation.But it is a regression with an enormous number of explanatory variables.
Note that (2) implies that the TVPs are mean zero and uncorrelated over time.However, extensions to other forms can be trivially done through a re-definition of W .For instance, if we are interested in random walk-type behavior in the TVPs, we can set This specification implies that β can be interpreted as the changes in the parameters and multiplication with W yields the cumulative sum over β.In our empirical exercise, we consider both of these specifications for W and refer to the former as the flexible (FLEX) and the latter as the random walk (RW) specification.
The existing literature using Bayesian shrinkage techniques typically uses MCMC methods.Exact MCMC sampling, however, quickly becomes computationally cumbersome since k is extremely large even for moderate values of T and K.
Various solutions to this have been proposed in the literature.The standard solution is simply not to work with the static representation, but instead make some parametric assumption about how the TVPs evolve (e.g., assume they follow random walks or a Markov switching process).Unless K is extremely large, exact MCMC methods are feasible.However, with macroeconomic data it is common to find strong evidence of changes in the conditional variance of a series, but much less evidence in favor of change in the conditional mean of a series, (see, e.g., Clark, 2011).When K is large, it is plausible to assume that only some of the predictors have time-varying coefficients and, even for these, coefficient change may only rarely happen.In this paper, we develop another approach which should work particularly well when β is extremely sparse.This is the scalable MCMC method, based on posterior perturbations, of Johndrow et al. (2017).

Extension to a TVP-VAR
Before discussing the scalable MCMC algorithm, we note that methods developed for the TVP regression can also be used for the TVP-VAR if it is written in equation-by-equation form (see, for instance, Carriero et al., 2019;Huber et al., 2021).In particular, we can use the following structural representation of the TVP-VAR: with y t being an M -dimensional vector of endogenous variables, c t denoting an Mdimensional vector of intercepts, A pt , for p = 1, . . ., P , denoting an M × M -dimensional time-varying coefficient matrix that may be stacked in a matrix A t = (A 1t , . . ., A P t ).
Furthermore, t is an M -dimensional vector of errors and Σ t = diag (σ 2 1t , . . ., σ 2 M t ) refers to its diagonal time-varying covariance matrix.Finally, A 0t defines contemporaneous relationships between the elements of y t and is lower-triangular with zeros on the diagonal.
The i th (i = 2, . . ., M ) equation of y t can be written as a standard TVP regression model: Here, x it is a K i (= M P +i)-dimensional vector of covariates with x it = (1, {y jt } i−1 j=1 , y t−1 , . . ., y t−P ) , ) denotes a K i -dimensional vector of time-varying coefficients, with c it referring to the i th element in c t , a ij,0t denoting the (i, j) th element of A 0t and A i•,t referring to the i th row of A t .For i = 1, x 1t = (1, y t−1 , . . ., y t−p ) and ) .Thus, the TVP-VAR can be written as a set of M independent TVP regressions which can be estimated separately using the MCMC methods described in the following section.An additional computational advantage arises in that the M equations can be estimated in parallel using multiple CPUs.
Depending on the particular choice of W , this model nests a variety of commonly used specifications in the literature.For instance, if W implies a random walk behavior of the latent states we arrive at a TVP-VAR closely related to the one proposed in Primiceri (2005).As we will show below, the main difference is that we have a more flexible state equation by allowing for heteroskedasticity in the shocks to the states through dynamic shrinkage priors.Another model that is closely related to ours is the one proposed in Cogley et al. (2010).This model assumes that the variances of the state innovations evolve according to independent stochastic volatility models.
3 Scalable MCMC Algorithm for a Large TVP Model (2017) is perfectly suited for.Thus, we use this algorithm for β.Every model used in the empirical application also includes stochastic volatility.
In the following section, we develop an MCMC algorithm to produce draws of L. Since there is nothing new in our MCMC algorithm for α and our algorithm for drawing L is discussed later, in this section we will proceed conditionally on them and work with the transformed regression involving dependent variable ỹ = L −1 (y − Xα) and explanatory variables W = L −1 W .The appendix provides full details of our MCMC algorithm.In this section, we will also assume that the prior on β is (conditional on other parameters) Gaussian with mean zero and a diagonal prior covariance matrix D 0 = diag(d 1 , . . ., d k ).
Many different global-local shrinkage priors have this general form and, in the following section, we will suggest several different choices likely to be well-suited to TVP regressions.
The exact MCMC algorithm of Bhattacharya et al. (2016) for drawing β proceeds as follows: 1 Bhattacharya et al. (2016) show that this algorithm is fast compared to existing approaches which involve taking the Cholesky factorization of the posterior covariance matrix.However, it can still be slow when k is very large.The computational bottleneck lies in the calculation of Γ = W D 0 W which has computational complexity of order O(T 2 k).In macroeconomic or financial applications involving hundreds of observations, T 2 k = T 3 K can be enormous.Johndrow et al. (2017) and Johndrow et al. (2020) propose an approximation to the algorithm of Bhattacharya et al. (2016) which, in sparse contexts, will be much faster and, thus, scalable to huge dimensions.The basic idea of the algorithm is to approximate the high-dimensional matrix Γ by dropping irrelevant columns of W so as to speed up computation.To be precise, Steps 4 and 5 of the algorithm are replaced with Here, WS denotes a T × s-dimensional sub-matrix of W that consists of columns defined by a set S and D 0,S is constructed by taking the diagonal elements of D 0 also defined by S.
Let S = {j : δ j = 1} denote an index set with δ j being the j th element of a k-dimensional selection vector δ with elements δ j = 1 with probability p j and δ j = 0 with probability (1 − p j ).Johndrow et al. (2017) approximates δ j by setting δj = 0 if d j ∈ (0, ξ] for ξ being a small threshold.Computational complexity is reduced from O(T 2 k) to O(T 2 s), where s = k j=1 δ j is the cardinality of the set S or equivalently the number of non-zero parameters in β.
Step 5* yields a draw from the approximate posterior p(β|•) with the • notation indicating that we condition on the data and the remaining parameters in the model.
The algorithm requires a choice of a threshold for constructing δ.Johndrow et al.
(2017) suggest simple thresholding rules that seem to work well in their work with artificial data (e.g., recommendations include setting the threshold to 0.01 when explanatory variables are largely uncorrelated, but 10 −4 when they are more highly correlated).However, choosing the threshold might be problematic for real data applications and can require a significant amount of tuning in practice.Instead we propose to choose the thresholds in a different way using SAVS.
To explain what SAVS is and how we use it in practice, note first that papers such as Hahn and Carvalho (2015) recommend separating out shrinkage (i.e., use of a Bayesian prior to shrink coefficients towards zero) and sparsification (i.e., setting the coefficents on de-selected variables to be precisely zero so as to remove them from the model) into different steps.First, MCMC output from a standard model (e.g., a regression with global-local shrinkage prior) is produced.Secondly, this MCMC output is then sparsified by choosing a sparse coefficient vector that minimizes the distance between the predictive distribution of the shrunk model and the predictive density of a model based on this sparse coefficient vector plus an additional penalty term for non-zero coefficients.This assumption is critically based on assuming normally distributed shocks.The optimal solution, β, is then a sparse vector which can be used to construct δ.
The advantages of this shrink-then-sparsify approach are discussed in Hahn and Carvalho (2015) and, in the context of TVP regressions, in Huber et al. (2021).One important advantage is that estimation error is removed for the sparsified coefficients.When using global shrinkage priors in high dimensional contexts with huge numbers of parameters, small amounts of estimation error can build up and have a deleterious impact on forecasts.
By sparsifying, estimation error in the small coefficients is eliminated, thus improving fore-casts.This paper differs from the aforementioned papers by using SAVS to approximate the indicators δ which is then used in our approximate MCMC algorithm.
The SAVS algorithm, developed in Ray and Bhattacharya (2018), is a fast method for solving the optimization problem outlined above, making it feasible to sparsify each draw from the posterior of β.In the present context, our contention is that a strategy which uses SAVS to shrink-then-sparsify our coefficients can be used to provide a sensible estimate of δ that does not lead to a deterioration in forecast accuracy.Using SAVS, we first produce a sparsified draws β.2For each draw β = ( β1 , . . ., βk ) , we then set Each draw of δj is used in the construction of Γ in the MCMC algorithm of Johndrow et al. (2017) described above.We will refer to this algorithm as being approximate to distinguish it from the exact algorithm of Ray and Bhattacharya (2018).
4 Bayesian Estimation and Inference

Dynamic global-local shrinkage priors
For the time-invariant coefficients, α, we use a horseshoe shrinkage prior (Carvalho et al., 2010).Since the properties of this prior are familiar and posterior simulation methods for this prior are standard, we do not discuss it further here.See the appendix for additional details.
The important contribution of the present paper lies in the development of a dynamic extension of the horseshoe prior for β.We modify methods outlined in Kowal et al.
(2019) to design a prior which reflects our beliefs about what kinds of parameter change are commonly found in macroeconomic applications.In particular, we want to allow for a high degree of sparsity in the TVPs.That is, we want a prior that allows for the possibility that parameter change is rare and may occur for only some coefficients in the regression.
There may be periods of instability when parameters change and times of stability when they do not.A dynamic global-local shrinkage prior which has these properties is: where β t = (β 1t , . . ., β Kt ) denotes the coefficients at time t, τ denotes a global shrinkage parameter that pushes all elements in β towards zero, λ t is a time-specific shrinkage factor that pushes all elements in β t towards zero and φ jt is a coefficient and time-specific shrinkage term that follows a half-Cauchy distribution.
Thus, the prior covariance matrix of β t is given by: which implies that λ t acts as a common factor that aims to detect periods characterized by substantive amounts of time variation.
The main innovation of this paper lies in our treatment of this common factor.Before we discuss the precise specifications for λ t , it is worth summarizing the key innovation of this prior.As opposed to the dynamic horseshoe of Kowal et al. (2019), we only introduce persistence in the common shrinkage factor λ t .The key point to note here is that, as opposed to assuming a dynamic law of motion for the coefficient-specific prior scaling parameters, we borrow strength from the cross-sectional dimension and by doing this we substantially reduce the computational burden necessary.
For the global shrinkage parameter we consider four different laws of motion.The first and second of these involve setting g t = log(τ λ t ) and assuming it follows an AR(1) process: with µ = log τ .We consider two possible distributions for ν t .In the first of these it follows a four parameter Z-distribution, Z(1/2, 1/2, 0, 0), leading to a variant of the dynamic horseshoe prior proposed in Kowal et al. (2019) (henceforth labeled dHS svol-Z).
The second of these follows a Gaussian distribution, leading to a standard stochastic volatility model for this prior variance (labeled dHS svol-N).This model resembles the one stipulated in Cogley et al. (2010) but with a single dynamic volatility process.Both of these processes imply a gradual evolution of g t and thus a smooth transition from times of rapid parameter change to times of less parameter change.
The third and fourth specifications allow for more abrupt change between times of stability and times of instability.They assume that λ t is a regime switching process with: We also include a fifth specification by setting λ t = 1 for all t.We refer to this setup as the static horseshoe prior (abbreviated as sHS).For these last three specifications (i.e., the ones that do not assume λ t to evolve according to an AR(1) process), we use a half-Cauchy prior on √ τ ∼ C + (0, 1).

Markov Chain Monte Carlo (MCMC) algorithm
For all of these models, Bayesian estimation and prediction can be done using MCMC methods.In this sub-section we mainly focus on how to sample λ t under the assumption that it evolves according to an AR(1) process.For this step we propose a simple and accurate approximation that renders the corresponding hierarchical model linear and conditionally Gaussian.We only briefly discuss the remaining steps since most of them are standard in the literature.
For the time varying regression coefficients, the scalable algorithms (with or without sparsification) of the preceding section, based on Johndrow et al. (2017), can be used.
The only modification is that we construct D 0 as follows: with λ t depending on the specific law of motion adopted.Most of the prior hyperparameters introduced in this section have posterior conditionals of standard forms.These are given in the appendix.
Sampling λ t for the specifications that assume it to be binary is also straightforward and can be carried out using standard algorithms.To sample from the posterior of λ t under the assumption that it evolves according to an AR(1) process, the algorithm proposed in Jacquier et al. (1995) can be used.However, since this algorithm simulates the λ t 's one at a time mixing is often an issue.A second option would be to view the prior (after squaring each element of β t and taking logs) as the observation equation of a dynamic factor model.This strategy, however, would be computationally challenging for moderate to large values of K.As a solution, we propose a new algorithm that is straightforward to implement and, if K is large, has good properties.
Let βt be a K-dimensional vector of normalized TVPs with typical element βjt = β jt /(φ jt τ 1/2 ).Using (5) and squaring yields: with Notice that ν t follows a χ 2 distribution with K degrees of freedom, denoted by χ 2 K .This implies that sampling algorithms that rely on the Gaussian mixture approximation proposed in Kim et al. (1998) cannot be used.
Instead we approximate the χ 2 K using a well-known limit theorem that implies, as K → ∞, This approximation works if K is large.In our case, K is often large.For instance, in the largest TVP-VAR model we consider, K is around 100.Since we estimate the TVP-VAR one equation at a time, values of this order of magnitude hold in each equation and the approximation is likely to be good.But if one were to do full system estimation of the TVP-VAR, there are on the order of M K VAR coefficients at each point in time and the approximation would be even better.
Substituting the Gaussian approximation into (7) and taking logs yields: Finally, under the assumption that ( √ 2Kq t + K) > 0 and by using a Taylor series expansion,3 we approximate log vt with a N (log(K) − 1/K, 2/K) to render (8) conditionally Gaussian.This implies that any of the standard algorithms proposed in the literature on Gaussian linear state space models can be used.In this paper, we simulate log λ t using the precision sampler outlined, for example, in Chan andJeliazkov (2009) andMcCausland et al. (2011).
The accuracy of this approximation for different values of K is illustrated in Figure 1.
From this figure it is clearly visible that, if K is greater than 5, our approximation works extremely well.In these cases, there is hardly any difference visible between the log χ 2 K and the single-component Gaussian distribution.For K = 1 (the most extreme case) and K = 5, some differences arise which mainly relate to the left tail of the distribution.
However, already for K = 5 these differences are so small that we do not expect them to have any serious consequences on our estimates of λ t , even for small values of K.

Illustration Using Artificial Data
In this section we illustrate the merits of our approach using synthetic data.Notes: This figure illustrates the approximation error resulting from approximating the error distribution (which is log χ 2 K ) with a single-component Gaussian with mean log(K) − 1/K and variance 2/K.For different values of K, the blue shaded areas show the exact error distribution, while the red shaded areas indicate the approximate error distribution.

How does our algorithm compare to exact MCMC?
We start by showing that using our approximate (sparsified) algorithm yields estimates that are close to the exact ones in terms of precision.This is achieved by considering five different data generating processes (DGPs).These are all based on Equation ( 1 The different DGPs assume that the states evolve as follows: • dense gradual: • dense mixed: • medium-dense gradual: • sparse abrupt: β t ∼ N (β t−1 , I K ) with P rob(d t = 1) = 0.02, • no TVPs: β t = 0 K×1 for all t.
The remaining parameters are set as follows: X j ∼ N (0, I T ) for j = 1, . . ., K. Based on these, we use the true path of the parameters β t to obtain a realization of y t .In all simulation experiments and for all models considered we simulate 2, 500 draws from the joint posterior of the parameters and latent states and discard the first 500 draws as burn-in.
We investigate the accuracy of our scalable approximate MCMC methods relative to the exact MCMC algorithm of Bhattacharya et al. (2016) (i.e., it is the version of our algorithm which imposes δ j = 1 for all j).Table 1 shows the ratio of mean absolute errors (MAEs), computed using the posterior mean of {β t } T t=1 and the true parameters, for the approximate relative to the exact approach for the five priors averaged over the five DGPs.
With one exception, MAE ratios are essentially one indicating that the approximate and exact algorithms are producing almost identical results.The one exception is for the DGP which does not have any TVPs.For this case, the approximate algorithm is substantially better than the exact one.This is because our approximate algorithm uses SAVS which (correctly for this DGP) can set the TVPs to be precisely zero.In this case, draws from the posterior will coincide with draws from the prior that induce heavy shrinkage.
Hence, compared to the exact model, the likelihoood does not influence the prior and more shrinkage can be achieved.
Thus, Table 1 shows that, where there is substantial time variation in parameters, the approximation inherent in our scalable MCMC algorithm is an excellent one, yielding results that are virtually identical to the slower exact algorithm.The table also shows the usefulness of SAVS in cases of very sparse DGPs.Notes: Numbers are averages based on 20 replications from each of the DGPs.
5.2 How big are the computational gains of our algorithm?
Our second artificial data experiment is designed to investigate the computational gains of our algorithm relative to exact MCMC for various choices of K, T , degrees of sparsity and data configurations.Since we are only interested in computation time we just generate one artificial data set for each of two different ways of specifying W .The random numbers refered to below are drawn from the standard Gaussian distribution.
For K = 1, ..., 400 and T ∈ {100, 200} we randomly draw a y and an X.The W is drawn in two ways which correspond to the flexible and random walk specifications of equations ( 2) and (3), respectively.
In terms of sparsity, we consider four scenarios based on how we choose WS : • 100% dense: WS = W .This is the exact algorithm.
Figure 2 depicts the computational advantages of our approximate MCMC algorithm relative to the exact algorithm of Bhattacharya et al. (2016).It shows the time necessary to obtain a draw of β.It can be seen that when the TVPs are highly correlated over time as with the random walk specification, then our scalable algorithm has substantial computational advantages relative to the exact algorithm particularly for large K and in sparse data sets.When the TVPs are uncorrelated the computational advantages of our approach relative to the exact algorithm are smaller, but still appreciable.4Notes: The figure shows the estimation time in seconds required to obtain a draw for K time-varying coefficients for different degrees of overall sparsity (i.e., 1%, 10%, 50%, and 100% dense).The dots refer to the empirical run times for which we fit a nonlinear trend (indicated by the solid lines).The red colored dots and red solid lines indicate run times of the exact algorithm (100% dense, with s = k).
6 Empirical Application using Eurozone Yield Data

Data overview and specification issues
We illustrate our methods using a monthly data set of 30 government bond yields in the euro area (EA).As opposed to forecasting standard US macroeconomic time series such as output, inflation and unemployment rates, forecasting EA government bond yields is challenging due to, at least, three reasons.The first is that the researcher has to decide on the segment of yield curve she is interested in or use techniques that allow for analyzing the full term structure of government bond yields.Following the latter approach leads to overfitting issues whereas the former approach might suffer from omitted variable bias.
The second challenge is that these time series are often subject to outliers as well as sharp shifts in the conditional variance.The final reason is that the time series we consider are we are coding using sparse algorithms.In the flexible specification for W , the underlying matrices are block-diagonal and thus exact sampling is already quite fast.
rather short and in such circumstances TVP-VARs risk overfitting if the estimates of the TVPs are not regularized sufficiently.We expect that the techniques proposed in this paper are capable of handling both issues well.
We use monthly yield curve data obtained from Eurostat.This dataset includes the yield to maturity of a (hypothetical) zero coupon bond on AAA-rated government bonds of eurozone countries for 30 different maturities.These maturities range from one-year to 30-years and span the period from 2005:01 to 2019:12.
If we wish to model all 30 yields jointly we have to estimate a TVP-VAR with M = 30 equations, a challenging statistical and computational task which we will take on in the next sub-section.Since the parameter space of such a model is vast and difficult to interpret, in this sub-section where we present some in-sample results, we will use a small-scale example.This model is based on the Nelson-Siegel three factor model (see, e.g., Nelson and Siegel, 1987;Diebold et al., 2006) and assumes that the yield on a security with maturity t, labeled r t (t), features a factor structure: Here, L t , S t and C t refer to the level, slope and curvature factor, respectively, while η t (t) denotes maturity-specific measurement errors which are independent across maturities and feature variance σ 2 η (t).ζ denotes a parameter that controls the shape of the factor loadings.Following Diebold et al. (2006), we set ζ = 0.7308 (12 × 0.0609).Since the loading of the level factor is one for all maturities and does not feature a discount factor, it defines the behavior at the long end of the yield curve.Moreover, the slope factor mainly shapes the short end of the yield curve and the curvature factor defines the middle part of the curve.The latent yield curve factors are obtained by running OLS on a tby-t basis.These estimates are then consequently used as our endogenous variables by setting y t = (L t , S t , C t ) and estimating the TVP-VAR defined in (4).We use the flexible specification for W in (2) and the approximate algorithm to estimate the model.In addition, we set the lag length to two.After obtaining forecasts for y t , we use (9) to map the factors back to the observed yields.It is worth noting that (9) constitutes an observation equation which links the observed yields to the latent Nelson-Siegel factors.
To compute predictive densities, we also take the corresponding measurement errors into account by estimating the measurement error variance independently for each observed series.

In-sample results
To provide some information on the amount of time variation, Figures 3 and 4  The main impression provided by Figure 3 and 4 is that there is little evidence of strong time-variation in the parameters when using this data set.However, there does seem to be some in the sense that there are many variables and time periods where the PIPs are appreciably above zero.That is, even though the figures contain a lot of white (PIPs essentially zero) and just a handful of deep reds (PIPs above one half), there is a great deal of pink of various shades (e.g., PIPs 20%-30%).This is consistent with time-variation being small, episodic and only occurring in some coefficients.

Forecast exercise
The dataset covers the entire yield curve and includes yields from one-year to thirty-year bonds in one-year steps.We choose {1y, 3y, 5y, 7y, 10y, 15y, 30y} maturities as our target variables that we wish to forecast and consider one-month and one-quarter ahead as forecast horizons.We use a range of competing models that differ in terms of how they model time-variation in coefficients and the number of endogenous variables they have.
All models feature stochastic volatility in the measurement errors and have two lags.We also offer comparison between the two MCMC algorithms: exact and approximate.
In terms of VAR dimension, we have large TVP-VARs and VARs with all 30 maturities (M = 30) as well as the three factor Nelson-Siegel model described in the previous subsection (M = 3).
In terms of time variation specified through the likelihood function (i.e., through the definition of W ), we consider the flexible (FLEX) and random walk (RW) specifications defined in ( 2) and (3).In terms of time variation specified through the prior, we consider the five global-local shrinkage priors (four dynamic and one static) given in Sub-section 4.1.
In addition, we consider as a competitor the conventional TVP-VAR setup of Primiceri (2005).We estimate the TVP-VAR only for the Nelson-Siegel model since the original prior overfits in higher dimensions.5 We also have VAR models where coefficients are constant over time.For these we do two versions, one with a Minnesota prior (MIN) and the other a horseshoe prior (HS).
These models are estimated by setting β = 0 and then using the sampling steps for α detailed in the appendix.For the Minnesota prior, we use a non-conjugate version that allows for asymmetric shrinkage patterns and integrate out the corresponding hyperparameters within MCMC.
To evaluate one-month and one-quarter ahead forecasts, we use a recursive prediction design and split the sample into an initial estimation period that ranges from 2005:01 to 2008:12 and a forecast evaluation period from 2009:01 to 2019:12.We use Root Mean Squared Forecast Errors (RMSEs) as the measure of performance for our point forecasts and Continuous Ranked Probability Scores (CRPSs, Gneiting and Raftery, 2007) as the measure of performance of our density forecasts.Both are presented in ratio form relative to the benchmark model which is the large VAR with Minnesota prior.Values less than one indicate an approach is beating the benchmark.
We present our forecasting results in two tables.Table 2 shows the one-month ahead forecast performance of the different models while Table B.1 in the appendix shows the one-quarter ahead forecasting results.Our focus on one-step ahead forecasts is predicated by the fact that the density forecast measures based on proper scoring rules (such as CRPSs) can be viewed as a training sample marginal likelihood and thus enables model comparison (see Gneiting and Raftery, 2007).
Overall, the evidence in Table 2 (and Table B.1) is mixed, with no single approach being dominant.In principle, one robust pattern is that models with TVPs tend to produce more accurate forecasts than the large VAR with stochastic volatility benchmark.
These gains range from being rather small (particularly at the short-end of the yield curve) to appreciable (when the focus is on the long-end of the yield curve).This is consistent with recent findings in Fischer et al. (2023) who document that flexible models work well for this particular dataset when longer maturities are considered.
If we compare results for the large TVP-VARs to results for the smaller TVP-VARs based on the Nelson-Siegel factors reveals that both specifications produce forecasts of similar quality.When forecasting one-month ahead and focusing on the CRPS as a measure of forecast performance, the best average forecast performance is produced by one of the large TVP-VARs.But when we focus on point forecasting performance, one of the NS models emerges as the best forecasting model.This finding indicates that using more information in an unrestricted manner seems to exert benign effect on higher order moments of the predictive density whereas for the first moment the effect is negligible (or even negative).Interestingly, this finding only holds for one-month ahead predictive densities.When we focus on one-quarter ahead forecasts (see Table B.1 in the appendix), this result is reversed with CRPSs indicating one of the NS models is forecasting best and RMSEs indicating one of the large TVP-VARs is forecasting best.
The comparison of the different choices for W also yields a mixed pattern of results.
At the short end of the yield curve the RW specification tends to forecast better, but at the longer end the FLEX specification does better.It is interesting to note, however, that the good performance for RW occurs with a large TVP-VAR whereas for the FLEX specification it occurs for a Nelson-Siegel version of the model.
In terms of which of our dynamic horseshoe priors forecasts best, it does seem to be the priors which assume λ t to exhibit rapid change between values forecast better than the gradual change of the stochastic volatility specifications.That is, the Markov switching or mixture versions of the prior, dHS MS and dHS Mix, tend to forecast better than dHS svol-Z or dHS svol-N.Although there are several exceptions to this pattern.
At this point it is also worth highlighting that the original Primiceri (2005) model is outperformed by our shrinkage specifications in all segments of the yield curve.This suggests that using proper shrinkage priors on the state innovation variances and allowing for dynamic shrinkage pays off.
Thus, overall (and with several exceptions) we have a story where, in this data set, time variation in the regression coefficients is present and there are gains to be made from capturing them.This can be seen by noting that the constant parameter VARs with stochastic volatility are never the best performing specifications across the different maturities and also for both time horizons we consider.As we have shown in the previous sub-section, this time variation is episodic (rather than gradually evolving) and only occurs occasionally and for some of the coefficients.However, ignoring this time variation and using constant parameter models leads to a deterioration in forecasts in almost all situations.
In terms of computation, our scalable algorithm does seem to work well.If we compare results from the exact MCMC algorithm to our approximate (non-sparsified) algorithm, it can be seen that using the computationally-faster approximation is not leading to a deterioration in forecast performance.In fact, there are some cases where the approximate forecasts are better than their exact counterparts.Notes: This table displays the one-step ahead forecast performance for non-sparsified models.We focus on seven maturities (1y, 3y, 5y, 7y, 10y, 15y, and 30y) as our target variables and use a hold-out period from 2009:01 to 2019:12.Point forecast performance is measured by relative root mean square errors (RMSEs), while density forecast performance (shown in parentheses) by relative continuous ranked probability scores (CRPSs).We consider two different models in terms of the dimension of the (TVP-)VARs: a large model including all 30 maturities (M = 30) and a small model specified as a three factor Nelson-Siegel model (M = 3).For the main TVP-VARs, we consider a flexible and a RW specification of W , each with five different global-local shrinkage priors (four dynamic and one static).These TVP-VARs are estimated with two different algorithms: our proposed approximate approach and an exact algorithm.In addition, we consider the conventional TVP-VAR setup of Primiceri (2005) for the Nelson-Siegel model and a set of VARs with constant coefficients.For the VARs with constant parameters, we adopt either a Minnesota or a horseshoe (HS) shrinkage prior.As overall benchmark model we choose a large VAR with constant parameters and a Minnesota prior.The red shaded rows correspond to the actual RMSE and CRPS values of this benchmark model, while the grey shaded rows correspond to models for which we use our approximate (but non-sparsified) MCMC algorithm.The best performing specification is in bold.j (j = 1, . . ., K).Using these, the posterior of ψ j follows an inverted Gamma distribution: where α j denotes the j th element of α.The posterior of j is also inverse Gamma distributed For the global shrinkage parameter, we introduce yet another auxiliary quantity α .This enables us to derive a conditional posterior for τ α which is also inverse Gamma distributed: and the posterior of α being given by: The local shrinkage parameters φ jt can be simulated conditionally on τ and {λ t } T t=1 similarly to the ψ j 's.Specificially, the posterior distribution of φ 2 jt follows an inverse Gamma: with ϑ jt denoting yet another scaling parameter that follows a inverse Gamma posterior distribution: If we do not assume λ t to evolve according to an AR(1) process, we sample the global shrinkage parameter τ similar to τ α .The conditional posterior of τ also follows an inverse Gamma: with the posterior of the auxiliary variable given by:

A.4 Sampling the Dynamic Shrinkage Parameters
As stated in Sub-section 4.1, the full history of λ t in the case that it follows a mixture or Markov switching specification can be easily obtained through standard techniques.More precisely, if d t in (6) follows a Markov switching model, we adopt the algorithm discussed in, e.g., Kim and Nelson (1999b,a).The posterior of the transition probabilities is Beta distributed: whereby T ij denotes the number of times a transition from state i to j has been observed in the full history of d t .
In the case of the mixture model, the posterior distribution of d t follows a Bernoulli distribution for each t: with p t given by: .
and the posterior of p follows a Beta distribution p| Finally, in the case that λ t evolves according to an AR(1) process with Gaussian shocks, we use precisely the same algorithm as Kastner and Frühwirth-Schnatter (2014) for simulating µ and ρ.In the case that we use Z-distributed shocks, the algorithm proposed in Kowal et al.
(2019) is adopted.This implies that we use Polya-Gamma (PG) auxiliary random variables to approximate the Z-distribution using a scale-mixture of Gaussians.Essentially, the main implication is that conditional on the T PG random variates, the parameters of the state evolution equation can be estimated similarly to the Gaussian case after normalizing everything by rendering the AR(1) conditionally homoscedastic.For more details, see Kowal et al. (2019).
B Higher-order forecast performance  Notes: This table displays the three-step ahead forecast performance for non-sparsified models.We focus on seven maturities (1y, 3y, 5y, 7y, 10y, 15y, and 30y) as our target variables and use a hold-out period from 2009:01 to 2019:12.Point forecast performance is measured by relative root mean square errors (RMSEs), while density forecast performance (shown in parentheses) by relative continuous ranked probability scores (CRPSs).We consider two different models in terms of the dimension of the (TVP-)VARs: a large model including all 30 maturities (M = 30) and a small model specified as a three factor Nelson-Siegel model (M = 3).For the main TVP-VARs, we consider a flexible and a RW specification of W , each with five different global-local shrinkage priors (four dynamic and one static).These TVP-VARs are estimated with two different algorithms: our proposed approximate approach and an exact algorithm.In addition, we consider the conventional TVP-VAR setup of Primiceri (2005) for the Nelson-Siegel model and a set of VARs with constant coefficients.For the VARs with constant parameters, we adopt either a Minnesota or a horseshoe (HS) shrinkage prior.As overall benchmark model we choose a large VAR with constant parameters and a Minnesota prior.The red shaded rows correspond to the actual RMSE and CRPS values of this benchmark model, while the grey shaded rows correspond to models for which we use our approximate (but non-sparsified) MCMC algorithm.The best performing specification is in bold.
Here, d t denotes an indicator that either follows a Markov switching model (labeled dHS MS) or a mixture specification (labeled dHS Mix) and κ 0 , κ 1 denote prior variances with the property that κ 1 κ 0 .For the Markov switching model, we assume that d t is driven by a (2×2)-dimensional transition probability matrix P with transition probabilities from state i to j denoted by p ij (with p ii ∼ B(a i,M S , b i,M S ), for i = 0, 1, following a Beta distribution a priori).The mixture model assumes that p(d t = 1) = p, with p ∼ B(a M ix , b M ix ).In the empirical application we specify κ 1 = 100/K, κ 0 = 0.01/K, a M ix = a 1,M S = b 0,M S = 3 and b M ix = a 0,M S = b 1,M S = 30.

Figure 1 :
Figure 1: Approximation error of a single-component Gaussian used to approximate a log χ 2 K distribution.
) but make different assumptions about the density and nature of parameter change.Dense DGPs are characterized by having time-variation in a large number of parameters (with sparse DGPs being the opposite of dense).The nature of parameter change can be gradual (e.g., characterized by constant evolution of the parameters) or abrupt.For each of the five DGPs, we simulate a time series of length T = 250 and with K = 50.

Figure 2 :
Figure 2: Time necessary to obtain a draw for K time-varying coefficients.
depicts heatmaps of the posterior inclusion probability (PIPs) for a Nelson-Siegel model with panels a) to d) referring to the four different dynamic priors for λ t .These PIPs are the posterior means of the elements of δ.
Results for our four different dynamic horseshoe priors are slightly different indicating the dynamic prior choice can have an impact on results.A clear pattern emerges only for the dynamic horseshoe prior with a mixture specification.It is finding that small amounts of time-variation occur only for the coefficients on the curvature factor.If the mixture part of the prior is replaced by a Markov switching specification, we tend to find short-lived periods where a small amount of time-variation occurs for all of the coefficients in an equation.But, interestingly, dHS MS finds that different equations have time-variation occuring at different periods of time.Evidence for TVPs is the least when we use stochastic volatility specifications in the dynamic horseshoe priors.For these priors, tiny amounts of time variation (i.e., tiny PIPs) are spread much more widely throughout the sample and across variables.

Figure 3 :
Figure 3: Heatmaps of posterior inclusion probability (PIPs) for time-variation in structural TVP-VAR coefficients with a gradually changing common shrinkage factor.a) dHS svol-Z Grey shaded areas indicate coefficients which do not appear in the model due to the lower triangularity of A 0t .

Figure 4 :
Figure 4: Heatmaps of posterior inclusion probability (PIPs) for time-variation in structural TVP-VAR coefficients with a regime-switching common shrinkage factor.a) dHS Mix

L
Grey shaded areas indicate coefficients which do not appear in the model due to the lower triangularity of A 0t .

Table 1 :
Mean absolute errors of the TVPs relative to exact estimation.

Table 2 :
One-month ahead forecast performance for EA central government bond yields at different maturities using non-sparsified models.

Table B .
1: One-quarter ahead forecast performance for EA central government bond yields at different maturities using non-sparsified models.

Table B
Nelson-Siegel TVP-VAR with the flexible specification for W