Small Sample Adjustment for Hypotheses Testing on Cointegrating Vectors

: Johansen’s (2000. “A Bartlett Correction Factor for Tests of on the Cointegrating Relations.” Econometric Theory 16: 740–78) Bartlett correction factor for the LR test of linear restrictions on cointegrated vectors is derived under the i.i.d. Gaussian assumption for the innovation terms. However, the distribution of most data relating to financial variables is fat-tailed and often skewed; there is therefore a need to examine small sample inference procedures that require weaker assumptions for the innovation term. This paper suggests that using the non-parametric bootstrap to approximate a Bartlett-type correction provides a statistic that does not require specification of the innovation distribution and can be used by applied econometricians to perform a small sample inference procedure that is less computationally demanding than it’s analytical counterpart. The procedureinvolvescalculatinganumberofbootstrapvaluesofthe LR teststatistic and estimating the expected value of the test statistic by the average value of the bootstrapped LR statistic. Simulation results suggest that the inference procedure has good finite sample property and is less dependent on the parameter space of the data generating process.


Introduction
The procedure for estimating and testing cointegrating relationships described in Johansen (2006) is available in virtually all econometric software packages and is widely used in applied research. Briefly, this method involves maximizing the Gaussian likelihood function and analyzing the eigenvalues and eigenvectors found using the reduced rank regression method.Once that the number of cointegrating vectors has been determined, hypotheses on the structural economic relationships underlying the long-run model can be tested using the likelihood ratio (LR) test.
Although the LR test of linear restrictions of cointegrating vectors has the correct size asymptotically, many studies contain reports that the approximation of the 2 distribution to the finite sample distribution of the LR test can be seriously inaccurate; see, for example, Haug (2002) or Gredenhoff and Jacobson (2001). Broadly speaking, the problem can be described as one of lacking coherence between the test statistic and its reference distribution. One way of addressing this problem is to correct the test statistic so that the finite sample distribution is closer to the asymptotic distribution. In this respect, early attempts of correcting the test statistic were made in Podivinsky (1992) and Psaradakis (1994), where small sample corrections based on degrees of freedom were suggested. More recently, Johansen (2002) proposed a Bartlett type correction factor for LR statistic and analytically derives the asymptotic expansions needed to calculate the expectation of the test statistic. Multiplying the unadjusted statistic by a factor derived from an asymptotic expansion of the expectation of the test provides a closer approximation of the resulting adjusted statistic to the 2 distribution, thus reducing the size distortion problem. Simulation results presented by Johansen (2000) suggests that applying this type of correction to the LR test statistic dramatically reduces the finite sample size distortion problem. However, the Bartlett correction factor is predicated under the assumption of Gaussian innovations. When the innovations are non-normal, the correction factor needs to be modified in order to account for skewness and kurtosis of the innovations. One way of overcoming such calculations is to use a numerical approximation in place of the analytical Bartlett correction. The first paper that suggested to calculate such approximation in time series context was the work by Canepa and Godfrey (2007). In their article the authors proposed computing the Bartlett adjustment for a quasi-LR test using non-parametric bootstrapping as a simple method to generate a non-normality robust small sample inference procedure in the context of ARMA models. The Bartlett corrected quasi-LR test suggested in Canepa and Godfrey (2007) can be used in ARMA models as a misspecification test or alternatively as a model specification procedure. The former case involves testing for the adequacy of an ARMA model by an overfitting diagnostic check, whereas in the latter case the inference procedure can be used to test the validity of simplifications of an ARMA model. In the multivariate context, Canepa (2016) proposed to use the bootstrap to approximate the finite sample expectation of the LR test in place of the analytical Bartlett correction in the context of cointegrated VAR models (see also Hunter, Burke and Canepa 2017). The author also proved the consistency of the suggested test under the assumption that the innovations are independent and identically distributed (i.i.d.). The results in Canepa (2016) are promising since it was found that the performance of the bootstrap Bartlett test was less dependent on the values of the parameters of the data generating process and better able to cope with violations of the Gaussian assumption about the innovations with respect to the Johansen (2000) Bartlett corrected LR test. A possible shortcoming of the inference procedure suggested in Canepa (2016) is that the suggested test statistic relies on the i.i.d. assumption. Evidence of violation of this strong assumption appear in many macroeconomic series, such as aggregate consumption and income, in interest rate data and in nominal and real price variables; see Sensier and van Dijk (2004). Similarly, it is a well-known stylized fact that GARCH-type models fit well to stock market returns (see Boswijk et al. 2016;Engle and Rangel 2008;Harvey et al. 2016;among others).
Against this background, in this paper we built on Canepa (2016) and investigate if the bootstrap Bartlett corrected LR test can be used to reduce the size distortion problem in situations where an analytical solution is difficult or does not work well. If such an application was to be successful it would have significant practical implications, for several reasons. The bootstrap Bartlett corrected LR test does not rely on the Gaussian assumption of the innovations, and this feature may be appealing to the applied worker. Moreover, simulation results indicate that the correction factor is useful for some parameter values but does not work well for others. As Johansen points out "the influence of the parameters is crucial [. . . ] There are parameters points close to the boundary where the order of integration or the number of cointegrating relations change, and where the correction does not work well" (cf. Johansen 2000 p. 741). Simulation results in Cavaliere et al. (2015) confirmed that the performance of test heavily depends on the parameter space (see also Andrews and Guggenberger 2009;Canepa 2006;Cavaliere et al. 2020;Elliott et al. 2015;Lu 2016;McCloskey 2017).
We believe that the dependency on the parameter values may be reduced by computing the Bartlett adjustment using the non-parametric bootstrap. Because the bootstrap method involves replacing the unknown cumulative distribution function of the LR test statistic by the empirical distribution function of the bootstrap distribution of the same test, the resulting inference procedure may show less sensitivity to the values of the parameters of the data generating process (DGP) than a test based on the asymptotic critical values. Computing the bootstrap Bartlett correction factor is relatively straightforward. Roughly speaking, this procedure involves calculating a number of bootstrap values of the LR test statistic and estimating the expected value of the test statistic by the average value of the bootstrapped LR statistic. The bootstrap Bartlett method was first proposed in Rocke (1989) where hypothesis testing in seemingly unrelated regression models was considered (see also Jacobson and Larsson 1999). Rocke's simulation results showed that the Bartlett adjustment for the LR test determined using the non-parametric bootstrap was considerably more accurate than the Bartlett adjustment from the second-order asymptotic method of Rothemberg (1984). In the time series context, the idea of using the bootstrap to tackle the problem of finite sample biased estimation is also considered in Engsted and Pedersen (2014) where a bootstrap based bias-correction method is considered as a simple approach to bias-adjustment in VAR models. The authors show that the bootstrap bias-correction approach yields a large reduction in bias compared to OLS estimate and the performance is similar to the analytical bias formula.
The contribution to the literature of this paper is twofold. First, it provides a "feasible" small sample correction factor that can easily be used to calculate the LR test for linear restriction on the cointegrating space in situations where the analytical Bartlett correction factor would fail. Second, the consistency of the suggested procedure is considered and it is shown that the bootstrap Bartlett statistic converges weakly in probability to the correct asymptotic distribution. Establishing the conditions that ensure asymptotic refinements will be the subject of future research. However, in his seminal article, Beran (1988) concluded that for asymptotically pivotal statistics (i.e. statistics for which the limiting distribution does not depend on unknown nuisance parameters), the analytical Bartlett adjustment produces an error in rejecting probability of order O ( T −3∕2 ) . Approximating the finite sample expectation of the LR test using the bootstrap involves substituting a √ T consistent estimate of statistic, hence the resulting inference procedure should have accuracy of the same order.
The performance of the suggested procedure and the analytical Bartlett correction under non-normality assumption of the innovations are evaluated by Monte Carlo evaluation. Innovation structures typically found in financial data are considered such as fat tailed and conditionally heteroskedastic (i.e. ARCH and GARCH) innovations. Performance is assessed in terms of the size and power of the inference procedures under consideration. The performance of the suggested procedure is compared with the bootstrap p-value test. In general, unlike the Bartlett (1937) idea that involves replacing the original statistic with a corrected statistic which is closer to the reference distribution, the bootstrap p-value test involves making the reference distribution closer to the finite sample distribution. In the literature it has been shown that in many cases the bootstrap can be considered as a numerical approximation to analytical calculations of one-term Edgeworth expansion (see Hall 1992) where the critical values of the limit distribution are replaced with transformations of critical values obtained from the Edgeworth expansions of the distribution function.
This section will close with a brief presentation of the Bartlett correction. The next section introduces the LR test for linear restrictions on cointegrating space, the Bartlett correction of Johansen (2000), and the two bootstrap inference procedures. In Section 3, the design of the Monte Carlo experiment is explained, and in Section 4, the simulation results are reported. In Section 5, some robustness checks are undertaken. An empirical application is considered in Sections 6 and 7 contains some concluding remarks.

The Bartlett Correction
The Bartlett correction is based on a simple idea but can be very effective in reducing the finite sample size distortion problem of the LR tests. This method takes the form of a correction to the mean of the LR statistic for a given parameter point under the null hypothesis. In regular cases, the asymptotic distribution of the LR statistic is given by Λ = −2 log(LR) ∼ 2 (q), where q is the dimension of the constraints, and the asymptotic mean of the LR statistic ought to be approximately equal to q. The Bartlett correction is intended to make the mean exactly equal to q by replacing the above equation by Λ B = qΛ E (Λ) and then referring the resulting statistic to a 2 (q). Typically, given the complicated form of the LR test, it is difficult to find an exact expression for E (Λ) and one can instead find an approximation of the form Thus, the quantity Λ which is closer to the limit distribution. In the independent and identically distributed (i.i.d.) setup, the Bartlett correction has been widely studied in the literature since the pioneering work by Lawley (1956). In the paper, the author showed that the finite-sample distribution of a Barlett-corrected likelihood ratio test statistic was closer to the 2 -distribution than the original LR statistic; see Cribari-Neto and Cordeiro (1996) for a review. However, the Bartlett correction is relatively less explored for dependent data. In the time series context, Giersbergen (2009) derived a Bartlett correction factor for testing hypotheses about the autoregressive parameter in the stable AR(1) model with trend and intercept, and showed that the Bartlett corrections are useful in controlling the size of the likelihood ratio statistic in small samples (see also Omtzigt 2003). Similarly, Taniguchi (1991) calculated the corrections for the AR(1) and MA(1) models without disturbance parameters, whereas Van Garderen (1999) used differential geometry and suggested how to compute the Bartlett correction factor geometrically, for the case of no disturbance parameter. In the context of ARMA (autoregressive moving average), Lagos and Morettin (2004) derived a Bartlett correction for the likelihood ratio statistic used to test hypotheses about parameters of a Gaussian stationary and invertible model. Ravishanker et al. (1990) considered the differential geometry of autoregressive fractionally integrated moving average processes and use the properties of Toeplitz forms associated with the spectral density functions of these long memory processes to compute the geometric quantities. The authors investigated the role of geometric quantities on the Bartlett correction to the likelihood ratio test statistics for the fractional difference parameter. Similarly, Cui (2006, 2007) proved that the empirical likelihood with moment restrictions is Bartlett correctable even in the presence of a nuisance parameter.
Recently, Bartlett-type corrections in unstable autoregressive models have also attracted much attention. For example, Chan and Liu (2010) proved that the Bartlett correction can be used for Gaussian short-memory time series. Chan et al. (2014) extended Chan and Liu (2010) to Gaussian long-memory time series. Similarly, Chen et al. (2016) examine the Bartlett correction factor for the frequency domain empirical likelihood of linear time series models and showed that the factor can be used for non-Gaussian short-memory time series; see also Chan et al. (2014), (for earlier studies see for example Bravo 1999;Larsson 1998;Nielsen 1997).

Model and Tests
Consider the p-dimensional VAR model The matrices of coefficients have the following dimensions: and are (p × r); is (p × p d ); is (p d × r); and Γ 1 , …, Γ k−1 are (p × p). Also, d t (p d × 1) and D t (p D × 1) are deterministic terms in (2). Once the cointegrating rank has been established, linear restrictions on cointegrating space can be tested for. We focus on the hypothesis  0 : = H , where H (p × s) (for r ≤ s ≤ p) is a known matrix that specifies that the same restrictions are imposed on all cointegrating vectors (r), s is the number of unrestricted parameters, and is an (s × r) matrix; see Johansen (1996) for a discussion of tests for other hypotheses. The LR test statistic for  0 can be obtained from the concentrated likelihood function and is given by wherêi and̃i are the usual eigenvalues implied by the maximum likelihood estimation of the restricted and unrestricted models, respectively. For the null hypothesis  0 : = H , an approximation to the order T −1 for the Bartlett adjustment is given by ( ), and the constant c ( ) is given in Johansen (2000). Thus, Λ B = −1 Λ is the Bartlett corrected LR statistic.
The likelihood ratio test in (3) and the correction in (4) are derived under the assumption that the innovations are t ∽ N(0, Ω). However, the Gaussian hypothesis is often too restrictive for the type of data used in economic applications. The fact that the distribution of most data relating to financial variables, for example (but certainly not exclusively), are fat tailed and often skewed has been extensively documented in the finance literature. Although, under weak conditions relaxing the Gaussian hypothesis does not affect the asymptotic distribution of Λ, one may expect the finite sample error in rejecting probability to be larger. Moreover, when innovations are non-Gaussian, the second terms of the asymptotic expansions of the mean and the variance of Λ depend on the skewness and kurtosis of their distribution. This means that in order to use the analytical Bartlett's correction factor, it is necessary to estimate the skewness and kurtosis of the true distribution and accordingly modify the Bartlett's adjustment. Rather than undertaking these tedious calculations, it is proposed below that the nonparametric bootstrap be used to approximate the finite sample expectation of Λ. By using the empirical distribution function in place of some specific parametric distribution, the non-parametric bootstrap method does not require a choice of error distribution be made; this feature is desirable with many types of data. The proposed inference procedure involves undertaking a simulation study using the constrained estimates of obtained by solving the eigenvalue problem, conditional on the initial values Y 0 and ΔY 0 , as the true values. Given these estimates and any required starting values, bootstrap data can be generated recursively after resampling residuals. From each generated sample, one obtains a bootstrap value of the LR statistic, say Λ * j , whose average estimates the mean of Λ under the null hypothesis. An alternative procedure is a straightforward application of the bootstrap p-value approach, where the significance level assigned to Λ is the fraction of the Λ * j greater than Λ. (Note that the subscript " * " is used to indicate the bootstrap analog throughout the paper.)

Bootstrap Algorithms
Bootstrap methods rely on simulations to approximate the finite-sample distribution of the test statistic under consideration. In order to achieve accurate inference procedures, the bootstrap DGP used for drawing bootstrap samples has to mimic the features of the underlying DGP. In this section, we describe the two bootstrap algorithms used to calculate the bootstrap Bartlett corrected test ( Λ *

B
) and the bootstrap p-value test (Λ * ). The former is suitable for the model in (2) when the innovations are i.i.d., whereas the latter is used when innovations are independent but not identically distributed.

Algorithm 1
The steps used to implement the bootstrap algorithm for calculating the bootstrap Bartlett corrected LR test can be summarized as follows: Step (1): estimate the model in (2) and compute Λ and the estimated restricted residuals aŝt Step (2): resample the residuals from (̂1, … ,̂T) independently with replacement to obtain a bootstrap sample . Generate the bootstrap sample using the estimated restricted model given in (6.2).
Step (3): compute Λ * j using the data of step (2) and repeat B times.
As far as the test Λ * is concerned, when innovations are independent and identically distributed with common variance, it is possible to obtain an accurate inference by simply resampling the residuals of the estimated restricted model in (2) without the need to make a particular parametric assumption about the distribution of the innovations. Swensen (2006) considers a recursive bootstrap algorithm for testing the rank of Π = ′ in (2) and shows that, under a variety of regularity conditions, the non-parametric bootstrap based test is consistent in the sense that the bootstrap statistic converges weakly in probability to the correct asymptotic distribution. This involves repeating Step (1)-(3) and then following Step (5) below.
Step (5): compute the bootstrap p-value function of the observed value Λ by calculatingP where I(⋅) is an indicator function that equals one if the inequality is satisfied and zero otherwise. The bootstrap p-value test, Λ * , is carried out by comparingP * (Λ) with the desired critical level, , and rejecting the null hypothesis ifP * (Λ) ≤ . Note that the resampling and testing in Algorithm 1 is done once that the cointegrating rank has been established. Therefore, for a given cointegrating rank all unit roots have been eliminated.

Algorithm 2
When the innovations show conditional heteroskedasticity simply resampling from the residual fails to mimic essential features of the DGP that initially generated the data. A suitable modification of the residual based bootstrap procedure is the wild bootstrap, which is designed to accommodate the possibility of independent but not identically distributed innovations. The wild bootstrap method was developed by Liu (1988) based on a suggestion presented in Wu (1986). Regarding time series, Gonçalves and Kilian (2003) proposed a recursive-design implementation of the wild bootstrap for the autoregression model with conditionally heteroskedastic innovations. For cointegrated VAR models, noteworthy are the recent papers by Taylor (2010a, 2010b). The wild bootstrap DGP is given by where * t =̂tZ t and Z t is specified as a two-point distribution so that Z t terms are mutually independent drawings from a distribution which is independent of the original data and has the properties that E Given the bootstrap data, the associated value of the test statistic Λ * i can be calculated; repeat B times and follow Step (4) to calculate Λ * B and Step (5) to calculate Λ * .
Using the fact that = f ( , , Γ i , Ω) is consistently estimated in the presence of conditional heteroskedastic innovations, we show below that Λ * and Λ * B converge weakly in probability to the first order asymptotic null-distribution of Λ.
Remark 2.1. The procedure outlined in Algorithm 2 is suitable when the innovations are serially uncorrelated. Many alternative procedures could be used for generating the bootstrap DGP, such as the block bootstrap for example. Establishing which bootstrap scheme is the best to calculate the Bartlett correction factor under different assumptions on the innovation process is outside the scope of this paper. In this work, the wild bootstrap was preferred to the block bootstrap for the following reasons. First, the wild bootstrap method is easier to implement than the block bootstrap as it does not involve the problem of determining block length as the latter bootstrap method does. Second, under Assumption 1 below, the innovations form an uncorrelated martingale difference sequence and using the block bootstrap procedure when innovations are uncorrelated may result in a loss of efficiency. Finally, the consistency of the wild bootstrap in the present context can be proved using available tools for independent random variables. However, when innovations admit serial correlation using Algorithm 2 would fail to replicate the correlation structure of the residuals, therefore the procedure is no longer valid.
Remark 2.2. Note that the bootstrap Bartlett correction could easily be extended to other inference procedure, such as the Wald test or may find good applications in structural VAR models in cases where inference on the slope parameters is needed. However, recent theoretical results in Bruggemann, Jentsch and Trenkler (2016) show that for inference in VAR statistics that depend both on the VAR slope and the variance parameters (e.g. in structural impulse response functions) for these statistics, Algorithm 2 would fail in the presence of conditional heteroskedasticity because the bootstrap algorithm does not correctly replicate the relevant fourth moments' structure of the error terms (see also Jentsch and Lunsford 2019). In contrast, the residual-based moving block bootstrap results in asymptotically valid inference. Investigating the usefulness of the suggested Bartlett test in the cases where the block bootstrap is needed will be the subject of future research.

Some Asymptotic Results
We now consider the Λ * B statistic and show that the distributions of the bootstrap based inference procedure coincides with the corresponding asymptotic counterpart. In this paper, we focus on the pseudo-data generated by Algorithm 2 since the consistency of the bootstrap procedure proposed in Algorithm 1 is derived in Canepa (2016). Our approach builds on the important theoretical results in Cavaliere, Rahbek and Taylor (2010a). The authors demonstrate that the limiting null distributions of the rank statistics remain valid in the less restrictive case of globally stationary, conditionally heteroskedastic shocks satisfying certain moment conditions. Cavaliere et al. (2010a) also show that the pseudo maximum likelihood estimator of the error correction model, that assumes Gaussian i.i.d. disturbances, remains consistent under these weaker conditions. Building on these results below, we show that the functional central limit theorem for the stochastic process built from the sequence of partial sums for the bootstrap analog holds and the expected value of the test is consistency estimated.
→ weak convergence in probability as defined by Gine and Zinn (1990), P * denotes the bootstrap probability and E * relates to the expectation under P * . Moreover, for any square matrix A, |A| is used to indicate the determinant of A, the matrix A ⊥ satisfies A ⊥ A = 0 (where (A, A ⊥ ) is a full rank matrix), and the norm . For any vector a, ‖a‖ denotes the Euclidean distance norm, ‖a‖ = (a ′ a) 1∕2 . In order to establish the validity of the wild bootstrap, we need to impose some conditions on the innovations. More precisely, we make the following assumption: Assumption 1.
(i) Define the characteristic polynomial, Assume that the roots of ] | | | = 0 are located outside the complex unit circle or at 1. Also assume that the matrices and have full rank r and that Assumption 1 replaces the usual Gaussian assumption on the innovations { t } by the less restrictive martingale sequence assumption. The innovations are not correlated, however ARCH and GARCH effects are now allowed in model (2) by Assumption 1-(ii). Finally, condition (iii) requires the 4 + moments to be uniformly finite.
Under Assumption 1, Theorem 1 in Rahbek, Hansen and Dennis (2002) implies that the process Y t has the following representation , the coefficients C i decrease exponentially, and A 0 is a term that depends only on the initial values and ′ A 0 = 0. Moreover, Theorem 2.1 in Hansen (1992) implies the weak convergence of the stochastic integrals where B = Ω 1∕2 W is a p-dimensional Brownian motion with variance Ω and W a p-dimensional standard Brownian motion. Rahbek et al. (2002) use this result to derive the asymptotic distribution of the pseudo likelihood ratio test for cointegrating rank. They show that, under the assumption that innovations form a stationary and ergodic vector of martingale difference sequence, the limit distributions of the rank tests are invariant to heteroskedasticity (see also Seo 2006). In the paper by Cavaliere, Rahbek and Taylor (2010a), it is shown that the limiting null distributions of the rank tests remain valid in the less restrictive case of global stationarity. Turning now to the statistics constructed under the pseudo-data generated by Algorithm 2, the representation in (6) is still valid for each bootstrap replication. However, the reminder term, (A 0 ), depends on the realization and needs careful consideration in the bootstrap context. Theorem 1 extends the validity of Lemma 1 in Swensen (2006) (derived under the assumption of i.i.d. innovations) to the case where innovations form an uncorrelated martingale sequence difference with finite fourth moments. In the following, we set , and the initial values of Y to zero, without loss of generality. As before, an asterisk ( * ) denotes the bootstrap analog.
Theorem 1. Let the conditions of Assumption 1 hold. Then, under the null hypoth- Proof of Theorem 1. By Lemma A.4 in Cavaliere et al. (2010a), under Assumption 1, the generated pseudo observations have the representation whereĈ =̂⊥ Using the results in (7), we can describe the asymptotic properties of the product moment matrices generated using the pseudo-observations, which are the basic properties of the test statistics. Following the standard notation, we define R 0t and R 1t as the residuals obtained by regressingZ 0t = ΔY t andZ 1t = Y t−1 , . Moreover, where F(u) := ′ ⊥ CB(u) and [Tu] is the integer value of uT.
The proof for (8)-(10) mimics the proof of Lemma A.7 in Cavaliere et al. (2010a). Similarly, Lemma A.5 in the same paper implies that the functional central limit theorem for the stochastic process built from the sequence of partial sums corresponding to the bootstrap resamples holds, so that Considering now, (11) and (12), as the reminder R * t in (7) vanishes Lemma 10 in Rahbek et al. (2002) holds and such that the continuous mapping theorem gives Similarly, we havê When linear restrictions are imposed on the parameterŝ= Ĥ, a submodel is defined and the space spanned by the linear transformation z: converge to zero. Therefore, using (14)- (16) the asymptotic distribution of Λ * can found by mimicking Theorem 13.9 in Johansen (1996). □ Proof of Corollary 1. Under Assumption 1, (11) and (12) imply weak convergence of the partial sums of stochastic integrals. Moreover, from (8)-(10) we have that S * i j → Σ i j in probability and the estimators of the parameters are consistent. Under the conditions of Theorem 1, this trivially implies that E * (Λ * ) → E (Λ) in probability as T → ∞. □

The Monte Carlo Design
To what extent do deviations from the Gaussian assumption in model (2) affect the finite sample performance of the analytical Bartlett correction? In addition, can the non-parametric bootstrap based Bartlett adjustment introduced above deliver accurate small sample inference when the Gaussian assumption on the innovations is relaxed? Questions of this nature can best be settled by case and simulation studies. We now describe the Monte Carlo study that addresses these issues.
The DGP adopted is given by , and B = 2 I. The null hypothesis , is tested against the alternative  1 : unrestricted. For ease of interpretation, the DGP in (17) is also given in VECM form From (17), it is easy to see that under the null hypothesis the variables Y 1t and Y 2t enter into the cointegrated relationship with coefficients proportional to (1, −1). This restriction matches the hypothesis of proportional comovements of the two random variables. The DGP in (17) is similar to that used in Gonzalo (1994). Among others, Gonzalo considers a simple two-dimensional VAR in which cointegration holds between the I(1) series Y 1t and Y 2t in (17). The DGP used in Gonzalo allows for high control over the many parameters affecting the size distortion of Λ such as the speed of adjustment ( ), the correlation between the innovations ( ), and the volatility parameter ( ). A possible shortcoming, however, is that bivariate cointegrated VARs are rarely encountered in empirical applications. The DGP in (17) maintains high control over the experimental design while also having greater practical relevance. The experimental parameter space is T ∈ (50, 100, 250, 500), ∈ (0.2, 0.5, 0.8, 1), ∈ (−0.5, 0, 0.5), and = 1. In addition, combinations of these parameters with alternative distributions of t are considered.
Although non-normality is not a feature confined to financial data, it is the financial literature that has extensively documented substantial departures from the assumption of Gaussian innovations. For example, it is well established that the unconditional distributions of returns from financial market variables such as equity prices and interest rates are characterized by non-normality. Equity returns tend to be negatively skewed, whereas the patterns of skewness for bond market yields are more varied. Non-normality of the marginal distributions of returns does not necessarily imply the non-normality of the conditional distributions, but many empirical studies suggest that for financial data the Gaussian distribution is highly counterfactual. Given the widespread use of Johansen's procedure in financial applications, it seems appropriate to consider innovation distributions that better describe the behavior of financial markets.
To illustrate the problem of non-normality in financial market variables, we use the behavior of exchange rates. It is well known that exchange rate changes do not follow a Gaussian distribution. Potentially important sources of non-zero skewness and excess kurtosis are recurrent periods of a volatile and then quiet currency markets. To mimic the jump-like behavior caused by the volatility clustering of exchange rates, several researchers have allowed the innovations to be drawn from fat tailed distributions. Among others, Tucker and Pond (1988) provide evidence on the descriptive validity of the mixture of normal distribution as a statistical model for currency markets. Hull and White (1998) give indications on the choice of parameters of the mixture of normals that match the higher-order moments of exchange rate changes for a number of major trading currencies. Building on these studies, we allow the innovations to be drawn from these distributions. 1 The DGP in (5) has

M2
) and l,t ∼ N(0, 1) with 1 1 + 2 2 = 0, 1 , 2 ≥ 0, and 1 + 2 = 1. Volatility clustering is introduced in (5) by 1 that causes occasional "jumps" in the innovation process of the cointegrated VAR(1). When 1 = 2 = 0 the zero skewness assumption about s,t is preserved, being the means of the normal distributions mixed at zero. In this case, excess kurtosis has been introduced in (17)  . Consistent with these considerations, the following five distributions of s,t have been investigated    Table 1 summarizes the descriptive statistics for D 1 , D 2 , D 3 and D 4 . Note that the skewness coefficient, (Skew), is computed as the third theoretical sample moment standardized by three halves power of the variance, whereas the kurtosis coefficient, (Kurt), is the fourth theoretical sample moment divided by the square of the variance. For a normal distribution, Skew should be zero and Kurt should be equal to three. 2 As it emerges from Table 1, the innovations generated using mixture of normals cover a broad range of fat tailed and skewed distributions. Innovations generated under D 1 have mildly fat tails but are not skewed, whereas D 2 , D 3, and D 4 are fat tailed and skewed distributions.
Though mixture of normals introduces fat tails, it preserves the i.i.d. structure of the innovations. Among others, Bollerslev (1987) suggests that ARCH and GARCH models better fit exchange rate data measured over short time intervals (i.e. daily or weekly). Accordingly, simulations with conditional heteroskedastic innovations have been carried out with with i,t ∼ N (0, 1) (for i = 1, 2) and h t denotes the conditional variance. Two specifications of the variance schemes are used: an ARCH(1) process given by 2 Note: the second, third, and fourth central moments of it are calculated as , and E( 4 it ) = (for b = 1, 2), respectively. with = 1, and a GARCH (1, 1) process given by with 0 = 0.1. As for the choice of the other parameters, the following values have been selected D 5 : s,t as in (7)  The parameter values in D 7 and D 8 are those estimated for the exchange rate markets in Bollerslev (1987).
Estimates of the rejection probabilities have been obtained using pseudorandom numbers with programs written in GAUSS. 3 The Monte Carlo experiment was based on N = 10, 000 replications for Λ, Λ B and on N = 1000 replications for Λ * B and Λ * . All bootstrap distributions have been generated by resampling and then calculating the test statistic 800 times. The random number generator was restarted for each T value with the initial value set equal to zero. The VAR(1) model was fitted with an unrestricted constant. Moreover, note that in the Johansen procedure, the maximum likelihood estimator of in Eq. (2) is calculated as the set of eigenvectors corresponding to the s largest eigenvalues of S ′ 0k S −1 00 S 0k with respect to S kk , where S 00 , S kk, and S 0k are the moment matrices formed from the residuals Δy t and y t−k , respectively, onto the Δy t−j . In this paper in place of the conventional algorithm for cointegration analysis (i.e. the algorithm for maximum likelihood estimation that uses the second moment matrices), all simulation results reported have been obtained using an algorithm based on QR decomposition; see Doornik and O'Brien (2002). This yields simulation results that are more numerically stable.

The Monte Carlo Results
Tables 2-5 report the simulation results on the performance of Λ, Λ B , Λ * B , and Λ * . The finite sample significance levels are estimated for nominal levels of 5% and all estimates are given as percentages. In Table 2, the normal distribution serves as a benchmark, whereas Table 3 shows the results for the case of innovations drawn from a mixture of two normal distributions. Table 3 also contains results relating to the sensitivity of the error in rejection probability to variations of key parameters of the DGP. Finally, Table 4 reports the rejection frequencies for the case of ARCH and GARCH innovations.
Before looking at other specifics of the simulation results, it is noteworthy to consider the benchmark case in which t ∼ N (0, 1), (note that in this case the distribution of s,t , = l,t in (17)). As far as Λ is concerned, Table 2 mainly confirms previous findings that inference based on first order asymptotic critical values is markedly inaccurate with excessively high rejection frequencies. Correcting Λ using the analytical Bartlett factor improves the behavior of the test statistic. However, Table 2 indicates that the performance of Λ B is highly dependent on the autoregressive coefficient of the error correction mechanism, . When is Note: the estimated rejection probabilities of Λ * B and Λ * have been calculated using Algorithm 1 in Section 2. For Λ and Λ B , the number of replications is N = 10,000, for Λ * B and Λ * N = 1000 and B = 800. A 95% confidence interval around the nominal level of 5% is given by (3.6, 6.4). The Bartlett corrections are given in parenthesis. The asymptotic distribution is 2 (1).   large (i.e. the speed of adjustment to the cointegrated equilibrium is low), the correction does not work well. Using the bootstrap to approximate the Bartlett adjustment factor produces estimated levels that are less sensitive to the value of parameter. The performance of the p-value bootstrap test is also less dependent on the value of the speed of adjustment parameter. Looking at the simulation results in Table 2, it appears that when T = 100 and ≠ 0, Λ * and Λ * B work well for ≤ 0.8, whereas the empirical levels of Λ B are within the 95% confidence interval for ≤ 0.5, say. When = 1 the process Y t is a pure I(1) process that does not cointegrate. In this case, we do not expect the resampling schemes presented in Section 2 to work, since the roots of the characteristic polynomial of the model in (2) are located inside the unit circle, and the process is not stationary. The size distortion of Λ * B and Λ * is still quite moderate, but there is no reason to believe that the test statistics would have adequate power. (Note that for the near unit-root model the bootstrap becomes inconsistent just as the exact unit root case). Coming to , the estimated sizes reported in columns 3-6 show that the Note: the estimated rejection probabilities of Λ * B and Λ * have been calculated using Algorithm 2 in Section 2. DGP with = 0. For Λ and Λ B , the number of replications is N = 10,000, for Λ * B and Λ * N = 1000 and B = 800. The asymptotic size of the tests is 5%. The Bartlett corrections are given in parenthesis. error in rejection probability increases when → 0. However, no matter the value of , bootstrap based inference outperforms Λ B .
Turning to the question of assessing how good is the bootstrap Bartlett correction when the innovations are fat-tailed, Table 3 suggests that the answer depends, in a complicated way, on , , T and the distribution of s,t . Looking at the estimated levels of Λ over the range D 1 , … , D 4 in the first place, a match with the excess kurtosis and skewness coefficients in Table 2 reveals that, in general, the error in the rejection probability of the test increases with |Kurt| and |Ske |, with the highest size distortion for the case of D 3 and D 4 . Furthermore, comparing the estimated sizes from the top and the bottom panel in Table 3, it appears that the effect of non-Gaussian innovations on the estimated level of the test is, once again, highly dependent on the parameter values of the DGP: it is pronounced when the speed of adjustment is slow and it is relatively mild when the latter is fast (i.e. = 0.2). Bewley and Orden (1994) report that Johansen's estimator produces outliers when the speed of adjustment is slow, while Phillips (1994) provides a theoretical analysis showing that the finite sample distribution of is leptokurtic. The simulations in Bewley and Orden and the theoretical results in Phillips explain why Λ behaves so poorly when the combinations of = 0.8 and the non-Gaussian distributions in Table 3 are selected: excess kurtosis in the innovations magnifies the effect of the slow speed of adjustment increasing the mismatch between the finite sample and the asymptotic reference distribution of the test statistic by moving the distribution to the left. In this situation, Λ B can only be partially successful because the second terms of the asymptotic expansions of the mean of Λ depend on the skewness and kurtosis of its distribution, and the conditions, under which this dependence vanishes, have not yet been established. In contrast, when using Λ * B the Gaussian distribution is replaced with the empirical density function of the innovations. This strongly mitigates the effects of skewness and kurtosis on the finite sample mean of the test and makes the finite sample distribution of Λ * B closer to the asymptotic distribution.
The final set of simulation experiments relates the ARCH and GARCH innovations. Table 4 presents the empirical sizes for the inference procedure under consideration when different values of , 1 , and 2 are considered. As for the other cases, the error in rejection probability of Λ and Λ B heavily depends on the distribution of s,t . In contrast, Λ * B and Λ * behave quite well leaving open the possibility of extending the bootstrap algorithm presented for the Bartlett correction in Section 2 to other cases in which the ordinary residual based bootstrap procedure would fail.
To wrap up the discussion, in Tables 2-4, Λ is greatly oversized in most instances. The error in rejection probability of the test statistic crucially depends on the parameter values of the DGP, and violations of the Gaussian assumption worsen the performance of the test for finite samples. Λ B offers improvements over the uncorrected statistic but its behavior mimics the performance of Λ and thus, it is not entirely reliable. In contrast, the two bootstrap procedures are less sensitive to the parameter values of the DGP and appear to be relatively robust to both non-Gaussian and conditionally heteroskedastic innovations.

Results under the Alternative Hypothesis
It is well known that the Bartlett correction factor is designed to bring the actual size of asymptotic tests close to their respective nominal size, but it may lead to a loss in power. Accordingly, the power properties of the proposed procedure are considered in this section.
For the experiments evaluating the power of the tests, data were generated under the alternative hypothesis where g ∈ (1.2, 1.4, 1.6, 1.8.2) with = 0.5, = 0.5, and = 1 in (17). The results of this set of experiments are reported in Table 5. Once again, the case of t , i.i.d. N(0, 1) (i.e. the case with s,t = l,t in (17)) serves as a benchmark, then s,t ∼ D 1 and s,t ∼ D 5 are considered. Experiments using the other distributions for the innovations considered in Tables 3 and 4 produced similar power properties and results will be omitted in the interest of brevity. Note that simulation results were obtained using Algorithm 1, for t ∼ N(0, 1) and s,t ∼ D 1 , whereas Algorithm 2 was used for s,t ∼ D 5 . Simulation experiment results under the alternative hypotheses are presented in Table 5. The simulation results show that the sample size and the distance between the null and the alternative hypothesis play an important role in determining the power of the test statistics under consideration. 4 Considering the asymptotic test first, it appears that the power of the Λ is badly affected by the choice of the distribution of the innovations: the test is relatively well behaved when the innovation are fat-tailed but i.i.d., whereas the performance of the test deteriorates when ARCH innovations are introduced in the DGP. Turning to the comparison of the power among the different procedures, overall it is found that in small samples (i.e. T = 50) correcting the test statistic for the size shifts the estimated power function down. There is evidence that Λ * B and Λ * share similar power properties, with no test uniformly outperforming its competitor. The results for the sensitivity of the inference procedures to the parameters of the DGP are not reported in detail here but the simulation experiment showed that a slow adjustment to the equilibrium worsens the rejection frequencies for Λ * B , Λ * and Note: the estimated rejection probabilities of Λ * B and Λ * have been calculated using Algorithms 1 and 2 in Section 2 using N = 1000 and B = 800. DGP with = 0.5, = 0.5. Λ B . On the other side, changing the correlation between the noises does not have an important impact on the power estimates.

Robustness Check
Previous simulation results revealed several insights on the performance of the Bootstrap Bartlett corrected test. A possible shortcoming is that DGPs with only one cointegration vector and a limited number of lags were considered. Increasing the number of nuisance parameters in the DGP is expected to increase the finite sample size distortion of the asymptotic LR test (see for example Podivinsky 1992).
To investigate the effect of the nuisance parameters on the finite sample performance of the inference procedures, a further set of simulation experiments was undertaken. In this case the simulation design involved: (i) considering the effect of increasing the number of lags in the DGP; (ii) increasing the cointegrating rank from r = 1 to r = 2. To complete the picture, the effect of including both greater dynamic and increasing the cointegration rank was also considered.
In particular, point (i) was addressed by undertaking simulation experiments using the DGP as in Eq. (17) with = 0.5, = 0.5, and N(0, 1) innovations. To investigate the effect of the nuisance parameters, the number of lags was progressively increased from k = 1 to k = 3. To address point (ii) a second DGP with r = 2 was considered. The second DGP, labelled as DGP2, is a four-dimensional VAR given by where 11 , 22 = 1, 11 = 0.4, 21 = 1.7, 31 = 0.1, 22 = 0.3, 32 = 0.6 and the i,t as in GDP1. In this case the null hypothesis of interest is where I is an identity matrix. Once again, under alternative hypothesis is unrestricted. Table 6 reports the simulation results for the empirical sizes of Λ, Λ B , Λ * B and Λ * . For ease of interpretation, the empirical sizes in Table 2 for k = 1 are also reported. Note: the estimated rejection probabilities of Λ * B and Λ * have been calculated using Algorithm 1 in Section 2. For Λ and Λ B, the number of replications is N = 10,000, for Λ * B and Λ * N = 1000 and B = 800. A 95% confidence interval around the nominal level of 5% is given by (3.6, 6.4).
As far as the simulation results are concerned the first thing to note in Table 6 is that inference based on first order asymptotic critical values is, once again, markedly inaccurate with excessively high rejection rates. Increasing the number of lags, k, dramatically increases the deviation from the nominal levels. By contrast, allowing more cointegrating vectors, r, in the system slightly reduces the size distortion. Turning to empirical sizes for Λ B , and Λ * B , we can see that they are much closer to the nominal sizes than the first order asymptotic critical values. Overall, the results in Table 6 show that correcting the LR test statistic is worthwhile, since all the empirical sizes reported for the corrected test are closer to the nominal 5% level than the unadjusted test statistic. However, introducing many nuisance parameters in the model affects the size accuracy of all test statistics under consideration.
Note that in Table 6 only results with normal innovation are reported. In line with the results in Table 3, departures from the normal distribution of the innovations increased the size distortion of the Λ and Λ B procedures, whereas Λ * and Λ * B presented empirical sizes closer to the nominal size (results not reported but available upon request).

An Empirical Application
As an illustration, the bootstrap Bartlett procedure discussed in Section 2 has been applied to investigate purchasing power parity (PPP) relationship. According to economic theory, once converted to a common currency, national price levels should be equal. In other words, where P is the log of the domestic price level,P is the log of the foreign price level, and E denotes the log of the spot exchange rate (home currency price of a unit of foreign currency). Therefore, departures from PPP relationship at time t can be defined as Equation (20) implies that if the PPP mechanism is functioning, one should observe the tendency of the two markets to adjust toward the long-run equilibrium level of exchange rates, meaning that PPP t should be a stationary stochastic process. However, using conventional unit root tests a number of studies examining the empirical validity of the PPP relationship for the period of floating exchanges rates have failed to reject the null hypothesis of non-stationarity for PPP t leading to what Obstfeld and Rogoff (2001) define as the "PPP puzzle" (see also Rogoff 1996). The progress in econometrics over the past 30 years added hundreds of papers to the PPP's literature. Failure to find empirical evidence of the PPP relationship in the empirical literature has identified the PPP puzzle as one of the six major puzzles in international economics (see for example Chen and Engel 2005;Falahati 2019;Ford and Horioka 2017;Frydman et al. 2009;Horioka and Ford 2017;Imbs et al. 2005;Johansen and Juselius 1992;Razzak 2018).
Reviewing the existing empirical works on PPP a large consensus on two facts emerges. First, consensus estimates suggest that the marginal distributions of prices and exchange rates exhibit excess kurtosis and nonzero skewness such that a Gaussian conditional distribution for the innovations is typically counterfactual (see for example Tucker and Pond 1988;Fujihara and Park 1990;Engel and West 2005). Second, there is fairly persuasive evidence that it takes long time before PPP returns to its steady-state value, meaning that speed of adjustment toward PPP equilibrium is very slow (see for example Alves et al. 2001;Costa and Crato 2001;Masih and Masih 2004). Because Λ * B is less sensitive to parameter values of the DGP (the empirical levels of Λ * B reported in Table 2 showed much less variation over the grid of parameters considered in the Monte Carlo experiment) and better able to cope with deviations from the Gaussian assumption, this test statistic may be appropriate when using Johansen's procedure for testing PPP hypotheses.
As an application we test whether or not the PPP relationship holds for the real exchange rate between the US dollar and the currency of a number of countries (economic regions) using quarterly data from 2000:Q1 to 2020:Q1. Namely, the countries included in the sample are Canada, UK, China, Australia and the EU area. 5 For the h-country (h = 1, … , 5) let E h,t be the nominal dollar exchange rate, P h,t the domestic consumer price index, and the US consumer price index (P USA,t ) . As for the estimation results, preliminary analysis on the unrestricted VAR(2) models ruled out serial correlation; however, evidence of non normality and heteroskedasticity of the ARCH type was detected for the series ( P h,t −P USA,t ) in all models under consideration. Under these circumstances, the test Λ B no longer constitutes a valid inference procedure. For this reason, Λ B was discarded from the analysis.
In Table 7, the empirical p-values for Λ, Λ * B and Λ * are reported. The null hypothesis under consideration is that ( P h,t −P USA,t ) − E h,t is stationary, or equivalently, that the vector (1, −1) ′ ∈ sp( ). This can be formulated as the hypothesis versus the alternative  1 : unrestricted. Empirical levels for Λ * B and Λ * in the third and fourth columns were obtained using Algorithm 2 in Section 2 with B = 5000. The p-values for Λ were calculated by taking the 95% percentile from the 2 (1) and calculating the actual p-value as the frequency of rejection.
From Table 7 it appears in finite samples that the asymptotic LR inference procedure does not work well as it rejects the null hypothesis in favor of the alternative in three cases (i.e. UK, Australia and the EU area), whereas the corrected LR statistic confirms that the PPP relationship holds between the U.S. dollar and the other currencies for all countries (economic regions). The Λ * also seems to work well thus providing an alternative test to the corrected LR test. Johansen's (2000) Bartlett corrected LR test relies on Gaussian innovations. However, in empirical applications, there is limited information on the distributional form of the innovations. Therefore, there is a need to investigate procedures that do not rest on the Gaussian assumption (or on any other specific distribution).

Concluding Remarks
This paper considers a non-parametric bootstrap Bartlett LR test, and finds that the bootstrap Bartlett correction serves two purposes at once. First, it is able to control the size distortion generated by a slow speed of adjustment to the cointegrated equilibrium as well as other crucial parameters of the data generating process. Second, it is robust to violations of the Gaussian assumption. No matter the distribution of innovations under consideration, (i.e. mixture of normals, ARCH or GARCH) there is little evidence that the size of the bootstrap Bartlett statistic depends, in any important way, on the form of innovations. Together, these results constitute an important improvement with respect to the analytical Bartlett correction, particularly in light of the fact that in empirical applications, the true underlying data generating process is not known.