Testing for Breaks in Cointegrated Panels - with an Application to the Feldstein-Horioka Puzzle

Stability tests for cointegrating coefficients are known to have very low power with small to medium sample sizes. In this paper we propose to solve this problem by extending the tests to dependent cointegrated panels through the stationary bootstrap. Simulation evidence shows that the proposed panel tests improve considerably on asymptotic tests applied to individual series. As an empirical illustration we examined investment and saving for a panel of European countries over the 1960-2002 period. While the individual stability tests, contrary to expectations and graphical evidence, in almost all cases do not reject the null of stability, the bootstrap panel tests lead to the more plausible conclusion that the long-run relationship between these two variables is likely to have undergone a break. --


Introduction
The analysis of cointegration in non-stationary panels has been recently rapidly expanding in two main directions. The first, urged by the nature of the data actually used in empirical applications, is the effort to generalise the tests to the case of dependent units, either by modelling the dependence (inter alia, Gengenbach, Palm, Urbain, 2006) or reproducing it through the bootstrap (Fachin, 2007, Westerlund andEdgerton, 2006). The second direction follows steps already taken by the cointegration literature in the early '90's, tackling the issues of testing (i) cointegration allowing for breaks and (ii) the stability of a cointegrating relationship. In this stream of the literature, the first problem seems to have received more attention (e.g., Banerjee and Carrion-i-Silvestre, 2004, Gutierrez, 2005, Westerlund, 2006 than the second (to the best of our knowledge, only Kao, 2001, 2005, for trend regressions, Kao and Chiang, 2000, for homogenous panel regressions). This is somehow surprising, as stability tests with unknown break points may have very low power with even medium sample sizes. For instance, the rejection rates under H 1 simulated by  for T = 100 and medium speed of adjustment are only marginally higher than Type I errors, and actually lower than the significance level. Cointegration stability tests are thus natural candidates for panel extensions hopefully able to grant power gains large enough to make them empirically useful. A second surprising aspect of the current debate is that so far the developments in the treatment of dependence across units seems to have been largely ignored in the "panel with breaks" literature 1 . The tests proposed should thus be regarded essentially as a first step in the construction of empirically relevant procedures, very much like the first generation panel cointegration tests. On the contrary, in this paper we tackle the dependence issue from the outset, proposing a panel generalisation of Hansen (1992) stability tests based on the stationary bootstrap which is completely robust to cross-section dependence, and may thus be helpful for actual empirical work. A fitting empirical illustration is the so-called Feldstein-Horioka (1980) puzzle, i.e. the widespread evidence supporting the existence of a long-run link between the investment (I) and savings (S) to GDP (Y ) ratios in advanced economies which should characterised by high capital mobility. The issue 1 Noticeable exceptions include the panel cointegration tests with breaks by Carrion-i-Silvestre (2004, 2006) and Westerlund (2006), which however leave many questions open. Westerlund applies simple resampling to data which, provided cointegration holds, are weakly dependent, while Banerjee and Carrion-i-Silvestre's (2004) procedure implies fitting an AR model to a MA process with a unit root under no cointegration (the same remark applies to Westerlund and Edgerton, 2006). Finally, Banerjee and Carrion-i-Silvestre (2006) test appears to have very good properties, but since it is based on Bai and Ng's (2004) PANIC procedure it unfortunately requires rather large sample sizes (the smallest ones reported in Banerjee and Carrion-i-Silvestre's simulations are T =50, N=40). has been examined in a non-stationary panel set-up among others by Carrion-i-Silvestre (2004) andDi Iorio andFachin (2007), who both report findings are on the whole rather favourable to the cointegration-withbreak hypothesis. Indeed, as remarked e.g. by Frankel (1992), breaks are to expected in view of the worldwide shift towards financial liberalisation of the last 1980's. Thus, it is of some interest to test if breaks actually place.
We shall now (section 2) introduce the set-up and outline the testing procedure, then present the design and results of a Monte Carlo experiment (section 3) and the empirical illustration (section 4). Some conclusions and suggestions for future research are finally discussed (section 5).
2 Testing parameter stability in cointegrated panels

Set-up
Consider a (k + 1)−dimensional I(1) random variable Z observed over N units and T time periods (respectively indexed by i and t), naturally par- Then, as long as no long-run relationships among the X 0 s exist, we can estimate the N cointegrating vectors (say, β i = [β i1 β i2 . . . β ik ]) by applying some single-equation method (e.g. FM-OLS) separately to each of the N time series. Hansen (1992) proposed three tests for the hypothesis that the β's are stable over time when no a priori information on the location of the possible breaks t b i is available: (i) the maximum of the Chow tests computed at all possible break points (SupF ); (ii) their mean (MeanF ); (iii) a Lagrange-Multiplier test of the hypothesis that the coefficients follow a martingale process of zero variance (L c ). The panel extension along the lines of Pedroni's (1999) group mean test is in principle trivial, as it involves simply taking the mean (or some robust statistic such as the median or an α−trimmed mean) of the statistics computed for the individual units. Similarly to the case of panel cointegration tests, the bootstrap is a natural candidate for solving the problem of inference under the general set-up of dependent units. To this end, we need to design a resampling scheme delivering pseudodata obeying the null hypothesis of coefficient stability and reproducing both the autocorrelation and cross-correlation properties of the data. Denoting by S i the stability statistic of interest for unit i, we propose to estimate the p−value of the group stability statistic S by the following algorithm: 1. Obtain estimates b β 0 i of the cointegrating vectors under H 0 : coefficient stability; 2. Compute the individual stability statistics b S i and estimate break lo- 4. Estimate models allowing for breaks at the periods e t b i and store the residuals b e t = [b e 1t . . . b e Nt ] ; the choice of the e t b0 i s, a key point of the procedure, is discussed in some detail in Remark (i) below; 5. Since cointegration holds, in resampling the T ×N matrix b E = [b e 1 . . . b e T ] 0 we only need to allow for short-run autocorrelation. Hence, we can apply the stationary bootstrap (Politis and Romano, 1994) and obtain a matrix of pseudo-residuals E * = [e * 1 . . . e * T ] 0 reproducing both the short-run correlation over time and the cross-units correlation of the estimated residuals; 7. Compute the group stability statistic S * for the pseudo-data set [Y * it X 0 it ] 0 , i = 1, . . . , N, t = 1, . . . , T ; Three remarks are in order: (i) As mentioned above, estimation of break points is a key point of the procedure. An apparently appealing choice is b t b i = arg max(Sup b F i ), so that break location is allowed to vary across units. In fact, this is a good choice when there is a break in the data (for instance, in the simulation reported in Fig. 1 the mean estimation error is 0.73 and the median error 1), but not so much so when H 0 : no break holds. In these circumstances in small time samples the break is often placed towards either end of the sample (see Fig. 2), causing overfitting and spuriously small estimated residuals. As a consequence of the latter, the bootstrap pseudodata tend to exhibit spuriously high signal/noise ratios, and the bootstrap stability tests to be severely oversized. Superior results are obtained when the restriction of a common break located at the median of the individual estimates of break periods is imposed (i.e., e t (ii) The hypothesis of partial (involving only some of the coefficients) stability is easily handled by modifying accordingly the equations estimated in step (4) and the stability statistics adopted; (iii) Although exploratory simulations showed the results to be quite robust to the choice of block length, in principle this is a critical point of the algorithm. Here for computational convenience we applied a simple rule-of-thumb, fixing it at T/10. In future work we plan to implement Politis and White's (2003) algorithm.

Design
The simulation experiment is based on the design adopted by Fachin (2007), essentially a generalisation of the Engle and Granger (1987) classical Data Generation Process (DGP) to the case of dependent panels (a similar design in also employed by e.g. Kao, 1999). Considering for the sake of simplicity the bivariate case Z it = [Y it X it ] 0 the DGP can be summarised as follows. Following Pesaran (2006), short-run dependence is induced by defining the shocks driving Y and X (u j , j = x, y) as the sum of a idiosyncratic component (² j , j = x, y) and a single stationary common factor (f j t , j = x, y); long-run dependence is caused by an explanatory variable common across units. Letting t b be the period in which the break takes place, we then have: where i = 1, . . . , N, t = 1, . . . , T ; when a = 0 the right-hand side variable x it is weakly exogeneous for the long-run parameter β. Since when weak exogeneity does not hold a full information, rather than single-equation, ap-proach should be used we will fix a = 0 in all experiments with no loss of generality.
Although at first sight the DGP equation for X does not appear to be subject to breaks, substituting for y it and rearranging yields: which makes clear that in fact both DGP equations are breaking. Both errors u j , j = x, y, are assumed to be the linear combination of a common component, f j ∼ N (0, 1), j = x, y, and an idiosyncratic one, ² j , j = x, y : The coefficients γ j i , j = x, y, are the factor loadings and determine the strength of the short-run cross-correlation across units; here γ j i ∼ Uniform(−1, 6) ∀i, j, so that the cross-correlation is substantial (about 0.65). The structure of the idiosyncratic component is: with σ 2 ij ∼ Uniform(0.5, 1.5), j = x, y, so to allow for some heterogeneity across units.
The DGP (??)- (7) is obviously quite complex. Rather than aiming at the unfeasible task of a complete design 2 we will define as a base case an empirically relevant set-up and then explore a few interesting variations. Considering that the simple bivariate DGP often used in simulation experiments is clearly unrealistic, but in single-equation cointegration modelling the number of explanatory variables is usually limited, we generally set k = 2 in both the DGP and estimated model. With no loss of generality we set both constant and slopes to 3 before the break (the same value chosen by Banerjee and Carrion-i-Silvestre, 2004, for the slope); after the break all coefficients are halved.
Finally, a = 0, so that the X variables are exogenous.
Since  report a tendency to overrjection of the asymptotic test in models with 3 or 4 explanatory variables we also run a separate experiment with k = 4. Finally, a key point is that given the rather short time series analysed in most experiments, in order to ensure computational stability we fixed the trimming coefficient at 25%. The cases considered are six altogether.
1. Base case: T = 50, N from 5 to 40; in the power simulations break date Uniform over units in [0.5T ± 3] = [22,28]. Since recursive stability tests assume rather large sample sizes we chose to fix the time sample in all experiments except the following one to 50. This is admittedly a rather large sample in terms of annual data, but pretty small if a quarterly frequency is assumed. It may thus be considered relevant for actual empirical applications (note that it is much smaller than those typically considered in simulation studies on stability tests, where generally T ≥ 100).
2. Large T : T = 100, N = 3, 5; in the power simulations break date Uniform over units in [0.5T ± 3]. Since the aim of this experiment is checking the time-asymptotic behaviour of the tests, for computational convenience only very small cross-section sample sizes are examined.
3. Late break : T = 50, N from 5 to 40; break date Uniform over units in [0.75T ± 3], that is [35,41]. Since 25% of the sample is trimmed at each end, the estimation sample is [13,38]: the break can thus fall very close or even after the end of the estimation sample, a very demanding set-up.
The bootstrap algorithm described above is based on residuals of cointegrating regressions estimated for all units with a break at the median of the individual estimated break points, which is intuitively acceptable if we assume all units to be affected by breaks stemming from a common cause. However, even assuming each unit to be affected by at most one break over the period of interest, two rather different set-ups may arise: (i) the break periods may be widely disperse over units, for instance because they stem from different causes, each one relevant to only some units; (ii) some of the units may be not affected by a break at all. The two following cases are designed to investigate these two scenarios in turn: 4. Twin breaks: as Base case, but in half of the units the break date is Uniform in [0.3T ± 3] , and in the other half in [0.6T ± 3].
5. Partial break : T = 50, N from 10 to 40, break date Uniform in [0.5T ± 3] over 0.7N units (the first seven in each block of ten), no break in the remaining units. This case deserves some discussion. The key question here is the following: what is the null hypothesis of the panel stability test (say, H P 0 )? Let H i 1 be that of the i − th individual test; then, one possibility is to take H P 0 = T N i H i 0 , so that the panel null hypothesis is "stability in all units". However, this appears far too restrictive, especially in view of small sample applications where outliers may have an heavy influence on individual cases. Following Pedroni's (2004) view of the meaning of panel cointegration tests, we prefer the panel null H P 0 : "stability in a large number of units" . In other terms, the aim of the test is assessing if in the units examined the cointegrating relationship is mostly, but not necessarily always, stable. As in the set-up of this experiment the answer is negative (H 0 holds only in 30% of the units) we would like to have high rejection rates. Note that since this view of the test clearly requires fairly large cross-section sample sizes we set N ≥ 10. To evaluate the improvements (in terms of both power gains and reduction in size bias) which could be expected by moving from a standard time series to a panel set-up we also computed the average rejection rates of the asymptotic tests based on Hansen (1992) asymptotic critical values computed for all individual units involved in each experiment 3 . Note that the comparison between the average performance of the asymptotic test on individual series and that of the panel tests with a smaller number of units (e.g., 5, 10 and 20 in the base case or 3 in the "Large T " case) should be taken as merely suggestive of a pattern, as the units involved are not the same.
Finally, after some experimentation with different options we decided to fix the number of Monte Carlo replications at 500 and that of bootstrap redrawings at 1000. Higher numbers of either would have delivered a small increase in the precision of the results not worth the large increase of the cost and time scale of the experiment (which, because of the recursive nature of the statistics evaluated, is computationally very demanding).

Results
The results are reported in Tables 1A-6B below. In the Base case (T = 50, N from 5 to 40) the Type I errors (Table 1A) of the bootstrap panel tests have some positive size bias for N = 5 but converge fairly closely to nominal significance levels as N increases. The asymptotic tests on individual series deliver variable performances: the L c test is slightly oversized, while both the MeanF and the SupF appear to be conservative (more the latter than the former). The power gains offered by the panel tests are remarkable. Consistently with a priori expectations, the asymptotic tests have negligible power, while that of the panel tests is generally acceptable and definitely good for α = 10% and N ≥ 10 (e.g., 92% for N = 40, with Type I error 11%; Table 1B). Hence, using the panel tests grants considerable improvements with respect to aggregate tests in terms of both reduction of size bias and increase in power. In fact, with this time sample a panel approach seems to be the only viable option. In comparative terms, we find the Type I errors to be very similar for all the three tests, while the SupF test appears to be somehow marginally less powerful than the L c and MeanF. The results of the mean and median panel tests also appear very similar. Since these findings hold approximately in all the cases examined the following comments are mostly expressed in general terms, with no reference to the specific tests.
Allowing for the different speed of adjustment of the DGP's employed, the "Large T " results (Tables 2A-2B) for the asymptotic tests are fully consistent with : as we can see, the size bias is still noticeable, and power very poor. On the other hand, the Type I errors of the bootstrap panel tests essentially converge to nominal significance levels, and their power approaches 100% even with extremely small N . Hence, even with a rather large time sample a panel approach seems preferable.
When T = 50 and breaks around 3/4 of the time sample (Table 3) power falls dramatically, rarely reaching 50% for the mean test; the performance of the median test, although not brilliant, appear somehow more robust. Since the upper extreme of the break interval (t = 41) falls after the end of the actual estimation sample (t = 38) these findings are not surprising, and make clear the great care necessary in using recursive stability tests.
The two experiments designed to check the robustness of the bootstrap procedure with respect to the nature of the breaks deliver comforting results. When the breaks come from two distributions, centred at the opposite ends of the sample (but not so close to them as in the previous case) the power loss caused by the misspecification of the cointegrating equation used to estimate the residuals to be bootstrapped is very small (Table 4). On the other hand, when 70% of the units are affected by the break it is interesting to see (Table 5) that the rejection rates seem to fall approximately in the same proportion (e.g., for N = 40 and α = 10% from 92.2% to 66.8%), so that if H 0 does not hold in the majority of the units it is likely to be rejected by the panel test as well. Somehow contrary to our expectations, in this set-up the mean and median tests deliver very similar results.
In a larger model with four explanatory variables (Tables 6A-B) we notice that the performance of the asymptotic tests is even worst than in the Base case. The Type I errors of the panel tests appear similar to the base case with only two variables, but unfortunately their power somehow smaller, possibly because the coefficient are estimated less precisely.
The overall conclusions to be drawn are now rather clear: consistently with  our experiments suggest that with a small or moderately large sample size (T ≤ 100) Hansen (1992) asymptotic test has power ranging from very low to close to zero. A fairly general solution to this serious empirical shortcoming seems to be provided by a panel approach based on the bootstrap: in out experiments the Type I errors turned out to be generally close to nominal sizes and converging rather rapidly over both over T and N to nominal levels, and power from acceptable to good with α = 10% when the break is located around the middle of the sample. Although tests power does not appear to be much affected by a wide dispersion of the breaks across units and to be (correctly) roughly proportional to the fraction of breaking units, it is important to keep in mind that it can be disappointing if the breaks fall towards the end of the sample (which is not surprising, since with a small time sample the marginal information becomes very small).

Empirical illustration: the Feldstein-Horioka Puzzle
As discussed in the Introduction, the apparent existence of a long-run link between the investment and savings in advanced economies, where high capital mobility may allow the current account to be unbalanced for long periods, is one the major empirical puzzles of contemporary macroeconomics (six altogether according to Obstfeld and Rogoff, 2000). Banerjee and Carrioni-Silvestre (2004) and Di Iorio and Fachin (2007) investigated the issue on a data set including 14 European economies (Austria, Belgium, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Netherlands, Portugal, Spain, Sweden, UK) over the period 1960-2002 using panel cointegration tests allowing for a single break in the cointegrating coefficients (either level only and both level and slope, which in the literature is referred to as "retention ratio"). Here we shall examine a subset of this panel, as in Finland and Portugal the Savings/GDP ratio was found to be stationary in our previous work. From the plots reported in Fig. 3A-B the existence of a long-run relationship with coefficients shifts appears plausible. Indeed, although Gregory-Hansen cointegration tests with breaks on individual economies generally fail to reject the null of no cointegration, their panel bootstrap versions do, suggesting the failure to reject to be merely due to low power. Hence, in the panel as a whole investment and saving do appear to cointegrate if breaks are allowed. The tests developed in this paper may help answering the next question, which is if a break actually took place.
Recalling that the choice of the trimming coefficient may affect considerably the results we computed all tests with both 25% and 12.5% trimming, obtaining always very similar results. Examining the individual statistics (Table 7; to save space we report only the results for 12.5% trimming) we find extremely strong evidence of instability in Belgium, while most of the remaining statistics are not significant. The failure of the asymptotic tests to reject the hypothesis of stability for the individual countries is puzzling in view of the the graphical evidence, and the natural suspicion is that it may be merely due to the extremely low power to be expected from the tests with such a small sample size. In fact, moving to the panel tests we can see (Table 8) that the means of all statistics suggest strong rejection of the null hypothesis of stability, with p-values smaller than 5% (actually zero for the MeanF and SupF statistics). Since this outcome may be due to the strong evidence for instability in Belgium it is important to look also at the medians. Here the evidence for rejection is weaker, with p-values between 10% and 15% for the L c and MeanF. However, recalling (cf. Table 1B) that with a panel of 12 units power must be expected to be rather low, such p-values should nevertheless be regarded as small enough to grant rejection. We can thus appreciate how applying the panel procedure does grant a power gain with respect to the individual tests, allowing to reach the more plausible conclusion that in this group of countries investment and savings do seem to be linked by a long-run relationship, but this is likely to have changed over time at least once. The next natural step is to estimate models allowing for coefficient breaks at the estimated breakpoints b t b i = arg max(Sup b F i ). Given the small time sample available these estimates should clearly be taken with great care. This is especially true when the break falls near the extremes of the sample, although for robustness sake break estimates under 25% trimming have been used (hence, the break estimates are constrained to fall in the interval 1971-1992). The results (reported in table 9) are indeed of some interest. In seven countries (Austria, Belgium, Germany, France, Ireland and Sweden, thus including two of the largest continental European economies), the retention ratio falls significantly after the break, consistently with the expectations of a progressive weakening of the long-run link between investments and savings in the advanced economies (Frankel, 1992). In the case of the United Kingdom the results are peculiar, as the retention ratio is negative before 1977 and turns positive afterwards. However, neither estimates are significant, suggesting that in this case there may not be an actual causal link of any relevance running from domestic savings to investment. This hypothesis is consistent with Kejriwal (2007), who using quarterly data over the 1957:1-2006:1 found no evidence for cointegration for this country. Finally, in the four remaining cases (Italy, Spain, Greece, Denmark), contrary to expectations, the retention ratio seems to increase. However, two remarks are in order: first, the associated coefficient is never significant (nor the individual stability statistics, with the exception of Greece); second, in two cases (Italy and Spain) the estimated break points falls at the extremes of the interval in which they are constrained to lie (respectively, 1970 and 1991). From Fig. 2 we know that this is typical of cases when no break actually took place. Unfortunately, with the available sample size no reliable conclusions for individual cases can be reached, so it is impossible to shed more light on the issue. Clearly, the great care invoked above is fully necessary.

Conclusions
Our overall conclusion is that the proposed panel stability tests may grant considerable advantages. With time sample sizes rather common in macroeconomic datasets (e.g., 50 observations) the asymptotic tests appear to be essentially of no use, while the proposed panel bootstrap tests have Type I errors close to nominal sizes and acceptable power. An empirical illustration on the Feldstein-Horioka puzzle for a panel of 12 economies over the period 1960-2002 shows how the bootstrap panel stability tests lead to a more plausible conclusion (cointegration with at least one break) than the asymptotic tests applied to each individual country (which, with a few exceptions, do not reject stability). Among the points on our research agenda we can mention generalising our procedures to tests of the hypothesis of breaks limited to only some of the variables, implementing some block-length selection algorithm, and exploring the use of the Bewley (1979) transform.