Volatility filtering in estimation of kurtosis (and variance)

Abstract The kurtosis of the distribution of financial returns characterized by high volatility persistence and thick tails is notoriously difficult to estimate precisely. We propose a simple but effective procedure of estimating the kurtosis coefficient (and variance) based on volatility filtering that uses a simple GARCH model. In addition to an estimate, the proposed algorithm issues a signal of whether the kurtosis (or variance) is finite or infinite. We also show how to construct confidence intervals around the proposed estimates. Simulations indicate that the proposed estimates are much less median biased than the usual method-of-moments estimates, their confidence intervals having much more precise coverage probabilities. The procedure alsoworks well when the underlying volatility process is not the one the filtering technique is based on. We illustrate how the algorithm works using several actual series of returns.


Introduction
Every empirical paper reports summary statistics for the data used in the study. Many textbooks do so as well for educational purposes. Among this output one can always see the sample variance and often sample kurtosis (raw or more frequently in the form of a kurtosis coe cient). When the data are nancial returns, these statistics, especially the sample kurtosis, are highly unreliable (e.g., [3], [28]). Indeed, when the data are highly persistent and thick-tailed, as are typical nancial returns, the sample moments typically underestimate the true moments thus giving a misleading picture. The fact that higher-order moments of nancial returns are notoriously hard to precisely estimate has even prompted some researchers to look for alternative measures of those data features they are meant to capture ( [28]).
Technically, even though the mathematical expectation of (non-centered) sample moments matches the population moments exactly, their distributions are highly skewed so that the major probability mass is concentrated away from the true parameter. The following experiment gives an illustration. We generate a series following simple zero mean ARCH(1) process ( [10]) with news impact parameter α, unit variance and standard normal standardized innovations. We set the sample size to which is larger than typical samples. Table 1 shows median biases of the sample kurtosis and variance expressed in percentage points, i.e. medians − θ θ × % across 1,000,000 simulations, whereθ is sample kurtosis or sample variance, and θ is the corresponding theoretical value (equaling ( − α )/( − α ) or , respectively). One can see that when α is relatively small, so is the median bias, but as α approaches the boundary of existence of the population kurtosis and variance ( / √ and 1, respectively), the median bias becomes enormous. Needless to say, often nancial returns operate in a close proximity of these boundaries.  It follows, in particular, that sample kurtoses documented in applied literature are very likely to be underestimating their true values. For example, in [5] the sample estimates for the kurtosis coe cient (left side of Table 1, p.544) are way below the values implied by the estimated GARCH-t model parameters.¹ The empirical nance literature documents stylized facts that real-world nancial returns exhibit heavy tails, sometimes to the extent that fourth or even second moments may be in nite (see, among others, [31] and [7]). For example, when power laws of return distributions are estimated, the tail index estimates often fall into the interval ( , ) implying that no nite kurtosis exists ( [31], [17]) or may even turn out smaller than 2 implying that no nite variance exists ( [32], [24]).² While the population moments may not exist, the sample analogs are always nite,³ and only their relatively large value may give a hint that the moments may not in fact exist. The gures for the sample kurtosis as 99.7 and 119.2 for some daily stock return indices documented in [14] (Table 1.1) probably re ect the fact that the population kurtosis is in nite. The kurtosis gures like 63 or 77 for exchange rate futures documented in [7] are also suspect, and, according to the experimental evidence above, the actual kurtosis may be seriously larger, even if it is nite.
Given the degree of persistence, the problem of biased sample moments is much more severe for the fourth moment (i.e. kurtosis) than for the second moment (i.e. variance). However, even for the latter it is practically relevant. The simulation evidence in [1] shows that for a GARCH model the 'variance targeted' 1 For example, for the British pound (left side of rst line in Table 1 on p.544) the reported sample estimate is . while the GARCH-t parameters (right side of rst line in Table 1 on p.544, and rst line in Table 2 on p.545) imply the unconditional kurtosis coe cient of (see formula (3) ahead) ≈ . .
2 See the detailed discussion of heavy-tailed power law distributions, their estimation and application to nancial data in [25], and of their connection with GARCH models in [37]. 3 See, for example, [9] on the limiting behavior of sample variances and autocovariances in such cases. estimate of the variance (i.e. sample variance) is (up to) twice as median biased as its estimate obtained within the GARCH framework.⁴ In this article we propose a method that in a way is 'opposite' to variance targeting, which is also (and primarily) applicable to computing the kurtosis coe cient. In a nutshell, the proposed volatility ltering (VF) technique implies passing the return series through a GARCH lter using gaussian quasi-maximum likelihood (QML) estimation, computing kurtosis and variance of the ltered series, and restoring the implied kurtosis and variance of the original series by 'reverting' the estimated lter. The resulting estimates are going to be more precise because computation of moments for highly persistent series (that are subject to big estimation biases) is thus avoided. The trick is reminiscent of prewhitening the errors in asymptotic variance estimation in time series (see [2]). Note that the underlying volatility process need not be the same GARCH as long as the ltering signi cantly reduces the degree of return persistence. The ltering can be based on the simple GARCH(1,1) process that often happens to be suitable for description of the conditional variance of nancial returns, and, if the out-of-sample performance can be regarded as a signal of a 'true' model, GARCH(1,1) is often statistically su cient for that purpose ( [20]).
We verify how the VF technique works in simulations relative to the method of moments (MM). The VF estimates are much less median biased than the MM estimates, not only when the true underlying process is GARCH of the same order, but also when it is nonlinear GARCH from another class or when its order is higher. For practical purposes, ltering using orders (1,1) appears to be su cient indeed. When the true kurtosis is in nite, the algorithm issues such signal often enough, up to almost always for su ciently big samples.
We also show how to construct asymptotic con dence intervals (CI) for MM and VF estimators and verify their actual coverage probabilities. The MM asymptotic CI are poorly motivated because of problems with existence of higher order (to be more exact, 8th) moments of the return series. The asymptotic CI for VF in contrast require more relaxed moment conditions (existence of 8th moments as well, but of the innovation series instead of the return series) and hence are much better motivated. Even though they involve computations of certain derivatives, they are still easy to construct numerically. We also utilize the robust CI by [26] based on group t statistics. We compare coverage rates in simulations and nd out severe distortions in the case of MM and much smaller distortions in the case of VF.
Finally, we illustrate discrepancies among di erent estimates and their standard errors using several actual series of various nature -stock index returns, individual stock returns, exchange rate returns, -and frequency -monthly, weekly, daily. These discrepancies turn out to be quite large for estimates of the kurtosis coe cient, while a comparison of standard errors con rms inadequacy of those for MM.
The article is organized as follows. Section 2 describes the technique. Section 3 contains simulation evidence on the performance of the VF method in comparison with the method of moments. Section 4 analyzes construction of con dence intervals, while Section 5 compares their coverage properties. Section 6 outlines an extension to the multivariate case. Section 7 shows an illustration with real data. Section 8 concludes. The Appendix contains auxiliary technical details and proofs of propositions.

Point estimation: How does it work?
For observable zero mean strictly stationary observable series of returns r t , the method-of-moment (MM) estimates for the second moment (i.e. variance) σ = Er t 4 The 'variance targeting' detour for GARCH models ( [12], [16]) helps reduce the dimensionality of the problem, which is useful in a multivariate context. The method boils down to estimating the variance from the sample at the rst step, subsequently plugging this estimate into the GARCH equation. and the fourth moment normalized by the squared variance (i.e. kurtosis coe cient) Now we show how estimates of the variance and kurtosis coe cient can be constructed using the GARCH(1,1) and GARCH(2,2) volatility ltering. Consider the GARCH( , ) model ( [4]) with possibly non-normal standardized innovations: where ε t ∼ i.i.d. D( , ) with nite fourth moment (kurtosis) ν = Eε t , and volatility (conditional variance) follows For further use, denote θ = (ω, α, β) . The variance of r t can be expressed as (see [4]) provided that it is nite, which happens when The kurtosis coe cient of r t can be expressed as (see [21]) provided that is nite, which happens when The algorithm of obtaining the volatility ltered (VF) estimates goes as follows: Step 1. Run⁵ GARCH(1,1) QML on r t . Obtain GARCH parametersω,β andα and standardized errorsε t . Compute an estimates of kurtosis of standardized innovations aŝ Step 2. Ifβ +α ≥ , setσ VF = ∞. Otherwise, computê Step 3. Ifβ + αβ +α ν ≥ , setκ VF = ∞. Otherwise,⁶ computê κ VF =ν − (β +α) − (β + αβ +α ν ) . 5 For raw returns, a conditional mean (e.g., a constant, or AR(1) as appropriate) should be added. 6 Alternatively, one could also setκ VF = ∞ before checking for niteness of kurtosis ifσ VF = ∞ is issued at step 2, because the kurtosis is in nite whenever variance is in nite. However, we stick to the algorithm in the text becauseσ VF = ∞ is sometimes issued spuriously even when both variance and kurtosis are nite.

Point estimation: How well does it work?
To verify the performance of the VF technique and compare it with that of MM we run simulations with runs of arti cial data of length T = , and , and read o 25%, 50% and 75% quantiles of distributions of estimates of κ and σ from both procedures. We concentrate on quantiles because of likely problems with existence of moments. Most of the time we focus on the kurtosis coe cient κ, because it is for the kurtosis that the problem under consideration is most acute; still, we look at the results for the variance σ as well.⁸ The parameter ω is set to unity unless otherwise noted. Table 2a shows the results for kurtosis estimates when the standardized innovations ε t are standard normal implying ν = , for several combinations of variance equation parameters. The rst two combinations are empirically plausible and imply a moderately big kurtosis coe cient between 5 and 6. The third combination implies reduction to a simpler ARCH(1) true process. The fourth combination implies very weak volatility persistence and almost mesokurtic tails.

Quantiles ofκ
One can see that in the case of strongly persistent GARCH process (the rst two panels) the VF estimates are much less biased than the MM estimates. When the sample size is large, the VF estimates are nearly median unbiased, while for the MM estimator the [25%,75%] interquartile range does not even contain the true value of kurtosis! For the less persistent ARCH process (the third panel) the VF estimates are quite precise even for the smallest sample size, while the true parameter is barely contained in the MM [25%,75%] interquartile range for the largest sample size. When the volatility persistence is very weak (the fourth panel), both procedures yield very similar results. Table 2b shows the results for the former two parameter combinations when the standardized innovations ε t are (standardized) Student with η = degrees of freedom implying ν = . This fattens the tails signi cantly resulting in a kurtosis coe cient larger than 10.
A larger kurtosis makes both estimators more variable, hence wider interquartile ranges. The VF estimator is especially prone to this, especially the right tail. However, in all cases its [25%,75%] interquartile range covers the true value of kurtosis, which does not fall far from the median, except for the smallest sample size. The MM estimates, in contrast, are very biased to the left, and their [25%,75%] interquartile ranges never cover the true value, even for the largest sample size. Table 2c shows the results with standard normal standardized innovations for the rst of previous parameter combinations with normal innovations but in ated news impact parameter α so that the kurtosis is either in nite or huge. The table additionally reports percentages of cases when the VF procedure classi es the kurtosis to be in nite.

Quantiles ofκ
In these cases the MM estimation gives ridiculously small estimates of kurtosis, the downward bias is huge. The VF estimates are also median biased downward, but to a much lesser degree. When the actual kurtosis in in nite, so is the median VF estimate for all sample sizes; in fact, the whole [25%,75%] interquartile range is in nite when the sample size is big enough, and the score in classi cation is large for the medium sample size and almost perfect for the largest sample size. When the actual kurtosis in nite but very large, it takes quite a big sample to reach the true value of kurtosis in median terms, and the proportion of in nite classi cations, although moderately large, does not exceed 50%.

. Kurtosis estimation under volatility misspeci cation
The volatility process followed by real data may not necessarily be GARCH(1,1). Now we verify how well the GARCH(1,1) volatility ltering works when the actual process is in fact more complex. We consider two cases: when the actual process is linear GARCH but of higher orders, (2,2), and when the actual process is some nonlinear GARCH, but of the same orders, (1,1). Table 3a shows the results for kurtosis estimates when the underlying process is GARCH(2,2) with standardized normal innovations. In the second parameter combination, the weights placed on second lags are much larger than in the rst parameter combination, otherwise the sum of α's and sum of β's are kept the same. The true kurtosis coe cients are also similar. One can see that GARCH(1,1) ltering when the true process is of higher order still works very well, with the largest sample size yielding nearly median unbiased VF estimates, while the MM estimates are still far from the true parameter both in median and interquartile terms.  Table 3b shows the results for kurtosis estimates when the underlying process is one of three nonlinear GARCH(1,1) models with standardized normal innovations, and a stochastic volatility (SV) model. The GARCH models are GJR ( [18]), TGARCH ( [38]), and EGARCH ( [33]), and the SV model is SARV ( [36]).

Quantiles ofκ
The formulas for kurtosis coe cients for these models can be found in the Appendix A.1. The GJR equation where I {rt− < } is indicator function. In contrast to the linear GARCH, it incorporates the leverage e ect, i.e. asymmetry with respect to the sign of the past innovation. We set the news impact coe cients pretty di erent in size for the innovations of di erent sign. Although on average the news impact is the same as in the rst GARCH(1,1) process in Table 2a, the kurtosis coe cient is larger because of this asymmetry. The TGARCH equation is formulated for the conditional standard deviation: We incorporate a distinguished leverage e ect like that for the GJR model. The (restricted) EGARCH equation is formulated for logs of conditional variances: The SARV equation is also formulated for logs but with an unobservable volatility shock in place of the observable driving process: where As expected, the MM estimates are severely biases downward, and in the three nonlinear GARCH cases the VF estimator still fares better in median terms. For the SV process, GARCH volatility ltering has a harder time whitening the squared returns, due to the unobservable component being the driving process of volatility. Interestingly, for the TGARCH process the sign of the bias for the VF estimator is positive rather than negative, and so is it for the SARV process. This may be a consequence of a combination of high asymmetry and additional highly nonlinear transformation of the conditional variance in the TGARCH case and unobservable driver of volatility in the SARV case. These are the cases when the VF estimator does not attain the goal for large sample sizes; for smaller ones the positive and negative biases partially cancel each other.

. Kurtosis estimation with higher order GARCH ltering
The evidence presented in the previous subsection poses a question: if the actual volatility dynamics is nonlinear relative to GARCH, or its order is higher, would increasing the orders of the GARCH process used in ltering help better capture the deviations from the actual process and make a positive impact on the precision of VF estimates? Table 4 contains the results when instead the GARCH(1,2) and GARCH(2,2) lters are employed, and the actual processes are those from Tables 3a and 3b. Here we set the sample size to T = . Interestingly and unexpectedly, for nonlinear/SV deviations from GARCH the use of higher orders does not bring improvements and in fact sometimes the performance of VF estimates deteriorates. In particular, in the GJR and TGARCH cases the median bias is larger when a higher order GARCH lter is used in place of GARCH(1,1). In the EGARCH and SV cases there is some improvement, but it is not large. When the true process has orders (2,2) but only (1,1) (or (1,2), for that matter) are assumed, the performance of the VF technique changes very little, if at all. This is true even when the coe cients on second lags are bigger than those on rst lags in the GARCH(2,2) speci cation. This phenomenon can be explained as follows. First, even though the GARCH(2,2) model better accounts for the true persistence structure, most of it can be well captured by the GARCH(1,1) structure. Indeed, the GARCH literature suggests that the GARCH(1,1) model is able to provide volatility forecasts that are no worse than those provided by more sophisticated models ( [20]). Secondly, an addition of one or two more param-eters is a sharp increase in model complexity, and it seems that estimation thereof reduces the precision of ingredients of the kurtosis estimate to a greater degree than the deviation from the true persistence structure.

. Estimation of variance
Finally, we compare MM and VF variance estimates. Table 5 shows the results with two processes with high persistence. The rst parameter combination implies IGARCH structure and hence in nite variance. In the second parameter combination, the news impact parameter is a bit smaller resulting in a covariance stationary process implying big but nite variance.
When the actual variance is nite, the VF [25%,75%] interquartile range is wider than that of the MM estimator. However, it is centered much closer to the true parameter, and the discrepancy diminishes quickly with the sample size. The percentages of incorrect classi cation diminishes very fast with the sample size. When the actual variance is in nite, typical VF estimates increase very quickly with the sample size, with the percentage of correct classi cation also growing, although more slowly.

Con dence intervals: How does it work?
One may want to surround the point estimates, if they turn out to be nite, by con dence intervals (CI). We consider two ways of their construction: one is based on the Delta method supported by the conventional large sample theory; the other is correlation and heterogeneity robust inference by [26] based on nite sample inference for group t statistics.

. Asymptotic con dence intervals
Consider rst the inference based on conventional large sample theory. In a nutshell, it amounts to applying the Delta method to time series asymptotics for basic estimators obtained from the sample -the second and fourth sample moments for MM, and the GARCH parameter estimates and fourth sample moments of standardized returns for VF.
For the MM estimator, we have Proposition 1. Suppose the return series r t is stationary and ergodic with E r t < ∞. The MM estimator (σ MM ,κ MM ) is consistent and asymptotically normal with the asymptotic variance is a matrix of analytical rst derivatives of the mapping from the vector of sample second and fourth moments to the vector (σ , κ) , and is the long-run variance of the vector of sample second and fourth moments.
The result of Proposition 1 follows by the Delta method applied to the transformation from the second and fourth moments of returns to the moments guring in the de nitions of σ and κ. Because the former exhibit serial correlation, HAC variance estimation is needed.
The asymptotic variance V MM can be consistently estimated bŷ is a consistent estimator of the matrix G MM , andŴ MM is heteroskedasticity and autocorrelation consistent (HAC, e.g., [34]) estimator of W MM . To construct a VF con dence interval one needs the joint asymptotics of the QML estimatorθ = (ω,α,β) of the GARCH parameters and sample fourth moments of standardized returnsv . Denote vr = E ε r t for r = , . We have Proposition 2. Suppose the return series r t follows a strong GARCH(1,1) process (1) with β + αβ + α ν < , v < ∞, and θ lying in the interior of a compact parameter set. The VF estimator (σ VF ,κ VF ) is consistent and asymptotically normal with the asymptotic variance where G VF is the matrix of rst derivatives of the mapping from the vector of GARCH parameters and fourth moments of standardized returns to the vector (σ , κ) embedded in the VF algorithm, and W VF is a covariance matrix of GARCH parameters and fourth moment of standardized returns The result of Proposition 2 follows by the Delta method applied to the transformation from the GARCH parameters and fourth moments of standardized returns to the moments guring in the expressions (2) and (3) or (4) and (5). Because the score function and standardized returns are martingale di erences, HAC variance estimation is not needed.
The matrix G VF is analytically rather convoluted, but it can be easily computed using numerical methods once the mapping is programmed, and estimated by plugging in the available estimates of components. The matrix W VF can be computed using the asymptotics in the Proposition that extends the asymptotics for GARCH coe cients in [15]. It can be easily estimated during the numerical minimization of the average quasilikelihood. Then, an asymptotic variance estimate for the VF estimator (σ VF ,κ VF ) iŝ whereĜ VF consistently estimates G VF , andŴ VF consistently estimates W VF . Note that the asymptotics for the VF estimator is well motivated when v < ∞. This moment condition is much weaker for persistent data than the condition E r t < ∞ needed for asymptotics of the MM estimator to hold. The MM asymptotic CI are poorly motivated as these computations presume existence of as high as eighth moments of returns. In the situations under consideration such moments are unlikely to exist, and even when they do, their method-of-moment estimates yield very poor approximations (the e ects described in the Introduction for second and fourth moments are, of course, further exacerbated for eight moments). This factor, together with biasedness of the estimates, is expected to make the MM con dence intervals rather spurious. In contrast, the normal asymptotics holds for the GARCH parameters without a requirement of nite eighth moments; in fact, asymptotic normality may not fail even when the unconditional kurtosis is in nite provided that the conditional kurtosis is nite (e.g., [29], [15]). Also note that MM inference uses HAC variance estimates, which often have poor nite sample properties, while VF inference does not. The decisive factor, however, in di erent precision of CI is expected to be the di erent biasedness properties of the two point estimators.

. Robust con dence intervals
Now consider the correlation and heterogeneity robust inference by [26] (IM henceforth). This way of constructing con dence intervals avoids tedious, especially for VF, estimation of derivatives and asymptotic variances; on the other hand, it requires repeated estimation of the GARCH model.⁹ Suppose we divide the sample into a xed number q of equal non-overlapping groups of size τ, where T = τq, and compute estimates (σ j ,κ j ) , j = , ..., q, on all groups. Suppose that ≤ q ≤ and ϕ ≤ . , where ϕ is the CI level or test size. Then, we have Proposition 3. Suppose that (σ j ,κ j ) q j= are consistent, asymptotically normal and asymptotically independent as τ → ∞, with asymptotic variances uniformly bounded from below. Then, the t statistics over the groups, These 'robust' t tests are nite sample in the sense that a nite sample Student's distribution is used for a sample of xed size q, and asymptotic in the sense that an asymptotically in nite sample is needed to ensure uncorrelatedness of these q 'observations'. By inverting the IM tests, one can construct IM con dence intervals for σ and κ simply asσ ∓ cv ϕ sσ / √ q andκ ∓ cv ϕ sκ / √ q, where cv ϕ is the − ϕ -quantile of the Student's t distribution with q − degrees of freedom. As mentioned above, the IM con dence intervals require no tedious estimation of asymptotic variance ingredients, but require estimation of the GARCH model multiple times (which is, however, not a big cost). The latter fact in the current context imposes a pressure on the number of groups from above: each group has to be long enough for the GARCH model to be estimable and these estimates to be reliable. There is yet another peculiarity of IM con dence intervals: because they are centered around an average over groups rather than around the whole sample point estimate, the point estimate may be severely shifted away from the CI center or may not belong to the CI at all. Such outcomes may leave an empirical researcher who is used to symmetric con dence intervals a bit dissatis ed. Table 6: Coverage probabilities of 95% asymptotic con dence intervals from MM and GARCH(1,1) VF procedures. Table 6 contains results of simulation experiments with several DGPs used before. Two are correctly speci ed GARCH(1,1) models with basic parameter values except the tail thickness of standardized returns -one has normal tails implying ν = , the other has Student's tails implying ν = . Two other DGPs are misspeci ed -one follows GARCH (2,2), and the other follows GJR(1,1). The true variance is σ = in all DGPs, the true value of kurtosis varies across DGPs. We report actual coverage probabilities of nominally 95% asymptotic con dence intervals from both estimators, MM and VF, and two inference tools, ∆M (for 'Delta method') and IM. In the case of asymptotic CI, we use the HAC variance estimator by [24] for the MM, and the two-sided di erence scheme for obtaining partial derivatives with the step h = − times the parameter value, for VF.¹⁰ In the case of IM inference, we use q = when T = , q = when T = , and q = when T = ; these choices are mostly driven by the need to allocate a su cient number of observations into each group for the volatility lter to be computationally feasible, and to set q to be slowly increasing with T.

Con dence intervals: How well does it work?
First consider the asymptotic inference based on the Delta method. One can notice immediately that the MM con dence intervals have a dramatic undercoverage problem, especially those for the kurtosis coe cient. Even though the MM coverage probabilities increase with the sample size, even with T = the actual coverage is far from nominal: for variance, the maximal coverage is about 75%, for kurtosis -mere 55% (and often much smaller). The VF con dence intervals are undercovered to a much lesser degree, uniformly. Still pretty serious undercoverage for relatively small samples quickly diminishes when the sample gets larger, and with T = the actual coverage for variance gets almost nominal, while for kurtosis it rises to relatively tolerable levels.
The VF coverage probabilities for variance do not vary or vary very little with DGP and do not depend much whether it is correctly or misspeci ed. Comparing the VF coverage probabilities in two top panels and two bottom panels one can conclude that the coverage properties do not vary much with whether the true DGP is misspeci ed or not in the VF procedure. It is the value of kurtosis that a ects coverage: the higher the kurtosis coe cient, the higher are coverage distortions. Highest size distortions for kurtosis appear for the second DGP where the kurtosis is highest because the tails of standardized returns are thickest. Indeed, the problems of estimation of fourth moments partially carry over to standardized returns even though they are not persistent. All remaining VF coverage distortions evidently originate from two sources: one is a bias in estimates of GARCH coe cients, and the other is uncertainty in estimation of a fourth moment of standardized returns.
Now consider the inference based on the IM method by. For both estimators and across all simulation designs, the IM con dence intervals have much better coverage properties that those based on the Delta method. For the largest sample size and estimation of variance, the VF con dence intervals based both on ∆M and on IM provide ideal coverage, though the coverage distortions for kurtosis kick in even for the largest sample size when the Delta method is used. For medium and small samples, the IM method clearly dominates in terms of coverage, be it for variance or for kurtosis. Having such good coverage is accompanied though with the CI being non-symmetric around the possibly biased VF point estimates and not even including them from time to time.

Extension to multivariate case
The extension to a multivariate context is straightforward, albeit technically tedious. Here we write out the VF algorithm when series of returns r t is -dimensional, > , and the returns are conditionally spherically distributed (cf. [19]).
The distribution D of standardized innovations is assumed to be spherical with fourth cross-moments c = Eε i,t ε j,t = Eε i,t for all i, j = , ..., , i ≠ j.
Denote n = ( + )/ . The vectorized variance of r t can be expressed as (see [11]) provided that it is nite, which happens when all eigenvalues of B + A lie inside the unit circle. From [19], the kurtoses and co-kurtoses are elements of the following vector: and L , D , D + , C are certain matrix manipulation matrices (see [19]), provided that this quantity is nite, which happens when all eigenvalues of Z lie inside the unit circle. The collection of coe cients of kurtoses and co-kurtoses can be extracted from K and Σ: The algorithm of obtaining VF estimates then goes as follows: Step 1.Run¹¹ unrestricted or restricted multivariate GARCH(1,1) QML on r t . Obtain MGARCH parametersΩ,B andÂ and standardized innovationsε t . Compute an estimate of fourth cross-moments as, for example, Compute estimateẐ of Z from estimatesΩ,B,Â andĉ.
Step 3.If the largest eigenvalue ofẐ is larger than unity in modulus, setK VF = ∞. Otherwise, computeK VF and extract relevant values ofκ iiVF andκ ijVF for i, j = , ..., , i ≠ j. If the whole covariance/co-kurtosis matrix is classi ed as in nite, it is possible that some variances/kurtoses are still nite. If they are of interest, one may reduce the dimensionality of the return vector and repeat the procedure for a subset of returns.

Empirical illustration
In this section, we demonstrate the VF technique on typical nancial return series of di erent frequencies (daily, weekly, monthly) and di erent nature (stock index, individual stock, exchange rate). We compare the estimates of variance σ and kurtosis coe cient κ from the MM procedure and VF procedures using GARCH(1,1) and GARCH (2,2) in addition to a constant conditional mean. Table 7a contains the results.
While the variance point estimates are in agreement across the methods, the point estimates of the kurtosis coe cient are not. While the former tend to be a bit higher, although not always, under volatility ltering, the latter are often two to three times larger (when nite) with ltering than without it. In some cases the VF method signals on non-existence of the fourth moment. This tendency is, of course, more clear for series that are likely to have thicker tails (possibly due to higher volatility persistence) -high frequency index and stock returns, while for exchange rates the di erence is not so high.
Most dramatic are di erences in standard errors provided by the two methods. Those for the VF variance estimates are typically 2-4 times larger than those for MM estimates. The latter standard errors provide too optimistic picture of actual estimation uncertainly. This tendency is exacerbated for kurtosis estimates. Even for moderately kurtic data (like monthly IBM returns and monthly dollar/yen exchange rate) the VF standard errors exceed those provided by MM by a fewfold. For more kurtic returns the ratio may be greater than ten or even twenty. For very kurtic data with a probably nite kurtosis (like monthly S&P500 returns), the di erence may be a few dozenfold. In cases of likely non-existent kurtosis, the MM shows moderate gures for the kurtosis coe cient, with inadequately small standard errors. Often, the MM estimate plus two MM standard deviations falls way short of the VF point estimates for the kurtosis coe cient.
Interestingly, while the gures across di erent orders of GARCH ltering are pretty similar, there are discrepancies for (most persistent and most thick-tailed) index returns. For the weekly S&P500 returns, the GARCH(1,1) ltering detects in nite kurtosis while the GARCH(2,2) ltering still regards it nite but very large (almost three times larger than the MM estimate computed from the raw data), and the huge VF standard errors re ect high uncertainty embedded in the estimate.  Note: Asymptotic standard errors are below point estimates.
. [ . , . ] . [ . , . ] . [  We also consider the bivariate case for daily and monthly exchange rate pairs. We compare the estimates of covariance matrix Σ and relevant elements of the co-kurtosis coe cient matrix K from the MM procedure and VF procedure that uses BEKK(1,1) ( [11]) in addition to a constant conditional mean vector. Here, because the use of the Delta method is even more cumbersome with the multivariate model, we construct robust IM 95% level con dence intervals instead of asymptotic ones. Table 7b contains the results. The MM variance and kurtosis estimates are equal or very similar to those from the univariate MM procedure, and the covariance and co-kurtosis values are pretty moderate. While the estimates of the former from the multivariate volatility ltering procedure tend to go up as expected, the estimates of the co-measures also increase in absolute value, especially the co-kurtosis coe cients (up to almost twice). The 95% con dence intervals are quite narrow for the daily data, but are little informative for the monthly data, with the con dence intervals for positive parameters often covering large negative areas.

Concluding remarks
The volatility ltering method is an e ective way to compute second and especially fourth moments of persistent series by avoiding a necessity to apply the method of moments to a persistent series. As our simulations show, the VF estimates are much less median biased compared to the MM estimates, and the con dence intervals have actual coverage properties much closer to nominal. This holds not only when the GARCH(1,1) process used in ltering coincides with the true process driving the persistence, but also for a wide range of other GARCH-type processes with di erent persistence structures.
The remaining median biases and coverage distortions of the VF kurtosis estimator in a correctly speci ed persistence model are mainly due to the biasedness (even in mean terms) of GARCH coe cients which is intrinsic even in a correctly speci ed GARCH model (see, e.g., [1]). While these fade o as the sample size grows, this happens pretty slowly. One may increase e ciency of GARCH and fourth moment estimates by using QML based on heavy-tailed distributions ( [13]), or by applying bias correction to GARCH parameters -analytical ( [30]) or via bootstrap ( [8]). However, these re nements are likely to complicate the algorithm seriously threatening its simplicity.
One bigger development of the method may constitute an attempt to tie the form (e.g., the lag order) of the process used to lter the volatility to some computable persistence characteristics. Another possible extension is changing the volatility process to one that allows analytic computation of all moments of the series, not only of the fourth and second.

A. Kurtosis and variance formulas for non-GARCH processes
The following formulas for the kurtosis coe cients can be found in [22], [23] and [36].

A. Proofs of Propositions
Proof of Proposition 1: Straightforward, using the Ergodic Theorem, Central Limit Theorem for general dependent observations of a stationary series, and the Delta method.
Proof of Proposition 2: The (minus) (twice) normal quasi-likelihood for observation t is t = r t σ t + log σ t .
Following [15], formula (2.10), denote where t is the true likelihood for observation t. Following the proof of Theorem 2.2 in [15], the joint asymptotics for average ∂ t /∂θ and average ε t is using the representation (4.12) from [15], where Further, Finally, it follows that the asymptotic variance of VF estimator can be computed via the Delta method: where G VF is the matrix of rst derivatives of the mapping from the vector of parameters and average fourth moments of standardized returns to the vector (σ , κ) .