Abstract
A growing literature on inference in differenceindifferences (DiD) designs has been pessimistic about obtaining hypothesis tests of the correct size, particularly with few groups. We provide Monte Carlo evidence for four points: (i) it is possible to obtain tests of the correct size even with few groups, and in many settings very straightforward methods will achieve this; (ii) the main problem in DiD designs with grouped errors is instead low power to detect real effects; (iii) feasible GLS estimation combined with robust inference can increase power considerably whilst maintaining correct test size – again, even with few groups, and (iv) using OLS with robust inference can lead to a perverse relationship between power and panel length.
1 Introduction
In labor economics and other empirical, policyoriented fields, differenceindifferences (DiD) designs are an extremely common way of estimating the effects of policies or programs (henceforth “treatment effects”). A recent literature has highlighted that failure to appropriately quantify the uncertainty surrounding DiD estimates can lead to dramatically misleading inference (e.g. Bertrand, Duflo, and Mullainathan 2004; Cameron and Miller 2015). In particular, researchers will tend to reject true null hypotheses with a probability that is far higher than the nominal size of the hypothesis test. The literature has suggested that obtaining tests that are close to the correct size requires nonstandard techniques, and that it may not be possible with a small number of groups (Bertrand, Duflo, and Mullainathan 2004; Cameron, Gelbach, and Miller 2008; Angrist and Pischke 2009).
In this paper we report evidence from Monte Carlo simulations that emphasises a different conclusion. We make four main points. First, our simulations demonstrate that in many typical DiD settings tests of the correct size can be obtained with very straightforward methods that are trivial to implement with standard statistical software (in fact, STATA’s clusterrobust inference implements these methods by default); and in settings where this works less well, a bootstrapbased approach highlighted by other authors (e.g. Cameron, Gelbach, and Miller 2008; Webb 2013) provides a reliable alternative. All this is true even with few groups. Second, these techniques have very low power to detect real treatment effects. Thus the real challenge for inference with DiD designs is power rather than test size. Third, our simulations show that substantial gains in power can be achieved using feasible GLS. Moreover, our simulation results suggest that the the combination of feasible GLS and clusterrobust inference can control test size, even if the parametric assumptions about the error process implicit in feasible GLS estimation are violated, and even with few groups. Fourth, using OLS with robust inference can lead to a perverse relationship between power and panel length. This is another reason to favour feasible GLS. In summary we highlight the need for applied researchers using DiD designs to pay careful attention not just to consistency and test size, but also to the efficiency of their estimators, and we recommend the use of feasible GLS combined with clusterrobust inference as a solution to this problem.
DiD designs often use microdata but estimate the effects of a treatment which varies only at a group level at any point in time (e.g. variation in policy across US states). A consequence is that withingroup correlation of errors can substantially increase the true level of uncertainty surrounding the treatment effect (e.g. Angrist and Pischke 2009; Donald and Lang 2007; Moulton 1990; Wooldridge 2003; Cameron and Miller 2015). Furthermore, treatment status is typically highly serially correlated. In fact, in the most common case on which we focus in this paper, treatment is an “absorbing state”: once a group is treated, it remains treated in all subsequent periods. This means that serially correlated error terms are likely to have a large impact on the true level of precision with which treatment effects are estimated. In a wellcited Monte Carlo study using US earnings data, Bertrand, Duflo, and Mullainathan (2004) show that accounting only for grouped errors at the statetime level whilst ignoring serial correlation led to a 44% probability of rejecting a true null hypothesis using a nominal 5% level test. So, for example, when evaluating a labor market policy implemented in certain regions from a particular point in time onwards, a researcher should worry both that people in the same region at the same time are affected by common labor market shocks (unrelated to the policy) and that these regional shocks are serially correlated.
A simple approach to deal with both crosssectional and serial correlation in withingroup errors would be to use the formula for a clusterrobust variance matrix due to Liang and Zeger (1986). This is consistent and Wald statistics which use it are asymptotically normal, but the asymptotics apply as the number of clusters tends to infinity. By clustering at the group level rather than the grouptime level to account for serial correlation, one is often left with few clusters. The finite sample (i.e. fewclusters) performance of this approach – an empirical question – then becomes crucial, and the literature to date has come to pessimistic conclusions about it. Bertrand, Duflo, and Mullainathan (2004) and Cameron, Gelbach, and Miller (2008) use US earnings data and generate placebo statelevel treatments before estimating their “effects.” Forming tstatistics using clusterrobust standard errors (CRSEs), they obtain 9% and 11% rejection rates using nominal 5% level tests with samples from 10 and 6 US states respectively.^{[1]} This is a considerable improvement over using OLS standard errors, when rejection rates are more than 40%. But it is still approximately double the nominal test size.
The crucial finding of Bertrand et al. and others – that inference can go badly wrong in DiD unless one is very careful – is confirmed once again in our experiments. But our simulations also show that a modification to the standard clusterrobust inference procedure described above can dramatically improve test size with few clusters. One can apply a scaling factor to the OLS residuals that are plugged into the CRSE formula, and use critical values from a t distribution with degrees of freedom equal to the number of groups minus one, rather than a standard normal. This is straightforward and in fact, if one uses a clusterrobust variance matrix in STATA by specifying the “vce(cluster clustvar)” option, the confidence intervals and pvalues returned are based upon this procedure by default.^{[2]} When this is done, our simulations show that true test size is within about one percentage point of nominal test size with 50, 20, 10 or 6 groups. This result also holds under a wide range of data generating processes. The key situation in which the method is unreliable is when there is a large imbalance between the numbers of treatment and control groups.
Various alternative techniques for achieving correct test size have been proposed and/or tested (Bertrand, Duflo, and Mullainathan 2004; Donald and Lang 2007; Cameron, Gelbach, and Miller 2008; Bester, Conley, and Hansen 2011). Of these, only a wild cluster bootstrapt procedure has been shown to produce tests of approximately the right size in the typical DiD setup considered in this paper (see Section 2) when the number of groups is as small as six (Cameron, Gelbach, and Miller 2008). Like using CRSEs, this is theoretically robust to heteroscedasticity and arbitrary patterns of error correlation within clusters, and to variation in error processes across clusters. It has also been shown to be quite robust to large imbalances between the numbers of treatment and control groups (Mackinnon and Webb 2017), and hence provides an important alternative to the simpler method described above in such situations. But it is less trivial to implement and computationally more intensive.
Our second point is that, while it is generally not difficult to obtain the correct size, power to detect real effects is a serious concern. When we use the methods above to implement correctly sized hypothesis tests, our simulations suggests that DiD designs can have very low power. This problem is very severe with few groups. For example, with a large 30year panel of US earnings data from 6 states, a policy implemented by half of the states that raised earnings by 5% would be detected with only 17% probability (using a test of size 0.05). The policy would have to increase earnings by 16% if the null of no policy effect is to be rejected with 80% probability.
More positively, our experiments also demonstrate that substantial gains in power can be achieved using feasible GLS. In particular, with a moderate time series dimension of at least about 10 time periods, one will often be able to increase power by modeling the serial correlation of unobservables inherent in typical DiD designs. A biascorrection for feasible GLS due to Hansen (2007) reduces, but does not eliminate, test size distortion, particularly with small numbers of groups. However, our simulations demonstrate that test size can be controlled in a way that is robust to having small numbers of groups, and to violations of the parametric assumptions about the error process implicit in feasbile GLS estimation, by using the straightforward clusterrobust inference technique described above. This is why we recommend the use of the combination of FGLS (with or without the Hansen correction) and clusterrobust techniques in DiD applications. Furthermore, our simulations also show that OLS estimation is susceptible to a perverse (negative) relationship between power and panel length, and we explain the econometric reason for this. This does not happen with feasible GLS.
The paper proceeds as follows. Section 2 describes the standard econometric setup in DiD designs that we consider, and discusses possible solutions to the inference problems that can arise in this setting. Section 3 details the Monte Carlo design we use to test different inference methods. Section 4 presents and discusses the results of our Monte Carlo simulations. To assist applied researchers who wish to follow the feasbile GLS strategy our experiments support, we provide a STATA ado file that implements this, including the Hansen (2007) bias correction. This is described in Section 5. Finally, Section 6 summarizes and concludes.
2 Approaches to Inference in a DifferenceinDifferences Design
We consider the standard linear DiD model
where α_{g} and δ_{t} capture group (state) and time (year) fixed effects, β is the treatment effect of interest for a treatment which varies at the grouptime level only, w_{igt} are individuallevel control variables, and υ_{igt} is the unobserved individuallevel earnings shock.
Our interest lies in the performance of different methods for performing inference about β, both in terms of type 1 and type 2 error (i.e. test size and power to detect real effects). Hence we assume that the OLS DiD estimator based on equation 1 is unbiased, i.e.
The problem we seek to address is that the υ_{igt} may not be iid within groups. Some of the variation in υ_{igt} may occur at the grouptime level, i.e.
Given the setup described, the computation of
As we explain fully in the next section, we first aggregate the data to the statetime level in this way and ignore any estimation error (i.e. we proceed as though
2.1 Standard ClusterRobust Inference
Our starting point is the standard OLS estimator of the standard error of
We then look at several ways of performing inference based on variants of Liang and Zeger ’s (1986) clusterrobust standard error (CRSE) estimator. Their formula for a clusterrobust variance matrix is
where X is the regressor matrix, X_{g} is the regressor matrix for group g, and u_{g} is the vector of regression residuals for group g. This estimator is consistent, and Wald statistics based on it are asymptotically normal, as G → ∞. But it is biased, and the bias can be substantial when G is small. Intuitively, model overfitting means that residuals will tend to be smaller in magnitude and less correlated within clusters than the true errors, meaning that CRSEs calculated using equation 3 will tend to be biased downwards. Any smallG bias is larger when the distribution of regressors is skewed: in the DiD context considered here, this is when there is an imbalance between the numbers of treatment and control groups (see Mackinnon and Webb 2017).
2.2 Bias Corrections for ClusterRobust Inference with Few Clusters
A typical way of attempting to reduce smallG bias (or, under special circumstances, to eliminate it) is effectively to scale up the residuals before plugging them into equation 3. The default in STATA is to scale by
For CRSEs formed using unscaled and
2.3 Bootstraps
With few groups, an alternative to relying on asymptotic results (such as normality of the tstatistic) or on small sample corrections is to recover the distribution of the test statistic empirically via a bootstrap. Cameron, Gelbach, and Miller (2008) found the wild cluster bootstrapt procedure to be the best (in terms of test size) of a large number of inference techniques in settings with few groups. In their implementation of the bootstrap (which we follow), they resampled clusters of residuals obtained from regressions which impose the null hypothesis, and scaled the resampled residuals by a constant drawn from a 2point distribution: 1 and −1, each with probability 0.5. This outperformed other bootstrapbased approaches, as well as inference based upon tstatistics formed with CRSEs – but that paper did not consider the
2.4 Modeling the Error Process Using GLS
The final approach to dealing with the serial correlation in the grouptime shocks is to use feasible Generalized Least Squares (GLS): this effectively exploits knowledge of this feature of the data to increase efficiency. A natural way to proceed is to assume an AR(k) process for the grouptime shocks. FGLS can then be implemented by estimating equation 2 using OLS, as before; estimating the k lag parameters using the OLS regression residuals; using those estimates to apply the standard GLS linear transformations to the variables entering equation 2; and estimating the analog of equation 2 on the transformed variables via OLS.
Two issues arise. First, estimates of the parameters obtained by regressing OLS residuals on k lags are inconsistent with T fixed, due to the presence of fixed group effects (Nickell 1981; Solon 1984). Hansen (2007) derives a bias correction which is consistent as G → ∞, and develops the asymptotic properties of a FGLS estimator which uses it. But this correction may not work well with small G. Second, one may be worried about misspecification of the error process.
However, neither of these issues affect the consistency of the FGLS estimator. And it is likely that FGLS would still be more efficient than OLS: a weighting matrix based on an incorrect parametrisation of the serial correlation process will often still be closer to the optimal GLS weighting matrix than the identity matrix used by standard OLS.
On the other hand, test size would generally be compromised, because the ordinary formula for the FGLS standard error depends upon the weighting matrix. But robust inference may offer a way to control test size. As noted more generally by Wooldridge (2006), the combination of FGLS estimation and robust inference is used relatively little in practice, but will often be a sensible way of realizing efficiency gains without compromising test size.^{[10]} One simply plugs the FGLS residuals, rather than OLS residuals, into the formula for a clusterrobust variance matrix.
Hansen (2007) considers combining CRSEs with his FGLS procedure, using biascorrected estimates of the parameters underlying a grouptime error process assumed to be AR, for the case where G = 50. The concern with fewer groups would be that test size can not be made robust to misspecification of the error process in this way, because clusterrobust inference relies on having many clusters.This leads us back to the question discussed above and explored in our simulations: how well can bias corrections for clusterrobust inference with few clusters do in controlling test size?
3 Experimental Design
We follow Bertrand, Duflo, and Mullainathan (2004), Cameron, Gelbach, and Miller (2008), and Hansen (2007) in using data on women aged 25 to 50 in their fourth interview month in the Merged Outgoing Rotation Group of the Current Population Survey. Our data include all 50 US states and the period 1979 to 2008 inclusive (i.e. G = 50 and T = 30).^{[11]} We focus on log(earnings) as the dependent variable.
Our control variables are a quartic in age. As in the aforementioned papers, we first aggregate the data to the statetime level in the way just described and ignore any estimation error from this procedure (i.e. we proceed as though
In our first set of Monte Carlo simulations, we repeatedly resample states with replacement from the CPS data and randomly choose half of the states to be “treated.”^{[14]} In treating exactly half of the states, we follow the main approach in Bertrand, Duflo, and Mullainathan (2004) and Cameron, Gelbach, and Miller (2008). This is the most favorable possible choice in terms of the resulting precision of treatment effect estimates, as it maximizes betweengroup variation in treatment status. For all treated states in each Monte Carlo replication, the placebo treatment is applied in the same randomly chosen year and in all subsequent years.^{[15]} We estimate the “effect” of this placebo treatment by estimating equation 2. We initially use OLS, and later feasible GLS, for estimation. Our interest lies in the performance of different methods for performing inference about β, both in terms of type 1 and type 2 error (i.e. test size and power to detect real effects). To examine the effects of having differing numbers of groups, we run variants where we resample 50, 20, 10 and 6 states. We also vary the fraction of groups that are treated, to explore robustness to unbalanced designs.
We first report how often the null hypothesis of no treatment effect is rejected using tests of nominal size 0.05 when using different inference methods.
We then look at power by reporting how often the null of no effect is rejected when there are real treatment effects of various sizes. We do this only for techniques that we have shown to work well in controlling test size. Nevertheless, to ensure that we are comparing the power of hypothesis tests which have exactly the same size, we adopt the procedure suggested by Davidson and MacKinnon (1998): the nominal significance level used to determine whether to reject the null hypothesis is that which gives a test of true size 0.05.
We also compute minimum detectable effects (MDEs) as first defined in Bloom (1995): the smallest effects that would lead to a rejection of the null hypothesis (of no effect) with given probabilities. To do this, we use the same Monte Carlo procedures as described above to simulate the distribution of the tstatistic under the null hypothesis.^{[16]} For power of x%, the MDE depends only on the (100x)th centile of this distribution, the critical values from the t_{G−}_{1} distribution, and the standard error (see later). We therefore recover the entire relationship between power and MDEs. We do this for DiD designs with varying numbers of groups.
We then explore whether power can be improved by using FGLS rather than OLS estimation. We rerun the Monte Carlo simulations, this time implementing FGLS assuming an AR(2) process for the grouptime shocks, homogeneous across states. We estimate the 2 AR parameters in two ways. First, we simply regress the residuals from OLS estimation of equation 2 on two lags. With fixed T and fixed group effects, this produces inconsistent estimates of the AR parameters. Second, we apply to these estimates the bias correction derived by Hansen (2007). This correction is consistent as G goes to infinity. We label these “FGLS” and “BCFGLS” respectively. In both cases, we explore what happens when the estimator is used with and without clusterrobust inference.
Finally, we show how the performance of OLS and FGLS/BCFGLS compare when the error process assumed in FGLS estimation is misspecified, by simulating statetime shocks according to a AR(2) process that is heterogeneous across states and a (homogeneous) MA(1) process, and when the length of the available panel is shorter, varying T between 10 and 30.
4 Results
4.1 Rejection Rates when the Null is True
Table 1 contains results from our first Monte Carlo simulations, using the CPS log(earnings) data. It shows the rate at which the null of no effect is rejected when generating placebo treatments, estimating equation 2 by OLS, and using different methods to conduct inference about β. All hypothesis tests are of nominal size 0.05. Hence, rejection rates that deviate significantly from 0.05 indicate incorrect test size. We use 5000 replications. Simulation standard errors are shown in parentheses. The standard error for an estimated rejection rate
G = 50  G = 20  G = 10  G = 6  

Assume iid  0.366  0.394  0.396  0.398 
(0.007)  (0.007)  (0.007)  (0.007)  
CRSE, N(0,1) critical values  0.058  0.078  0.127  0.228 
(0.003)  (0.004)  (0.005)  (0.006)  

0.046  0.039  0.039  0.053 
(0.003)  (0.003)  (0.003)  (0.003)  
Wild cluster bootstrapt  0.038  0.060  0.044  0.060 
(0.001)  (0.002)  (0.003)  (0.003) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 5000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample groups (US states) from CPSMORG data, having imposed the sampling restrictions described in the text. The treatment parameter has a true coefficient of zero. G is the number of sampled states. Data from 1979 to 2008 inclusive are sampled (i.e. T = 30). Regressions are run on aggregated stateyear data. The different inference methods used are discussed in the text.
The first row of Table 1 shows the rejection rates obtained assuming iid errors, i.e. by simply forming a tstatistic using the OLS standard error and comparing to standard normal critical values. Rejection rates exceed 40%, more than eight times the nominal test size. This essentially replicates the result in Bertrand, Duflo, and Mullainathan (2004).
Forming CRSEs using unscaled OLS residuals and comparing the resulting tstatistic to standard normal critical values results in rejection rates that are too high, particularly with small G. Using t_{G−}_{1} rather than the standard normal as the reference distribution is enough to achieve approximately the correct test size when G ≥ 20, but not with 6 or 10 groups.
The
The final row of Table 1 shows rejection rates obtained using the wild cluster bootstrapt procedure as in Cameron, Gelbach, and Miller (2008).^{[17]} We essentially replicate previous findings: it also performs well relative to most tested alternatives.
In summary, the simulations reported in Table 1 suggest that tests with very little size distortion can be obtained using very straightforward methods, even with very few groups. In particular, this is achieved by computing a tstatistic with CRSEs that use residuals scaled by
The accuracy of robust variance matrix estimators in finite samples depends on the skewness of the distribution of regressors as well as sample size [as illustrated in Monte Carlo work by Imbens and Kolesár (2012)]. With clusterrobust standard errors in the DiD context considered here, a researcher should pay attention not only to the number of groups but also to whether treatment status is skewed, i.e. whether the number of treated groups is similar to the number of controls.
With unbalanced designs, inference based on clusterrobust standard errors should produce tests of correct size less reliably than when the numbers of treatment and control groups are equal (as in the Monte Carlo experiments up to now). This has been illustrated recently by Mackinnon and Webb (2017). Scaling the residuals that are plugged into the standard CRSE formula by
Table 2 and Table 3 repeat the simulations presented in Table 1 but with varying degrees of imbalance, for the cases with 10 and 50 groups respectively. G1 denotes the number of treated groups. The first columns repeat the results for the balanced designs shown in Table 1, and subsequent columns reduce the number of treated groups. When G = 10, the simple method that works well under a wide a wide range of balanced designs – using CRSEs with
G1 = 5  G1 = 4  G1 = 3  G1 = 2  


0.039  0.046  0.071  0.137 
(0.003)  (0.003)  (0.004)  (0.005)  
Wild cluster bootstrapt  0.044  0.052  0.053  0.024 
(0.003)  (0.003)  (0.003)  (0.002) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 5000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample groups (US states) from CPSMORG data, having imposed the sampling restrictions described in the text. The treatment parameter has a true coefficient of zero. G1 denotes the number of treated groups. Data from 1979 to 2008 inclusive are sampled (i.e. T = 30). Regressions are run on aggregated stateyear data. The different inference methods used are discussed in the text.
G1 = 25  G1 = 15  G1 = 10  G1 = 5  


0.046  0.044  0.057  0.109 
(0.003)  (0.003)  (0.003)  (0.004)  
Wild cluster bootstrapt  0.038  0.045  0.039  0.052 
(0.001)  (0.001)  (0.001)  (0.002) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 5000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample groups (US states) from CPSMORG data, having imposed the sampling restrictions described in the text. The treatment parameter has a true coefficient of zero. G1 denotes the number of treated groups. Data from 1979 to 2008 inclusive are sampled (i.e. T = 30). Regressions are run on aggregated stateyear data. The different inference methods used are discussed in the text.
In summary these simulations confirm that, if the imbalance between treatment and control groups is large enough, using CRSEs with
4.1.1 Robustness
We undertook a number of further experiments to explore the robustness of these findings. First, we considered the case where a binary employment indicator is the dependent variable (in a linear probability model.) Together with log(earnings), this covers the two most common outcomes of interest in DiD studies, according to a survey of the applied literature in Bertrand, Duflo, and Mullainathan (2004). The results for employment are very similar to those reported below for earnings, and the full results are available in an earlier working paper (Brewer, Crossley, and Joyce 2013).
Second, to ensure that our findings are not specific to the CPS data generating process we have also conducted our Monte Carlo experiments using simulated data (returning to a balanced design). The earnings generating process still conformed with equation 2, but we simulated the statetime shocks ourselves. In doing so we varied their degree of serial correlation and nonnormality. In particular, we assumed that the statetime shocks for each state evolve according to the AR(1) process
where ω_{gt} is iid across groups and time and is drawn from a t distribution with d degrees of freedom. To control the degree of nonnormality in the white noise, we varied d between 4 (very high nonnormality) and 120 (at which point the t distribution is essentially standard normal). To control the degree of serial correlation, we varied ρ. We also examined a scenario in which the data generating process is heterogeneous, by drawing ρ separately for each state from a uniform distribution between 0 and 1. The scaling applied to ω_{gt} ensures that, when ρ = 0.4, the variance of ε_{gt} is equal to 0.004 – approximately the empirical variance of the residuals in the CPS data. This means that the degree of serial correlation is allowed to affect the stationary variance of ε_{gt}, but the distribution of the white noise is not. We generated the initial condition (ε_{g1}) such that its variance matched the stationary variance of statetime shocks in other time periods.
In each Monte Carlo replication, we first resampled states with replacement from the CPS data and randomly choose treated states and the year in which the placebo treatment begins, just as before. We then regressed Y_{gt} on state and year fixed effects only. For each statetime combination, we simulated the outcome variable by summing the relevant (estimated) state effect, the relevant (estimated) year effect, and the random stateyear shock generated as above. We then estimate the DiD model using the transformed outcome and conduct the hypothesis test on β.
Table 4 and Table 5 report rejection rates for various combinations of ρ and d, when the number of groups are, respectively 50 and 10. They show that our main finding is robust to a very wide range of error processes. Rejection rates remain within about a percentage point of the nominal test size under all of the tested combinations of degrees of serial correlation, nonnormality in the white noise, and number of groups.
ρ = 0  ρ = 0.2  ρ = 0.4  ρ = 0.6  ρ = 0.8  ρ varies with g  

d = 4  0.046  0.048  0.045  0.048  0.046  0.047 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002)  
d = 20  0.049  0.045  0.047  0.046  0.046  0.048 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002)  
d = 60  0.044  0.047  0.049  0.045  0.045  0.043 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002)  
d = 120  0.045  0.047  0.044  0.044  0.045  0.046 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 10,000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample groups (US states) from CPSMORG data, having imposed the sampling restrictions described in the text. The treatment parameter has a true coefficient of zero. Data from 1979 to 2008 inclusive are sampled (i.e. T = 30). Regressions are run on aggregated stateyear data. Simulated logearnings are generated by effectively replacing the empirical regression residuals with a simulated error term generated according to an AR(1) process. Each cell in the table represents a different AR(1) process. ρ denotes the AR(1) parameter. In the final column the AR(1) parameter is drawn separately for each group, from a uniform distribution between 0 and 1. d denotes the degrees of freedom of the scaled t distribution from which the white noise is drawn (hence it controls the degree of nonnormality). See text for full details.
ρ = 0  ρ = 0.2  ρ = 0.4  ρ = 0.6  ρ = 0.8  ρ varies with g  

d = 4  0.049  0.053  0.049  0.053  0.047  0.050 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002)  
d = 20  0.055  0.052  0.050  0.053  0.050  0.048 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002)  
d = 60  0.055  0.053  0.051  0.055  0.052  0.050 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002)  
d = 120  0.055  0.048  0.050  0.051  0.054  0.052 
(0.002)  (0.002)  (0.002)  (0.002)  (0.002)  (0.002) 
Notes as for Table 4.
4.2 Power to Detect Real Effects
Our findings in the previous section indicate that controlling test size need not be a major concern in DiD designs. However, we now show that power to detect real treatment effects with tests of correct size can be extremely low.
Rejection rates in Table 6 indicate power to detect treatment effects on earnings of approximately 2% (precisely, 0.02 logpoints), 5%, 10% and 15%. This is based upon the same Monte Carlo replications as Table 1, except we transform the dependent variable: for example, to look at power to detect a 5% effect we add
G = 50  G = 20  G = 10  G = 6  

β = 0.02:

0.378  0.162  0.095  0.072 
(0.007)  (0.005)  (0.004)  (0.004)  
β = 0.02: wild cluster bootstrapt  0.405  0.131  0.095  0.078 
(0.004)  (0.003)  (0.004)  (0.004)  
β = 0.05:

0.954  0.625  0.338  0.178 
(0.003)  (0.007)  (0.007)  (0.005)  
β = 0.05: wild cluster bootstrapt  0.954  0.510  0.309  0.176 
(0.002)  (0.005)  (0.006)  (0.005)  
β = 0.10:

1.000  0.963  0.782  0.482 
(.)  (0.003)  (0.006)  (0.007)  
β = 0.10: wild cluster bootstrapt  1.000  0.902  0.737  0.439 
(.)  (0.003)  (0.006)  (0.007)  
β = 0.15:

1.000  0.999  0.933  0.776 
(.)  (0.000)  (0.004)  (0.006)  
β = 0.15: wild cluster bootstrapt  1.000  0.992  0.906  0.707 
(.)  (0.001)  (0.004)  (0.006) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 5000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample states from CPSMORG data, having imposed the sampling restrictions described in the text. β is the true value of the treatment parameter. G is the number of sampled states. Data from 1979 to 2008 inclusive are sampled (i.e. T = 30). Regressions are run on aggregated stateyear data. The inference methods used are discussed in the text. We adjust for test size when making power comparisons using the procedure outlined by Davidson and MacKinnon (1998). See text for details.
We focus on the two methods that we have shown above to produce approximately correctly sized hypothesis tests even when the number of groups is small: the
The results indicate that power is a serious issue in these designs. A 2% effect would be detected with a probability of less than 1 in 4, even with data from all 50 US states. To detect a 5% effect with a probability of about 80% – a conventional benchmark for power – one would need data on all 50 US states. Power declines much further with G. With 6 states, a researcher would have less than a 50−50 chance of detecting even a 10% effect, a 17% chance of detecting a 5% effect, and a 7% chance of detecting a 2% effect (power barely greater than the size of the test). In other words, it is unlikely that one would detect effects of a typically realistic magnitude using a correctly sized test, and highly unlikely when the number of groups is small.
A comparison of the two inference methods suggests that their power is similar for all combinations of number of groups and size of treatment effect considered in these simulation experiments. If anything, the simpler
Figure 1 documents power more comprehensively by showing the minimum effects that would be detected (i.e. that would lead to a rejection of the null of no effect) with given probabilities – a way of assessing statistical power first outlined by Bloom (1995). We vary power between 1% and 99% and compute the minimum detectable effects (MDEs) in each case. We continue just with the hypothesis test that uses CRSEs with
Figure 1:
For a given level of power, x, the MDE is
where
We proceed with the same Monte Carlo design underlying the results in Table 1. Monte Carlo replications provide us with an estimate of the distribution of the tstatistic under the null. They also provide repeated estimates of the
Figure 1 plots MDEs against power when the number of groups is 50, 20, 10 and 6. With earnings data on the entire US population (50 states), one would need a treatment effect of about 3.5% to have even a 50−50 chance of detecting it. With a sample from 6 US states – by no means an extreme example in the applied DiD literature – the MDE on earnings is about 16% for 80% power and 11% for 50% power.We have also calculated analogous results using a binary employment indicator as the dependent variable. This leads to similar conclusions. For 80% power, the MDE on the employment rate with data from all 50 states is about 2 percentage points, rising to 6.5 percentage points with 6 states.^{[19]}
4.3 Increasing Power with Feasible GLS
The previous subsection argued that lack of power is a key problem in typical DiD designs. This suggests that there may be large gains from efforts to improve the efficiency of estimation. The serial correlation problem inherent in a typical DiD study also suggests one way to go about this: exploit knowledge of this feature of the data using feasible GLS. Here we implement the FGLS estimation procedure suggested by Hansen (2007), with and without the straightforward robust inference procedure that our simulations have shown to produce correctly sized tests even with few groups (see Section 2 for details). Our simulation experiments indicate that, in combination with our recommended robust inference technique, FGLS can improve power considerably whilst maintaining correctly sized tests, in a way that is robust to misspecification (or misestimation) of the error process, even with few groups.^{[20]}
The first, third and fifth columns of Table 7 shows rejection rates under a true null hypothesis (no treatment effect). The first row reiterates the good size properties of OLS estimation combined with CRSEs that use
G = 50  G = 20  G = 6  

effect = 0  effect = 0.05  effect = 0  effect = 0.05  effect = 0  effect = 0.05  
OLS,

0.046  0.954  0.039  0.625  0.053  0.178 
(0.003)  (0.003)  (0.003)  (0.007)  (0.003)  (0.005)  
FGLS  0.060  0.996  0.092  0.800  0.114  0.292 
(0.003)  (0.001)  (0.004)  (0.006)  (0.004)  (0.006)  
FGLS,

0.046  0.994  0.044  0.799  0.058  0.272 
(0.003)  (0.001)  (0.003)  (0.006)  (0.003)  (0.006)  
BCFGLS  0.042  0.995  0.064  0.798  0.089  0.291 
(0.003)  (0.001)  (0.003)  (0.006)  (0.004)  (0.006)  
BCFGLS,

0.045  0.991  0.051  0.784  0.063  0.267 
(0.003)  (0.001)  (0.003)  (0.006)  (0.003)  (0.006) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 5000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample states from CPSMORG data, having imposed the sampling restrictions described in the text. G is the number of sampled states. Data from 1979 to 2008 inclusive are sampled (i.e. T = 30). Regressions are run on aggregated stateyear data. The inference methods used are discussed in the text. We adjust for test size when making power comparisons using the procedure outlined by Davidson and MacKinnon (1998). See text for details.
The second, fourth and sixth columns of 7 turn attention to power, showing rejection rates when there is a 5treatment effect on earnings. Again, the first row reiterates the earlier finding (i.e. see Table 6) that OLS estimation combined with a correctly sized test provides low power, particularly with few groups. As Hansen (2007) showed in the case where G = 50, the FGLS procedures deliver substantial improvements in power. Combined with robust inference which delivers the correct test size, BCFGLS detects the treatment effect with 96% probability, whereas OLS detects it with 80% probability. Table 7 shows that FGLS also delivers very substantial proportionate power gains relative to OLS with smaller G: with G = 6, power is 18% using OLS and 29% using FGLS. Using Hansen’s bias correction delivers a little more power than “ordinary” FGLS.
Figure 2 illustrates the power gains more comprehensively, plotting MDEs against power and comparing the results for OLS and BCFGLS estimation with varying numbers of groups (always combined with clusterrobust inference, so that test size is correct). The power gains from BCFGLS are substantial.
Figure 2:
4.3.1 Robustness
As we did for the analysis of test size in Section 4.1, we have repeated this analysis for the case where the outcome of interest is a binary employment status indicator, and the results, which are available in in Brewer, Crossley, and Joyce (2013), confirm that the findings reported here hold in that case: FGLS delivers substantial power gains over OLS, and this can be done whilst controlling test size, even with few groups.
We now further explore the robustness of these results to different data settings. Of particular interest are cases where the parametric assumptions about the serial correlation process inherent in the FGLS procedure are incorrect. In such cases, can test size still be reliably controlled (even with few groups), and what power gains (if any) can FGLS offer?
We continue to use FGLS estimation based on the assumption of an AR(2) process for the statetime shocks which is the same for all states. We explore its properties under two forms of misspecification of the error process. First, we simulate an AR(2) process which is heterogeneous across states. Second, we simulate an MA(1) process with parameter 0.5.^{[21]} In each case, we effectively replace the empirical logearnings residuals in the CPS (from a regression of stateyear earnings on state and year fixed effects) with our simulated error terms, as in the robustness checks reported above.
Columns one and three of Table 8 shows the results of these simulations under the null hypothesis of no treatment effect using tests of nominal size 0.05, for the case where G = 10. The results show that the previous results on test size hold under these alternative processes: without robust inference, Hansen’s bias correction for the AR parameter estimates brings true test size closer to the nominal size when using FGLS estimation (although there is still some overrejection); but test size can be controlled reliably using our suggested robust inference technique, whether estimation is carried out using OLS, FGLS or BCFGLS.^{[22]} Columns two and four of Table 8 report power to detect a treatment effect of 0.05 logpoints on earnings, again for the case where G = 10.^{[23]} The second column shows that, using correctly sized hypothesis tests (i.e. those that use our suggested robust inference technique), FGLS estimation does offer substantial power gains over OLS even when based upon the incorrect assumption that the AR(2) error process is homogeneous across states.^{[24]} The fourth column shows that, where the true error process is MA(1) rather than AR(2), there is no power gain from FGLS. Intuitively this makes sense: where the parametric assumptions about the serial correlation in the error terms are a very poor approximation to the true process, FGLS does not offer efficiency gains; where the assumptions are a better approximation, FGLS does offer efficiency gains over the OLS estimator, which does not exploit any knowledge of the nature of the error process.
Heterogenous AR(2)  MA(1)  

effect = 0  effect = 0.05  effect = 0  effect = 0.05  
OLS,

0.056  0.520  0.056  0.619 
(0.003)  (0.007)  (0.003)  (0.007)  
FGLS  0.106  0.782  0.088  0.713 
(0.004)  (0.006)  (0.004)  (0.006)  
FGLS,

0.065  0.712  0.061  0.606 
(0.003)  (0.006)  (0.003)  (0.007)  
BCFGLS  0.069  0.807  0.072  0.710 
(0.004)  (0.006)  (0.004)  (0.006)  
BCFGLS,

0.063  0.727  0.059  0.603 
(0.003)  (0.006)  (0.003)  (0.007) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 5000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample states from CPSMORG data, having imposed the sampling restrictions described in the text. Data from 1979 to 2008 inclusive are sampled (i.e. T = 30). Regressions are run on aggregated stateyear data. The underlying data effectively replaces empirical CPS regression residuals with a simulated error term, generated according to an AR(2) process which varies across groups and an MA(1) process respectively. The inference methods used are discussed in the text. We adjust for test size when making power comparisons using the procedure outlined by Davidson and MacKinnon (1998). See text for details.
Finally, we show how these results vary with the number of time periods available. Table 9 repeats the simulations from Table 8 for the cases where T = 20 and T = 10. This shows that test size can still be controlled using our recommended robust inference approach with few time periods, whether estimation is based on OLS or FGLS. It also shows, however, that the power gains from FGLS diminish with decreasing T, and with T = 10 the power of OLS and FGLS (when combined with inference which provides a correctly sized test) are essentially the same. An interesting feature of Table 8 is that with OLS and robust inference, power actually declines with T.^{[25]} Indeed, with these data, the growing power advantage of FGLS with increasing T comes about because the power of OLS declines and not because the power of the FGLS improves. Further inspection of the underlying simulations suggests that there is a genuine increase in the precision of OLS estimation as T falls: the estimated policy effects become more tightly distributed around their true value.
T = 30  T = 20  T = 10  

effect = 0  effect = 0.05  effect = 0  effect = 0.05  effect = 0  effect = 0.05  
OLS,

0.039  0.338  0.034  0.380  0.037  0.447 
(0.003)  (0.007)  (0.003)  (0.007)  (0.003)  (0.007)  
FGLS  0.101  0.449  0.107  0.436  0.091  0.415 
(0.004)  (0.007)  (0.004)  (0.007)  (0.004)  (0.007)  
FGLS,

0.044  0.484  0.039  0.478  0.040  0.479 
(0.003)  (0.007)  (0.003)  (0.007)  (0.003)  (0.007)  
BCFGLS  0.078  0.453  0.078  0.431  0.078  0.413 
(0.004)  (0.007)  (0.004)  (0.007)  (0.004)  (0.007)  
BCFGLS,

0.049  0.479  0.043  0.481  0.038  0.486 
(0.003)  (0.007)  (0.003)  (0.007)  (0.003)  (0.007) 
The table shows the proportion of the time that the null hypothesis of no treatment effect was rejected in 5000 Monte Carlo simulations. Simulation standard errors are reported in parentheses. The simulations resample states from CPSMORG data, having imposed the sampling restrictions described in the text. T is the number of (consecutive) time periods (years). The first year of data is chosen from a uniform distribution between 1979 and (2009T) in each Monte Carlo simulation. Regressions are run on aggregated stateyear data. The inference methods used are discussed in the text. We adjust for test size when making power comparisons using the procedure outlined by Davidson and MacKinnon (1998). See text for details.
The reason why this is possible in this context is as follows.^{[26]} DiD regressions effectively estimate the difference in mean outcomes between posttreatment periods and pretreatment periods (and then compare these differences across treatment and control groups). The variance of this difference is decreasing in the covariance between the error terms pre and post treatment (intuitively, if error terms pre and post treatment covaried perfectly then they would not add any noise to the difference between pre and post treatment outcomes because they would cancel out). If serial correlation between observations decays with time, then the error term from an additional time period pre (post) treatment will covary less strongly with error terms in the post (pre) treatment period than the error terms already present. Hence the “covariance effect” acts to increase the variance of the DiD estimate of the treatment effect when you add another time period to the data. On the other hand, of course, the variance is also increasing in the variance of the average error terms both pre and post treatment, and adding more time periods will reduce this variance. But this reduction will be small if serial correlation is high, and hence it can be dominated by the covariance effect.
An additional lesson, then, is that researchers who use OLS to perform DiD estimation should be very cautious about blindly using as many time periods as possible. With serially correlated error terms, this can result in a reduction in power. This issue does not arise with FGLS, which effectively transforms the data in a way that removes the serial correlation from the error terms. This seems to us another reason to favour a FGLS approach.
In summary, our simulation experiments suggest that as long as the number of time periods is not too small (approximately 10 or less), FGLS combined with robust inference is likely to offer substantial power gains; and even if it does not – because the error process is badly misspecified – test size can still be reliably controlled, even with few groups. We therefore recommend this approach is used more routinely by applied researchers doing DiD estimation.
5 STATA Code for FGLS
The replication files provided with this paper include the STATA ado file Hansen.ado which implements the the FGLS estimators used in this paper, with or without the iterative biascorrection procedure set out in Hansen (2007). It also allows a user to combine GLS with CRSEs as illustrated above. This implementation assumes an AR error process homogeneous across groups. It is designed to look and feel like the “regress” command in STATA in terms of syntax and outputs, but with options to control the FGLS estimation. Full details are at the top of the file. This package can be used by applied researchers to implement the empirical strategies we recommend.
6 Summary and Conclusion
This paper contributes to a growing literature on inference in differenceindifferences designs with grouped errors. The literature has emphasized difficulties in obtaining correctly sized hypothesis tests, particularly with few groups, but our results suggest this is not the key challenge.
Using Monte Carlo evidence, we have made four points. First, it is possible to obtain tests of the correct size, even with few groups, and in many settings this is possible using methods that are very straightforward to implement. Second, the main problem in differenceindifferences designs with grouped errors is instead low power to detect real effects. Third, feasible GLS estimation combined with robust inference methods can increase power considerably whilst maintaining correct test size − again, even with few groups. These findings have proven robust to a wide range of data generating processes. We also show that with OLS and serially correlated errors, power can actually fall as panel length grows. This seems to us another reason to prefer a FGLS approach.
We therefore recommend that applied researchers adopt FGLS estimation combined with robust inference methods in practice. We also suggest that future research could usefully focus on improving power, as well as on getting test size correct, when using differenceindifference designs.
Acknowledgement
We would like to thank a large number of colleagues and seminar and conference participants for helpful comments. All remaining errors are our own. This work was supported by the ESRC through Programme Evaluation for Policy Analysis, a former Node of the National Centre for Research Methods, through the Research Centre on MicroSocial Change (MiSoC) at the University of Essex (ES/L009153/1), and through the Centre for the Microeconomic Analysis of Public Policy (CPP) at the IFS (RES544285001).
References
Angrist, J. and J. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. Search in Google Scholar
Barrios, T., R. Diamond, G. W. Imbens, and M. Kolesár. 2012. “Clustering, Spatial Correlations, and Randomization Inference.” Journal of the American Statistical Association 107 (498): 578–591.10.1080/01621459.2012.682524 Search in Google Scholar
Bell, R. M., and D. F. McCaffrey. 2002. “Bias Reduction in Standard Errors for Linear Regression with Multistage Samples.” Survey Methodology 28 (2): 169–179. Search in Google Scholar
Bertrand, M., E. Duflo, and S. Mullainathan. 2004. “How Much Should We Trust DifferencesinDifferences Estimates?” Quarterly Journal of Economics 119: 249–275.10.1162/003355304772839588 Search in Google Scholar
Bester, A. C., T. G. Conley, and C. B. Hansen. 2011. “Inference with Dependent Data Using Cluster Covariance Estimators.” Journal of Econometrics 165 (2): 137–151.10.1016/j.jeconom.2011.01.007 Search in Google Scholar
Bloom, H. S. 1995. “Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.” Evaluation Review 19 (5): 547–556.10.1177/0193841X9501900504 Search in Google Scholar
Brewer, M., T. F. Crossley, and Robert Joyce. 2013. “Inference with DifferencesinDifferences Revisited.” IZA DP No. 7742. Search in Google Scholar
Cameron, A. C., J. G. Gelbach, and D. L. Miller. 2008. “BootstrapBased Improvements for Inference with Clustered Errors.” The Review of Economics and Statistics 90 (3): 414–427.10.1162/rest.90.3.414 Search in Google Scholar
Cameron, A. C., and D. L. Miller. 2015. “A Practitioner’s Guide to ClusterRobust Inference.” Journal of Human Resources 50 (2): 317–372.10.3368/jhr.50.2.317 Search in Google Scholar
Conley, T., and C. Taber. 2011. “Inference with ‘Difference in differences’ with a Small Number of Policy Changes.” The Review of Economics and Statistics 93 (1): 113–125.10.1162/REST_a_00049 Search in Google Scholar
Davidson, R., and J. G. MacKinnon. 1998. “Graphical Methods for Investigating the Size and Power of Test Statistics.” The Manchester School 66: 1–26.10.1111/14679957.00086 Search in Google Scholar
Donald, S. G., and K. Lang. 2007. “Inference with DifferenceinDifferences and Other Panel Data.” The Review of Economics and Statistics 89 (2): 221–233.10.1162/rest.89.2.221 Search in Google Scholar
Hansen, C. 2007. “Generalized Least Squares Inference in Panel and Multilevel Models with Serial Correlation and Fixed Effects.” Journal of Econometrics 140 (2): 670–694.10.1016/j.jeconom.2006.07.011 Search in Google Scholar
Imbens, G., and M. Kolesár. 2012. “Robust Standard Errors in Small Samples: Some Practical Advice.” National Bureau of Economic Research Working Paper 18478. Search in Google Scholar
Liang, K.Y., and S. L. Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73: 13–22.10.1093/biomet/73.1.13 Search in Google Scholar
Mackinnon, J., and M. Webb. 2017. “Wild Bootstrap Inference for Wildly Different Cluster Sizes.” Journal of Applied Econometrics 32 (2): 233–254.10.1002/jae.2508 Search in Google Scholar
Moulton, B. R. 1990. “An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables on Micro Units.” The Review of Economics and Statistics 72: 334–338.10.2307/2109724 Search in Google Scholar
Nickell, S. 1981. “Biases in Dynamic Models with Fixed Effects.” Econometrica 49 (6): 1417–1426.10.2307/1911408 Search in Google Scholar
Romano, J., and M. Wolf. 2015. “Resurrecting Weighted Least Squares (August 2015).” University of Zurich, Department of Economics, Working Paper No. 172. Search in Google Scholar
Solon, G. 1984. “Estimating Autocorrelations in Fixed Effects Models.” National Bureau of Economic Research Technical Working Paper 32. Search in Google Scholar
Webb, M. 2013. “Reworking Wild Bootstrap Based Inference for Clustered Errors.” Queen’s Economics Department Working Paper No. 1315. Search in Google Scholar
Wooldridge, J. M. 2003. “ClusterSample Methods in Applied Econometrics.” American Economic Review 93: 133–138.10.1257/000282803321946930 Search in Google Scholar
Wooldridge, J. M. 2006. “ClusterSample Methods in Applied Econometrics: An Extended Analysis.” Mimeograph, Michigan State University Department of Economics. Search in Google Scholar
Supplemental Material
The online version of this article offers supplementary material (DOI:https://doi.org/10.1515/jem20170005).
©2018 Thomas F. Crossley et al., published by De Gruyter
This work is licensed under the Creative Commons AttributionNonCommercialNoDerivatives 4.0 License.