In labor economics and other empirical, policy-oriented fields, difference-in-differences (DiD) designs are an extremely common way of estimating the effects of policies or programs (henceforth “treatment effects”). A recent literature has highlighted that failure to appropriately quantify the uncertainty surrounding DiD estimates can lead to dramatically misleading inference (e.g. Bertrand, Duflo, and Mullainathan 2004; Cameron and Miller 2015). In particular, researchers will tend to reject true null hypotheses with a probability that is far higher than the nominal size of the hypothesis test. The literature has suggested that obtaining tests that are close to the correct size requires non-standard techniques, and that it may not be possible with a small number of groups (Bertrand, Duflo, and Mullainathan 2004; Cameron, Gelbach, and Miller 2008; Angrist and Pischke 2009).

In this paper we report evidence from Monte Carlo simulations that emphasises a different conclusion. We make four main points. First, our simulations demonstrate that in many typical DiD settings tests of the correct size can be obtained with very straightforward methods that are trivial to implement with standard statistical software (in fact, STATA’s cluster-robust inference implements these methods by default); and in settings where this works less well, a bootstrap-based approach highlighted by other authors (e.g. Cameron, Gelbach, and Miller 2008; Webb 2013) provides a reliable alternative. All this is true even with few groups. Second, these techniques have very low power to detect real treatment effects. Thus the real challenge for inference with DiD designs is power rather than test size. Third, our simulations show that substantial gains in power can be achieved using feasible GLS. Moreover, our simulation results suggest that the the combination of feasible GLS and cluster-robust inference can control test size, even if the parametric assumptions about the error process implicit in feasible GLS estimation are violated, and even with few groups. Fourth, using OLS with robust inference can lead to a perverse relationship between power and panel length. This is another reason to favour feasible GLS. In summary we highlight the need for applied researchers using DiD designs to pay careful attention not just to consistency and test size, but also to the efficiency of their estimators, and we recommend the use of feasible GLS combined with cluster-robust inference as a solution to this problem.

DiD designs often use micro-data but estimate the effects of a treatment which varies only at a group level at any point in time (e.g. variation in policy across US states). A consequence is that within-group correlation of errors can substantially increase the true level of uncertainty surrounding the treatment effect (e.g. Angrist and Pischke 2009; Donald and Lang 2007; Moulton 1990; Wooldridge 2003; Cameron and Miller 2015). Furthermore, treatment status is typically highly serially correlated. In fact, in the most common case on which we focus in this paper, treatment is an “absorbing state”: once a group is treated, it remains treated in all subsequent periods. This means that serially correlated error terms are likely to have a large impact on the true level of precision with which treatment effects are estimated. In a well-cited Monte Carlo study using US earnings data, Bertrand, Duflo, and Mullainathan (2004) show that accounting only for grouped errors at the state-time level whilst ignoring serial correlation led to a 44% probability of rejecting a true null hypothesis using a nominal 5% level test. So, for example, when evaluating a labor market policy implemented in certain regions from a particular point in time onwards, a researcher should worry both that people in the same region at the same time are affected by common labor market shocks (unrelated to the policy) and that these regional shocks are serially correlated.

A simple approach to deal with both cross-sectional and serial correlation in within-group errors would be to use the formula for a cluster-robust variance matrix due to Liang and Zeger (1986). This is consistent and Wald statistics which use it are asymptotically normal, but the asymptotics apply as the number of clusters tends to infinity. By clustering at the group level rather than the group-time level to account for serial correlation, one is often left with few clusters. The finite sample (i.e. few-clusters) performance of this approach – an empirical question – then becomes crucial, and the literature to date has come to pessimistic conclusions about it. Bertrand, Duflo, and Mullainathan (2004) and Cameron, Gelbach, and Miller (2008) use US earnings data and generate placebo state-level treatments before estimating their “effects.” Forming t-statistics using cluster-robust standard errors (CRSEs), they obtain 9% and 11% rejection rates using nominal 5% level tests with samples from 10 and 6 US states respectively.^{1} This is a considerable improvement over using OLS standard errors, when rejection rates are more than 40%. But it is still approximately double the nominal test size.

The crucial finding of Bertrand et al. and others – that inference can go badly wrong in DiD unless one is very careful – is confirmed once again in our experiments. But our simulations also show that a modification to the standard cluster-robust inference procedure described above can dramatically improve test size with few clusters. One can apply a scaling factor to the OLS residuals that are plugged into the CRSE formula, and use critical values from a t distribution with degrees of freedom equal to the number of groups minus one, rather than a standard normal. This is straightforward and in fact, if one uses a cluster-robust variance matrix in STATA by specifying the “vce(cluster *clustvar*)” option, the confidence intervals and p-values returned are based upon this procedure by default.^{2} When this is done, our simulations show that true test size is within about one percentage point of nominal test size with 50, 20, 10 or 6 groups. This result also holds under a wide range of data generating processes. The key situation in which the method is unreliable is when there is a large imbalance between the numbers of treatment and control groups.

Various alternative techniques for achieving correct test size have been proposed and/or tested (Bertrand, Duflo, and Mullainathan 2004; Donald and Lang 2007; Cameron, Gelbach, and Miller 2008; Bester, Conley, and Hansen 2011). Of these, only a wild cluster bootstrap-t procedure has been shown to produce tests of approximately the right size in the typical DiD setup considered in this paper (see Section 2) when the number of groups is as small as six (Cameron, Gelbach, and Miller 2008). Like using CRSEs, this is theoretically robust to heteroscedasticity and arbitrary patterns of error correlation within clusters, and to variation in error processes across clusters. It has also been shown to be quite robust to large imbalances between the numbers of treatment and control groups (Mackinnon and Webb 2017), and hence provides an important alternative to the simpler method described above in such situations. But it is less trivial to implement and computationally more intensive.

Our second point is that, while it is generally not difficult to obtain the correct size, power to detect real effects is a serious concern. When we use the methods above to implement correctly sized hypothesis tests, our simulations suggests that DiD designs can have very low power. This problem is very severe with few groups. For example, with a large 30-year panel of US earnings data from 6 states, a policy implemented by half of the states that raised earnings by 5% would be detected with only 17% probability (using a test of size 0.05). The policy would have to increase earnings by 16% if the null of no policy effect is to be rejected with 80% probability.

More positively, our experiments also demonstrate that substantial gains in power can be achieved using feasible GLS. In particular, with a moderate time series dimension of at least about 10 time periods, one will often be able to increase power by modeling the serial correlation of unobservables inherent in typical DiD designs. A bias-correction for feasible GLS due to Hansen (2007) reduces, but does not eliminate, test size distortion, particularly with small numbers of groups. However, our simulations demonstrate that test size can be controlled in a way that is robust to having small numbers of groups, and to violations of the parametric assumptions about the error process implicit in feasbile GLS estimation, by using the straightforward cluster-robust inference technique described above. This is why we recommend the use of the combination of FGLS (with or without the Hansen correction) and cluster-robust techniques in DiD applications. Furthermore, our simulations also show that OLS estimation is susceptible to a perverse (negative) relationship between power and panel length, and we explain the econometric reason for this. This does not happen with feasible GLS.

The paper proceeds as follows. Section 2 describes the standard econometric setup in DiD designs that we consider, and discusses possible solutions to the inference problems that can arise in this setting. Section 3 details the Monte Carlo design we use to test different inference methods. Section 4 presents and discusses the results of our Monte Carlo simulations. To assist applied researchers who wish to follow the feasbile GLS strategy our experiments support, we provide a STATA ado file that implements this, including the Hansen (2007) bias correction. This is described in Section 5. Finally, Section 6 summarizes and concludes.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.