Fixed effects are a common means to “control for” unobservable differences among observations based upon observable characteristics; examples include age, year, or location in cross-sectional studies or individual or firm effects in panel data. While fixed effects permit different mean outcomes among groups, the estimates of treatment effects are typically required to be the same; in more colloquial terms, the intercepts of the conditional expectation functions may differ, but not the slopes.

Our main contribution is considering the empirical importance of heterogeneity in these slopes (*i.e.*, treatment effects) across fixed effects groups. In particular, we compare treatment effect estimates using a fixed effects estimator (FE) to the average treatment effect (ATE) in replications of eight influential papers from the *American Economic Review* published between 2004 and 2009.^{1} We first consider a randomized experiment as a case study in Section 1 and, in Section 3, we show generally that heterogeneous treatment effects are common and that the FE and ATE are often statistically and economically different. In all but one paper, there is at least one statistically significant source of treatment effect heterogeneity. In five papers, this heterogeneity induces the ATE to be statistically different from the FE estimate at the 5% level (7 of 8 are statistically different at the 10% level). Five of these differences are economically significant, which we define as an absolute difference exceeding 10%. Based upon this result, we conclude that methods that consistently estimate the ATE offer more interpretable results than standard FE models.

In Section 2, we provide a formal framework to establish the theoretical bias of the FE estimator in the presence of heterogenous treatment effects. We derive the probability limit of the FE under heterogeneous treatment effects and provide an interpretation as a weighted average of group-specific effects. We propose two alternative estimators that are able to consistently estimate the ATE under group-specific heterogeneity and derive the joint asymptotic distributions of these estimators with the FE.

One approach to incorporate heterogeneous marginal effects into a regression framework is the correlated random coefficients model (CRC). Our paper explores the empirical relevance of CRC models by considering a simplified version: a fixed effects regression that includes group-specific marginal effects. This assumption corresponds to the following data-generating process:

$${y}_{i}={x}_{i}{\beta}_{g(i)}+{\mathbf{z}}_{i}^{\mathrm{\prime}}\gamma +{\u03f5}_{i},$$(1)

where *y*_{i} is the outcome for observation *i* among *N*, *x*_{i} is treatment or another variable of interest, and **z**_{i} contains control variables, including group-specific fixed effects. The treatment effects are group-specific for each of the *g* = 1, …, *G* groups, where group membership is known for each observation. Lastly, *ϵ*_{i} is mean 0 with variance-covariance matrix Ω. Our analysis of this model can be viewed as a special case of the results in Chernozhukov et al. (2013).

There is a long tradition in the econometrics literature considering average partial effects (see, *e.g.*, Blundell and Powell 2003; Chamberlain 1980; 1982; 1984; 1992; Chernozhukov et al. 2013; Graham and Powell 2012; Wooldridge 1997; 2005).^{2}

#### Average treatment effect (ATE)

*The average treatment effect (ATE) for Equation 1 is defined as*

$${\beta}^{\mathrm{A}\mathrm{T}\mathrm{E}}\equiv \sum _{g}{\pi}_{g}{\beta}_{g},$$

*where π*_{g} is the population frequency of group g.

An established result is that fixed effects regressions average the group-specific slopes proportional to both the sample frequency of the group and the conditional variance of treatment, an average that generally does not coincide with the average treatment effect.^{3} Though this theoretical result is well established, there has been little guidance for the applied researcher regarding the empirical importance of the difference. We find that the difference can be large.

**Comparison to the literature.** Our approach is similar to the CRC model of Chamberlain (1982) (see also Chamberlain (1984, 1992)). The primary differences between our setting and that of the CRC is that (i) we focus on cross-sectional data, whereas the CRC is based on panel data; and (ii) we employ fixed, rather than random effects. Because of the general similarities, our approach is related to the large literature analyzing non-separable correlated heterogeneity in panel data contexts. Closest to our derivation, Wooldridge (2005) shows conditions under which the FE provides consistent estimates of the average partial effect. Our analysis builds upon this derivation for the case of fixed coefficients and offers a different interpretation of the necessary conditions for this result. Graham and Powell (2012) study the identification and estimation of average partial effects under “irregularity” conditions where the information bound may be singular and Arellano and Bonhomme (2012) study the identification and estimation of distributions of coefficients in CRC models.

Another important example is Chernozhukov et al. (2013), who study average and quantile treatment effects and derive results that nest our approach. In particular, while we focus on cross-sectional settings, our models are relevant for panel models with discrete regressors, as in Chernozhukov et al. (2013). Ghanem (2017) studies testable implications of the assumptions made in these non-separable panel data models. Finally, Imai and Kim (2016) study the linear fixed effect model from a matching perspective, reformulate our result from this perspective, and study dynamic extensions. While these papers provide a strong theoretical reason to believe that FE does not provide sample-weighted estimates, we illustrate the empirical importance of this distinction using a broad array of microeconometric questions.

In the presence of heterogeneous treatment effects, the FE gives a weighted average of these effects. The weights depend not only on the frequency of the groups, but also upon sample variances within the groups. Angrist and Krueger (1999) compare the results from regression and matching estimators to demonstrate that the effects of a dichotomous treatment are averaged using different weights under each procedure. Many empirical studies, including many of those that we replicate in this paper, run separate regressions by group out of concern for the presence of treatment effect heterogeneity. Less common are the more parsimonious interacted model or weighted regression approaches that we propose, but which assume that there is no heterogeneity in coefficients for other predictors. A related approach is the random growth model, which uses individual-specific time trends to control for differing growth rates (see, *e.g.*, Heckman and Hotz 1989; Papke 1994; Friedberg 1998). This heterogeneity is used to control for omitted variables, rather than to model the treatment effect of interest itself, however. Solon, Haider, and Wooldridge (2015) declare that the FE may be biased in the presence of heterogeneous treatment effects and note that weighted least squares can be used to recover the average partial effect. We build upon their discussion by deriving the necessary weights and providing applications to illustrate empirically the importance of the difference between weighted and FE estimates.

Even if an experiment ensures that treatment is independent of any other covariates, the FE might not be a consistent estimator of the ATE. Among our *AER* replications, there is one experiment that can be used to illustrate this point: Karlan and Zinman (2008). In this paper, the authors randomize the interest rate offered for a microloan across a population of South Africans and estimate the credit elasticity. One set of fixed effects that the authors use is the “pre-approved risk category” of the borrower (low, medium, or high). To offer interest rates commensurate with prevailing market rates, the authors charge higher rates to higher risk individuals. As we will show, however, that differing means in treatment do not drive the difference between the FE and ATE estimates, but rather differences in variances. To this point, the authors offer not only higher rates to riskier borrowers, but also offer a greater range of rates to this group and, as a result, the variance of treatment differs across the groups. Thus, the FE estimate will not be equal to the ATE if the responsiveness to interest rates varies across risk groups.

The FE weights are given in column 3 of Table . These are the relative variances of treatment by group multiplied by the sample frequency of that group (see Proposition 1). Using these weights and the group effect estimated using an interacted model (given in column 2 of Table ), we calculate the FE estimate in the bottom row of the table in the “FE weight” column. Compare the weights from the FE model to the sample frequencies used to calculate the ATE. Note that high risk individuals are over-weighted in the FE model due to their relatively high variance in treatment and the low and medium risk individuals are under-weighted.

We find that high-risk borrowers are much less responsive to the interest rate than low-risk borrowers. Because high-risk individuals are over-weighted and have a smaller (in absolute value) treatment effect, the FE estimate underestimates the sample-weighted responsiveness of individuals to the interest rate by over 60%.

## 2 Estimating the Average Treatment Effect

In this section, we first derivate the bias of the FE estimator under treatment effect heterogeneity. Based upon this result, we provide two alternative estimators that eliminate this bias. We also discuss testing procedures related to our proposed estimators.

## 2.1 Bias of the Fixed Effects Estimator

One way to parameterize the treatment effect heterogeneity in Equation 1 is by interacting the fixed effects with treatment; call this vector **a**_{i}.^{4} Then, the data-generating process can be rewritten as:

$${y}_{i}={\mathbf{a}}_{i}^{\mathrm{\prime}}\beta +{\mathbf{z}}_{i}^{\mathrm{\prime}}\gamma +{\u03f5}_{i},$$(2)

where *β* is now a vector of coefficients. Further define the *N* × 1 column vector forms **Y**, **X**, and *ϵ* as vectors across the *N* observations and **A** and **Z** as matrices across observations. Define $\mathbf{M}={\mathbf{I}}_{N}-\mathbf{Z}{\left({\mathbf{Z}}^{\mathrm{\prime}}\mathbf{Z}\right)}^{-1}{\mathbf{Z}}^{\mathrm{\prime}}$ as the annihilator matrix for **Z**; $\stackrel{~}{\mathbf{Y}}$, $\stackrel{~}{\mathbf{X}}$, and $\stackrel{~}{\mathbf{A}}$ are annihilated versions. Notably, $\stackrel{~}{x}$_{i} is a value in the $\stackrel{~}{\mathbf{X}}$ vector.

As a baseline case, consider an OLS model with fixed effects that does not account for treatment effect heterogeneity, which we call the *fixed effects estimator*.

#### Fixed effects estimator (FE)

*Define the standard fixed effect estimator (FE) as:*

$${\hat{b}}^{\mathrm{F}\mathrm{E}}={\left({\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\stackrel{~}{\mathbf{X}}\right)}^{-1}{\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\stackrel{~}{\mathbf{Y}}.$$

In general, the FE is a biased and inconsistent estimator of the ATE.

#### Bias and inconsistency of FE

*Under the usual assumptions for Equation *
1
* (see Online Appendix A), the expected value of the FE is:*

$$\mathbb{E}\left[{\hat{b}}^{\mathrm{F}\mathrm{E}}|\mathbf{X},\mathbf{Z},\mathbf{A}\right]={\left[\sum _{i}{\stackrel{~}{x}}_{i}^{2}\right]}^{-1}\sum _{i}{\stackrel{~}{x}}_{i}{\stackrel{~}{\mathbf{a}}}_{i}^{\mathrm{\prime}}\beta ={\beta}^{\mathrm{A}\mathrm{T}\mathrm{E}}+\sum _{g}\frac{{N}_{g}}{N}{\beta}_{g}\left[\frac{\hat{\text{Var}}\left({\stackrel{~}{x}}_{i}\mid g(i)=g\right)}{\hat{\text{Var}}\left({\stackrel{~}{x}}_{i}\right)}-1\right]+{o}_{p}(1),$$

*where*
$\hat{\text{Var}}(\cdot )$
*is the sample variance and N*_{g} is the number of observations in group g. Further, the FE converges in probability to:

$${\hat{b}}^{\mathrm{F}\mathrm{E}}\underset{n\to \mathrm{\infty}}{\overset{p}{\to}}{\beta}^{\mathrm{A}\mathrm{T}\mathrm{E}}+\sum _{g}{\pi}_{g}{\beta}_{g}\left[\frac{\text{Var}\left({\stackrel{~}{x}}_{i}\mid g(i)=g\right)}{\text{Var}\left({\stackrel{~}{x}}_{i}\right)}-1\right].$$

*Hence, if the variance of x*_{i} conditional on **z**_{i} varies across groups and treatment effects also vary across groups, then the FE is a biased and inconsistent estimator for the ATE.

Proposition 1 reveals that, while the FE is an average of the group-specific effects, the weights generally do not coincide with sample frequencies. Instead, FE upweights groups with high variance in treatment conditional upon other covariates and downweights groups with low variance in treatment. This is an efficient approach if the treatment effect is the same for all groups, but leads to biased and inconsistent estimates of the ATE when the treatment effect varies across groups.

An example where FE would give unbiased results is a regression using data from a perfectly randomized experiment where treatment has the same variance across groups. Such perfection is likely unattainable in observational or experimental settings, however. Indeed, in Section 1, we replicated a randomized experiment from Karlan and Zinman (2008) as a case study. In that experiment, treatment is randomized within different fixed effects groups, but the variances of treatment are not the same across groups. There, we found that the ATE differs from the FE estimate by over 60%.

## 2.2 Alternative Estimators

We offer two alternative estimators for the ATE that, unlike the FE, are unbiased and consistent. For the first estimator, Equation 2 hints that an interacted model could be used to estimate the treatment effect for each group; the resulting group-specific estimates are averaged to provide the ATE. This is the *interaction-weighted estimator*.

#### Interaction-weighted estimator (IWE)

*The interaction-weighted estimator is found by estimating β from Equation 2 using an interacted model, then using these estimates to calculate the ATE. Thus, the IWE is given by:*

$${\hat{b}}^{\mathrm{I}\mathrm{W}\mathrm{E}}=\hat{\mathbf{f}}{\left({\stackrel{~}{\mathbf{A}}}^{\mathrm{\prime}}\stackrel{~}{\mathbf{A}}\right)}^{-1}{\stackrel{~}{\mathbf{A}}}^{\mathrm{\prime}}\stackrel{~}{\mathbf{Y}},$$

*where*
^{5}

$$\hat{\mathbf{f}}=\frac{1}{N}\left[\begin{array}{cccc}N& {N}_{1}& \cdots & {N}_{G-1}\end{array}\right].$$

Proposition 1 shows that, while FE provides a weighted average of the treatment effects, these weights do not equal sample frequencies. The *regression-weighted estimator* re-weights each observation to undo the FE weighting and applies the frequency weighting of the ATE. A potential advantage of this approach is that it does not require estimating each group’s treatment effect.

#### Regression-weighted estimator (RWE)

*The regression-weighted estimator re-weights each observation according to*

$${\hat{w}}_{i}={\left[\hat{\text{Var}}\left({\stackrel{~}{x}}_{j}\mid g(j)=g(i)\right)\right]}^{-1/2};$$(3)

*that is, inversely proportional to the standard deviation of the conditional treatment values within its group. Let $\hat{\mathbf{W}}$ be a diagonal matrix of these values squared. Then, the RWE is given by:*

$${\hat{b}}^{\mathrm{R}\mathrm{W}\mathrm{E}}={\left({\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\hat{\mathbf{W}}\stackrel{~}{\mathbf{X}}\right)}^{-1}{\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\hat{\mathbf{W}}\stackrel{~}{\mathbf{Y}}.$$

To calculate the RWE, first estimate the annihilator matrix **M**. Then, calculate the weights according to Equation 3. Then, perform weighted least squares using the annihilated data. Note that the RWE can be re-written as:

$${\hat{b}}^{\mathrm{R}\mathrm{W}\mathrm{E}}={\left(\sum _{i}\frac{{\stackrel{~}{x}}_{i}^{2}}{\hat{\text{Var}}\left({\stackrel{~}{x}}_{i}\mid g(j)=g(i)\right)}\right)}^{-1}\sum _{i}\frac{{\stackrel{~}{x}}_{i}{\stackrel{~}{y}}_{i}}{\hat{\text{Var}}\left({\stackrel{~}{x}}_{j}\mid g(j)=g(i)\right)}\phantom{\rule{0ex}{0ex}}=\frac{1}{N}\sum _{g}{N}_{g}\frac{\hat{\text{Cov}}\left({\stackrel{~}{x}}_{i},{\stackrel{~}{y}}_{i}\mid g(i)=g\right)}{\hat{\text{Var}}\left({\stackrel{~}{x}}_{i}\mid g(i)=g\right)}.$$

The IWE and RWE can be compared to the FE. First, it should be noted that, unlike the FE, both the IWE and the RWE are unbiased estimators of the ATE (see Online Appendix A). Furthermore, they are consistent, which we illustrate by deriving the joint asymptotic distribution of the three estimators.^{6} To do so, we first define $\hat{\mathrm{\Omega}}$ to be the variance-covariance matrix of *ϵ*, which may be defined following standard heteroskedastic- or cluster-robust approaches.

#### Asymptotic distribution of the estimators

*Under standard assumptions for the data-generating process given by Equation 1 (see Online Appendix A and, e.g., Wooldridge (2001)), the asymptotic distribution of the estimators is*

$$\sqrt{N}\left[\begin{array}{c}{\hat{b}}^{\mathrm{F}\mathrm{E}}-{\beta}^{\mathrm{F}\mathrm{E}}\\ {\hat{b}}^{\mathrm{I}\mathrm{W}\mathrm{E}}-{\beta}^{\mathrm{A}\mathrm{T}\mathrm{E}}\\ {\hat{b}}^{\mathrm{R}\mathrm{W}\mathrm{E}}-{\beta}^{\mathrm{A}\mathrm{T}\mathrm{E}}\end{array}\right]\stackrel{d}{\u27f6}N\left(\mathbf{0},\left[\begin{array}{ccc}{\mathrm{\Sigma}}_{FE}& {\mathrm{\Sigma}}_{12}& {\mathrm{\Sigma}}_{13}\\ {\mathrm{\Sigma}}_{12}^{\mathrm{\prime}}& {\mathrm{\Sigma}}_{IWE}& {\mathrm{\Sigma}}_{23}\\ {\mathrm{\Sigma}}_{13}^{\mathrm{\prime}}& {\mathrm{\Sigma}}_{23}^{\mathrm{\prime}}& {\mathrm{\Sigma}}_{RWE}\end{array}\right]\right),$$

*where*

$$\begin{array}{rlll}{\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}& =\mathbb{E}\left[{\stackrel{~}{x}}_{i}^{2}\right]& {\mathbf{V}}_{\stackrel{~}{\mathbf{A}}}& =\mathbb{E}\left[{\stackrel{~}{\mathbf{a}}}_{i}^{\mathrm{\prime}}{\stackrel{~}{\mathbf{a}}}_{i}\right]\\ {\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{W}& =\mathbb{E}\left[{w}_{i}^{2}{\stackrel{~}{x}}_{i}^{2}\right]=1& \mathbf{f}& =\left[1\phantom{\rule{thickmathspace}{0ex}}{\pi}_{1}\phantom{\rule{thickmathspace}{0ex}}\dots \phantom{\rule{thickmathspace}{0ex}}{\pi}_{G-1}\right]\\ {\mathrm{\Sigma}}_{FE}& ={\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{-1}\left[\text{plim}\frac{{\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\hat{\mathrm{\Omega}}\stackrel{~}{\mathbf{X}}}{N}\right]{\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{-1}& {\mathrm{\Sigma}}_{12}& ={\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{-1}\left[\text{plim}\frac{{\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\hat{\mathrm{\Omega}}\stackrel{~}{\mathbf{A}}}{N}\right]{\mathbf{V}}_{\stackrel{~}{\mathbf{A}}}^{-1}{\mathbf{f}}^{\mathrm{\prime}}\\ {\mathrm{\Sigma}}_{IWE}& =\mathbf{f}\text{\hspace{1em}}{\mathbf{V}}_{\stackrel{~}{\mathbf{A}}}^{-1}\left[\text{plim}\frac{{\stackrel{~}{\mathbf{A}}}^{\mathrm{\prime}}\hat{\mathrm{\Omega}}\stackrel{~}{\mathbf{A}}}{N}\right]{\mathbf{V}}_{\stackrel{~}{\mathbf{A}}}{\mathbf{f}}^{\mathrm{\prime}}& {\mathrm{\Sigma}}_{13}& ={\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{-1}\left[\text{plim}\frac{{\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\hat{\mathrm{\Omega}}\mathbf{W}\stackrel{~}{\mathbf{X}}}{N}\right]{\left[{\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{W}\right]}^{-1}\\ {\mathrm{\Sigma}}_{RWE}& ={\left[{\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{W}\right]}^{-1}\left[\text{plim}\frac{{\stackrel{~}{\mathbf{X}}}^{\mathrm{\prime}}\mathbf{W}\hat{\mathrm{\Omega}}\mathbf{W}\stackrel{~}{\mathbf{X}}}{N}\right]{\left[{\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{W}\right]}^{-1}& {\mathrm{\Sigma}}_{23}& =\mathbf{f}{\mathbf{V}}_{\stackrel{~}{\mathbf{A}}}^{-1}\left[\text{plim}\frac{{\stackrel{~}{\mathbf{A}}}^{\mathrm{\prime}}\hat{\mathrm{\Omega}}\mathbf{W}\stackrel{~}{\mathbf{X}}}{N}\right]{\left[{\mathbf{V}}_{\stackrel{~}{\mathbf{X}}}^{W}\right]}^{-1}.\end{array}$$

**Remarks.**

Identification is achieved if the FE model is identified and $\text{Var}({\stackrel{~}{x}}_{i}\mid g(i)=g)>0\phantom{\rule{thickmathspace}{0ex}}\mathrm{\forall}\text{\hspace{1em}}g$, that is, if there is variation in treatment (either in level or assignment status) within each group.

The IWE estimates the treatment effect for each group, allowing the researcher to examine the various treatment effects, which themselves may be of interest. The RWE does not estimate the group-level effects, which is an advantage if the sample size is relatively small. The effective sample size is often small when clustered standard errors are employed and the RWE may be more successful in this situation. This is particularly true if the level of heterogeneity and the level of clustering are the same or colinear.^{7}

In the presence of heterogeneous treatment effects, the IWE may reduce standard errors by modeling the effects directly. The IWE may also be more robust to model misspecification.

We only consider heterogeneity in *β* and assume constant *γ* coefficients across groups. Under this assumption, the IWE estimator is a more parsimonious version of a fully saturated model estimated separately for each group. The econometrician must decide whether this assumption is acceptable for his or her particular application.

When the IWE is estimated, a standard Wald test can be used to test for the presence of heterogeneous treatment effects. When the IWE and its associated interactions are not estimated, a score test based on the FE can be used instead (see the next subsection).

Given the asymptotic result in Proposition 2, it is straightforward to perform a test of equality between either estimate of the ATE and the FE estimate.

These results can be confirmed using a Monte Carlo simulation; see Online Appendix B.

## 2.3 Testing for Heterogeneous Treatment Effects

Armed with two estimators of the ATE, we next consider testing. First, we derive tests for the presence of heterogeneous treatment effects using both Wald and score tests. Then, we offer a specification test for equality between the ATE and the FE. These tests are implemented by Stata commands and an `R` package available from the authors, as discussed in Online Appendix C.

## 2.3.1 Wald Test for Modeled Heterogeneity

If the IWE is estimated following Equation 2, then testing for the presence of heterogeneous treatment effects is straightforward. Standard or robust methods can be used to test for the joint significance of the interaction terms.

#### Wald test for modeled heterogeneity

*The Wald test statistic for heterogeneous treatment effects is calculated according to*

$${T}_{W}=\mathbf{p}{\mathbf{V}}^{\mathrm{I}\mathrm{N}\mathrm{T}}{\mathbf{p}}^{\mathrm{\prime}},$$

*where*

$${\mathbf{V}}^{\mathrm{I}\mathrm{N}\mathrm{T}}={\left({\stackrel{~}{\mathbf{A}}}^{\mathrm{\prime}}\stackrel{~}{\mathbf{A}}\right)}^{-1}{\stackrel{~}{\mathbf{A}}}^{\mathrm{\prime}}\hat{\mathrm{\Omega}}\stackrel{~}{\mathbf{A}}{\left({\stackrel{~}{\mathbf{A}}}^{\mathrm{\prime}}\stackrel{~}{\mathbf{A}}\right)}^{-1}$$

*and the $(G-1)\times G$ matrix*

$$\mathbf{p}=\left[\begin{array}{cc}\mathbf{0}& {\mathbf{1}}_{G-1}\end{array}\right].$$

*Asymptotically, this test statistic has a ${\chi}_{G-1}^{2}$ distribution under the null hypothesis.*

## 2.3.2 Score Test for Unmodeled Heterogeneity

If the RWE is estimated, the researcher may not be interested in or able to estimate the treatment effects by group. Nonetheless, the presence of heterogeneous treatment of the form modeled by the IWE can be tested.

This procedure begins by obtaining the residual from the FE model for each observation *e*_{i}.^{8} The score is calculated according to

$$\mathbf{s}\left({y}_{i};{\mathbf{a}}_{i},{\mathbf{z}}_{i},{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)={e}_{i}\left[\begin{array}{c}{\mathbf{z}}_{i}\\ {\mathbf{a}}_{i}\end{array}\right].$$

#### Score test for unmodeled heterogeneity

*A score test statistic for the presence of heterogeneous treatment effects has the form*^{9}

$${T}_{S}=N{\left(\frac{1}{N}\sum _{i=1}^{N}\mathbf{s}\left({y}_{i};{\mathbf{a}}_{i},{\mathbf{z}}_{i},{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)\right)}^{\mathrm{\prime}}{\mathbf{S}}_{0}^{-1}{\mathbf{C}}^{\mathrm{\prime}}{\left(\mathbf{C}{\mathbf{S}}_{0}^{-1}{\mathbf{C}}^{\mathrm{\prime}}\right)}^{-1}\mathbf{C}{\mathbf{S}}_{0}^{-1}\left(\frac{1}{N}\sum _{i=1}^{N}\mathbf{s}\left({y}_{i};{\mathbf{a}}_{i},{\mathbf{z}}_{i},{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)\right),$$

*where*

$${\mathbf{S}}_{0}=\frac{1}{N}\sum _{i=1}^{N}\mathbf{s}\left({y}_{i};{\mathbf{a}}_{i},{\mathbf{z}}_{i},{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)\mathbf{s}{\left({y}_{i};{\mathbf{a}}_{i},{\mathbf{z}}_{i},{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)}^{\mathrm{\prime}}$$

*and*

$$\mathbf{C}=\left[\begin{array}{cc}{\mathbf{0}}_{(G-1)\times (K+1)}& {\mathbf{I}}_{G-1}\end{array}\right]$$

*(see, e.g., Wooldridge 2001). If clustering is desired, with C clusters and N*_{c} observations in cluster c, then instead we have

$${\mathbf{S}}_{0}=\frac{1}{C}\sum _{c=1}^{C}\sum _{j=1}^{{N}_{c}}\sum _{i=1}^{{N}_{c}}\mathbf{s}\left({y}_{i};{\mathbf{a}}_{i},{\mathbf{z}}_{i},{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)\mathbf{s}{\left({y}_{i};{\mathbf{a}}_{i},{\mathbf{z}}_{i},{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)}^{\mathrm{\prime}}.$$

*Like the Wald test above, this test statistic has an asymptotic ${\chi}_{G-1}^{2}$ distribution under the null hypothesis.*^{10}

## 2.3.3 Test for Equality Between the ATE and FE Estimates

Even if heterogeneous treatment effects are present, the ATE and FE may be equal or at least statistically indistinguishable. In this subsection, we derive a test that is able to distinguish between the two estimates. The same approach can be applied for either estimator of the ATE (*i.e.*, RWE or IWE) and we refer to the chosen estimator as $\hat{b}$^{ATE}.

#### Specification test of the differences between the FE and ATE estimates

*The test of the following null hypothesis*

$$\begin{array}{rl}{H}_{0}& :{\beta}^{\mathrm{A}\mathrm{T}\mathrm{E}}-{\beta}^{\mathrm{F}\mathrm{E}}=0\\ {H}_{a}& :{\beta}^{\mathrm{A}\mathrm{T}\mathrm{E}}-{\beta}^{\mathrm{F}\mathrm{E}}\ne 0\end{array}$$

*can be conducted using a Hausman-style test. Note that the Wald test statistic*

$${T}_{E}=\frac{{\left({\hat{b}}^{\mathrm{A}\mathrm{T}\mathrm{E}}-{\hat{b}}^{\mathrm{F}\mathrm{E}}\right)}^{2}}{\text{Var}\left[{\hat{b}}^{\mathrm{A}\mathrm{T}\mathrm{E}}-{\hat{b}}^{\mathrm{F}\mathrm{E}}\right]}$$

*has an asymptotic χ*^{2}(1)* distribution under H*_{0}*. The variance term is easily computed using the joint asymptotic distribution given in Proposition 2.*

## 3 Comparing FE and ATE Estimates: An *AER* Investigation

To consider the empirical relevance of the distinction between the FE and ATE estimators, we turn to highly-cited papers published in the *American Economic Review* between 2004 and 2009. The papers that we choose are well known in their respective fields and rightfully serve as prime examples of respected empirical work. We find the eight most-cited papers that use fixed effects in an OLS model as part of their primary specification and meet additional requirements that serve to limit our scope to papers in applied microeconomics with a clear effect of interest. These papers are listed in Table along with the outcomes, effects of interest, fixed effects considered, and models replicated as identified by the table and column number of appearance in the original paper. A complete description of the process that we follow to identify these papers can be found in Online Appendix D.1.

Table 2: Papers from the *AER* used in the meta-analysis.

To consider whether the difference between the FE and ATE estimators is empirically important, we test for heterogeneous treatment effects and for a difference between the FE and ATE estimates.^{11} Our results are summarized in Table . For each paper, we list the groups that we consider as potential dimensions of treatment effect heterogeneity along with a test for the presence of heterogeneity, a specification test comparing the ATE and FE estimates, and the percent difference in the two estimates. In the final column, we indicate whether the author considers treatment effect heterogeneity among the groups. These statistics all use the RWE and we compute standard errors following the level of clustering used by the original author.^{12} The results for the IWE are generally very similar, as we would expect, and these results are included in the detailed tables of Online Appendix D.3.

Table 3: *AER* replication results.

Column (3) shows that all but one paper has at least one set of fixed effects groups that exhibit treatment effect heterogeneity. This heterogeneity translates into significant differences between the ATE and FE estimates for five papers at the 5% level and seven papers at the 10% level, as seen in Column (4). Defining a difference to be “economically significant” if it exceeds 10%, Column (5) shows that five papers have economically significant differences between the ATE and FE estimates. The average of the largest deviation for each paper that we consider is 21%. As a comparison, Graham and Powell (2012) find a 25% difference between their CRC and FE estimates.

The weighting scheme employed by FE yields a more efficient estimator in the absence of heterogeneous treatment effects. This suggests that FE may be more efficient if heterogeneity is relatively unimportant. As we have shown, however, the FE is generally an inconsistent estimator of the ATE. This presents a bias-variance trade-off. Figure 1 shows the relationship between the largest absolute difference between the FE and RWE estimates for each paper and compares that to the percent difference in the standard errors of the two estimators.^{13} The ATE estimator exhibits standard errors that are less than ten percent larger than those for the FE in six of eight cases.^{14} Overall, the results indicate that there is not generally a strong bias-variance trade-off unless the differences between the estimates are great. But, if the difference between the estimates is great (*i.e.*, the bias is high), then the ATE should be preferred for policy and interpretablity reasons.

Figure 1: The relationship between the difference in the estimates and the change in variance among the *AER* replications Notes: Figure is based on the full results presented in Online Appendix D.3. Figure plots estimates from the RWE and corresponding standard errors at the level of clustering used by the original authors, where applicable.

## 4 Conclusion

We show that, in the presence of heterogeneous treatment effects, OLS with group fixed effects generally offers a biased estimator of the average treatment effect, a result that has relevance for a variety of fields, including labor, development, health, public finance, and corporate finance. Based on this evidence, we suggest that researchers explore the impact that heterogeneous treatment effects may have on their estimates by considering interaction-weighted or regression-weighted estimators or by analyzing the group-specific weights implied by OLS with fixed effects. We believe that reporting average treatment effects will make estimates more interpretable for individual papers and, perhaps more importantly, across academic studies without increasing the variance of the estimates.

The methods employed in this paper, however, are subject to three notable limitations. First, when clustered standard errors are used, small-sample issues may arise when the number of groups grows close to the number of clusters. When this situation arises, researchers must choose between estimating conservative standard errors and providing a treatment effect that is representative of the whole sample. The optimal solution is inherently application specific.

Second, our discussion has been limited to the case of OLS and we have ignored issues of endogeneity. In cases where the treatment of interest can be assumed to be “as-good-as-random,” as in the cases of a randomized or natural experiment, regression discontinuity, or difference-in-differences identification strategies, our methods may be applied directly. When instrumental variables are used, however, our methods will be complicated by the weights inherent in local average treatment effect estimation (Abadie 2002; Kling 2001); in particular, see Wooldridge (1997) for an analysis of CRC models in the context of instrumental variables estimation.

Finally, our focus in this paper has been to analyze heterogeneity in treatment effects across observable groups. Heterogeneity may also arise along unobservable margins (see, *e.g.*, Bitler, Gelbach, and Hoynes 2014).

## Acknowledgement

We are grateful for comments from Michael Anderson, Alan Auerbach, Rodney Andrews, Joshua Angrist, Marianne Bitler, Henning Bohn, Moshe Buchinsky, Federico Bugni, Colin Cameron, Carlos Dobkin, Shakeeb Khan, Maximilian Kasy, Patrick Kline, Yolanda Kodrzycki, Trevon Logan, Fernando Lozano, Matt Masten, Arnaud Maurel, Doug Miller, Juan Carlos Montoy, Enrico Moretti, Ron Oaxaca, Steve Raphael, Adam Rosen, Jesse Shapiro, Jasjeet Sekhon, Todd Sorensen, Doug Steigerwald, Rocio Titiunik, and Philippe Wingender and for the comments and suggestions of seminar participants at UC Berkeley, the 2008 AEA Pipeline Conference at UCSB, and the 2009 All UC Labor Conference. Any remaining errors are the fault of the authors.

## References

Abadie, Alberto. 2002. “Bootstrap Tests for Distributional Treatment Effects in Instrumental Variable Models.” *Journal of the American Statistical Association* 97 (457): 284–292. CrossrefGoogle Scholar

Angrist, Joshua D. and Alan B. Krueger. 1999. Empirical Strategies in Labor Economics. In *Handbook of Labor Economics*, ed. Orley Ashenfelter and David Card. Vol. 3. Amsterdam: Elsevier. Google Scholar

Angrist, Joshua and Jörn-Steffen Pischke. 2009. *Mostly Harmless Econometrics*. Princeton, NJ: Princeton University Press. Google Scholar

Arellano, Manuel and Stéphane Bonhomme. 2012. “Identifying Distributional Characteristics in Random Coefficients Panel Data Models.” *The Review of Economic Studies* 79 (3): 987–1020. CrossrefGoogle Scholar

Banerjee, Abhijit and Lakshmi Iyer. 2005. “History, Institutions, and Economic Performance: The Legacy of Colonial Land Tenure Systems in India.” *American Economic Review* 95 (4): 1190–1213. CrossrefGoogle Scholar

Bedard, Kelly and Olivier Deschênes. 2006. “The Long-Term Impact of Military Service on Health: Evidence from World War II and Korean War Veterans.” *American Economic Review* 96 (1): 176–194. CrossrefGoogle Scholar

Bitler, Marianne P., Jonah B. Gelbach and Hilary W. Hoynes. 2014. Can Variation in Subgroups’ Average Treatment Effects Explain Treatment Effect Heterogeneity? Evidence from a Social Experiment. Working Paper 20142 National Bureau of Economic Research. Google Scholar

Blundell, R. W. and James L. Powell. 2003. Endogeneity in Nonparametric and Semiparametric Regression Models. In *Advances in Economics and Econometrics: Theory and Applications*, ed. M. Dewatripont, L. P. Hansen and S. J. Turnovsky. Vol. II. Cambrige: Cambridge University Press. Google Scholar

Cameron, A. Colin and Pravin K. Trivedi. 2005. *Microeconometrics*. Cambridge: Cambridge University Press. Google Scholar

Card, David, Carlos Dobkin and Nicole Maestas. 2008. “The Impact of Nearly Universal Insurance Coverage on Health Care Utilization: Evidence from Medicare.” *American Economic Review* 98 (5): 2242–2258. Web of ScienceCrossrefGoogle Scholar

Chamberlain, Gary. 1980. “Analysis of Covariance with Qualitative Data.” *Review of Economic Studies* 47: 225–238. CrossrefGoogle Scholar

Chamberlain, Gary. 1982. “Multivariate Regression Models for Panel Data.” *Journal of Econometrics* 18: 5–46. CrossrefGoogle Scholar

Chamberlain, Gary. 1984. Chapter 22 Panel data. In *Handbook of Econometrics*, 1247–1318. Vol. 2. Amsterdam: Elsevier. Google Scholar

Chamberlain, Gary. 1992. “Efficiency Bounds for Semiparametric Regression.” *Econometrica* 60 (3): 567–596. CrossrefGoogle Scholar

Chernozhukov, Victor, Iván Fernández-Val, Jinyong Hahn and Whitney Newey. 2013. “Average and Quantile Effects in Nonseparable Panel Models.” *Econometrica* 81 (2): 535–580. CrossrefWeb of ScienceGoogle Scholar

Friedberg, Leora. 1998. “Did Unilateral Divorce Raise Divorce Rates? Evidence from Panel Data.” *American Economic Review* 88 (3): 608–627. Google Scholar

Gentzkow, Matthew and Jesse Shapiro. 2013. “Measuring the Sensitivity of Parameter Estimates to Sample Statistics.” Working paper University of Chicago. Google Scholar

Ghanem, Dalia. 2017. “Testing Identifying Assumptions in Nonseparable Panel Data Models.” *Journal of Econometrics* 197 (2): 202–217. CrossrefWeb of ScienceGoogle Scholar

Graham, Bryan S.and James L Powell. 2012. “Identification and Estimation of Average Partial Effects in “Irregular” Correlated Random Coefficient Panel Data Models.” *Econometrica* 80 (5): 2105–2152. Web of ScienceCrossrefGoogle Scholar

Griffith, Rachel, Rupert Harrison and John Van Reenen. 2006. “How Special Is the Special Relationship? Using the Impact of U.S. R&D Spillovers on U.K. Firms as a Test of Technology Sourcing.” *American Economic Review* 96 (5): 1859–1875.CrossrefGoogle Scholar

Heckman, James J., and V. Joseph Hotz. 1989. “Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training.” *Journal of the American Statistical Association* 84 (408): 862–874. CrossrefGoogle Scholar

Imai, Kosuke and In Song Kim. 2016. “When Should We Use Linear Fixed Effects Regression Moedles For Causal Inference With Longitudinal Data?” Working Paper. Google Scholar

Karlan, Dean S. and Jonathan Zinman. 2008. “Credit Elasticities in Less-Developed Economies: Implications for Microfinance.” *American Economic Review* 98 (3): 1040–1068. Web of ScienceCrossrefGoogle Scholar

Kline, Patrick and Andres Santos. 2012. “A Score Based Approach to Wild Bootstrap Inference.” *Journal of Econometric Methods* 1 (1): 23–41. Google Scholar

Kling, Jeffrey R. 2001. “Interpreting Instrumental Variables Estimates of the Returns to Schooling.” *Journal of Business & Economic Statistics* 19 (3): 358–364. CrossrefGoogle Scholar

Lochner, Lance and Enrico Moretti. 2004. “The Effect of Education on Crime: Evidence from Prison Inmates, Arrests, and Self-Reports.” *American Economic Review* 94 (1): 155–189. CrossrefGoogle Scholar

Meghir, Costas and Marten Palme. 2005. “Educational Reform, Ability, and Family Background.” *American Economic Review* 95 (1): 414–424. CrossrefGoogle Scholar

Murphy, Kevin M. and Robert H. Topel. 1985. “Estimation and Inference in Two-Step Econometric Models.” *Journal of Business & Economic Statistics* 3 (4): 370–379. Google Scholar

Oreopoulos, Philip. 2006. “Estimating Average and Local Average Treatment Effects of Education When Compulsory Schooling Laws Really Matter.” *American Economic Review* 96 (1): 152–175. CrossrefGoogle Scholar

Oster, Emily. 2014. Unobservable Selection and Coefficient Stability: Theory and Validation. Working Paper, University of Chicago. Google Scholar

Papke, Leslie E. 1994. “Tax Policy and Urban Development: Evidence from the Indiana Enterprise Zone Program.” *Journal of Public Economics* 54: 37–49. CrossrefGoogle Scholar

Pérez-González, Francisco. 2006. “Inherited Control and Firm Performance.” *American Economic Review* 96 (5): 1559–1588. CrossrefGoogle Scholar

Solon, Gary, Steven J. Haider and Jeffrey M. Wooldridge. 2015. “What are We Weighting For?” *Journal of Human Resources* 50 (2): 301–316. Web of ScienceCrossrefGoogle Scholar

Wooldridge, Jeffrey M. 1997. “On Two Stage Least Squares Estimation of the Average Treatment Effect in a Random Coefficient Model.” *Economic Letters* 56: 129–133. CrossrefGoogle Scholar

Wooldridge, Jeffrey M. 2001. *Econometric Analysis of Cross-Section and Panel Data*. Cambridge, MA: MIT Press. Google Scholar

Wooldridge, Jeffrey M. 2005. “Fixed-Effects and Related Estimators for Correlated Random-Coefficient and Treatment-Effect Panel Data Models.” *Review of Economics and Statistics* 87 (2): 385–390. CrossrefGoogle Scholar

## Footnotes

1

See Murphy and Topel (1985), Gentzkow and Shapiro (2013), and Oster (2014) for other examples of papers that replicate published studies to elucidate a methodological point. We only analyze the data that the authors openly provide on the EconLit website. Though some of these papers include both OLS and instrumental variables approaches, we consider the implications of heterogeneous treatment effects for the OLS specifications only to focus on the weighting scheme applied by this common procedure.

2

We assume that the sample is representative of the population of interest for the ATE; specifically, *N*_{g}/*N* → *π*_{g}.

3

See, *e.g.*, Angrist and Krueger (1999), Wooldridge (2005), and Angrist and Pischke (2009).

4

Consider **a**_{i} having first *x*_{i}, followed by *x*_{i} interacted with *G* − 1 fixed effects.

5

These weights are designed to align with the definition of **a**_{i}; see footnote 4.

6

The fixed effects that we consider denote group membership and the sizes of these groups grow with overall sample size – *i.e.*, *N*_{g} → ∞; ∀ *g* ∈ 1, …, *G*, *G* fixed. This is somewhat opposite of the typical configuration in panel data problems.

7

The RWE estimator is identified in this situation because the model form is the same as the FE model, which is identified and the clustered variance-covariance matrix is well-defined, but observations are differentially weighted based on covariates, rather than features of the error structure.

8

$\mathbf{e}=\mathbf{M}\mathbf{Y}-\mathbf{M}\mathbf{X}{\hat{b}}^{\mathrm{F}\mathrm{E}}$.

9

This form assumes that the information matrix equality holds, which is true under standard regularity conditions and correct specification under the null (see Cameron and Trivedi 2005).

10

This test may outperform the Wald test when a clustered variance-covariance matrix is used (Kline and Santos 2012).

11

We develop a Stata command and `R` package to perform these analyses. See Online Appendix C. We have posted these resources online for researchers interested in implementing these tests.

12

In Online Appendix D.3, we provide both the clustered and non-clustered heteroskedasticity-robust results. If the fixed effects groups are colinear with the clustering term, we are not able to cluster the IWE estimator. This is the case for the coastal interaction in Banerjee and Iyer (2005) and in the models of Oreopoulos (2006). Because the RWE estimator does not require estimating the interactions, clustering is possible in these cases. We choose to present the RWE results in Table for this reason.

13

If the difference in the standard errors is positive, the RWE has a larger standard error.

14

It is perhaps not surprising that the standard errors for Karlan and Zinman (2008) increase substantially given the large change in the estimate (over 60% for the RWE). But the *t*-statistics are similar: −4.00 using FE and −3.94 using the RWE.

## About the article

**Published Online**: 2018-02-03

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.