Peter M. Aronow

# Abstract

Recent approaches in causal inference have proposed estimating average causal effects that are local to some subpopulation, often for reasons of efficiency. These inferential targets are sometimes data-adaptive, in that they are dependent on the empirical distribution of the data. In this short note, we show that if researchers are willing to adapt the inferential target on the basis of efficiency, then extraordinary gains in precision can potentially be obtained. Specifically, when causal effects are heterogeneous, any asymptotically normal and root- n consistent estimator of the population average causal effect is superefficient for a data-adaptive local average causal effect.

## 1 Introduction

When causal effects are heterogeneous, then inferences depend on the population for which causal effects are estimated. Although population average causal effects have traditionally been the inferential targets, recent results have focused on estimating average causal effects that are local to some subpopulation for reasons of efficiency. These approaches include trimming observations based on the distribution of the propensity score [1], using regression adjustment to estimate reweighted causal effects [2, 3, 4], or implementing calipers for propensity-score matching [5, 6]. In some cases, the target parameter is dependent on the empirical distribution of the data, including cases where the researcher is explicitly conducting inference on, e. g., the average treatment effect among the treated conditional on the observed covariate distribution [7], or other causal sample functionals [8, 9], without revision to the estimator being used.

These approaches privilege efficiency in estimation over targeting population average causal effects, and often allow for the target to be defined on the basis of the observed data. We provide an example of how these approaches, taken to their extreme, can provide extraordinary gains in statistical certainty. We consider the case of a data-adaptive target parameter [10] that is allowed to vary with the data depending on which subpopulation’s local average causal effect is best estimated. When treatment effects are heterogeneous, adaptively changing the target parameter on the basis of efficiency yields an unusual result: if the population average causal effect can be consistently estimated with a root- n consistent and asymptotically normal estimator θ ˆ , then the same estimator θ ˆ is always superefficient (i. e., faster than root- n consistent) for a data-adaptive local average causal effect. Furthermore, with an additional regularity condition on mean square convergence, we show that the mean square error of θ ˆ for a data-adaptive local average causal effect is of o ( n 1 ) .

## 2 Results

Consider a full data probability distribution G with an associated causal effect distribution τ with finite expectation E G [ τ ] , where E G [ . ] denotes the expectation over the distribution G . Further denote the support of the distribution of τ as S u p p G [ τ ] . We impose a regularity condition on τ establishing non-degeneracy of τ .

Assumption 1:

(Effect heterogeneity). m i n ( sup S u p p G [ τ ] ) E G [ τ ] , E G [ τ ] , inf ( S u p p G [ τ ] ) = c > 0

Assumption 1 is equivalent to assuming that causal effects are not constant across observations in the distribution G ; i. e., causal effects are heterogeneous.

We do not observe the full data probability distribution G , but we observe an empirical distribution F n . Suppose that, using F n , we have a root- n consistent and asymptotically normal estimator of the average causal effect E G [ τ ] , θ ˆ .

Definition 1:

An estimator θ ˆ is root- n consistent and asymptotically normal for θ 0 if n ( θ ˆ θ 0 ) = N ( 0 , σ 2 ) + o p ( 1 ) , for some 0 < σ 2 < .

We now define the target parameter, θ F n .

Definition 2:

Let the target parameter

θ F n = { θ ˆ : | θ ˆ E G [ τ ] | c E G [ τ ] + c : θ ˆ E G [ τ ] > c E G [ τ ] c : θ ˆ E G [ τ ] < c ,

where, as in Assumption 1, c = min sup S u p p 1 G [ τ ] E G [ τ ] , E G [ τ ] inf S u p p 1 p t G [ τ ] .

The target parameter adapts naturally to the closest value in an interval surrounding E G [ τ ] , where the width of the interval is defined by the support of τ . We formalize how each θ F n is a local average treatment effect.

Proposition 1:

There exists a nonnegative weighting associated with each empirical distribution F n , w F n , such that across all F n , θ F n = E G [ w F n τ ] E G [ w F n ] .

A proof of Proposition 1 follows directly from the fact that a weighted mean can obtain any value in the interval defined by the infimum and supremum of its distribution’s support. Proposition 1 asserts that across all realizations, the target parameter θ F n corresponds to an average causal effect for at least one subpopulation. (There in fact may be infinitely many subpopulations to which θ F n corresponds.) The composition of the subpopulation(s) associated with each θ F n is not directly knowable by the researcher and may vary across realizations of the data.

However, mirroring results on other data-adaptive parameters under random sampling, including the sample average causal effect, the target parameter θ F n will converge to the average causal effect E G [ τ ] at root- n rate. Proposition 2 proves that the data-adaptive local average causal effect is asymptotically equivalent to the average causal effect, and establishes its rate of convergence.

Proposition 2:

Suppose that θ ˆ is a root- n consistent and asymptotically normal estimator of E G [ τ ] . Then n ( θ F n E G [ τ ] ) = O p ( 1 ) .

A proof of Proposition 2 follows by noting that n ( θ ˆ E G [ τ ] ) = O p ( 1 ) and that across every realization, | θ F n E G [ τ ] | | θ ˆ E G [ τ ] | .

We now turn to our primary result, proving the superefficiency of θ ˆ in estimating θ F n .

Proposition 3:

Suppose that Assumption 1 holds and that θ ˆ is a root- n consistent and asymptotically normal estimator of E G [ τ ] . Then n ( θ ˆ θ F n ) = o p ( 1 ) .

Proof:

Decompose θ ˆ into θ ˜ = N ( E G [ τ ] , σ 2 / n ) and u = o p ( n 1 / 2 ) , so that θ ˆ = θ ˜ + u . Since ( θ ˜ θ F n ) is o p ( a n ) for any positive sequence ( a n ) , the rate of convergence of θ ˆ is at worst governed by the bound ensured by u ’s o p ( n 1 / 2 ) convergence. To prove the claim, note that for any positive ε , Pr | θ ˜ θ F n | a n ε Pr θ ˜ θ F n 0 = 2 Φ ( c n / σ ) , where Φ ( . ) denotes the standard Normal CDF. Since lim n 2 Φ ( c n / σ ) = 0 , ( θ ˜ θ F n ) is o p ( a n ) . Thus θ ˆ θ F n = o p ( a n ) + o p ( n 1 / 2 ) = o p ( n 1 / 2 ) , yielding the result. □

In short, Proposition 3 demonstrates that the probability that θ ˆ falls inside the support of the effect distribution converges to one quickly; conditional on this event, then estimation error is zero (as the target parameter takes on the value as the estimator with probability one). To illustrate this result, we can consider a case where an interval defined by the support of the effect distribution encompasses the sampling distribution of the estimator.

Corollary 1:

Suppose that c max sup S u p p [ θ ˆ ] E G [ τ ] , E G [ τ ] inf S u p p [ θ ˆ ] . Then Pr ( θ ˆ = θ F n ) = 1 .

A proof of Corollary 1 follows by noting that Pr ( | θ ˆ E G [ τ ] | c ) = 0 , and applying Definition 2. In other words, if the support of the estimator being used lies entirely within the interval [ E G [ τ ] c , E G [ τ ] + c ] , then estimation error is always zero. This condition necessarily holds if SuppG[τ]=ℝ, then the value that any estimator θ ˆ takes must coincide with a local average causal effect. But note that Corollary 1 would not hold if SuppG[τ]=ℝ+ and Pr ( θ ˆ < 0 ) > 0 .

Our results can be generalized to stronger claims straightforwardly. When a regularity condition is imposed on the rate of convergence of θ ˆ to normality, a stronger result can be obtained about the rate of mean square convergence.

Proposition 4:

Suppose that Assumption 1 holds and θ ˆ obeys n ( θ ˆ E G [ τ ] ) = N ( 0 , σ 2 ) + ε , where E [ ε 2 ] = o ( n 1 / 2 ) . Then E [ ( θ ˆ θ F n ) 2 ] = o ( n 1 ) .

Proof:

We will show that the mean square error of ( θ ˜ θ F n ) converges to zero sufficiently quickly, implying that the rate of convergence of θ ˆ is at worst governed by the mean square error bound ensured by ε ’s convergence rate. To obtain the rate of convergence of the mean square error of θ ˜ , we integrate over its squared deviation from the target parameter. Within c of E G [ τ ] , the squared deviation is zero, thus we need only integrate over the squared deviation over the tails of the normal distribution. To ease calculations, we obtain an upper bound by integrating over the squared deviation from E G [ τ ] , rather than from θ F n :

E G [ ( θ ˜ θ F n ) 2 ] 2 c x 2 n σ 2 π e x 2 n 2 σ 2 = c σ 2 π e c 2 n 2 σ 2 n + 2 σ 2 Φ c n σ n = o ( n 1 ) .

Since E [ ( θ ˜ θ F n ) 2 ] = o ( n 1 ) and n 1 / 2 E [ ε 2 ] = o ( n 1 ) , the Cauchy-Schwarz inequality ensures that E [ ( θ ˆ θ F n ) 2 ] = o ( n 1 ) + o ( n 1 ) = o ( n 1 ) .□

## 3 Discussion

Our results highlight the additional certainty obtained by data-adaptively choosing the population for which average causal effects are measured on the basis of efficiency. It is well known that efficiency gains may be obtained through data-adaptive inference. But the extent to which the researcher can benefit from such practice has been understated. Under treatment effect heterogeneity – a precondition for locality to be a concern – all root- n consistent and asymptotically normal estimators of the average treatment effect are superefficient for a local average treatment effect.

There is of course a cost to this superefficiency: the target parameter is likely not of intrinsic interest. This issue is not unique to our setting, and other methods that change the inferential target based on efficiency concerns may be subject to this critique. As Crump et al. ([1], p. 188) notes, “external validity may be lost by changing the focus to average treatment effects for a subset of the original sample.” This is exacerbated in our setting by the researcher’s lack of knowledge about the characteristics of the subpopulation under study. Our result represents an extreme case of privileging efficiency over targeting population average causal effects. However, our results provide insight into a potential pathology of data-adaptivity purely on efficiency concerns: the gains in statistical certainty may be essentially unbounded without further restrictions. We hope that future work in the domain of efficiency theory for data-adaptive parameters will consider classes of restrictions that would exclude the case considered here.

# Acknowledgement

The author thanks Don Green, Cyrus Samii, Jas Sekhon, Mark van der Laan, and two anonymous reviewers for helpful comments. The author expresses particular gratitude to Jas Sekhon for suggesting a parsimonious proof strategy for Proposition 3 and to an anonymous reviewer for inspiring Corollary 1. All remaining errors are the author’s responsibility.

### References

1. Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Dealing with limited overlap in estimation of average treatment effects. Biometrika 2009. Search in Google Scholar

2. Humphreys M. Bounds on least squares estimates of causal effects in the presence of heterogeneous assignment probabilities Columbia University, 2009 Manuscript. Search in Google Scholar

3. Angrist JD, Pischke JS. Mostly harmless econometrics: An empiricist’s companion. Princeton, NJ: Princeton University Press, 2009. Search in Google Scholar

4. Aronow PM, Samii C. Does regression produce representative estimates of causal effects? Am J Pol Sci 2016;60(1):250–267. Search in Google Scholar

5. Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat 2011;10(2):150–161. Search in Google Scholar

6. Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985;39(1):33–38. Search in Google Scholar

7. Abadie A, Imbens G. Simple and bias-corrected matching estimators for average treatment effects. NBER technical working paper no. 283 2002. Search in Google Scholar

8. Aronow PM, Green DP, Lee DK. Sharp bounds on the variance in randomized experiments. Ann Stat 2014;42(3):850–871. Search in Google Scholar

9. Balzer LB, Petersen ML, van der Laan MJ. Targeted estimation and inference for the sample average treatment effect. Berkeley, CA: Bepress, 2015. Search in Google Scholar

10. van der Laan MJ, Hubbard AE, Pajouh SK. Statistical inference for data adaptive target parameters. Princeton, NJ: Bepress, 2013. Search in Google Scholar

Published Online: 2016-11-11
Published in Print: 2016-9-1