Accessible Published by De Gruyter March 14, 2020

Efficient Nonparametric Causal Inference with Missing Exposure Information

Edward H. Kennedy

Abstract

Missing exposure information is a very common feature of many observational studies. Here we study identifiability and efficient estimation of causal effects on vector outcomes, in such cases where treatment is unconfounded but partially missing. We consider a missing at random setting where missingness in treatment can depend not only on complex covariates, but also on post-treatment outcomes. We give a new identifying expression for average treatment effects in this setting, along with the efficient influence function for this parameter in a nonparametric model, which yields a nonparametric efficiency bound. We use this latter result to construct nonparametric estimators that are less sensitive to the curse of dimensionality than usual, e. g. by having faster rates of convergence than the complex nuisance estimators they rely on. Further we show that these estimators can be root-n consistent and asymptotically normal under weak nonparametric conditions, even when constructed using flexible machine learning. Finally we apply these results to the problem of causal inference with a partially missing instrumental variable.

1 Introduction

It is very common for there to be missing data in observational studies where causal effects are of interest. In this paper we consider studies where there is substantial missingness in an exposure variable. This is a very common feature of observational studies, and examples abound in the literature. For example, Zhang et al. [1] described the Consortium on Safe Labor observational study, where the goal was to estimate effects of mothers’ body mass index on infant birthweight. There, covariate and outcome information was essentially always available, but body mass index data was only available for about half of the mothers. Shortreed and Forbes [2] used the Framingham Heart Study data to assess effects of physical activity on cardiovascular disease and mortality, but up to 30 % of subjects were missing physical activity information. Similarly, Ahn et al. [3] used the Molecular Epidemiology of Colorectal Cancer study to estimate effects of physical activity on colorectal cancer stage, but 20 % of subjects were missing physical activity data. Shardell and Hicks [4] described an analysis of the Baltimore Hip Studies involving older adults with hip fractures, where the goal was to assess effects of perceived mobility recovery on survival. However, this self-reported mobility measure was unavailable for 27 % of subjects. Molinari [5] and Mebane Jr and Poast [6] give numerous other examples of studies with missing exposure information, particularly in survey settings, e. g. from the National Longitudinal Survey of Youth, and the Health and Retirement Study. This is certainly a prevalent problem.

In fact the problem is even more widespread, since in instrumental variable studies one can view the instrument as a type of exposure (e. g. for the purpose of estimating intention-to-treat-style effects, as well as other instrumental variable estimands that require estimating instrument effects on both treatment and outcome). And it is similarly common for instrument values to be missing. For example, in a Mendelian randomization context Burgess et al. [7] used genetic variants as instrumental variables to study the effect of C-reactive protein on fibrinogen and coronary heart disease. However, data on these variants was missing for up to 10 % of subjects, due to difficulty in interpreting output from genotyping platforms. Mogstad and Wiswall [8] and Chaudhuri and Guilkey [9] give further examples of missing instruments from economics.

Although missing exposures and instruments are a prevalent problem, the proposed methods for dealing with this issue have relied on potentially restrictive modeling assumptions, and have been somewhat ad hoc in not considering optimal efficiency. For example Williamson et al. [10] and Zhang et al. [1] propose interesting semiparametric estimators, but they rely on parametric models for nuisance functions, and do not consider the question of efficiency in either nonparametric or semiparametric models. Zhang et al. [1] also only considers binary outcomes. Chaudhuri and Guilkey [9] discusses (semiparametric) efficiency theory, but only for a finite-dimensional parameter in a population moment condition depending on a known function. This means their results apply to classical linear models, but not to the fully nonparametric setting pursued here. Kennedy and Small [11] consider nonparametric efficiency theory in missing instrumental variable problems, but only in simpler settings with one-sided noncompliance and no covariates.

Thus we fill these gaps by giving a new identifying expression for average treatment effects of multivalued discrete exposures in the presence of complex confounding and missing exposure values, deriving the efficient influence function and corresponding nonparametric efficiency bounds, and constructing nonparametric estimators that can be n-consistent and asymptotically normal, even if nuisance functions are estimated at slower rates via nonparametric machine learning tools. Finally we apply these general results to also address the problem of causal inference with a partially missing instrumental variable. Throughout we make use of a missing at random assumption used by previous authors, allowing exposure missingness to depend on post-exposure outcome information.

2 Missing exposures

In this section we consider the general problem of identification and efficient estimation of average treatment effects, when exposure values are missing at random, allowing the missingness mechanism to depend on both covariates and post-treatment outcome information.

2.1 Setup

Suppose we observe an iid sample (O1,...,On)P where

O=(X,R,RZ,Y)

for XRd denoting covariate information, Z{z1,...,zk} a discrete treatment or exposure, R ∈ {0, 1} an indicator for whether Z is observed or not, and Y=(Y1,...,Yp)TRp a vector of p outcomes of interest. In general we use script characters to denote the support of a random variable, e. g. XXRd. For notational simplicity we further define the nuisance functions

μ(yx)=P(YyX=x)π(x,y)=P(R=1X=x,Y=y)λz(x,y)=P(Z=zX=x,Y=y,R=1).

Note that µ is the cumulative distribution function of the outcome given covariates, π can be viewed as the missingness propensity score, and λ the regression on covariates and outcomes of treatment among those for whom it is measured. We further define

βz(x)=E{Yλz(X,Y)X=x}=Yyλz(x,y)dμ(yx)γz(x)=E{λz(X,Y)X=x}=Yλz(x,y)dμ(yx).

The quantity βz(x)={βz1(x),...,βzp(x)}T is a vector of the same dimension as Y. We will see shortly that, under missing at random assumptions, γz equals the propensity score, while βz equals the product of the propensity score and outcome regression.

Our goal is to estimate the mean ψz=E(Yz)={E(Y1z),...,E(Ypz)}TRp of the outcomes that would have been observed under treatment level zZ. It is well-known that this equals

(1)ψz=XE(YX=x,Z=z)dP(x)

under the following standard causal assumptions:

(2)(Consistency.)Y=YZ
(3)(ZPositivity.)P{ϵ<P(Z=zX)<1ϵ}=1forallzZ
(4)(ZExchangeability.)ZYzXforallzZ

These assumptions have been discussed extensively in the literature [12, 13], so we refer the reader elsewhere for more details.

Crucially, when treatment Z is missing for some subjects, expression (1) is still not identified even under (2)–(4), since Z is not observed unless R = 1. We consider identification under missing at random conditions used for example by Chaudhuri and Guilkey [10], Williamson et al. [9] and Zhang et al. [1], which are:

(5)(RExchangeability.)RZX,Y
(6)(RPositivity.)P{ϵ<P(R=1X,Y)<1ϵ}=1

Note that the missing at random condition (5) allows missingness in treatment Z to depend on the post-treatment outcome Y; this will be important if the outcome captures some information about the missingness mechanism beyond the covariates. An alternative missing-at-random assumption would be R(Z,Y)X; note however that this implies our R-exchangeability assumption, as well as the further testable implication that RYX. Therefore our assumption is strictly weaker.

Figure 1 uses directed acyclic graphs to illustrate two different data generating processes that satisfy exchangeability conditions (4) and (5). The first represents a process where missingness occurs prior to the outcome (e. g. subjects miss a visit when they would have contributed treatment information); the second represents a process where missingness occurs after the outcome (e. g. survey non-response or data corruption after measurement).

Figure 1: Two directed acyclic graphs for which the required exchangeability assumptions (4) and (5) hold, where (U1,U2)$(U_1,U_2)$ are unmeasured and Z is only observed when R = 1. In graph (a) missingness can be viewed as occurring prior to the outcome, while in (b) it can be viewed as occurring after. The variable U2 can be represented as the potential outcome Y0${\bf{Y}}^0$.

Figure 1:

Two directed acyclic graphs for which the required exchangeability assumptions (4) and (5) hold, where (U1,U2) are unmeasured and Z is only observed when R = 1. In graph (a) missingness can be viewed as occurring prior to the outcome, while in (b) it can be viewed as occurring after. The variable U2 can be represented as the potential outcome Y0.

2.2 Identification & efficiency theory

Our first result gives a new identifying expression for ψz under the causal and missing at random assumptions above. This essentially follows from the important facts that, under (5)–(6), the propensity score is given by

γz(x)=P(Z=zX=x)

(note this means Z-positivity (3) is equivalent to γz being bounded away from zero and one), and that the outcome regression satisfies

βz(x)=γz(x)E(YX=x,Z=z).

Proposition 1.

Under the causal assumptions (2)–(4) and the missing at random assumptions (5)–(6), it follows that

ψz=E(Yz)=Eβz(X)γz(X).

Proof.

We have

E(Yz)=XE(YX=x,Z=z)dP(x)=XYyP(Z=zX=x,Y=y)P(Z=zX=x)dμ(yx)dP(x)=XYyλz(x,y)Yλz(x,y)dμ(yx)dμ(yx)dP(x)=Eβz(X)γz(X)

where the first equality follows by the causal assumptions (2)–(4), the second by Bayes’ rule, the third by the missing at random assumptions (5)–(6) and iterated expectation, and the fourth by definition.   □

Interestingly, although the complete-data functional (4) does not depend on the observational treatment process, its identified version under the missing at random assumptions does. Intuitively, this occurs because when Z is missing, one cannot simply condition on (X,Z) anymore, and instead the distribution of Z given X needs to be constructed by marginalizing over that of Z given (X,Y) among those with Z observed.

The next result gives a crucial von Mises-type expansion for the parameter from Proposition 1, which lays the foundation for the efficiency bound and estimation results to come. This result can be viewed as giving a distributional Taylor expansion for the functional ψz.

Lemma 1.

The functional ψz=ψz(P) from Proposition 1 admits the expansion

ψz(P)ψz(P)=O{ϕz(O;P)ψz(P)}(dPdP)+R2(P,P)

where

ϕz(O;P)=Yβz(X)/γz(X)γz(X)R{1(Z=z)λz(X,Y)}π(X,Y)+λz(X,Y)+βz(X)γz(X)

and

R2(P,P)=EP(Yβz(X)/γz(X)γz(X)π(X,Y)π(X,Y)π(X,Y){λz(X,Y)λz(X,Y)}+βz(X)βz(X)γz(X)+βz(X)γz(X)γz(X)γz(X)γz(X)γz(X)γz(X)γz(X)).

Proof.

Here we drop the z argument throughout to ease notation. Note we can write

ϕ(O;P)=R{1(Z=z)λ(X,Y)}π(X,Y)+λ(X,Y)Yβ(X)/γ(X)γ(X)+β(X)γ(X)=1γ(X)RYπ(X,Y){1(Z=z)λ(X,Y)}+{Yλ(X,Y)β(X)}β(X)γ(X)2R{1(Z=z)λ(X,Y)}π(X,Y)+{λ(X,Y)γ(X)}+β(X)γ(X).

Therefore, letting E=EP and dropping (X,Y) arguments from (π,λ,β,γ) to further ease notation, we have

ψ(Pˉ)ψ(P)+O{ϕ(O;Pˉ)ψ(Pˉ)}dP=E1γˉRYπˉ(1zλˉ)+(Yλˉβˉ)βˉγˉ2R(1zλˉ)πˉ+(λˉγˉ)+βˉγˉβγ=EYβˉ/γˉγˉππˉπˉ(λλˉ)+ββˉγˉβˉγˉγγˉγˉ+βˉγˉβγ=EYβˉ/γˉγˉππˉπˉ(λλˉ)+βγβˉγˉγγˉγˉ=R2(Pˉ,P)

where the second equality follows from from rearranging and iterated expectation (together with E(YλX)=β and E(λX)=γ), and the third since

ββγβγγγγ+βγβγ=β1γ1γβγγγγ=βγβγγγγ.

Note the above implies E{ϕ(O;P)ψz(P)}=0, which is also straightforward to see using iterated expectation.   □

Lemma 1 has several important consequences. First, it suggests how one could correct the first-order bias of a plug-in estimator ψz(Pˆ), by estimating the first term in the expansion and subtracting it off. This is one way to view what semiparametric estimators (particularly of the “one-step” variety) based on influence functions are doing, and in fact the estimator presented in the next subsection does precisely this. Second, it essentially immediately yields the efficient influence function for ψz. The next theorem states this result; after it we describe why the efficient influence function is useful here.

Theorem 1.

Under a nonparametric model satisfying positivity conditions (3) and (6), the efficient influence function for ψz is given by ϕz(O;P)ψz as defined in Lemma 1.

Proof.

Recall from Bickel et al. [14] and van der Vaart [15] that the nonparametric efficiency bound for a functional ψ is given by the supremum of Cramer-Rao lower bounds for that functional across smooth parametric submodels. The efficient influence function is the mean-zero function whose variance equals the efficiency bound, and is given by the unique φ that is a valid submodel score (or limit of such scores) satisfying pathwise differentiability, i. e.

(7)ddϵψ(Pϵ)|ϵ=0=Oφ(O;P)ddϵlogdPϵ |ϵ=0dP

for Pϵ any smooth parametric submodel.

To see that ϕψ is the efficient influence function for ψ, first note that the expansion in Lemma 1 implies

ψz(Pϵ)ψz(P)=O{ϕz(O;P)ψz(P)}dPϵRz(P,Pϵ)

so differentiating with respect to ε yields

ddϵψz(Pϵ)=O{ϕz(O;P)ψz(P)}ddϵdPϵddϵRz(P,Pϵ)=O{ϕz(O;P)ψz(P)}ddϵlogdPϵdPϵddϵRz(P,Pϵ).

The property (7) follows after evaluating at ε = 0, since

ddϵRz(P,Pϵ)|ϵ=0=0

by virtue of the fact that Rz(P,Pϵ) consists of only second-order products of errors between Pϵ and P. Thus applying the product rule yields a sum of two terms, each of which is a product of a derivative term (which may not be zero at ε = 0) and an error term involving differences of components of Pϵ and P (which will be zero at ε = 0). Since our model is nonparametric, the tangent space is the entire Hilbert space of mean-zero finite-variance functions; hence there is only one influence function satisfying (7) and it is the efficient one [14, 15, 16]. Therefore ϕψ must be the efficient influence function.

An equivalent way to derive this result, as suggested by an anonymous reviewer in a previous version of this manuscript, is to use results from Robins et al. [17]. Specifically, as in Theorem 7.2 of Tsiatis [16], one can take full-data efficient influence function for ψz, inverse-probability-weight it for those with R = 1 and subtract off its projection onto the tangent space. This yields the same efficient influence function.   □

The efficient influence function is important since its variance cov{ϕz(O;P)ψz} gives an efficiency bound for estimation of ψz, providing a benchmark for efficient estimation. More precisely, following Bickel et al. [14], van der Vaart [15], and Tsiatis [16], this variance provides a local asymptotic minimax lower bound in the sense of Hajek and Le Cam, and tells us that the asymptotic variance of any regular asymptotically linear estimator can be no smaller (in that the difference in covariance matrices must be non-negative definite). Insofar as the bias-correction suggested earlier directly involves the efficient influence function, this object is also crucial for constructing estimators that have second-order bias and so can be n-consistent and asymptotically normal even when the nuisance functions are estimated flexibly at slower rates of convergence. This feature will be detailed in the next subsection.

2.3 Estimation & inference

Here we present an estimator based on the functional expansion from Lemma 1, which is asymptotically efficient under weak nonparametric conditions.

To ease notation let ϕz=ϕz(O;P) and ϕˆz=ϕz(O;Pˆ) denote the true and estimated versions of the uncentered efficient influence function for ψz. The estimator we study here is given by

ψˆz=Pn(ϕˆz)

where we use Pn(f)=1ni=1nf(Oi) to denote sample averages. Therefore the estimator ψˆz is simply the sample average of the estimated (uncentered) influence function values; equivalently we can write it as a bias-corrected version of the plug-in ψz(Pˆ), namely

ψˆz=ψz(Pˆ)+Pn(φˆz)

where φˆz=ϕˆzψz(Pˆ) is the estimated efficient influence function.

For simplicity, in the following results we assume the nuisance estimates Pˆ are constructed from a separate independent sample. In practice, one can split the sample, use part for fitting Pˆ and the other for constructing ϕˆz, and then swap so as to attain full efficiency based on the entire sample size n rather than a fraction, e. g. n/2. This is the idea behind the sample-splitting methods used in other functional estimation problems [18, 19, 20]. Alternatively, if the same observations are used both for estimating Pˆ and constructing ϕˆz, one generally needs to rely on empirical process conditions to obtain the kinds of results we present here.

The next theorem gives the asymptotic properties of the estimator ψˆz, and conditions under which it is n-consistent and converging to a normal distribution with asymptotic variance equal to the nonparametric efficiency bound. In what follows, we let f2=P(f2)=Of(o)2dP(o) denote the squared L2(P) norm.

Theorem 2.

Assumeϕˆzϕz=oP(1)andP(ϵ<πˆ<1ϵ)=P(ϵ<γˆz<1ϵ)=1. Then

ψˆzψz=OP1n+πˆπλˆzλz+βˆzβz+γˆzγzγˆzγz,

and ifπˆπλˆzλz+βˆzβz+γˆzγzγˆzγz=oP(1/n), we have

n(ψˆzψz)N(0,cov(ϕz)).

Proof.

Dropping z subscripts to ease notation, we can write

(8)ψˆψ=(PnP)(ϕˆϕ)+(PnP)ϕ+P(ϕˆϕ).

For the first term in (8) above, Lemma 2 in the Appendix (reproduced from Kennedy et al. [21]) implies that

(PnP)(ϕˆϕ)=OPϕˆϕn=oP(1/n)

where the last equality follows since ϕˆϕ=oP(1) by assumption. The expansion from Lemma 1 now implies

P(ϕˆϕ)=O{ϕ(O;Pˆ)ψ(P)}dP=R2(Pˆ,P)πˆπλˆλ+βˆβ+γˆγγˆγ

where the last line uses Cauchy-Schwarz and the fact that (γˆ,πˆ,γ) are all bounded away from zero. This yields the result.   □

Importantly, Theorem 2 shows that ψˆz attains faster rates than its nuisance estimators, and can be asymptotically efficient under weak nonparametric conditions. Specifically, as long as the influence function is consistently estimated in L2 norm, the estimator ψˆz has a rate of convergence that is second-order in the nuisance estimation error, thus attaining faster rates than the nuisance estimators. Under standard n – 1/4-type rate conditions, the estimator is n-consistent, asymptotically normal, and efficient. Importantly, these rates can plausibly be attained under nonparametric smoothness, sparsity, or other structural conditions (e. g. additive modeling or bounded variation assumptions, etc.). For example, if it is assumed that all d-dimensional nuisance functions lie in a Hölder class with smoothness index s (i. e. partial derivatives up to order s exist and are Lipschitz) then the assumption of Theorem 2 would be satisfied when s > d/2, i. e. the smoothness index is at least half the dimension. Alternatively, if the functions are s-sparse then one would need s=o(n) up to log factors, as in Farrell [22]. Then asymptotically valid 95 % confidence intervals can be constructed via a simple Wald form, ψˆz±1.96diag{cov(ϕz)}/n The next result points out the double robustness of ψˆz.

Corollary 1.

Under the conditions of Theorem 2, the estimator ψˆz is consistent if either

  1. γˆzγz=oP(1)andπˆπ=oP(1), or

  2. γˆzγz=oP(1)andλˆzλz=oP(1).

Corollary 1 shows that ψˆz is doubly robust [23, 24], since it is consistent if either πˆ or λˆz are (and γˆz is). Note however that our formulation requires the propensity score γz to be estimated consistently. This contrasts with the semiparametric approach of Zhang et al. [1], who construct an estimator that is consistent as long as two of three nuisance functions are estimated consistently. However, Zhang et al. [1] work under a different factorization of the likelihood, and impose parametric models on the partially observed propensity score and outcome regression functions. It is unclear whether our remainder can be written in a triply robust form, though we conjecture that results of Zhang et al. [1] would not hold in the fully nonparametric setting considered here. This and a more general study of triple robustness could be an important avenue for future work.

3 Application to missing instruments

Here we apply the theory from the previous section to identify and efficient estimate the local average treatment effect in instrumental variable studies with missing instrument values.

It is quite common for some instrument values to be missing in instrumental variable studies [8, 9, 11]. This setup fits in the proposed framework from the previous section as follows. We have O=(X,R,RZ,Y) where Z  ∈ {0, 1} is now an instrument, and Y=(A,Y) for A ∈ {0, 1} a binary treatment and YR an outcome of interest. Here the outcome YR is a scalar, but a multivariate outcome presents no additional complications. Note also our slight abuse of notation in using bold Y=(A,Y) to denote a vector that contains the scalar outcome Y. Then we can write

βz(x)={βza(x),βzy(x)}T

where βzt(x)=E{Tλz(X,A,Y)X=x}.

In addition to the causal assumptions (2)–(4) and missing at random assumptions (5)–(6) from before, we further make the instrumental variable assumptions:

(9)(Exclusion.)  Yza=Ya
(10)(Relevance.)  P(Az=1>Az=0)ϵ>0
(11)(Monotonicity.)  P(Az=1Az=0)=1

Our first result identifies the local average treatment effect under the assumptions above.

Proposition 2.

Under the causal assumptions (2)–(4), the missing at random assumptions (5)–(6), and the instrumental variable assumptions (9)–(11), it follows that

θ=E(Ya=1Ya=0Az=1>Az=0)=E{β1y(X)/γ1(X)β0y(X)/γ0(X)}E{β1a(X)/γ1(X)β0a(X)/γ0(X)}.

Proof.

It is well known [25, 26] that assumptions (9)–(11) imply

θ=E(Yz=1Yz=0)E(Az=1Az=0)

so the result follows from Proposition 1, after taking Y=(A,Y) and Z={0,1}.   □

Although we focus on the local average treatment effect, the same observed data functional can represent other treatment effects under varying assumptions (e. g. the effect on the would-be-treated under a no-effect-modification assumption as discussed for example by Hernán and Robins [27]). Thus our results equally apply to these other settings.

Now we go on to use the theory from the previous section to construct an efficient estimator of the local average treatment effect θ. As with βz(x), we can decompose the efficient influence function ϕz from the previous section as

ϕz(O)={ϕza(O),ϕzy(O)}T

for the two outcomes (A,Y)Y. As before we write ϕz=ϕz(O;P) and ϕˆz=ϕz(O;Pˆ) to ease notation, and suppose Pˆ is constructed from an independent sample. The proposed estimator is given by

θˆ=Pn(ϕˆ1yϕˆ0y)Pn(ϕˆ1aϕˆ0a).

This simply takes the ratio of the corresponding estimators for the effects of Z on A and Y, respectively.

The next result describes the asymptotic properties of the estimator θˆ, and gives conditions under which it is n-consistent and asymptotically normal, akin to the earlier Theorem 2 for a general ψˆz.

Theorem 3.

Assumeϕˆztϕzt=oP(1)for z ∈ {0, 1} andt ∈ {a, y}, andP(ϵ<πˆ<1ϵ)=P(ϵ<γˆz<1ϵ)=P{Pn(ϕˆ1aϕˆ0a)>ϵ}=1. Define

Sn,z=πˆπλˆzλz+maxt{a,y}βˆztβzt+γˆzγzγˆzγz.

Then

θˆθ=OP1n+Sn,0+Sn,1,

and ifSn,0+Sn,1=oP(1/n), we have

n(θˆθ)N0,var(ϕ1yϕ0y)θ(ϕ1aϕ0a)P(ϕ1aϕ0a).

Proof.

Note that we have

θˆθ=Pn(ϕˆ1yϕˆ0y)Pn(ϕˆ1aϕˆ0a)P(ϕ1yϕ0y)P(ϕ1aϕ0a)=1Pn(ϕˆ1aϕˆ0a){Pn(ϕˆ1yϕˆ0y)P(ϕ1yϕ0y)}θ{Pn(ϕˆ1aϕˆ0a)P(ϕ1aϕ0a)}=Pn(ϕ1yϕ0y)θ(ϕ1aϕ0a)P(ϕ1aϕ0a)+oP(1/n)+OPπˆπmaxzλˆzλz+maxz,tβˆztβzt+γˆzγzγˆzγz

where the third line follows since

Pn(ϕˆt)P(ϕt)=(PnP)(ϕˆtϕt)+(PnP)ϕt+P(ϕˆtϕt)

with the first term oP(1/n) by Lemma 2 and the third remainder term from Theorem 2, and since Pn(ϕˆ1aϕˆ0a) is bounded away from zero with

Pn(ϕˆ1aϕˆ0a)P(ϕ1aϕ0a)=(PnP)(ϕˆ1aϕˆ0a)+P{(ϕˆ1aϕˆ0a)(ϕ1aϕ0a)}=OP(1/n)+maxzϕˆzaϕza=oP(1)

where the second equality follows from Lemma 2 and the central limit theorem.   □

As before, the estimator θˆ has a fast convergence rate that is second-order involving products of nuisance errors, so that under for example n – 1/4-type rates the estimator will be n-consistent, asymptotically normal, and efficient. It is also doubly robust, as pointed out in the next corollary.

Corollary 2.

Under the conditions of Theorem 3, the estimator θˆ is consistent if either

  1. γˆzγz=oP(1)forz ∈ {0, 1} andπˆπ=oP(1), or

  2. γˆzγz=oP(1)andλˆzλz=oP(1), forz ∈ {0, 1}.

To summarize, the above results extend the work of Chaudhuri and Guilkey [8], Mogstad and Wiswall [9], and Kennedy and Small [11], by providing an efficient nonparametric estimator of the instrumental variable estimand when some instrument values are missing, allowing adjustment for complex confounding via flexible data-adaptive estimators of the nuisance functions.

4 Discussion

In this paper we filled a gap in the literature by considering nonparametric identification, efficiency theory, and estimation of average treatment effects in the presence of complex confounding and missing exposure values, where the exposure missingness can depend not only on the covariates but also the outcome information. We derived the efficient influence function for the average treatment effect and corresponding nonparametric efficiency bounds, and constructed nonparametric estimators can attain these efficiency bounds under weak rate conditions on the nuisance estimators. This allows one to incorporate modern flexible regression and machine learning tools. We also apply our general results to the problem of causal inference with a partially missing instrumental variable, yielding a new estimator and efficiency bound in this problem as well.

There are several important avenues for future work. First, it will be useful to study finite-sample properties of the estimators proposed here, in comparison to the more parametric estimators proposed in earlier work. Relatedly, it would be useful to construct an efficient plug-in estimator using targeted maximum likelihood [28, 29], which would respect bounds on the parameter space, e. g. when Y is bounded. Second, we restricted study to possibly multi-valued but discrete point treatments; it would be of interest to extend to treatments that are continuous [30, 31] or time-varying [32, 33]. This would also be useful for continuous instrumental variable problems [34] with instrument missingness. Further, identification, efficiency theory, and estimation are all more complicated in settings where there is simultaneous missingness in covariates, treatment, and outcome [35]; however, this also occurs often in practice and deserves deeper investigation. Lastly, we assumed exchangeability in the sense of the missing indicator R being conditionally independent of the underlying exposure Z given both covariates X and outcome Y; it would be of interest to consider the case where we only assume RZX. However, there average treatment effects are no longer point identified, and so one would need to consider bounds and/or sensitivity analysis.

Acknowledgements

The author thanks Matteo Bonvini, Dylan Small, Mike Daniels, and Joe Hogan for helpful discussions, as well as an anonymous reviewer of a previous version of the manuscript.

A Appendix

The following lemma from [21] is useful in proving Theorem 2.

Lemma 2.

Letfˆ(o)be a function estimated from a sampleON=(On+1,...,ON), and letPndenote the empirical measure over(O1,...,On), which is independent ofON. Then

(PnP)(fˆf)=OPfˆfn.

Proof.

First note that, conditional on ON, the term in question has mean zero since

E{Pn(fˆf)|ON}=E(fˆfON)=P(fˆf).

The conditional variance is

var{(PnP)(fˆf)|ON}=var{Pn(fˆf)|ON}=1nvar(fˆfON)fˆf2/n.

Therefore using Chebyshev’s inequality we have

P|(PnP)(fˆf)|fˆf/nt=EP|(PnP)(fˆf)|fˆf/nt|ON1t2.

Thus for any ε > 0 we can pick t=1/ϵ so that the probability above is no more than ε, which yields the result.   □

References

[1] Zhang Z, Liu W, Zhang B, Tang L, Zhang J. Causal inference with missing exposure information: methods and applications to an obstetric study. Stat Meth Med Res. 2016;25:2053–66.Search in Google Scholar

[2] Shortreed SM, Forbes AB. Missing data in the exposure of interest and marginal structural models: a simulation study based on the framingham heart study. Stat Med. 2010;29:431–43.Search in Google Scholar

[3] Ahn J, Mukherjee B, Gruber SB, Sinha S. Missing exposure data in stereotype regression model: application to matched case–control study with disease subclassification. Biometrics. 2011;67:546–58.Search in Google Scholar

[4] Shardell M, Hicks GE. Statistical analysis with missing exposure data measured by proxy respondents: a misclassification problem within a missing-data problem. Stat Med. 2014;33:4437–452.Search in Google Scholar

[5] Molinari F. Missing treatments. J Bus Econ Stat. 2010;28:82–95.Search in Google Scholar

[6] Mebane Jr WR, Poast P. Causal inference without ignorability: identification with nonrandom assignment and missing treatment data. Political Anal. 2013;21:233–51.Search in Google Scholar

[7] Burgess S, Seaman S, Lawlor DA, Casas JP, Thompson SG. Missing data methods in Mendelian randomization studies with multiple instruments. Am J Epidemiol. 2011;174:1069–76.Search in Google Scholar

[8] Mogstad M, Wiswall M. Instrumental variables estimation with partially missing instruments. Econ Lett. 2012;114:186–9.Search in Google Scholar

[9] Chaudhuri S, Guilkey DK. GMM with multiple missing variables. J Appl Econometrics. 2016;31:678–706.Search in Google Scholar

[10] Williamson E, Forbes A, Wolfe R. Doubly robust estimators of causal exposure effects with missing data in the outcome, exposure or a confounder. Stat Med. 2012;31:4382–400.Search in Google Scholar

[11] Kennedy EH, Small DS. Paradoxes in instrumental variable studies with missing data and one-sided noncompliance. J French Stat Soc. 2017.Search in Google Scholar

[12] Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat. 2004;86:4–29.Search in Google Scholar

[13] van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer, 2003.Search in Google Scholar

[14] Bickel PJ, Klaassen CA, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Baltimore: Johns Hopkins University Press, 1993.Search in Google Scholar

[15] van der Vaart AW. Semiparametric statistics. In: Lectures on probability theory and statistics. Berlin Heidelberg: Springer, 2002:331–457.Search in Google Scholar

[16] Tsiatis AA. Semiparametric theory and missing data. New York: Springer, 2006.Search in Google Scholar

[17] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–66.Search in Google Scholar

[18] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double machine learning for treatment and causal parameters. arXiv preprint arXiv:1608.00060, 2016.Search in Google Scholar

[19] Robins JM, Li L, Tchetgen Tchetgen EJ, van der Vaart AW. Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman, 2008:335–421.Search in Google Scholar

[20] Zheng W, van der Laan MJ. Asymptotic theory for cross-validated targeted maximum likelihood estimation. UC Berkeley Division Biostat Working Paper Ser. 2010;273:1–58.Search in Google Scholar

[21] Kennedy EH, Balakrishnan S, G’Sell M. Sharp instruments for classifying compliers and generalizing causal effects. The Ann Stat. 2019.Search in Google Scholar

[22] Farrell MH. Robust inference on average treatment effects with possibly more covariates than observations. J Econometrics. 2015;189:1–23.Search in Google Scholar

[23] J. M. Robins. Robust estimation in sequentially ignorable missing data and causal inference models. Proc Am Stat Assoc. 2000;1999:6–10.Search in Google Scholar

[24] Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc. 1999;94:1096–120.Search in Google Scholar

[25] Abadie A. Semiparametric instrumental variable estimation of treatment response models. J Econometrics. 2003;113:231–63.Search in Google Scholar

[26] Imbens GW, Angrist JD. Identification and estimation of local average treatment effects. Econometrica. 1994;62:467–75.Search in Google Scholar

[27] Hernán MA, Robins JM. Instruments for causal inference: an epidemiologist’s dream? Epidemiology. 2006;17:360–72.Search in Google Scholar

[28] van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. NYC: Springer, 2011.Search in Google Scholar

[29] van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. UC Berkeley Division of Biostatistics Working Paper Series, 2006:212.Search in Google Scholar

[30] Díaz I, van der Laan MJ. Population intervention causal effects based on stochastic interventions. Biometrics. 2012;68:541–9.Search in Google Scholar

[31] Kennedy EH, Ma Z, McHugh MD, Small DS. Nonparametric methods for doubly robust estimation of continuous treatment effects. J R Stat Soc: Ser B. 2017;79:1229–45.Search in Google Scholar

[32] Kennedy EH. Nonparametric causal effects based on incremental propensity score interventions. J Am Stat Assoc. 2019;114:645–56.Search in Google Scholar

[33] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–60.Search in Google Scholar

[34] Kennedy EH, Lorch S, Small DS. Robust causal inference with continuous instruments using the local instrumental variable curve. J R Stat Soc: Ser B. 2019;81:121–43.Search in Google Scholar

[35] Sun B, Tchetgen Tchetgen EJ. On inverse probability weighting for nonmonotone missing at random data. J Am Stat Assoc. 2018;113:369–79.Search in Google Scholar

Received: 2019-08-05
Revised: 2020-01-28
Accepted: 2020-02-17
Published Online: 2020-03-14

© 2020 Walter de Gruyter GmbH, Berlin/Boston