Jump to ContentJump to Main Navigation
Show Summary Details
More options …

The International Journal of Biostatistics

Ed. by Chambaz, Antoine / Hubbard, Alan E. / van der Laan, Mark J.


IMPACT FACTOR 2018: 1.309

CiteScore 2018: 1.11

SCImago Journal Rank (SJR) 2018: 1.325
Source Normalized Impact per Paper (SNIP) 2018: 0.715

Mathematical Citation Quotient (MCQ) 2018: 0.03

Online
ISSN
1557-4679
See all formats and pricing
More options …

Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference

Mark J. van der Laan
Published Online: 2014-02-11 | DOI: https://doi.org/10.1515/ijb-2012-0038

Abstract

In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distributed copies of a random variable consisting of baseline covariates, a subsequently assigned binary treatment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score.

Keywords: asymptotic linearity; cross-validation; efficient influence curve; influence curve; targeted minimum loss based estimation

1 Introduction and overview

This introduction provides an atlas for the contents of this article. It starts with formulating the role of estimation of nuisance parameters to obtain asymptotically linear estimators of a target parameter of interest. This demonstrates the need to target this estimator of the nuisance parameter in order to make the estimator of the target parameter asymptotically linear when the model for the nuisance parameter is large. The general approach to obtain such a targeted estimator of the nuisance parameter is described. Subsequently, we present our concrete example to which we will apply this general method for targeted estimation of the nuisance parameter, and for which we establish a number of formal theorems. Finally, we discuss the link to previous articles that concerned some kind of targeting of the estimator of the nuisance parameter, and we provide an organization of the remainder of the article.

1.1 The role of nuisance parameter estimation

Suppose we observe n independent and identically distributed copies of a random variable O with probability distribution P0. In addition, assume that it is known that P0 is an element of a statistical model M and that we want to estimate ψ0=Ψ(P0) for a given target parameter mapping Ψ:MIR. In order to guarantee that P0M one is forced to only incorporate real knowledge, and, as a consequence, such models M are always very large and, in particular, are infinite dimensional. We assume that the target parameter mapping is path-wise differentiable and let D(P) denote the canonical gradient of the path-wise derivative of Ψ at PM [1]. An estimator ψn=Ψˆ(Pn) is a functional Ψˆ applied to the empirical distribution Pn of O1,,On and can thus be represented as a mapping Ψˆ:MNPIR from the non-parametric statistical model MNP into the real line. An estimator Ψˆ is efficient if and only if it is asymptotically linear with influence curve D(P0): ψnψ0=1ni=1nD(P0)(Oi)+oP(1/n).

The empirical mean of the influence curve D(P0) represents the first-order linear approximation of the estimator as a functional of the empirical distribution, and the derivation of the influence curve is a by-product of the application of the so-called functional delta-method for statistical inference based on functionals (i.e. Ψˆ) of the empirical distribution [24].

Suppose that Ψ(P) only depends on P through a parameter Q(P) and that the canonical gradient depends on P only through Q(P) and a nuisance parameter g(P). The construction of an efficient estimator requires the construction of estimators Qn and gn of these nuisance parameters Q0 and g0, respectively. Targeted minimum loss-based estimation (TMLE) represents a method for construction of (e.g. efficient) asymptotically linear substitution estimators Ψ(Qn), where Qn is a targeted update of Qn that relies on the estimator gn [57]. The targeting of Qn is achieved by specifying a parametric submodel {Qn():}{Q(P):PM} through the initial estimator Qn and a loss function OL(Q)(O) for Q0=argminQP0L(Q)L(Q)(o)dP0(o), so that the generalized score ddL(Qn())|=0 spans a desired user-supplied estimating function D(Qn,gn). In addition, one may decide to target gn by specifying a parametric submodel {gn(1):1}{g(P):PM} and loss function OL(g)(O) for g0=argmingP0L(g), so that the generalized score dd1L(gn(1))|1=0 spans another desired estimating function D1(gn,ηn) for some estimator ηn of nuisance parameter η. The parameter is fitted with MLE n=argminPnL(Qn()), providing the first-step update Qn1=Qn(n), and similarly 1,n=argmin1PnL(gn(1)). This updating process that mapped a current fit (Qn,gn) into an update (Qn1,gn1) is iterated till convergence at which point the TMLE (Qn,gn) solves PnD(Qn,gn)=0, i.e. the empirical mean of the estimating function equals zero at the final TMLE (Qn,gn). If one also targeted gn, then it also solves PnD1(gn,ηn)=0. The submodel through Qn will depend on gn, while the submodel through gn will depend on another nuisance parameter ηn. By setting D(Q,g) equal to the efficient influence curve D(Q,g), the resulting TMLE solves the efficient influence curve estimating equation PnD(Qn,gn)=0 and thereby will be asymptotically efficient when (Qn,gn) is consistent for (Q0,g0) under appropriate regularity conditions, where the targeting of gn is not needed.

The latter is shown as follows. By the property of the canonical gradient (in fact, any gradient) we have Ψ(Qn)Ψ(Q0)=P0D(Qn,gn)+Rn(Qn,Q0,gn,g0), where Rn involves integrals of second-order products of the differences (QnQ0) and (gng0). Combined with PnD(Qn,gn)=0, this implies the following identity: Ψ(Qn)Ψ(Q0)=(PnP0)D(Qn,gn)+Rn(Qn,Q0,gn,g0).

The first term is an empirical process term that, under empirical process conditions (mentioned below), equals (PnP0)D(Q,g), where (Q,g) denotes the limit of (Qn,gn), plus an oP(1/n)-term. This then yields Ψ(Qn)Ψ(Q0)=(PnP0)D(Q,g)+Rn(Qn,Q0,gn,g0)+oP(1/n).

To obtain the desired asymptotic linearity of Ψ(Qn) one needs Rn=oP(1/n), which in general requires at minimal that both nuisance parameters are consistently estimated: Q=Q0 and g=g0. However, in many problems of interest, Rn only involves a cross-product of the differences QnQ0 and gng0, so that Rn converges to zero if either Qn is consistent or gn is consistent: i.e. Q=Q0 or g=g0. In this latter case, the TMLE is so-called double robust. Either way, the consistency of the TMLE relies now on one of the nuisance parameter estimators being consistent, thereby requiring the use of non-parametric adaptive estimation such as super-learning [810] for at least one of the nuisance parameters. If only one of the nuisance parameter estimators is consistent, and we are in the double robust scenario, then it follows that the bias is of the same order as the bias of the consistent nuisance parameter estimator. However, if the nuisance parameter estimator is not based on a correctly specified parametric model, but instead is a data-adaptive estimator, then this bias will be converging to zero at a rate slower than 1/n: i.e. nRn converges to infinity as n. Thus, in that case, the estimator of the target parameter may thus be overly biased and thereby will not be asymptotically linear.

1.2 Targeting the fit of the nuisance parameter: general approach

In this article, we demonstrate that if QQ0, then it is essential that the consistent nuisance parameter estimator gn be targeted toward the estimand so that the bias for the estimand becomes second order: that is, in our new TMLEs relying on consistent estimation of g0 presented in this article one simultaneously updates gn into a gn so that certain smooth functionals of gn, derived from the study of Rn, are asymptotically linear under appropriate conditions. Even if both estimators Qn and gn are consistent, but Qn might be converging at a slower rate than gn, this targeting of the nuisance parameter estimator may still remove finite sample bias for the estimand. In addition, we also present such TMLE when only relying on one of the nuisance parameters to be consistently estimated, but not knowing which one: i.e. either Q=Q0 or g=g0. The same arguments applies to other double robust estimators, such as estimating equation based estimators and inverse probability of treatment weighted (IPTW) estimators [1116]. In fact, we demonstrate such a targeted IPTW-estimator in our next section.

The current article concerns the construction of such targeted IPTW and TMLE that are asymptotically linear under regularity conditions, even when only one of the nuisance parameters is consistent and the estimators of the nuisance parameters are highly data adaptive. In order to be concrete in this article, we will focus on a particular example. In such an example we can concretely present the second-order term Rn mentioned above and thereby develop the concrete form of the TMLE.

The same approach for construction of such TMLE can be carried out in much greater generality, but that is beyond the scope of this article. Nonetheless, it is helpful for the reader to know that the general approach is the following (considering the case that g=g0, but Q can be misspecified): (1) approximate Rn(Qn,Q0,gn,g0)=Φ0,n(gn)Φ0,n(g0)+R1,n for some mapping Φ0,n that depends on P0 (e.g. through Q0) and the data (e.g. Qn,gn), and where R1,n is a second-order term so that it is reasonable to assume R1,n=oP(1/n); (2) approximate Φ0,n(gn)Φ0,n(g0)=Φn(gn)Φn(g0)+R2,n, where R2,n is a second-order term and Φn is now a known (only based on data) mapping approximating Φ0; (3) construct gn so that it is a TMLE of the target parameter Φn(g0) thereby allowing an expansion Φn(gn)Φn(g0)=(PnP0)D1,n(P0)+R3,n with D1,n(P0) being the efficient influence curve of Φn(g0). That is, in step 3, gn is iteratively updated to solve PnD1,n(gn,ηn)=0 with D1,n(P0) depending on P0 through g0 and a nuisance parameter η0, so that Φn(gn) is an asymptotically linear estimator of Φn(g0) under regularity conditions. After these three steps, we have that Rn(Qn,Q0,gn,g0)=(PnP0)D1,n(P0)+R1,n+R2,n+R3,n, where R1,n+R2,n+R3,n=oP(1/n), and these steps provide us with the parameter Φn(g0) that needs to be targeted by gn, thereby telling us how to target gn in the TMLE of ψ0. In addition, we can then conclude that this TMLE is asymptotically linear with known influence curve D(Q,g0)+D1(P0), where D1(P0) represents the limit of the efficient influence curve D1,n(P0) of Φn(g0): Ψ(Qn)Ψ(Q0)=(PnP0){D(Q,g0)+D1(P0)}+oP(1/n).

1.3 Concrete example covered in this article

Let us now formulate our concrete example we will cover in this article. Let O=(W,A,Y)P0, W baseline covariates, A a binary treatment, and Y a final outcome. Let M be a model that makes at most some assumptions about the conditional distribution of A, given W, but leaves the marginal distribution of W and the conditional distribution of Y, given A,W, unspecified. Let Ψ:MIR be defined as Ψ(P)=EPEP(Y|A=1,W), the so-called treatment specific mean controlling for the baseline covariates. The canonical gradient, also called the efficient influence curve, of Ψ at P is given by D(P)(O)=A/g(1|W)(YQˉ(1,W))+Qˉ(1,W)Ψ(P), where g(1|W)=P(A=1|W) is the propensity score and Qˉ(a,W)=EP(Y|A=a,W) is the outcome regression [13]. Let Q=(QW,Qˉ), where QW is the marginal distribution of W, and note that Ψ(P) only depends on P through Q=Q(P). For convenience, we will denote the target parameter with Ψ(Q) in order to not have to introduce additional notation. A targeted minimum loss-based estimator (TMLE) is a plug-in estimator Ψ(Qn), where Qn is an update of an initial estimator Qn that relies on an estimator gn of g0, and it has the property that it solves PnD(Qn,gn)=0, where we used the notation Pf=f(o)dP(o).

For this particular example, such TMLE are presented in Scharfstein et al. [17]; van der Laan and Rubin [7]; Bembom et al. [1821]; Rosenblum and van der Laan [22]; Sekhon et al. [23]; van der Laan and Rose [6, 24]. Since P0D(Q,g)=ψ0Ψ(Q)+P0(Qˉ0Qˉ)(gˉ0gˉ)/gˉ [25, 26], where we use the notation gˉ(W)=g(1|W) and Qˉ(W)=Qˉ(1,W), and PnD(Qn,gn)=0, we obtain the identity: Ψ(Qn)ψ0=(PnP0)D(Qn,gn)+P0(Qˉ0Qˉn)(gˉ0gˉn)/gˉn.(1)

The first term equals (PnP0)D(Q,g)+oP(1/n) if D(Qn,gn) falls in a P0-Donsker class with probability tending to 1, and P0{D(Qn,gn)D(Q,g)}20 in probability as n [4, 27]. If Qˉn and gˉn are consistent for the true Qˉ0 and gˉ0, respectively, then the second term is a second-order term. If one now assumes that this second-order term is oP(1/n), it has been proven that the TMLE is asymptotically efficient. This provides the general basis for proving asymptotic efficiency of TMLE when both Q0 and g0 are consistently estimated.

However, if only one of these nuisance parameter estimators is consistent, then the second term is still a first-order term, and it remains to establish that it is also asymptotically linear with a second-order remainder. For sake of discussion, suppose that Qˉn converges to a wrong Qˉ while gˉn is consistent. In that case, this remainder behaves in first order as P0(Qˉ0Qˉ)(gˉngˉ0)/gˉ0. To establish that such a term is asymptotically linear requires that gˉn solves a particular estimating equation: that is, gˉn needs to be a TMLE itself targeting the required smooth functional of g0. This is naturally achieved within the TMLE framework by specifying a submodel through gn and loss function with the appropriate generalized score, so that a TMLE update step involves both updating Qn and gn, and the iterative TMLE algorithm now results in a final TMLE (Qn,gn), not only solving PnD(Qn,gn)=0 but also these additional equations that allow us to establish asymptotic linearity of the desired smooth functional of gn: see general description of TMLE above.

In this article, we present TMLE that targets gn in a manner that allows us to prove the desired asymptotic linearity of the second term in the right-hand side of eq. (1) when either gˉn or Qˉn is consistent, under conditions that require specified second-order terms to be oP(1/n). The latter type of regularity conditions are typical for the construction of asymptotically linear estimators and are therefore considered appropriate for the sake of this article. Though it is of interest to study cases in which these second-order terms cannot be assumed to be oP(1/n), this is beyond the scope of this article.

1.4 Relation to current literature on targeted nuisance parameter estimators

The construction of TMLE that utilizes targeting of the nuisance parameter gn has been carried out in earlier papers. For example, in van der Laan and Rubin [7], we target gn to obtain a TMLE that, beyond being double robust locally efficient, also equals the IPTW-estimator. In Gruber and van der Laan [29] we target gn to guarantee that, beyond being double robust locally efficient, also outperforms a user-supplied given estimator, based on the original idea of Rotnitzky et al. [28]. In that sense, the distinction of the current article with these previous articles is that gn is now targeted to guarantee that the TMLE remains asymptotically linear when Qn is misspecified. This task of targeting gn appears to be one step more complicated than in these previous articles, since the smooth functionals of gn that need to be targeted are themselves indexed by parameters of the true data distribution P0, and thus unknown. As mentioned above, our strategy is to approximate these unknown smooth functionals by an estimated smooth functional and develop the targeted estimator gn that targets this estimated parameter of g0.

The TMLEs presented in this article are always iterative and thereby rely on convergence of the iterative updating algorithm. Since the empirical risk increases at each updating step, such convergence is typically guaranteed by the existence of the MLE at each updating step (e.g. an MLE of coefficient in a logistic regression). Either way, in this article, we assume this convergence to hold. Since our assumptions of our theorems require gn(1|W) to be bounded away from zero, we demonstrate how this property can be achieved by using submodels for updating gn that guarantee this property. Detailed simulations will appear in a future article.

1.5 Organization

The organization of this paper is as follows. In Section 2, we introduce a targeted IPTW-estimator that relies on an adaptive consistent estimator of g0, and we establish its asymptotic linearity with known influence curve, allowing for the construction of asymptotically valid confidence intervals based on this adaptive IPTW-estimator. In the remainder of the article, we focus on construction of TMLE involving the targeting of gn to establish the asymptotic linearity of the resulting TMLE under appropriate conditions. In Section 3, we introduce a novel TMLE that assumes that the targeted adaptive estimator gn is consistent for g0, and we establish its asymptotic linearity. In Section 4, we introduce a novel TMLE that only assumes that either the targeted Qˉn or the targeted gˉn is consistent, and we establish its asymptotic linearity with known influence curve. This TMLE needs to protect the asymptotic linearity under misspecification of either gn or Qˉn, and, as a consequence, relies on targeting of gn (in order to preserve asymptotic linearity when Qˉn is inconsistent), but also extra targeting of Qˉn (in order to preserve asymptotic linearity when Qˉn is consistent, but gn is inconsistent). The explicit form of the influence curve of this TMLE allows us to construct asymptotic confidence intervals. Since this result allows statistical inference in the statistical model that only assumes that one of the estimators is consistent, and we refer to this as “double robust statistical inference”. Even though double robust estimators have been extensively presented in the current literature, double robust statistical inference in these large semi-parametric models has been a difficult topic: typically, one has suggested to use the non-parametric bootstrap, but there is no theory supporting that the non-parametric bootstrap is a valid method when the estimators rely on data-adaptive estimation.

In Section 5, we extend the TMLE of Section 3 (that relies on gn being consistent for g0) to the case that gn converges to a possibly misspecified g but one that suffices for consistent estimation of ψ0 in the sense that Ψ(Qˉn) will be consistent. We present a corresponding asymptotic linearity theorem for this TMLE that is able to utilize the so-called collaborative double robustness of the efficient influence curve which states that Ψ(Q)=ψ0 if P0D(Q,g)=0 and gG(Q,P0) for a set G(Q,P0) (including g0). In order to construct a collaborative estimator gn that aims to converge to an element in G(Qn,P0) in collaboration with Qn, we use the framework of collaborative targeted minimum loss-based estimator (C-TMLE) [20, 2935]. Our asymptotic linearity theorem can now be applied to this C-TMLE. Again, even though C-TMLEs have been presented in the current literature, statistical inference based on the C-TMLEs has been another challenging topic, and Section 5 provides us with a C-TMLE with known influence curve. We conclude this article with a discussion. The proofs of the theorems are presented in the Appendix.

1.6 Notation

In the following sections, we will use the following notation. We have O=(W,A,Y)P0M, where M is a statistical model that makes only assumptions on the conditional distribution of A, given W. Let g0(a|W)=P0(A=a|W), and gˉ0(W)=P0(A=1|W). The target parameter is Ψ:MIR defined by Ψ(P0)=EQW,0Qˉ0(1,W), where Qˉ0(1,W)=EP0(Y|A=1,W), which will also be denoted with Qˉ0(W), and QW,0 is the distribution of W under P0. We also use the notation Ψ(Q), where Q=(QW,Qˉ). In addition, D(Q,g) denotes the efficient influence curve of Ψ at (Q,g). We also use the following notation: HA(Qˉr,gˉ)=Qˉr/gˉ H0r=Qˉ0r/gˉ0 DA(Qˉr,gˉ)(A,W)=HA(Qˉr,gˉ)(W)(Agˉ(W)) HY(gˉ)(A,W)=A/gˉ(W) Qˉ0r(Qˉ,gˉ)=EP0(YQˉ|A=1,gˉ) Qˉ0r=Qˉ0r(Qˉ,gˉ0) gˉ0r(gˉ,Qˉ)=E0(A|gˉ,Qˉ) Qˉ0r(gˉ)=E0(Y|A=1,gˉ)onlyIPTWSection2 Qˉ0r=Qˉ0r(gˉ0)onlyIPTWSection2 f0=P0f20.5.

2 Statistical inference for IPTW-estimator when using super-learning to fit treatment mechanism

We first describe an IPTW-estimator that uses super-learning to fit the treatment mechanism g0. Subsequently, we present this IPTW-estimator but now using an update of the super-learning fit of g0, and we present a theorem establishing the asymptotic linearity of this targeted IPTW-estimator under appropriate conditions. Finally, we discuss how this targeted IPTW-estimator compares with an IPTW-estimator that relies on a parametric model to fit the treatment mechanism.

2.1 An IPTW-estimator using super-learning to fit the treatment mechanism

We consider a simple IPTW-estimator Ψˆ(Pn)=PnD(gˆ(Pn)), where D(g)(O)=YA/gˉ(W), and gˆ:MNPG is an adaptive estimator of g0 based on the log-likelihood loss function L(g)(O)logg(A|W). For a general presentation of an IPTW-estimator, we refer to Robins and Rotnitzky [11], van der Laan and Robins [13], and Hernan et al. [36]. We wish to establish conditions under which reliable statistical inference based on this estimator of ψ0 can be obtained. One might wish to estimate g0 with ensemble learning, and, in particular, super-learning in which cross-validation [37] is used to determine the best weighted combination of a library of candidate estimators: van der Laan and Dudoit [8]; van der Laan et al. [9, 38, 39]; van der Vaart et al. [10]; Dudoit and van der Laan [40]; Polley et al. [41]; Polley and van der Laan [42]; van der Laan and Petersen [43]. The super-learner is a general template for construction of an adaptive estimator based on a library of candidate estimators, a loss function whose expectation is minimized over the parameter space by the true parameter value, a parametric family that defines “weighted” combinations of the estimators in the library. We will start with presenting a succinct description of a particular super-learner. Consider a library of estimators gˆj:MNPG, j=1,,J and a family of weighted (on logistic scale) combinations of these estimators Logitgˆα(1|W)=j=1JαjLogitgˆj(1|W), indexed by vectors α for which αj[0,1] and jαj=1. Consider a random sample split Bn{0,1}n into a training sample {i:Bn(i)=0} of size n(1p) and validation sample {i:Bn(i)=1} of size np, and let Pn,Bn1 and Pn,Bn0 denote the empirical distribution of the validation sample and training sample, respectively. Define αn=argminαEBnPn,Bn1LgˆαPn,Bn0 =argminαEBn1npi:Bn(i)=1LgˆαPn,Bn0(Oi),

as the choice of estimator that minimizes cross-validated risk. The super-learner of g0 is defined as the estimator gˆ(Pn)=gˆαn(Pn).

2.2 Asymptotic linearity of a targeted data-adaptive IPTW-estimator

The next theorem presents an IPTW-estimator that uses a targeted fit gn of g0, involving the updating of an initial estimator gn, and conditions under which this IPTW-estimator of ψ0 is asymptotically linear. For example, gn could be defined as a super-learner of the type presented above. In spite of the fact that such an IPTW-estimator uses a very data adaptive and hard to understand estimator gn, this theorem shows that its influence curve is known and can be well estimated.

Theorem 1 We consider a targeted IPTW-estimator Ψˆ(Pn)=PnD(gn), where D(g)(O)=YA/g(A|W), and gn is an update of an initial estimator gn of g0G defined below.

Definition of targeted estimator gn: Let Qˉnr be obtained by non-parametric estimation of the regression function EP0(Y|A=1,gˉn(W)) treating gˉn as a fixed covariate (i.e. function of W). This yields an estimator HnrQˉnr/gˉn of H0r=Qˉ0r/gˉ0, where Qˉ0r=EP0(Y|A=1,gˉ0). Consider the submodel Logitgˉn()=Logitgˉn+Hnr, and fit with the MLE n=argmaxPnloggn().

We define gn=gn(n) as the corresponding targeted update of gn. This TMLE gn satisfies PnDA(Qˉnr,gˉn)=0.

Empirical process condition: Assume that D(gn),DA(Qˉnr,gˉn) fall in a P0-Donsker class with probability tending to 1.

Negligibility of second-order terms: Define Qˉ0,nrEP0(Y|A=1,gˉ0(W),gˉn(W)). Assume gˉn>δ>0 with probability tending to 1 and assume QˉnrQˉ0r0=oP(1) gˉngˉ002=oP(1/n) QˉnrQˉ0r0gˉngˉ00=oP(1/n) Qˉ0,nrQˉ0r0gˉngˉ00=oP(1/n).

Then, Ψˆ(Pn)ψ0=(PnP0)IC(P0)+oP(1/n),

where IC(P0)(O)=YA/g0(A|W)ψ0H0r(W)(Agˉ0(W)).

So under the conditions of this theorem, we can construct an asymptotic 0.95-confidence interval ψn±1.96σn/n based on this targeted IPTW-estimator ψn=Ψˆ(Pn), where σn2=PnICn2=1ni=1nICn(Oi)2,

and ICn(O)=YA/gˉn(W)ψnHnr(W)(Agˉn(W)) is the plug-in estimator of the influence curve IC(P0) obtained by plugging in gn or gn for g0 and Qˉnr for Qˉ0r.

Regarding the displayed second-order term conditions, we note that these are satisfied if gˉngˉ0 converges to zero w.r.t. L2(P0)-norm at rate oP(n1/4), gˉn>δ>0 for some δ>0 with probability tending to 1 as n, and the product of the rates at which gˉn converges to gˉ0 and (Qˉnr,Qˉ0,nr) converges to Qˉ0r is oP(1/n).

Regarding the empirical process condition, we note that an example of a Donsker class is the class of multivariate real-valued functions with uniform sectional variation norm bounded by a universal constant [44]. It is important to note that if each estimator in the library falls in such a class, then also the convex combinations fall in that same class [4]. So this Donsker condition will hold if it holds for each of the candidate estimators in the library of the super-learner.

2.3 Comparison of targeted data-adaptive IPTW and an IPTW using parametric model

Consider an IPTW-estimator using a MLE gn,1 according to a parametric model for g0, and let us contrast this IPTW-estimator with an IPTW-estimator defined in the above theorem based on an initial super-learner gn that includes gn,1 as an element of the library of estimators. Let us first consider the case that the parametric model is correctly specified. In that case gn,1 converges to g0 at a parametric rate 1/n. From the oracle inequality for cross-validation [8, 10, 38], it follows that gn also converges at the rate 1/n to g0 possibly up to a logn-factor in case the number of algorithms in the library is of the order nq for some fixed q. As a consequence, all the consistency and second-order term conditions for the IPTW-estimator using a targeted gn based on gn hold. If one uses estimators in the library of algorithms that have a uniform sectional variation norm smaller than a M< with probability tending to 1, then also a weighted average of these estimators will have uniform sectional variation norm smaller than M< with probability tending to 1. Thus, in that case we will also have that D(gn),DA(Qˉnr,gˉn) fall in a P0-Donsker class. Examples of estimators that control the uniform sectional variation norm are any parametric model with fewer than K main terms that themselves have a uniform sectional variation norm, but also penalized least-squares estimators (e.g. Lasso) using basis functions with bounded uniform sectional variation norm, and one could map any estimator into this space of functions with universally bounded uniform sectional variation norm through a smoothing operation. Thus, under this restriction on the library, the IPTW-estimator using the super-learner is asymptotically linear with influence curve IC(P0)(O) as stated in the theorem. We note that IC(P0) is the efficient influence curve for the target parameter EP0EP0(Y|A=1,gˉ0(W)) if the observed data were (gˉ0(W),A,Y) instead of O=(W,A,Y).

The parametric IPTW-estimator is asymptotically linear with influence curve OYA/g0(A|W)ψ0Π(YA/gˉ0(W)|Tg), where Tg is the tangent space of the parametric model for g0, and Π(f|Tg) denotes the projection of f onto Tg in the Hilbert space L02(P0) [13]. This IPTW-estimator could be less or more efficient than the IPTW-estimator using the targeted super-learner depending on the actual tangent space of the parametric model.

For example, if the parametric model happens to have a score equal to OQˉ0(W)(A/gˉ0(W)1), then the parametric IPTW-estimator would be asymptotically efficient. Of course, a standard parametric model is not tailored to correspond with such optimal scores, but this shows that we cannot claim superiority of one versus the other in the case that the parametric model for g0 is correctly specified.

If, on the other hand, the parametric model is misspecified, then the IPTW-estimator using gn,1 is inconsistent. However, the super-learner gn will be consistent if the library contains a non-parametric adaptive estimator, and will perform asymptotically as well as the oracle selector among all the weighted combinations of the algorithms in the library. To conclude, the IPTW-estimator using super-learning to estimate g0 will be as good as the IPTW-estimator using a correctly specified parametric model (included in the library of the super-learner), but will remain consistent and asymptotically linear in a much larger model than the parametric IPTW-estimator relying on the true g0 being an element of the parametric model.

3 Statistical inference for TMLE when using super-learning to consistently fit treatment mechanism

In the next subsection, we present a TMLE that targets the fit of the treatment mechanism, analog to the targeted IPTW-estimator presented above. In addition, this subsection presents a formal asymptotic linearity theorem demonstrating that this TMLE will be asymptotically linear even when Qˉn is inconsistent under reasonable conditions. We conclude this section with a subsection showing how the iterative updating of the treatment mechanism can be carried out in such a way that the final fit of the treatment mechanism is still bounded away from zero, as required to obtain a stable estimator.

3.1 Asymptotic linearity of a TMLE using a targeted estimator of the treatment mechanism

The following theorem presents a novel TMLE and corresponding asymptotic linearity with specified influence curve, where we rely on consistent estimation of g0. The TMLE still uses the same updating step for the estimator of Qˉ0 as the regular TMLE [7], but uses a novel updating step for the estimator of g0, analogue to the updating step of the IPTW-estimator in the previous section. We remind the reader of the importance of using the logistic fluctuations as working-submodels for Qˉ0 in the definition of the TMLE, guaranteeing that the TMLE update stays within the bounded parameter space (see, e.g. Gruber and van der Laan [19]).

Theorem 2

Iterative targeted MLE of ψ0:

Definitions: Given Qˉ,gˉ, let Qˉnr(Qˉ,gˉ) be a consistent estimator of the regression Qˉ0r(Qˉ,gˉ)=EP0(YQˉ|A=1,gˉ) of (YQˉ) on gˉ(W) and A=1. Let (gn,Qˉn) be an initial estimator of (g0,Qˉ0).

Initialization: Let gn0=gn,Qˉn0=Qˉn, and Qˉnr0=Qˉnr(Qˉn0,gˉn0). Let k=0.

Updating step for gnk: Consider the submodel Logitgˉnk()=Logitgˉnk+HA(Qˉnrk,gˉnk), and fit with the MLE n=argmaxPnloggnk().

We define gnk+1=gnk(n) as the corresponding update of gnk. This gnk+1 satisfies 1ni=1nHAQˉnrk,gˉnk(Wi)Aigˉnk+1(Wi)=0.

Updating step for Qˉnk: Let L(Qˉ)(O)YlogQˉ(A,W)+(1Y)log(1Qˉ(A,W)) be the quasi-log-likelihood loss function for Qˉ0=E0(Y|A=1,W) (allowing that Y is continuous in [0,1]). Consider the submodel LogitQˉnk()=LogitQˉnk+HY(gnk), and let n=argminPnL(Qˉnk()). Define Qˉnk+1=Qˉnk(n) as the resulting update. Define Qˉnrk+1=Qˉnr(Qˉnk+1,gˉnk+1).

Iterating till convergence: Now, set kk+1, and iterate this updating process mapping a (gnk,Qˉnk,Qˉnrk) into (gnk+1,Qˉnk+1,Qˉnrk+1) till convergence or till large enough K so that the estimating equations (2) below are solved up till an oP(1/n)-term. Denote the limit of this iterative procedure with (gn,Qˉn,Qˉnr).

Plug-in estimator: Let Qn=(QW,n,Qˉn), where QW,n is the empirical distribution estimator of QW,0. The TMLE of ψ0 is defined as Ψ(Qn).

Estimating equations solved by TMLE: This TMLE (Qn,gn,Qˉnr) solves PnD(Qn,gn)=0 PnDA(Qˉnr,gˉn)=0.(2)

Empirical process condition: Assume that D(Qn,gn), DA(Qˉnr,gˉn) falls in a P0-Donsker class with probability tending to 1 as n.

Negligibility of second-order terms: Define Qˉ0,nr(W)EP0YQˉ(1,W)|A=1,gˉn(W),gˉ0(W)Qˉ0r(W)EP0YQˉ(1,W)|A=1,gˉ0(W)H0,nr=Qˉ0,nr/gˉnH0r=Qˉ0r/gˉ0,

where gˉn(W) is treated as a fixed covariate (i.e. function of W) in the conditional expectation Qˉ0,nr. Assume that there exists a δ>0, so that gˉn>δ>0 with probability tending to 1, and QˉnQˉ0=oP(1) QˉnrQˉ0r0=oP(1) gˉngˉ00QˉnQˉ0=oP(1/n) Qˉ0,nrQˉ0r0gˉngˉ00=oP(1/n) gˉngˉ002=oP(1/n) QˉnrQˉ0r0gˉngˉ00=oP(1/n).

Then, Ψ(Qn)ψ0=(PnP0)IC(P0)+oP(1/n),

where IC(P0)=D(Q,g0)DA(Qˉ0r,gˉ0).

Thus, under the assumptions of this theorem, an asymptotic 0.95-confidence interval is given by ψn±1.96σn/n, where σn2=PnICn2, and ICn=D(Qn,gn)DA(Qˉnr,gˉn).

3.2 Using a δ-specific submodel for targeting g that guarantees the positivity condition

The following is an application of the constrained logistic regression approach of the type presented in Gruber and van der Lann [19] for the purpose of estimation of gˉ0 respecting the constraint that gˉ0>δ>0 for a known δ>0. Recall that A{0,1}. Suppose that it is known that gˉ0(W)(δ,1] for some δ>0, a condition the asymptotic linearity of our proposed estimators relies upon. Define AδAδ1δ. We have gˉ0(W)=δ+(1δ)gˉ0,δ, where gˉ0,δ=E0(Aδ|W) is a regression that is known to be between [0,1]. Let gδ,n0 be an initial estimator of the true conditional distribution gδ,0 of Aδ, given W, which implies an estimator gˉn0=δ+(1δ)gˉδ,n0 of gˉ0. Let k=0. Consider the following submodel for the conditional distribution of Aδ, given W, through a given estimator gδ,nk: Logitgˉδ,nk()=Logitgˉδ,nk+HA(Qˉnrk,gˉδ,nk).

The MLE is simply obtained with logistic regression of Aδ on W (see, e.g. Gruber and van der Lann [19]) based on the quasi-log-likelihood loss function: n=argminPnLgˉδ,nk(),

where L(gˉδ)(O)=Aδloggˉδ(W)+(1Aδlog(1gˉδ(W))

is the quasi-log-likelihood loss. The update gˉδ,nk+1=gˉδ,nk(n) implies an update gˉnk+1=δ+(1δ)gˉδ,nk+1 of gˉnk=δ+(1δ)gˉδ,nk, and, by construction gˉnk+1>δ>0. The above submodel gˉnk()=δ+(1δ)gˉδ,nk() and corresponding loss function L(gˉ)=L(gˉδ) generates the same score equation as the submodel and loss function used in Theorem 2. Therefore, the TMLE algorithm presented in Theorem 2 but now using this δ-specific logistic regression model solves the same estimating equations, so that the same Theorem 2 immediately applies. However, using this submodel we have now guaranteed that gˉnk>δ>0 for all k in the iterative TMLE algorithm, and thereby that gˉn>δ>0.

4 Double robust statistical inference for TMLE when using super-learning to fit outcome regression and treatment mechanism

In this section, our aim is to present a TMLE that is asymptotically linear with known influence curve if either g0 or Q0 is consistently estimated, but we do not need to know which one. Again, this requires a novel way of targeting the estimators gn,Qˉn in order to arrange that the relevant smooth functionals of these nuisance parameter estimators are indeed asymptotically linear under appropriate second-order term conditions. In this case, we also need to augment the submodel for the estimator of Qˉ0 with another clever covariate: that is, our estimator of Qˉ0 needs to be double targeted, once for solving the efficient influence curve equation, but also for achieving asymptotic linearity in the case that the estimator of g0 is misspecified.

Theorem 3

Definitions: For any given gˉ,Qˉ, let gˉnr(gˉ,Qˉ) and Qˉnr(gˉ,Qˉ) be consistent estimators of gˉ0r(gˉ,Qˉ)=EP0(A|Qˉ,gˉ) and Qˉ0r(gˉ,Qˉ)=EP0(YQˉ|A=1,gˉ), respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let Qˉnr=Qˉnr(gˉn,Qˉn) and gˉnr=gˉnr(gˉn,Qˉn) denote these estimators applied to the TMLEs (gˉn,Qˉn) defined below.

Iterative targeted MLE of ψ0:

Initialization: Let (gn, Qˉn) be an initial estimator of (g0,Qˉ0). Let gn0=gn, Qˉn0=Qˉn and let k=0. Let gˉnr,k=gˉnr(gˉnk,Qˉnk) be obtained by non-parametrically regressing A on Qˉnk,gˉnk. Let Qˉnr,k=Qˉnr(gˉnk,Qˉnk) be obtained by non-parametrically regressing YQˉnk on A=1,gˉnk.

Updating step: Consider the submodel Logitgˉnk()=Logitgˉnk+HAQˉnr,k,gˉnk, and fit with the MLE A,n=argmaxPnloggnk().

Define the submodel LogitQˉnk()=LogitQˉnk+1HY(gˉnk)+2HY1gˉnr,k,gˉnk, where HY1(gˉr,gˉ)Agˉrgˉrgˉgˉ

Let Y,n=argminPnL(Qˉnk()) be the MLE, where L(Qˉ) is the quasi-log-likelihood loss.

We define gnk+1=gnk(A,n) as the corresponding targeted update of gnk, and Qˉnk+1=Qˉnk(Y,n) as the corresponding update of Qˉnk. Let gˉnr,k+1=gˉnr(gˉnk+1,Qˉnk+1) and Qˉnr,k+1=Qˉnr(gˉnk+1,Qˉnk+1).

Iterate till convergence: Now, set kk+1, and iterate this updating process mapping a (gnk,Qˉnk,gˉnrk,Qˉnrk) into (gnk+1,Qˉnk+1,gˉnrk+1,Qˉnrk+1) till convergence or till large enough K so that the following three estimating equations are solved up till an oP(1/n)-term: PnD(QnK,gnK)=oP(1/n)PnDA(Qˉnr,K,gˉnK)=oP(1/n)PnDY(QˉnK,gˉnr,K,gˉnK)=oP(1/n),

where DY(Qˉ,gˉ0r,gˉ)=HY1(gˉ0r,gˉ)(YQˉ).

Final substitution estimator: Denote the limits of this iterative procedure with Qˉnr,gˉnr,gn,Qˉn. Let Qn=(QW,n,Qˉn), where QW,n is the empirical distribution estimator of QW,0. The TMLE of ψ0 is defined as Ψ(Qn).

Equations solved by TMLE: oP(1/n)=PnD(Qn,gn)oP(1/n)=PnDA(Qˉnr,gˉn)oP(1/n)=PnDY(Qˉn,gˉnr,gˉn).

Empirical process condition: Assume that D(Qn,gn), DA(Qˉnr,gˉn), DY(Qˉn,gˉnr,gˉn) fall in a P0-Donsker class with probability tending to 1 as n.

Negligibility of second-order terms: Define Qˉ0,nr=EP0(YQˉ|A=1,gˉ,gˉn) and gˉ0,nr=EP0(A|gˉ,Qˉ,Qˉn). Assume that there exists a δ>0 so that gˉn>δ>0 with probability tending to 1, that gˉn,Qˉn are consistent for gˉ,Qˉ w.r.t. 0-norm, where either gˉ=gˉ0 or Qˉ=Qˉ0, and assume that the following second-order terms are oP(1/n): QˉnQˉ0=oP(1)QˉnrQˉ0r0=oP(1)gˉnrgˉ0r0=oP(1)gˉngˉ02=oP(1/n)gˉngˉ0QˉnQˉ0=oP(1/n)Qˉ0,nrQˉ0r0gˉngˉ0=oP(1/n)QˉnrQˉ0r0gˉngˉ0=oP(1/n)gˉnrgˉ0r0QˉnQˉ0=oP(1/n)gˉ0,nrgˉ0r0QˉnQˉ0=oP(1/n).

Then, Ψ(Qn)ψ0=(PnP0)IC(P0)+oP(1/n),

where IC(P0)=D(Q,g)DA(Qˉ0r,gˉ)DY(Qˉ,gˉ0r,gˉ).

Note that consistent estimation of the influence curve IC(P0) relies on consistency of gˉnr,Qˉnr as estimators of gˉ0r,Qˉ0r, and estimators Qˉn,gˉn converging to a Qˉ,gˉ for which either Qˉ=Qˉ0 or gˉ=gˉ0. These estimators imply an estimated influence curve ICn. An asymptotic 0.95-confidence interval is given by ψn±1.96σn/n, where σn2=PnICn2.

If gˉ=gˉ0, then EP0(A|gˉ,Qˉ)=gˉ, and therefore DY(Qˉ,gˉ0r,gˉ)=0 for all Qˉ. If Qˉ=Qˉ0, then it follows that Qˉ0r=0, and thus that DA(Qˉ0r,gˉ)=0 for all gˉ. In particular, if both gˉ=gˉ0 and Qˉ=Qˉ0, then IC(P0)=D(Q0,g0). We also note that if gˉgˉ0, but gˉ is a true conditional distribution of A, given some function Wr of W for which Qˉ(W) is only a function of Wr, then it follows that EP0(A|gˉ,Qˉ)=gˉ and thus DY=0.

As shown in the final remark of the Appendix, the condition of Theorem 3 that either g=g0 or Qˉ=Qˉ0 can be weakened to (gˉ,Qˉ) having to satisfy P0(QˉQˉ0)(gˉgˉ0)/gˉ=0, allowing for the analysis of collaborative double robust TMLE, as discussed in the next section. However, as shown in the next section, if one arranges in the TMLE algorithm that gˉn=gˉnr (i.e. gˉn already non-parametrically adjusts for Qˉn), then there is no need for the extra targeting in Qˉnk, and the influence curve will be D(Q,g)DA(Qˉ0r,gˉ).

5 Collaborative double robust inference for C-TMLE when using super-learning to fit outcome regression and reduced treatment mechanism

We first review the theoretical underpinning for collaborative estimation of nuisance parameters, in this case, the outcome regression and treatment mechanism. Subsequently, we explain that the desired collaborative estimation can be achieved by applying the previously established template for construction of a C-TMLE to a TMLE that solves certain estimating equations when given an initial estimator of (Q0,g0). This C-TMLE template involves (1) creating a sequence of TMLEs ((gn,k,Qn,k):k=1,,K) constructed in such a manner that the empirical risk of both gn,k and Qn,k is decreasing in k, and (2) using cross-validation to select the k for which Qn,k is the best fit of Q0. Subsequently, we present this TMLE that maps an initial of (Q0,g0) into targeted estimators solving the desired estimating equations and establish its asymptotic linearity under appropriate conditions, including that the initial estimator of (Q0,g0) is collaboratively consistent. Finally, we present a concrete C-TMLE algorithm that uses this TMLE algorithm as its basis, so that our theorem can be applied to this C-TMLE: a C-TMLE is still a TMLE, but it is a TMLE based on a data adaptively selected initial estimator that is collaboratively consistent, so that we can apply the same theorem to this C-TMLE.

5.1 Motivation and theoretical underpinning of collaborative double robust estimation of nuisance parameters

We note that P0D(Q,g)=P0Agˉ(Qˉ0Qˉ)+QˉΨ(Q). If QW=QW,0, this reduces to P0D(Q,g)=P0Agˉ(Qˉ0Qˉ)=Ψ(Q0)Ψ(Q)+P0Agˉgˉ(Qˉ0Qˉ).

Let G be the class of all possible distributions of A, given W, and let g0G be the true conditional distribution of A given W. We define the set G(P0,Qˉ)g:∈G:0=P0(Agˉ)Qˉ0Qˉgˉ. For any gG(P0,Qˉ), we have P0D(Q,g)=Ψ(Q0)Ψ(Q). Suppose we have an estimator (Qn,gn) satisfying PnD(Qn,gn)=0 and converging to a (Q,g) so that gG(P0,Qˉ). Then it follows that P0D(Q,g)=0 and P0D(Q,g)=Ψ(Q0)Ψ(Q), thereby establishing that Ψ(Qn) is a consistent estimator of Ψ(Q0). Let us state this crucial result as a lemma

Lemma 1 (van der Laan and Gruber [33]) If P0(Agˉ)(Qˉ0Qˉ)/gˉ=0, and P0D(Q,g)=0, then Ψ(Q)=ψ0. More generally, P0D(Q,g)=Ψ(Q0)Ψ(Q)+P0(Agˉ)(Qˉ0Qˉ)/gˉ.

We note that G(P0,Qˉ) contains the true conditional distributions g0r of A, given Wr, for which (QˉQˉ0)/gˉ0r is a function of Wr, i.e. for which QˉQˉ0 only depends on W through Wr. We refer to such distributions as reduced treatment mechanisms. However, it contains many more conditional distributions since any conditional distribution g for which (Agˉ(W)) is orthogonal to (Qˉ0Qˉ)/gˉ in L02(P0) is an element of G(P0,Qˉ). We refer to van der Laan and Gruber [33] and Gruber and van der Laan [29] for the introduction and general notion of collaborative double robustness.

5.2 C-TMLE

The general C-TMLE introduced in van der Laan and Gruber [33] provides a template for construction of a TMLE (gn,Qˉn) satisfying PnD(Qn,gn)=0 and converging to a (g,Qˉ) with gG(P0,Qˉ) so that P0D(Q,g)=0 and thereby Ψ(Q)Ψ(Q0)=0. Thus C-TMLE provides a template for construction of targeted MLEs that exploit the collaborative double robustness of TMLEs in the sense that a TMLE will be consistent as long as (Qn,gn) converges to a (Q,g) for which gG(P0,Qˉ). The goal is not to estimate the true treatment mechanism, but instead to construct a gn that converges to a conditional distribution given a reduction Wr of W that is an element of G(P0,Qˉ). We could state that, just as the propensity score provides a sufficient dimension reduction for the outcome regression, so does, given Qˉ, (QˉQˉ0) provide a sufficient dimension reduction for the propensity score regression in the TMLE. The current literature appears to agree that propensity score estimators are best evaluated with respect to their effect on estimation of the causal effect of interest, not by metrics such as likelihoods or classification rates [4548], and the above-stated general collaborative double robustness provides a formal foundation for such claims.

The general C-TMLE has been implemented and applied to point treatment and longitudinal data [20, 2933, 35]. A C-TMLE algorithm relies on a TMLE algorithm that maps an initial (Qˉn,gn) into a TMLE (Qˉn,gn) and uses this algorithm in combination with a targeted variable selection algorithm for generating candidate models for the propensity score to generate a sequence of candidate TMLEs (gnk,Qˉnk), increasingly non-parametric in k, and finally uses cross-validation to select the best TMLE among these candidates estimators of Qˉ0.

5.3 A TMLE that allows for collaborative double robust inference

Our next theorem presents a TMLE algorithm and a corresponding influence curve under the assumption that the propensity score correctly adjusts for the possibly misspecified Qˉ and Qˉ0Qˉ=E0(YQˉ(W)|A=1,W). The presented TMLE algorithm already arranges that this TMLE indeed non-parametrically adjusts for Qˉ. In the next subsection, we will present an actual C-TMLE algorithm that generates a TMLE for which the propensity score is targeted to adjust for QˉQˉ0, so that this theorem can be applied.

Theorem 4

Definitions: For any given gˉ,Qˉ, let gˉnr(gˉ,Qˉ) and Qˉnr(gˉ,Qˉ) be consistent estimators of gˉ0r(gˉ,Qˉ)=EP0(A|gˉ,Qˉ) and Qˉ0r(gˉ,Qˉ)=EP0(YQˉ|A=1,gˉ), respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let Qˉnr=Qˉnr(gˉn,Qˉn) and gˉnr=gˉnr(gˉn,Qˉn) denote these estimators applied to the TMLE (gˉn,Qˉn) defined below.

“Score” equations the TMLE should solve: Below, we describe an iterative TMLE algorithm that results in estimators gˉnr,Qˉnr, gn, Qˉn that solve the following equations: 0=PnD(Qn,gn) 0=PnDA(Qˉnr,gˉn).(3)

Iterative targeted MLE of ψ0:

Initialization: Let Qˉn and gn (e.g. aiming to adjust for QˉnQˉ0) be initial estimators.

Let Qˉn0=Qˉn, gˉn0=gˉnr(gˉn,Qˉn0), and Qˉnr0=Qˉnr(gˉn0,Qˉn0).

Updating step: Consider the submodel Logitgˉnk()=Logitgˉnk+HA(Qˉnr,k,gˉnk), and fit with the MLE A,n=argmaxPnloggnk().

Define the submodel LogitQˉnk()=LogitQˉnk+HY(gnk) and let L(Qˉ) be the quasi-log-likelihood loss function for Qˉ0. Let Y,n=argminPnLQˉnk() be the MLE. Let Qˉnk+1=Qˉnk(Y,n), gˉnk+1=gˉnr(gˉnk(A,n),Qˉnk+1), and Qˉnrk+1=Qˉnr(gˉnk+1,Qˉnk+1).

Iterating till convergence: Now, set kk+1 and iterate this updating process mapping a (gnk,Qˉnk,Qˉnrk) into (gnk+1,Qˉnk+1,Qˉnrk+1) till convergence or till large enough K so that the following estimating equations are solved up till an oP(1/n)-term: oP(1/n)=PnD(QnK,gnK)oP(1/n)=PnDA(QˉnrK,gˉnK).

Final substitution estimator: Denote these limits (in k) of this iterative procedure with gn, Qˉn, Qˉnr. Let Qn=(QW,n,Qˉn), where QW,n is the empirical distribution estimator of QW,0. The TMLE of ψ0 is defined as Ψ(Qn).

Assumption on limits gˉ,Qˉ of gˉn,Qˉn: Assume that (gˉn,Qˉn) is consistent for (gˉ,Qˉ) w.r.t. 0-norm, where gˉ(W)=EP0(A|Wr) for some function Wr(W) of W for which Qˉ only depends on W through Wr, and assume that P0QˉQˉ0gˉ(Agˉ)=0, where the latter holds, in particular, if QˉQˉ0 only depends on W through Wr (e.g. gˉn involves non-parametric adjustment by Qˉ,Qˉ0). As a consequence, we have gˉ=gˉ0r.

Empirical process condition: Assume that D(Qn,gn), DA(Qˉnr,gˉn) fall in a P0-Donsker class with probability tending to 1 as n.

Negligibility of second-order terms: Define Qˉ0,nrEP0YQˉ|A=1,gˉ,gˉn.

Assume that the following conditions hold for each of the following possible definitions of gˉ0,nr: EP0(A|gˉ,Qˉ,Qˉn), EP0(A|gˉ,gˉn), EP0(A|gˉ,Qˉnr,gˉn). Note that gˉ0r=E0(A|gˉ,Qˉ)=E0(A|gˉ)=gˉ is the limit of each of these choices for gˉ0,nr.

We assume gˉ,gˉn are bounded away from δ>0 with probability tending to one, and QˉnrQˉ0r0=oP(1)gˉngˉ02=oP(1/n)Qˉ0,nrQˉ0r0gˉngˉ0=oP(1/n)QˉnQˉ0gˉngˉ0=oP(1/n)QˉnrQˉ0r0gˉngˉ0=oP(1/n)gˉ0,nrgˉ0r0QˉnQˉ0=oP(1/n)gˉ0,nrgˉ0r0gˉngˉ0=oP(1/n)gˉ0,nrgˉ0r0QˉnrQˉ0r0=oP(1/n).

Then, Ψ(Qn)ψ0=(PnP0)IC(P0)+oP(1/n),

where IC(P0)=D(Q,g)DA(Qˉ0r,gˉ).

Thus, consistency of this TMLE relies upon the consistency of Qˉnr as an estimator of Qˉ0r, and estimator (Qˉn,gˉn) converging to a (Qˉ,gˉ) for which gˉ equals a true conditional mean of A, given Wr, and Qˉ0Qˉ,Qˉ only depend on W through Wr. Since Qˉ0,nrQˉ0r depends on how well gˉn approximates gˉ, QˉnrQˉ0r depends on how well (Qˉn,gˉn) approximates (Qˉ,gˉ), beyond the behavior of the non-parametric regression defining Qˉnr. In addition, gˉ0,nrgˉ0r depends on either how well gˉn approximates gˉ or how well Qˉn approximates Qˉ. As a consequence, it follows that each of the second-order terms displayed in the theorem involves square differences of approximation errors gˉngˉ and QˉnQˉ.

It is also interesting to note that the algebraic form of the influence curve of this TMLE is identical to the influence curve of the TMLE of Theorem 2 that relied on gˉn being consistent for gˉ0.

5.4 A C-TMLE algorithm

The TMLE algorithm presented in Theorem 4 maps an initial estimator (Qn0,gn0) into an updated estimator (Qn,gn) that solves the two estimating equations (3), allowing for statistical inference with known influence curve if the initial estimator (Qn0,gn0) is collaboratively consistent (i.e. the limits of (Qn,gn) satisfy the condition in the theorem). The updating algorithm results in a gn that non-parametrically adjusts for Qˉn itself, and thus for its limit Qˉ in the limit. The condition on the limit g was that it should non-parametrically adjust not only for Qˉ but also for QˉQˉ0. If the initial estimator gn0 already adjusted for an approximation of Qˉn0Qˉ0, for example, (gn0,Qn0) is already a C-TMLE, then this condition might hold approximately. Nonetheless, we want to present a C-TMLE algorithm that simultaneously fits g in response to QˉQˉ0, but also carries out the non-parametric adjustment by Qˉ. The latter is normally not part of the C-TMLE algorithm, but we want to enforce this in order to be able to apply Theorem 3 and thereby obtain a known influence curve. We achieve this goal in this subsection by applying the C-TMLE algorithm as presented by van der Laan and Gruber [49] and to the particular TMLE algorithm presented in Theorem 4.

First, we compute a set of K univariate covariates W1,,WK, i.e. functions of W, which we will refer to as main terms, even though a term could be an interaction term or a super-learning fit of the regression of A on a subset of the components of W. Let Ω={W1,,WK} be the full collection of main terms. In the previous subsection, we defined an algorithm that maps an initial (Q,g) into a TMLE (Q,g). Let OL(Q)(O) be the loss function for Q0.

The general template of a C-TMLE algorithm is the following: given a TMLE algorithm that maps any initial (Q,g) into a TMLE (Q,g), the C-TMLE algorithm generates a sequence of increasing sets SkΩ of k main terms, where each set Sk has an associated estimator gk of g0, and simultaneously it generates a corresponding sequence of Qk, k=1,,K, where both gk and Qk are increasingly non-parametric in k. Here increasingly non-parametric means that the empirical mean of the loss function of the fit is decreasing in k. This sequence (gk,Qk) maps into a corresponding sequence of TMLEs (gk,Qk) using the TMLE algorithm presented in Theorem 4. In this variable selection algorithm, the choice of the next main term to add, mapping Sk into Sk+1, is based on how much the TMLE using the g-fit implied by Sk+1, using Qk as initial estimator, improves the fit of the corresponding TMLE Qk for Q0. Cross-validation is used to select k among these candidate TMLEs Qk, k=1,,K, where the last TMLE QK uses the most aggressive bias reduction by being based on the most non-parametric estimator gK implied by Ω.

In order to present a precise C-TMLE algorithm we will first introduce some notation. For a given subset of main terms SΩ, let Sc be its complement within Ω. In the C-TMLE algorithm, we use a forward selection algorithm that augments a given set Sk into a next set Sk+1 obtained by adding the best main term among all main terms in the complement Sk,c of Sk. Each choice S corresponds with an estimator of g0. In other words, the algorithm iteratively updates a current estimate gk into a new estimate gk+1, but the criterion for g does not measure how well g fits g0; it measures how well the TMLE of Q0 that uses this g (and as initial estimator Qk) fits Q0.

Given a set Sk, an initial gk1,Qk1, we define a corresponding gk obtained by MLE-fitting of β in the logistic regression working model Logitgˉk=Logitgˉ0rgˉk1,Qˉk1+jSkβjWj,

where we remind the reader of the definition gˉ0r(gˉ,Qˉ)=E0(A|Qˉ(W),gˉ(W)). Thus, this estimator gk involves non-parametric adjustment by gˉk1,Qˉk1, augmented with a linear regression component implied by Sk. This function mapping Sk,gk1,Qk1 into a fit gk will be denoted with g(Sk,gk1,Qk1). This also allows us to define a mapping from (Qk,Sk,Qk1,gk1) into a TMLE (Qk,gk) defined by the TMLE algorithm of Theorem 4 applied to initial Qk and gk=g(Sk,gk1,Qk1). We will denote this mapping into Qk with TMLE(Qk,Sk,Qk1,gk1).

The C-TMLE algorithm defined below generates a sequence (Qk,Sk) and thereby corresponding TMLEs (Qk,gk), k=0,,K, where Qk represents an initial estimate, Sk a subset of main terms that defines gk, and Qk,gk the corresponding TMLE that starts with (Qk,gk). These TMLEs Qk represent subsequent updates of the initial estimator Q0. The corresponding main term set Sk that defines gk in this k-specific TMLE, increases in k, one unit at a time: S0 is empty, |Sk+1|=|Sk|+1, SK=Ω. The C-TMLE uses cross-validation to select k, and thereby to select the TMLE Qk that yields the best fit of Q0 among the K+1 k-specific TMLEs (Qk:k=0,,K) that are increasingly aggressive in their bias-reduction effort. This C-TMLE algorithm is defined as follows and uses the same format as presented in Wang et al. [35]:

Initiate algorithm: Set initial TMLE. Let k=0, and Qk=Q0, gstart be initial estimates of Q0, g0, and let S0 be the empty set. Let gk=g(S0,Q0,gstart). This defines an initial TMLE Q0=TMLE(Q0,S0,Q0,g0).

Determine next TMLE. Determine the next best main term to add: Sk+1,cand=argminSkWj:WjSk,cPnLTMLEQk,SkWj,Qk1,gk1.

If PnLTMLEQk,Sk+1,cand,Qk1,gk1PnL(Qk),

then (Sk+1=Sk+1,cand,Qk+1=Qk), else Qk+1=Qk, and Sk+1=argminSkWj:WjSk,cPnLTMLEQk,SkWj,Qk1,gk1.

[In words: If the next best main term added to the fit of EP0(A|W) yields a TMLE of EP0(Y|A,W) that improves upon the previous TMLE Qk, then we accept this best main term, and we have our next (Qk+1,Sk+1) and corresponding TMLE Qk+1,gk+1 (which still uses the same initial estimate of Q0 as Qk uses). Otherwise, reject this best main term, update the initial estimate in the candidate TMLEs to the previous TMLE Qk of EP0(Y|A,W), and determine the best main term to add again. This best main term will now always result in an improved fit of the corresponding TMLE of Q0, so that we now have our next TMLE Qk+1,gk+1 (which now uses a different initial estimate than Qk used).]

Iterate. Run this from k=1 to K at which point SK=Ω. This yields a sequence (Qk,gk) and corresponding TMLE (Qk,gk), k=0,,K.

This sequence of candidate TMLEs Qk of Q0 has the following property: the estimates gk are increasingly non-parametric in k and PnL(Qk) is decreasing in k, k=0,,K. It remains to select k. For that purpose we use V-fold cross-validation. That is, for each of the V splits of the sample in a training and validation sample, we apply the above algorithm for generating a sequence of candidate estimates (Qk:k) to a training sample, and we evaluate the empirical mean of the loss function at the resulting Qk over the validation sample, for each k=0,,K. For each k we take the average over the V splits of the k-specific performance measure over the validation sample, which is called the cross-validated risk of the k-specific TMLE. We select the k that has the best cross-validated risk, which we denote with kn. Our final C-TMLE of Q0 is now defined as Qn=Qkn, and the TMLE of ψ0 is defined as ψn=Ψ(Qn).

Fast version of above C-TMLE: We could carry out the above C-TMLE algorithm but replacing the TMLE that maps an initial (Q,g) into (Q,g) replaced by the first step of the TMLE that maps (Q,g) into (Q1,g1). In that manner, the selection of the sets Sk is based on the bias reduction achieved in a first step of the TMLE algorithm, and most bias reduction occurs in the first step. After having selected the final one-step TMLE Qkn1 and corresponding gkn, one should still carry out the full TMLE algorithm so that the final Qn=Qkn,gkn is a real TMLE solving the estimating equations of Theorem 4.

Statistical inference for C-TMLE: Let Qˉnr=Qˉnr(gˉn,Qˉn) be the final estimator of Qˉ0r=Qˉ0r(gˉ,Qˉ)=EP0(YQˉ|A=1,gˉ), a by-product of the TMLE algorithm. An estimate of the influence curve of ψn is given by ICn=D(Qn,gˉn)DA(Qˉnr,gˉn).

The asymptotic variance of n(ψnψ0) can thus be estimated with σn2=1/ni=1nICn(Oi)2. An asymptotically valid 0.95-confidence interval for ψ0 is given by ψn±1.96σn/n.

6 Discussion

Targeted minimum loss-based estimation allows us to construct plug-in estimators Ψ(Qn) of a path-wise differentiable parameter Ψ(Q0) utilizing the state of the art in ensemble learning such as super-learning, while guaranteeing that the estimator Qn and an estimator gn of the nuisance parameter the TMLE utilizes in its targeting step solve a set of user-supplied estimating equations, empirical means of estimating functions. These estimating functions can be selected so that the resulting TMLE of ψ0 has certain statistical properties such as being efficient, or guaranteed to be more efficient than a given user-supplied estimator [28, 29], and so on. However, most importantly, these estimating equations are necessary to make the TMLE asymptotically linear, i.e. to make the TMLE unbiased enough so that the first-order linear expansion can be used for statistical inference. For example, by selecting the estimating functions to be equal to the canonical gradient of Ψ:MIR one arranges that Ψ(Qn) is asymptotically efficient under conditions that assume consistency of Qn and gn.

However, we noted that this level of targeting is insufficient if one only relies on consistency of gn, even when that suffices for consistency of Ψ(Qn). Under such weaker assumptions, additional targeting is necessary so that a specific smooth functional of gn is asymptotically linear, which requires that an unknown smooth function of gn is itself a TMLE. The joint targeting of Qn and gn is achieved by a TMLE that also solves the extra equations making this smooth function of gn asymptotically linear, allowing one to establish asymptotic linearity of Ψ(Qn) under milder conditions that assume that the second-order terms are negligible relative to the first-order linear approximation.

In this article we also pushed this additional level of targeting to a new level by demonstrating how it allows for double robust statistical inference, and that even if we estimate the nuisance parameter in a complicated manner that is based on a criterion that cares about how it helps the estimator to fit ψ0, as used by the C-TMLE, we can still determine a set of additional estimating equations that need to be targeted by the TMLE in order to establish asymptotic linearity and thereby valid statistical inference based on the central limit theorem. This allows us now to use the sophisticated but often necessary C-TMLE while still preserving valid statistical inference under regularity conditions.

It remains to evaluate the practical benefit of the modifications of IPTW, TMLE, and C-TMLE as presented in this article for both estimation and assessment of uncertainty. We plan to address this in future research.

Even though we focussed in this article on a particular concrete estimation problem, TMLE is a general tool and our TMLE and theorems can be generalized to general statistical models and path-wise differentiable statistical target parameters.

We note that this targeting of nuisance parameter estimators in the TMLE is not only necessary to get a known influence curve but also necessary to make the TMLE asymptotically linear. So it does not simply suffice to run a bootstrap as an alternative of influence curve based inference, since the bootstrap can only work if the estimator is asymptotically linear so that it has an existing limit distribution. In addition, the established asymptotic linearity with known influence curve has the important by-product that one now obtains statistical inference with no extra computational cost. This is particularly important in these large semi-parametric models that require the utilization of aggressive machine learning methods in order to cover the model-space, making the estimators by necessity very computer intensive, so that a (disputable) bootstrap method might simply be too computer extensive.

Acknowledgments

This research was supported by an NIH grant R01 AI074345-06. The author is grateful for the excellent, helpful, and insightful comments of the reviewers.

Appendix

Proof of Theorem 1

To start with we note: PnD(gn)P0D(g0)=(PnP0)D(g0)+Pn(D(gn)D(g0))=(PnP0)(D(g0)ψ0)+P0(D(gn)D(g0))+(PnP0)(D(gn)D(g0)).

The first term of this decomposition yields the first component D(g0)ψ0 of the influence curve. Since gn falls in Donsker class the rightmost term is oP(1/n) if P0(D(gn)D(g0))20 in probability. So it remains to analyze the term P0(D(gn)D(g0)). We now note P0D(gn)D(g0)=P0YA1/gn1/g0=P0YA(g0gn)/(gng0)=P0YAg0gn/g02+P0YAg0gn2/g02gn.

By our assumptions, the last term P0YA(g0gn)2/g02gn=P0Qˉ0(gˉngˉ0)2/(gˉ0gˉn)=oP(1/n).

So it remains to study: P0YAg0gn/g02=P0Qˉ0gˉ0gˉn/gˉ0.

Note that this equals {Ψ1(gn)Ψ1(g0)}, where Ψ1(g)=P0Qˉ0gˉ0gˉ is an unknown smooth parameter of g. Our strategy is to first approximate this parameter by an easier (still unknown) parameter Ψ1r(g)=P0Qˉ0r/gˉ0gˉ resulting in a second-order term: Ψ1(gn)Ψ1(g0)=Ψ1r(gn)Ψ1r(g0)+oP(1/n). This is carried out in the next lemma. The efficient influence curve of a target parameter Φ:gˉP0Hgˉ (which treats P0 as known) at g0 is given by H(Agˉ0). Thus, one likes to construct gˉn so that it solves the empirical mean of H0r(Agˉn) for H0r=Qˉ0r/gˉ0, so that gˉn targets the parameter Ψ1r(g0). However, H0r is unknown. Therefore, instead gˉn is constructed to solve the empirical mean of an estimate Hnr(Agˉn) of the efficient influence curve H0r(Agˉn), and we will show that this indeed suffices to establish the asymptotic linearity of Ψ1r(gˉn).

Lemma 2 Define Ψ1(g)=P0Qˉ0gˉ0gˉ, Ψ1r(g)=P0Qˉ0rgˉ0gˉ, Qˉ0,nrEP0(Y|A=1,gˉ0(W),gˉn(W)), and Qˉ0r=EP0(Y|A=1,gˉ0(W)), where gˉn(W) is treated as a fixed function of W when calculating the conditional expectation. Assume R1,nP0(Qˉ0,nrQˉ0r)(gˉngˉ0)/gˉ0=oP(1/n).

Then, Ψ1(gn)Ψ1(g0)=Ψ1r(gˉn)Ψ1r(gˉ0)+R1,n.

Proof of Lemma 2: Note that Ψ1(gn)Ψ1(g0)=P0YAgng0/g02=P0Qˉ0,nrAgng0/g02=P0Qˉ0,nrgˉngˉ0/gˉ0=P0Qˉ0rgˉngˉ0/gˉ0+P0Qˉ0,nrQˉ0rgˉngˉ0/gˉ0.

Since we assumed R1,n=oP(1/n), it remains to prove that Ψ1r(gn)Ψ1r(g0)=P0Qˉ0r(gˉngˉ0)/gˉ0 is asymptotically linear. Recall that H0r=Qˉ0r/gˉ0, and Hnr=Qˉnr/gˉn, where Qˉnr is obtained by regressing Y on the initial estimator gˉn(W) and A=1.

The next step of the proof is the following series of equalities P0DA(Qˉnr,gˉn)=P0(DA(Qˉnr,gˉn)DA(Qˉ0r,gˉn))+P0DA(Qˉ0r,gˉn)=(HnrH0r)(W)(Agˉn(W))dP0(W,A)+P0DA(Qˉ0r,gˉn)=(HnrH0r)(W)(gˉ0gˉn)(W)dP0(W)+P0DA(Qˉ0r,gˉn)=R2,n+P0DA(Qˉ0r,gˉn),

where, by assumption, R2,n=oP(1/n). We now note that P0DA(Qˉ0r,gˉn)=H0r(W)(Agˉn(W))dP0(A,W)=H0r(W)gˉ0(W)dP0(W)H0r(W)gˉn(W)dP0(W)Ψ1r(gˉ0)Ψ1r(gˉn).

Thus, we have P0DA(Qˉnr,gˉn)=Ψ1r(gˉn)Ψ1r(gˉ0)R2,n,

from which we deduce that, by Lemma 2 and PnDA(Qˉnr,gˉn)=0, that Ψ1(gˉn)Ψ1(gˉ0)=P0DAQˉnr,gˉn+R1,n+R2,n=PnP0DAQˉnr,gˉn+R1,n+R2,n=PnP0DAQˉ0r,gˉ0+R1,n+R2,n+R3,n,

where we defined R3,n=(PnP0)(DA(Qˉnr,gˉn)DA(Qˉ0r,gˉ0)).

By our assumptions, R3,n=oP(1/n), so that it follows that Ψ1(gn)Ψ1(g0)=(PnP0)DA(Qˉ0r,gˉ0)+oP(1/n). □

Proof of Theorem 2

One easily checks that Ψ(Qn)Ψ(Q0)=P0D(Qn,g0)=P0D(Qn,gn)+P0D(Qn,gn)D(Qn,g0)=(PnP0)D(Qn,gn)+P0D(Qn,gn)D(Qn,g0),

because PnD(Qn,gn)=0 by eq. (2). If D(Qn,gn) falls in a P0-Donsker class and P0{D(Qn,gn)D(Q,g0)}2=oP(1) for some possibly misspecified limit Q of Qn, then the first term on the right-hand side equals (PnP0)D(Q,g0)+oP(1/n), giving us the first component D(Q,g0) of the influence curve of Ψ(Qn). The second term can be written as A+B with A=P0D(Qn,gn)D(Qn,g0)D(Q,gn)D(Q,g0)B=P0D(Q,gn)D(Q,g0).

The first term A equals P0(HY(gn)HY(g0))(QˉnQˉ),

where HY(g)(A,W)=A/gˉ(W). By our assumptions, this term is oP(1/n). Thus, it suffices to establish asymptotic linearity of Ψ1(gn)=P0D(Q,gn) as an estimator of Ψ1(g0)=P0D(Q,g0). We have Ψ1(gn)Ψ1(g0)=P0(YQˉ)Agˉngˉ0(gˉngˉ0)=P0Qˉ0,nrAgˉngˉ0(gˉngˉ0)=P0Qˉ0,nr1gˉn(gˉngˉ0),

where Qˉ0,nr appeared by writing the expectation w.r.t. P0 as an expectation of the conditional expectation, given A,gˉn(W),gˉ0(W). Let H0,nr=Qˉ0,nr/gˉn and recall H0r=Qˉ0r/gˉ0, where Qˉ0r=EP0(YQˉ(W)|A=1,gˉ0(W)). The last term can be written as P0H0r(gˉngˉ0)P0(H0,nrH0r)(gˉngˉ0).

By our assumptions, the second term above is oP(1/n). Thus, in order to establish asymptotic linearity of Ψ1(gn), it suffices to establish asymptotic linearity of Ψ1r(gˉn)=P0H0rgˉn as an estimator of Ψ1r(gˉ0)=P0H0rgˉ0, where P0 and H0r are treated as known.

The estimator gˉn was constructed to target P0Hnrgˉ0 instead where we recall that Hnr=Qˉnr/gˉn. That is, our targeted estimator gn solves the efficient influence curve equation PnDA(Qˉnr,gˉn)=0 for the parameter P0Hnrgˉ0 of gˉ0. We now note that P0DA(Qˉ0r,gˉn)=H0r(W)(Agˉn(W))dP0(A,W)=H0r(W)gˉ0(W)dP0(W)H0r(W)gˉn(W)dP0(W)Ψ1r(gˉ0)Ψ1r(gˉn).

We have P0DA(Qˉnr,gˉn)=P0{DA(Qˉnr,gˉn)DA(Qˉ0r,gˉn)}+P0DA(Qˉ0r,gˉn)=(HnrH0r)(W)(Agˉn(W))dP0(W,A)+P0DA(Qˉ0r,gˉn)=(HnrH0r)(W)(gˉ0gˉn(W))dP0(W)+P0DA(Qˉ0r,gˉn)R2,n+P0DA(Qˉ0r,gˉn),

where R2,n=oP(1/n), by assumption. Combining the last two equations yields: Ψ1r(gˉn)Ψ1r(gˉ0)=P0DA(Qˉnr,gˉn)R2,n=(PnP0)DA(Qˉnr,gˉn)R2,n=(PnP0)DA(Qˉ0r,gˉ0)R2,n+R3,n,

where we defined R3,n=(PnP0)DA(Qˉnr,gˉn)DA(Qˉ0r,gˉ0).

We have that R3,n=oP(1/n) if DA(Qˉnr,gˉn)DA(Qˉ0r,gˉ0) falls in a P0-Donsker class with probability tending to 1, and P0{DA(Qˉnr,gˉn)DA(Qˉ0r,gˉ0)}20 in probability when n. Thus, we have proven that Ψ1r(gn)Ψ1r(g0)=(PnP0)DA(Qˉ0r,gˉ0)+oP(1/n). Thus, Ψ1(gn)Ψ1(g0)=(PnP0)DA(Qˉ0r,gˉ0)+oP(1/n).

Proof of Theorem 3

As outlined in Section 1, we have Ψ(Qn)Ψ(Q0)=P0D(Qn,gn)+P0(Qˉ0Qˉn)gˉ0gˉngˉn=(PnP0)D(Qn,gn)+P0(Qˉ0Qˉn)gˉ0gˉngˉn=(PnP0)D(Q,g)+P0(QˉnQˉ0)gˉngˉ0gˉn+oP(1/n),

if D(Qn,gn) falls in a Donsker class with probability tending to 1, and P0{D(Qn,gn)D(Q,g)}20 in probability as n. The first term on right-hand side gives us the first component D(Q,g) of the influence curve of ψn.

It suffices to analyze the second term. Initially, we note that P0(QˉnQˉ0)gˉngˉ0gˉn=P0(QˉnQˉ0)gˉngˉ0gˉ+R1,n,

where R1,n=P0(QˉnQˉ0)(gˉngˉ0)gˉngˉgˉgˉn.

By assumption, R1,n=oP(1/n).

Now, we note P0(QˉnQˉ0)gˉngˉ0gˉ=P0(QˉnQˉ+QˉQˉ0)gˉngˉ+gˉgˉ0gˉ=P0(QˉnQˉ)gˉngˉgˉ+P0(QˉnQˉ)gˉgˉ0gˉ+P0(QˉQˉ0)gˉngˉgˉ+P0(QˉQˉ0)gˉgˉ0gˉ.

By our assumptions, the first term R2,n=P0(QˉnQˉ)gˉngˉgˉ satisfies R2,n=oP(1/n). In addition, the last term equals zero by assumption: Qˉ=Qˉ0 or gˉ=gˉ0.

So it suffices to analyze the second and third terms of this last expression. In order to represent the second and third terms we define Ψ2,gˉ,gˉ0(Qˉn)=P0Qˉngˉgˉ0gˉΨ1,gˉ,Qˉ,Qˉ0(gˉn)=P0QˉQˉ0gˉgˉn.

The sum of the second and third terms can now be represented as: I(Qˉ=Qˉ0)Ψ2,gˉ,gˉ0(Qˉn)Ψ2,gˉ,gˉ0(Qˉ)+I(gˉ=gˉ0)Ψ1,gˉ,Qˉ,Qˉ0(gˉn)Ψ1,gˉ,Qˉ,Qˉ0(gˉ).

For notational convenience, we will suppress the dependence of these mappings on the unknown quantities, and thus use Ψ1,Ψ2.

Analysis of Ψ1(gˉn) if gˉ=gˉ0: Recalling the definition Qˉ0,nr, we have Ψ1(gˉn)Ψ1(gˉ)=P0QˉQˉ0gˉ0(gˉngˉ0)=P0(YQˉ)Agˉ02(gˉngˉ0)=P0EP0(YQˉ|A=1,gˉ0,gˉn)gˉ0(gˉngˉ0)=P0Qˉ0,nrgˉ0(gˉngˉ0)=P0Qˉ0,nrQˉ0rgˉ0(gˉngˉ0)P0Qˉ0rgˉ0(gˉngˉ0).

By our assumptions, R3,nP0Qˉ0,nrQˉ0rgˉ0(gˉngˉ0)=oP(1/n),

so that it remains to analyze P0H0r(gˉngˉ0), where H0r=Qˉ0r/gˉ0. Let Hnr=Qˉnr/gˉn, and recall that by construction PnDA(Qˉnr,gˉn)=PnHnr(Agˉn)=0. We then proceed as follows: P0H0r(gˉngˉ0)=P0Hnr(gˉngˉ0)+P0(H0rHnr)(gˉngˉ0)P0Hnr(gˉngˉ0)+R4,n,

where, by our assumptions, R4,n=P0(HnrH0r)(gˉngˉ0)=oP(1/n).

In addition, P0Hnr(gˉngˉ0)=P0Hnr(Agˉn)=(PnP0)Hnr(Agˉn)=(PnP0)H0r(Agˉ0)+R4,n+R5,n,

where R5,n=oP(1/n) if P0{DA(Qˉnr,gˉn)DA(Qˉ0r,gˉ}2=oP(1) and DA(Qˉnr,gˉn) falls in a Donsker class with probability tending to 1. This proves that, if gˉ=gˉ0, then Ψ1(gˉn)Ψ1(gˉ0)=(PnP0)DA(Qˉ0r,gˉ)+oP(1/n).

Analysis of Ψ2(Qˉn) if Qˉ=Qˉ0: Recall the definitions of HY(gˉr,gˉ), gˉ0,nr=EP0(A|gˉ,Qˉn,Qˉ), gˉnr (an estimator of gˉ0r=EP0(A|gˉ,Qˉ)), and that, by construction, PnHY(gˉnr,gˉn)(YQˉn)=0. We have Ψ2(Qˉn)Ψ2(Qˉ0)=P0gˉgˉ0gˉ(QˉnQˉ0)=P0Agˉgˉ(QˉnQˉ0)=P0gˉ0,nrgˉgˉ(QˉnQˉ0)=P0Agˉ0,nrgˉ0,nrgˉgˉ(YQˉn)=P0HY(gˉ0,nr,gˉ)(YQˉn)=P0HY(gˉnr,gˉn)(YQˉn)+P0{HY(gˉ0,nr,gˉ)(Qˉ0Qˉn)HY(gˉnr,gˉn)(Qˉ0Qˉn)}

Here we used that gˉ0 is a conditional expectation of A allowing us to first replace gˉ0 by A and then retake the conditional expectation but now only conditioning on what is needed to fix all other terms within expectation w.r.t. P0. As a result of this trick, we were able to replace the hard to estimate gˉ0 that conditions on all of W by the easier gˉ0,nr. Similarly, we used this to replace Qˉ0 by Y. The last term is a second-order term involving square differences (QˉnQˉ0)(gˉngˉ) and (QˉnQˉ0)(gˉ0,nrgˉnr). By our assumptions, this last term is oP(1/n). We now proceed as follows: P0HY(gˉnr,gˉn)(YQˉn)=(PnP0)HY(gˉnr,gˉn)(YQˉn)=(PnP0)HY(gˉ0r,gˉ)(YQˉ)+oP(1/n),

where we assumed that HY(gˉnr,gˉn)(YQˉn) falls in a Donsker class with probability tending to 1, and P0HY(gˉnr,gˉn)(YQˉn)HY(gˉ0r,gˉ)(YQˉ)20,

in probability. This proves Ψ2(Qˉn)Ψ2(Qˉ0)=(PnP0)DY(Qˉ,gˉ0r,gˉ)+oP(1/n). □

Proof of Theorem 4

As in the proof of previous theorem, we start with Ψ(Qn)Ψ(Q0)=(PnP0)D(Q,g)+P0(QˉnQˉ0)gˉngˉ0gˉn+oP(1/n),

where we use that D(Qn,gn) falls in a Donsker class with probability tending to 1, and P0{D(Qn,gn)D(Q,g)}20 in probability as n. The first term yields the first component D(Q,g) of the influence curve of ψn.

As in the proof of previous theorem, we decompose this second term as follows: P0(QˉnQˉ0)gˉngˉ0gˉn=P0(QˉnQˉ+QˉQˉ0)gˉngˉ+gˉgˉ0gˉn=P0(QˉnQˉ)gˉngˉgˉn+P0(QˉnQˉ)gˉgˉ0gˉn+P0(QˉQˉ0)gˉngˉgˉn+P0(QˉQˉ0)gˉgˉ0gˉn,

resulting in four terms, which we will denote with Terms 1–4. We will now analyze these four terms.

Term 1: The first term P0(QˉnQˉ)gˉngˉgˉ=oP(1/n), by assumption.

Term 4: Due to our assumption that P0(QˉQˉ0)(gˉgˉ0)/gˉ=0 this last term equals: P0(QˉQˉ0)(gˉgˉ0)(gˉngˉ)gˉngˉ=P0(QˉQˉ0)(gˉgˉ0)(gˉngˉ)gˉ2+R1,n,

where, by assumption, R1,n=P0(QˉQˉ0)(gˉgˉ0)(gˉngˉ)2gˉ2gˉn=oP(1/n).

We proceed as follows: P0(QˉQˉ0)(gˉgˉ0)(gˉngˉ)gˉ2=P0(QˉQˉ0)(gˉA)(gˉngˉ)gˉ2=P0(QˉQˉ0)gˉ(gˉngˉ)+P0(QˉQˉ0)Agˉ2(gˉngˉ).

The first term is asymptotically equivalent with minus Term 3, which shows that Term 3 is canceled out by a component of Term 4 up till a second-order term that is oP(1/n), by assumption. The second term equals P0(QˉQˉ0)Agˉ2(gˉngˉ)=P0(YQˉ)Agˉ2(gˉngˉ)=P0EP0(YQˉ|A=1,gˉ,gˉn)Agˉ2(gˉngˉ)=P0EP0(YQˉ|A=1,gˉ,gˉn)gˉ0,nrgˉ2(gˉngˉ)=P0EP0(YQˉ|A=1,gˉ)gˉ0rgˉ2(gˉngˉ)P0(H1(gˉn)H1(gˉ))(gˉngˉ),

where H1(gˉn)EP0(YQˉ|A=1,gˉ,gˉn)gˉ0,nrgˉ2 approximates H1(gˉ)=EP0(YQˉ|A=1,gˉ)gˉ0rgˉ2, gˉ0,nr=EP0(A|gˉ,gˉn), and gˉ0r=EP0(A|gˉ). Let Qˉ0,nr=EP0(YQˉ|A=1,gˉ,gˉn) and Qˉ0r=EP0(YQˉ|A=1,gˉ). We assumed gˉ0,nrgˉ0r0gˉngˉ0=oP(1/n), and Qˉ0,nrQˉ0r0gˉngˉ0=oP(1/n), which implies that R2,n=P0(H1(gˉn)H1(gˉ))(gˉngˉ)=oP(1/n).

By assumption, E0(A|W1) for some W1 that is a function of W. Therefore, gˉ0r(W)=EP0(A|gˉ(W))=EP0(EP0(A|W1)|gˉ(W))=gˉ(W). Thus, it remains to analyze P0EP0(YQˉ|A=1,gˉ)gˉ(gˉngˉ).(4)

This term is analyzed below and it is shown that this term equals (PnP0)DA(Qˉ0r,gˉ)+oP(1/n).

To conclude, we have then shown that the fourth term equals the latter expression minus the third term.

We now analyze (4) which can be represented as P0Qˉ0rgˉ(gˉngˉ), where Qˉ0r=EP0(YQˉ|A=1,gˉ). In this proof, we will use the notation H0r=Qˉ0r/gˉ, Hnr=Qˉnr/gˉn. Since gˉ(W)=E0(A|W1) for some W1, and Qˉ0r is thus also a function of W1, we have P0Qˉ0rgˉgˉ=P0Qˉ0rgˉA.

We now proceed as follows: P0H0r(gˉngˉ)=P0H0r(gˉnA)=P0Hnr(gˉnA)P0(H0rHnr)(gˉnA)=P0Hnr(Agˉn)P0(H0rHnr)(gˉnE0(A|gˉ,Qˉnr,gˉn)).

For the second term R4,n, we can substitute gˉnE0(A|gˉ,Qˉnr,gˉn)=(gˉngˉ)+E0(A|gˉ,Qˉ0r)E0(A|gˉ,Qˉnr,gˉn),

by noting that gˉ=E0(A|gˉ,Qˉ0r). Thus, this second term results in two terms, one that can be bounded by HnrH0r0gˉngˉ0 and the other is bounded by HnrH0r0E0(A|gˉ,Qˉ0r)E0(A|gˉ,Qˉnr,gˉn)0.

By assumption, both terms are oP(1/n) and thus R4,n=oP(1/n).

Since, by construction of gn, PnHnr(Agˉn)=0, the first term can be written as follows: P0Hnr(Agˉn)=(PnP0)Hnr(Agˉn)=(PnP0)H0r(Agˉ)+R5,n,

where R5,n=oP(1/n) if P0{DA(Qˉnr,gˉn)DA(Qˉ0r,gˉ)}2=oP(1) and DA(Qˉnr,gˉn) falls in a Donsker class with probability tending to 1, and we are reminded that DA(gˉ,Qˉ0r)=H0r(Agˉ). This completes the proof for the fourth term.

Term 3: Our analysis of Term 4 showed that Term 3 cancels out and thus that the sum of the third and fourth terms equals (PnP0)DA(Qˉ0r,gˉ)+oP(1/n), which yields the second component DA(Qˉ0r,gˉ) of the influence curve of ψn.

Analysis of Term 2: Up till a second-order term that can be bounded by gˉngˉ0QˉnQˉ0=oP(1/n, we can represent Term 2 as Ψ2,gˉ,gˉ0(Qˉn)Ψ2,gˉ,gˉ0(Qˉ).

where Ψ2,gˉ,gˉ0(Qˉn)=P0Qˉngˉgˉ0gˉ.

We have Ψ2(Qˉn)Ψ2(Qˉ)=P0gˉgˉ0gˉ(QˉnQˉ)=P0Agˉgˉ(QˉnQˉ)=P0EP0(A|gˉ,Qˉn,Qˉ)gˉgˉ(QˉnQˉ).

Recall that, by our assumption, gˉ=EP0(A|gˉ,Qˉ). Let gˉ0,nr=EP0(A|gˉ,Qˉn,Qˉ). By our assumptions, P0gˉ0,nrgˉgˉ(QˉnQˉ)=oP(1/n).(5)

This proves that Ψ2(Qˉn)Ψ2(Qˉ)=oP(1/n). □

Remark: Proof of additional result In this analysis of Term 2, we assumed gˉ=EP0(A|gˉ,Qˉ), and condition (5). Let us now try to provide a different type of analysis for this Term 2, relying on different conditions. We have Ψ2(Qˉn)Ψ2(Qˉ)=P0gˉgˉ0gˉ(QˉnQˉ)=P0Agˉgˉ(QˉQˉn)=P0gˉ0,nrgˉgˉ0,nrgˉA(YQˉn),

where gˉ0,nr=E0(A|gˉ,Qˉ,Qˉn), and if we assume that P0gˉ0,nrgˉgˉ0,nrgˉA(YQˉ)=0. The latter equality holds if we target in the TMLE algorithm Qˉn with clever covariate HY(gˉnr,gˉn)=(gˉnrgˉn)/(gˉnrgˉn)A, where gˉnr estimates a non-parametric regression of A on Qˉn,gˉn, exactly as in Theorem 3. Under that assumption one can now show that we obtain another influence curve component DY defined by DY(Qˉ,gˉ0r,gˉ)=gˉ0rgˉgˉ0rgˉA(YQˉ),

where gˉ0r=E0(A|gˉ,Qˉ). Thus, now we have that Ψ(Qn) is asymptotically linear with influence curve D(Q,g)DA(Qˉ0r,gˉ)DY(Qˉ,gˉ0r,gˉ). However, note that if gˉ0r=gˉ, i.e. if gˉ=E0(A|gˉ,Qˉ), then DY=0. To conclude, one can remove the condition that gˉn needs to non-parametrically adjust for Qˉn as arranged by the TMLE algorithm in Theorem 4 by adding the additional clever covariate HY(gˉnr,gˉn) to the submodel for Qˉn in the TMLE algorithm, and the influence curve will now have another component DY(P0), as in Theorem 3. This results in a generalization of Theorem 3 which does not require that either Qn or gn is consistent, but only requires that their limits gˉ,Qˉ satisfy P0(gˉgˉ0)/gˉ(QˉQˉ0)=0. Thus, this latter generalization of Theorem 3 would provide an appropriate theorem for a C-TMLE that does not enforce the non-parametric adjustment for Qˉn, but still needs to adjust for QˉnQˉ0.

References

  • 1.

    Bickel PJ, Klaassen CA, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Springer-Verlag, 1997. Google Scholar

  • 2.

    Gill RD. Non- and semiparametric maximum likelihood estimators and the von Mises method (part 1). Scand J Stat 1989;16:97–128. Google Scholar

  • 3.

    Gill RD, van der Laan MJ, Wellner JA. Inefficient estimators of the bivariate survival function for three models. Ann Inst Henri Poincaré 1995;31:545–97.Google Scholar

  • 4.

    van der Vaart AW, Wellner JA. Weak convergence and empirical processes. New York: Springer-Verlag, 1996. Google Scholar

  • 5.

    van der Laan MJ. Estimation based on case-control designs with known prevalence probability. Int J Biostat 2008. Available at: http://www.bepress.com/ijb/vol4/iss1/17/

  • 6.

    van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2012. Web of ScienceGoogle Scholar

  • 7.

    van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat 2006;20. Google Scholar

  • 8.

    van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, CA, November 2003. Google Scholar

  • 9.

    van der Laan MJ, Polley E, Hubbard A. Super learner. Stat Appl Genet Mol Biol 2007;6:Article 25. Google Scholar

  • 10.

    van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Stat Decis 2006;240:351–71. Google Scholar

  • 11.

    Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In Aids epidemiology. Methodological issues. Basel: Bikhäuser, 1992:297–331. Google Scholar

  • 12.

    Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Am Stat Assoc 1995;900:122–9.CrossrefGoogle Scholar

  • 13.

    van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer-Verlag, 2003. Google Scholar

  • 14.

    Robins JM, Rotnitzky A, van der Laan MJ. Comment on “on profile likelihood” by S.A. Murphy and A.W. van der Vaart. J Am Stat Assoc – Theory Methods 2000;450:431–5.Google Scholar

  • 15.

    Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, 2000.Google Scholar

  • 16.

    Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “inference for semiparametric models: some questions and an answer”. Stat Sin 2001;110:920–36. Google Scholar

  • 17.

    Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder). J Am Stat Assoc 1999;940:1096–120 (1121–46). CrossrefGoogle Scholar

  • 18.

    Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, et al. Biomarker discovery using targeted maximum likelihood estimation: application to the treatment of antiretroviral resistant HIV infection. Stat Med 2009;28:152–72. CrossrefWeb of ScienceGoogle Scholar

  • 19.

    Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat 2010;6:Article 26. Available at: www.bepress.com/ijb/vol6/iss1/26 Web of Science

  • 20.

    Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat 2010;60. Web of ScienceGoogle Scholar

  • 21.

    Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Technical Report 265, UC Berkeley, CA, 2010. Web of ScienceGoogle Scholar

  • 22.

    Rosenblum M, van der Laan MJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat 2010;60. Web of ScienceGoogle Scholar

  • 23.

    Sekhon JS, Gruber S, Porter K, van der Laan MJ. Propensity-score-based estimators and C-TMLE. In: MJ van der Laan and S Rose, editors. Targeted learning: prediction and causal inference for observational and experimental data, chapter 21. New York: Springer, 2011. Web of ScienceGoogle Scholar

  • 24.

    Gruber S, van der Laan MJ. Targeted minimum loss based estimation of a causal effect on an outcome with known conditional bounds. Int J Biostat 2012;8. Web of ScienceCrossrefGoogle Scholar

  • 25.

    Zheng W, van der Laan MJ. Asymptotic theory for cross-validated targeted maximum likelihood estimation. Technical Report 273, Division of Biostatistics, University of California, Berkeley, CA, 2010. Google Scholar

  • 26.

    Zheng W, van der Laan MJ. Cross-validated targeted minimum loss based estimation. In: MJ van der Laan and S Rose, editors. Targeted learning: causal inference for observational and experimental data, chapter 21. New York: Springer, 2011:459–74. Web of ScienceGoogle Scholar

  • 27.

    van der Vaart AW. Asymptotic statistics. New York: Cambridge University Press, 1998. Google Scholar

  • 28.

    Rotnitzky A, Lei Q, Sued M, Robins J. Improved double-robust estimation in missing data and causal inference models. Biometrika 2012;99:439–56. Web of ScienceCrossrefGoogle Scholar

  • 29.

    Gruber S, van der Laan MJ. Targeted minimum loss based estimator that outperforms a given estimator. Int J Biostat 2012;80:Article 11.  CrossrefWeb of ScienceGoogle Scholar

  • 30.

    Gruber S, van der Laan MJ. Marginal structural models. In: MJ van der Laan and S Rose, editors. C-TMLE of an additive point treatment effect, chapter 19. New York: Springer, 2011.Google Scholar

  • 31.

    Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. Int J Biostat 2011;70:1–34.CrossrefWeb of ScienceGoogle Scholar

  • 32.

    Stitelman OM, van der Laan MJ. Collaborative targeted maximum likelihood for time to event data. Int J Biostat 2010:Article 21.Web of ScienceGoogle Scholar

  • 33.

    van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat 2010;60. Google Scholar

  • 34.

    van der Laan MJ, Rose S. Targeted learning: prediction and causal inference for observational and experimental data. New York: Springer, 2011.Web of ScienceGoogle Scholar

  • 35.

    Wang H, Rose S, van der Laan MJ. Finding quantitative trait loci genes. In: MJ van der Laan and S Rose, editors. Targeted learning: causal inference for observational and experimental data, chapter 23. New York: Springer, 2011. Web of ScienceGoogle Scholar

  • 36.

    Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000;110:561–70. CrossrefGoogle Scholar

  • 37.

    Györfi L, Kohler M, Krzyżak A, Walk H. A distribution-free theory of nonparametric regression. New York: Springer-Verlag, 2002. Google Scholar

  • 38.

    van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Stat Decis 2006;240:373–95. Google Scholar

  • 39.

    van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol Biol 2004;3:Article 4. Google Scholar

  • 40.

    Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol 2005;20:131–54. CrossrefGoogle Scholar

  • 41.

    Polley EC, Rose S, van der Laan MJ. Super learning. In: MJ van der Laan and S Rose, editors. Targeted learning: causal inference for observational and experimental data, chapter 3. New York: Springer, 2011. Web of ScienceGoogle Scholar

  • 42.

    Polley EC, van der Laan MJ. Super learner in prediction. Technical report 200. Division of Biostatistics, UC Berkeley, Working Paper Series, 2010. Google Scholar

  • 43.

    van der Laan MJ, Petersen ML. Targeted learning. In: Zhang C, Ma Y, editors. Ensemble machine learning. New York: Springer, 2012:117–56. ISBN 978-1-4419-9326-7. Google Scholar

  • 44.

    van der Laan MJ. Efficient and inefficient estimation in semiparametric models. Center for Mathematics and Computer Science, CWI-tract 114. 1996. Google Scholar

  • 45.

    Lee BK, Lessler J, Stuart EA. Improved propensity score weighting using machine learning. Stat Med 2009;29:337–46. Google Scholar

  • 46.

    Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 2009;20:512–22. DOI: 10.1097/EDE.0b013e3181a663cc.CrossrefWeb of ScienceGoogle Scholar

  • 47.

    Vansteelandt S, Bekaert M, Claeskens G. On model selection and model misspecification in causal inference. Stat Methods Med Res 2010;21:7–30. DOI:10.1177/0962280210387717.CrossrefWeb of ScienceGoogle Scholar

  • 48.

    Westreich D, Cole SR, Funk MJ, Brookhart MA, Sturmer T. The role of the c-statistic in variable selection for propensity scores. Pharmacoepidemiol Drug Saf 2011;20:317–20. CrossrefGoogle Scholar

  • 49.

    van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat 2009;6. Google Scholar

About the article

Published Online: 2014-02-11

Published in Print: 2014-05-01


Citation Information: The International Journal of Biostatistics, Volume 10, Issue 1, Pages 29–57, ISSN (Online) 1557-4679, ISSN (Print) 2194-573X, DOI: https://doi.org/10.1515/ijb-2012-0038.

Export Citation

© 2014 by Walter de Gruyter Berlin / Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

[1]
Kara E. Rudolph, Oleg Sofrygin, and Mark J. van der Laan
Journal of the American Statistical Association, 2020, Page 1
[2]
Karel Vermeulen and Stijn Vansteelandt
Journal of the American Statistical Association, 2015, Volume 110, Number 511, Page 1024
[3]
D Benkeser, M Carone, M J Van Der Laan, and P B Gilbert
Biometrika, 2017, Volume 104, Number 4, Page 863
[4]
Iván Díaz
Statistics in Medicine, 2019, Volume 38, Number 15, Page 2735
[6]
Claude M. Setodji, Daniel F. McCaffrey, Lane F. Burgette, Daniel Almirall, and Beth Ann Griffin
Epidemiology, 2017, Volume 28, Number 6, Page 802
[7]
Yuying Xie, Yeying Zhu, Cecilia A Cotton, and Pan Wu
Statistical Methods in Medical Research, 2017, Page 096228021771548
[8]
Wei Luo, Yeying Zhu, and Debashis Ghosh
Biometrika, 2017, Page asw068

Comments (0)

Please log in or register to comment.
Log in