Mark J. van der Laan

Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference

De Gruyter | Published online: February 11, 2014

Abstract

In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distributed copies of a random variable consisting of baseline covariates, a subsequently assigned binary treatment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score.

1 Introduction and overview

This introduction provides an atlas for the contents of this article. It starts with formulating the role of estimation of nuisance parameters to obtain asymptotically linear estimators of a target parameter of interest. This demonstrates the need to target this estimator of the nuisance parameter in order to make the estimator of the target parameter asymptotically linear when the model for the nuisance parameter is large. The general approach to obtain such a targeted estimator of the nuisance parameter is described. Subsequently, we present our concrete example to which we will apply this general method for targeted estimation of the nuisance parameter, and for which we establish a number of formal theorems. Finally, we discuss the link to previous articles that concerned some kind of targeting of the estimator of the nuisance parameter, and we provide an organization of the remainder of the article.

1.1 The role of nuisance parameter estimation

Suppose we observe n independent and identically distributed copies of a random variable O with probability distribution P 0 . In addition, assume that it is known that P 0 is an element of a statistical model M and that we want to estimate ψ 0 = Ψ ( P 0 ) for a given target parameter mapping Ψ : M I R . In order to guarantee that P 0 M one is forced to only incorporate real knowledge, and, as a consequence, such models M are always very large and, in particular, are infinite dimensional. We assume that the target parameter mapping is path-wise differentiable and let D ( P ) denote the canonical gradient of the path-wise derivative of Ψ at P M [1]. An estimator ψ n = Ψ ˆ ( P n ) is a functional Ψ ˆ applied to the empirical distribution P n of O 1 , , O n and can thus be represented as a mapping Ψ ˆ : M N P I R from the non-parametric statistical model M N P into the real line. An estimator Ψ ˆ is efficient if and only if it is asymptotically linear with influence curve D ( P 0 ) :

ψ n ψ 0 = 1 n i = 1 n D ( P 0 ) ( O i ) + o P ( 1 / n ) .

The empirical mean of the influence curve D ( P 0 ) represents the first-order linear approximation of the estimator as a functional of the empirical distribution, and the derivation of the influence curve is a by-product of the application of the so-called functional delta-method for statistical inference based on functionals (i.e. Ψ ˆ ) of the empirical distribution [24].

Suppose that Ψ ( P ) only depends on P through a parameter Q ( P ) and that the canonical gradient depends on P only through Q ( P ) and a nuisance parameter g ( P ) . The construction of an efficient estimator requires the construction of estimators Q n and g n of these nuisance parameters Q 0 and g 0 , respectively. Targeted minimum loss-based estimation (TMLE) represents a method for construction of (e.g. efficient) asymptotically linear substitution estimators Ψ ( Q n ) , where Q n is a targeted update of Q n that relies on the estimator g n [57]. The targeting of Q n is achieved by specifying a parametric submodel { Q n ( ) : } { Q ( P ) : P M } through the initial estimator Q n and a loss function O L ( Q ) ( O ) for Q 0 = arg min Q P 0 L ( Q ) L ( Q ) ( o ) d P 0 ( o ) , so that the generalized score d d L ( Q n ( ) ) | = 0 spans a desired user-supplied estimating function D ( Q n , g n ) . In addition, one may decide to target g n by specifying a parametric submodel { g n ( 1 ) : 1 } { g ( P ) : P M } and loss function O L ( g ) ( O ) for g 0 = arg min g P 0 L ( g ) , so that the generalized score d d 1 L ( g n ( 1 ) ) | 1 = 0 spans another desired estimating function D 1 ( g n , η n ) for some estimator η n of nuisance parameter η . The parameter is fitted with MLE n = arg min P n L ( Q n ( ) ) , providing the first-step update Q n 1 = Q n ( n ) , and similarly 1 , n = arg min 1 P n L ( g n ( 1 ) ) . This updating process that mapped a current fit ( Q n , g n ) into an update ( Q n 1 , g n 1 ) is iterated till convergence at which point the TMLE ( Q n , g n ) solves P n D ( Q n , g n ) = 0 , i.e. the empirical mean of the estimating function equals zero at the final TMLE ( Q n , g n ) . If one also targeted g n , then it also solves P n D 1 ( g n , η n ) = 0 . The submodel through Q n will depend on g n , while the submodel through g n will depend on another nuisance parameter η n . By setting D ( Q , g ) equal to the efficient influence curve D ( Q , g ) , the resulting TMLE solves the efficient influence curve estimating equation P n D ( Q n , g n ) = 0 and thereby will be asymptotically efficient when ( Q n , g n ) is consistent for ( Q 0 , g 0 ) under appropriate regularity conditions, where the targeting of g n is not needed.

The latter is shown as follows. By the property of the canonical gradient (in fact, any gradient) we have Ψ ( Q n ) Ψ ( Q 0 ) = P 0 D ( Q n , g n ) + R n ( Q n , Q 0 , g n , g 0 ) , where R n involves integrals of second-order products of the differences ( Q n Q 0 ) and ( g n g 0 ) . Combined with P n D ( Q n , g n ) = 0 , this implies the following identity:

Ψ ( Q n ) Ψ ( Q 0 ) = ( P n P 0 ) D ( Q n , g n ) + R n ( Q n , Q 0 , g n , g 0 ) .

The first term is an empirical process term that, under empirical process conditions (mentioned below), equals ( P n P 0 ) D ( Q , g ) , where ( Q , g ) denotes the limit of ( Q n , g n ) , plus an o P ( 1 / n ) -term. This then yields

Ψ ( Q n ) Ψ ( Q 0 ) = ( P n P 0 ) D ( Q , g ) + R n ( Q n , Q 0 , g n , g 0 ) + o P ( 1 / n ) .

To obtain the desired asymptotic linearity of Ψ ( Q n ) one needs R n = o P ( 1 / n ) , which in general requires at minimal that both nuisance parameters are consistently estimated: Q = Q 0 and g = g 0 . However, in many problems of interest, R n only involves a cross-product of the differences Q n Q 0 and g n g 0 , so that R n converges to zero if either Q n is consistent or g n is consistent: i.e. Q = Q 0 or g = g 0 . In this latter case, the TMLE is so-called double robust. Either way, the consistency of the TMLE relies now on one of the nuisance parameter estimators being consistent, thereby requiring the use of non-parametric adaptive estimation such as super-learning [810] for at least one of the nuisance parameters. If only one of the nuisance parameter estimators is consistent, and we are in the double robust scenario, then it follows that the bias is of the same order as the bias of the consistent nuisance parameter estimator. However, if the nuisance parameter estimator is not based on a correctly specified parametric model, but instead is a data-adaptive estimator, then this bias will be converging to zero at a rate slower than 1 / n : i.e. n R n converges to infinity as n . Thus, in that case, the estimator of the target parameter may thus be overly biased and thereby will not be asymptotically linear.

1.2 Targeting the fit of the nuisance parameter: general approach

In this article, we demonstrate that if Q Q 0 , then it is essential that the consistent nuisance parameter estimator g n be targeted toward the estimand so that the bias for the estimand becomes second order: that is, in our new TMLEs relying on consistent estimation of g 0 presented in this article one simultaneously updates g n into a g n so that certain smooth functionals of g n , derived from the study of R n , are asymptotically linear under appropriate conditions. Even if both estimators Q n and g n are consistent, but Q n might be converging at a slower rate than g n , this targeting of the nuisance parameter estimator may still remove finite sample bias for the estimand. In addition, we also present such TMLE when only relying on one of the nuisance parameters to be consistently estimated, but not knowing which one: i.e. either Q = Q 0 or g = g 0 . The same arguments applies to other double robust estimators, such as estimating equation based estimators and inverse probability of treatment weighted (IPTW) estimators [1116]. In fact, we demonstrate such a targeted IPTW-estimator in our next section.

The current article concerns the construction of such targeted IPTW and TMLE that are asymptotically linear under regularity conditions, even when only one of the nuisance parameters is consistent and the estimators of the nuisance parameters are highly data adaptive. In order to be concrete in this article, we will focus on a particular example. In such an example we can concretely present the second-order term R n mentioned above and thereby develop the concrete form of the TMLE.

The same approach for construction of such TMLE can be carried out in much greater generality, but that is beyond the scope of this article. Nonetheless, it is helpful for the reader to know that the general approach is the following (considering the case that g = g 0 , but Q can be misspecified): (1) approximate R n ( Q n , Q 0 , g n , g 0 ) = Φ 0 , n ( g n ) Φ 0 , n ( g 0 ) + R 1 , n for some mapping Φ 0 , n that depends on P 0 (e.g. through Q 0 ) and the data (e.g. Q n , g n ), and where R 1 , n is a second-order term so that it is reasonable to assume R 1 , n = o P ( 1 / n ) ; (2) approximate Φ 0 , n ( g n ) Φ 0 , n ( g 0 ) = Φ n ( g n ) Φ n ( g 0 ) + R 2 , n , where R 2 , n is a second-order term and Φ n is now a known (only based on data) mapping approximating Φ 0 ; (3) construct g n so that it is a TMLE of the target parameter Φ n ( g 0 ) thereby allowing an expansion Φ n ( g n ) Φ n ( g 0 ) = ( P n P 0 ) D 1 , n ( P 0 ) + R 3 , n with D 1 , n ( P 0 ) being the efficient influence curve of Φ n ( g 0 ) . That is, in step 3, g n is iteratively updated to solve P n D 1 , n ( g n , η n ) = 0 with D 1 , n ( P 0 ) depending on P 0 through g 0 and a nuisance parameter η 0 , so that Φ n ( g n ) is an asymptotically linear estimator of Φ n ( g 0 ) under regularity conditions. After these three steps, we have that R n ( Q n , Q 0 , g n , g 0 ) = ( P n P 0 ) D 1 , n ( P 0 ) + R 1 , n + R 2 , n + R 3 , n , where R 1 , n + R 2 , n + R 3 , n = o P ( 1 / n ) , and these steps provide us with the parameter Φ n ( g 0 ) that needs to be targeted by g n , thereby telling us how to target g n in the TMLE of ψ 0 . In addition, we can then conclude that this TMLE is asymptotically linear with known influence curve D ( Q , g 0 ) + D 1 ( P 0 ) , where D 1 ( P 0 ) represents the limit of the efficient influence curve D 1 , n ( P 0 ) of Φ n ( g 0 ) : Ψ ( Q n ) Ψ ( Q 0 ) = ( P n P 0 ) { D ( Q , g 0 ) + D 1 ( P 0 ) } + o P ( 1 / n ) .

1.3 Concrete example covered in this article

Let us now formulate our concrete example we will cover in this article. Let O = ( W , A , Y ) P 0 , W baseline covariates, A a binary treatment, and Y a final outcome. Let M be a model that makes at most some assumptions about the conditional distribution of A, given W, but leaves the marginal distribution of W and the conditional distribution of Y, given A , W , unspecified. Let Ψ : M I R be defined as Ψ ( P ) = E P E P ( Y | A = 1 , W ) , the so-called treatment specific mean controlling for the baseline covariates. The canonical gradient, also called the efficient influence curve, of Ψ at P is given by D ( P ) ( O ) = A / g ( 1 | W ) ( Y Q ˉ ( 1 , W ) ) + Q ˉ ( 1 , W ) Ψ ( P ) , where g ( 1 | W ) = P ( A = 1 | W ) is the propensity score and Q ˉ ( a , W ) = E P ( Y | A = a , W ) is the outcome regression [13]. Let Q = ( Q W , Q ˉ ) , where Q W is the marginal distribution of W, and note that Ψ ( P ) only depends on P through Q = Q ( P ) . For convenience, we will denote the target parameter with Ψ ( Q ) in order to not have to introduce additional notation. A targeted minimum loss-based estimator (TMLE) is a plug-in estimator Ψ ( Q n ) , where Q n is an update of an initial estimator Q n that relies on an estimator g n of g 0 , and it has the property that it solves P n D ( Q n , g n ) = 0 , where we used the notation P f = f ( o ) d P ( o ) .

For this particular example, such TMLE are presented in Scharfstein et al. [17]; van der Laan and Rubin [7]; Bembom et al. [1821]; Rosenblum and van der Laan [22]; Sekhon et al. [23]; van der Laan and Rose [6, 24]. Since P 0 D ( Q , g ) = ψ 0 Ψ ( Q ) + P 0 ( Q ˉ 0 Q ˉ ) ( g ˉ 0 g ˉ ) / g ˉ [25, 26], where we use the notation g ˉ ( W ) = g ( 1 | W ) and Q ˉ ( W ) = Q ˉ ( 1 , W ) , and P n D ( Q n , g n ) = 0 , we obtain the identity:

(1) Ψ ( Q n ) ψ 0 = ( P n P 0 ) D ( Q n , g n ) + P 0 ( Q ˉ 0 Q ˉ n ) ( g ˉ 0 g ˉ n ) / g ˉ n .

The first term equals ( P n P 0 ) D ( Q , g ) + o P ( 1 / n ) if D ( Q n , g n ) falls in a P 0 -Donsker class with probability tending to 1, and P 0 { D ( Q n , g n ) D ( Q , g ) } 2 0 in probability as n [4, 27]. If Q ˉ n and g ˉ n are consistent for the true Q ˉ 0 and g ˉ 0 , respectively, then the second term is a second-order term. If one now assumes that this second-order term is o P ( 1 / n ) , it has been proven that the TMLE is asymptotically efficient. This provides the general basis for proving asymptotic efficiency of TMLE when both Q 0 and g 0 are consistently estimated.

However, if only one of these nuisance parameter estimators is consistent, then the second term is still a first-order term, and it remains to establish that it is also asymptotically linear with a second-order remainder. For sake of discussion, suppose that Q ˉ n converges to a wrong Q ˉ while g ˉ n is consistent. In that case, this remainder behaves in first order as P 0 ( Q ˉ 0 Q ˉ ) ( g ˉ n g ˉ 0 ) / g ˉ 0 . To establish that such a term is asymptotically linear requires that g ˉ n solves a particular estimating equation: that is, g ˉ n needs to be a TMLE itself targeting the required smooth functional of g 0 . This is naturally achieved within the TMLE framework by specifying a submodel through g n and loss function with the appropriate generalized score, so that a TMLE update step involves both updating Q n and g n , and the iterative TMLE algorithm now results in a final TMLE ( Q n , g n ) , not only solving P n D ( Q n , g n ) = 0 but also these additional equations that allow us to establish asymptotic linearity of the desired smooth functional of g n : see general description of TMLE above.

In this article, we present TMLE that targets g n in a manner that allows us to prove the desired asymptotic linearity of the second term in the right-hand side of eq. (1) when either g ˉ n or Q ˉ n is consistent, under conditions that require specified second-order terms to be o P ( 1 / n ) . The latter type of regularity conditions are typical for the construction of asymptotically linear estimators and are therefore considered appropriate for the sake of this article. Though it is of interest to study cases in which these second-order terms cannot be assumed to be o P ( 1 / n ) , this is beyond the scope of this article.

1.4 Relation to current literature on targeted nuisance parameter estimators

The construction of TMLE that utilizes targeting of the nuisance parameter g n has been carried out in earlier papers. For example, in van der Laan and Rubin [7], we target g n to obtain a TMLE that, beyond being double robust locally efficient, also equals the IPTW-estimator. In Gruber and van der Laan [29] we target g n to guarantee that, beyond being double robust locally efficient, also outperforms a user-supplied given estimator, based on the original idea of Rotnitzky et al. [28]. In that sense, the distinction of the current article with these previous articles is that g n is now targeted to guarantee that the TMLE remains asymptotically linear when Q n is misspecified. This task of targeting g n appears to be one step more complicated than in these previous articles, since the smooth functionals of g n that need to be targeted are themselves indexed by parameters of the true data distribution P 0 , and thus unknown. As mentioned above, our strategy is to approximate these unknown smooth functionals by an estimated smooth functional and develop the targeted estimator g n that targets this estimated parameter of g 0 .

The TMLEs presented in this article are always iterative and thereby rely on convergence of the iterative updating algorithm. Since the empirical risk increases at each updating step, such convergence is typically guaranteed by the existence of the MLE at each updating step (e.g. an MLE of coefficient in a logistic regression). Either way, in this article, we assume this convergence to hold. Since our assumptions of our theorems require g n ( 1 | W ) to be bounded away from zero, we demonstrate how this property can be achieved by using submodels for updating g n that guarantee this property. Detailed simulations will appear in a future article.

1.5 Organization

The organization of this paper is as follows. In Section 2, we introduce a targeted IPTW-estimator that relies on an adaptive consistent estimator of g 0 , and we establish its asymptotic linearity with known influence curve, allowing for the construction of asymptotically valid confidence intervals based on this adaptive IPTW-estimator. In the remainder of the article, we focus on construction of TMLE involving the targeting of g n to establish the asymptotic linearity of the resulting TMLE under appropriate conditions. In Section 3, we introduce a novel TMLE that assumes that the targeted adaptive estimator g n is consistent for g 0 , and we establish its asymptotic linearity. In Section 4, we introduce a novel TMLE that only assumes that either the targeted Q ˉ n or the targeted g ˉ n is consistent, and we establish its asymptotic linearity with known influence curve. This TMLE needs to protect the asymptotic linearity under misspecification of either g n or Q ˉ n , and, as a consequence, relies on targeting of g n (in order to preserve asymptotic linearity when Q ˉ n is inconsistent), but also extra targeting of Q ˉ n (in order to preserve asymptotic linearity when Q ˉ n is consistent, but g n is inconsistent). The explicit form of the influence curve of this TMLE allows us to construct asymptotic confidence intervals. Since this result allows statistical inference in the statistical model that only assumes that one of the estimators is consistent, and we refer to this as “double robust statistical inference”. Even though double robust estimators have been extensively presented in the current literature, double robust statistical inference in these large semi-parametric models has been a difficult topic: typically, one has suggested to use the non-parametric bootstrap, but there is no theory supporting that the non-parametric bootstrap is a valid method when the estimators rely on data-adaptive estimation.

In Section 5, we extend the TMLE of Section 3 (that relies on g n being consistent for g 0 ) to the case that g n converges to a possibly misspecified g but one that suffices for consistent estimation of ψ 0 in the sense that Ψ ( Q ˉ n ) will be consistent. We present a corresponding asymptotic linearity theorem for this TMLE that is able to utilize the so-called collaborative double robustness of the efficient influence curve which states that Ψ ( Q ) = ψ 0 if P 0 D ( Q , g ) = 0 and g G ( Q , P 0 ) for a set G ( Q , P 0 ) (including g 0 ). In order to construct a collaborative estimator g n that aims to converge to an element in G ( Q n , P 0 ) in collaboration with Q n , we use the framework of collaborative targeted minimum loss-based estimator (C-TMLE) [20, 2935]. Our asymptotic linearity theorem can now be applied to this C-TMLE. Again, even though C-TMLEs have been presented in the current literature, statistical inference based on the C-TMLEs has been another challenging topic, and Section 5 provides us with a C-TMLE with known influence curve. We conclude this article with a discussion. The proofs of the theorems are presented in the Appendix.

1.6 Notation

In the following sections, we will use the following notation. We have O = ( W , A , Y ) P 0 M , where M is a statistical model that makes only assumptions on the conditional distribution of A, given W. Let g 0 ( a | W ) = P 0 ( A = a | W ) , and g ˉ 0 ( W ) = P 0 ( A = 1 | W ) . The target parameter is Ψ : M I R defined by Ψ ( P 0 ) = E Q W , 0 Q ˉ 0 ( 1 , W ) , where Q ˉ 0 ( 1 , W ) = E P 0 ( Y | A = 1 , W ) , which will also be denoted with Q ˉ 0 ( W ) , and Q W , 0 is the distribution of W under P 0 . We also use the notation Ψ ( Q ) , where Q = ( Q W , Q ˉ ) . In addition, D ( Q , g ) denotes the efficient influence curve of Ψ at ( Q , g ) . We also use the following notation:

H A ( Q ˉ r , g ˉ ) = Q ˉ r / g ˉ
H 0 r = Q ˉ 0 r / g ˉ 0
D A ( Q ˉ r , g ˉ ) ( A , W ) = H A ( Q ˉ r , g ˉ ) ( W ) ( A g ˉ ( W ) )
H Y ( g ˉ ) ( A , W ) = A / g ˉ ( W )
Q ˉ 0 r ( Q ˉ , g ˉ ) = E P 0 ( Y Q ˉ | A = 1 , g ˉ )
Q ˉ 0 r = Q ˉ 0 r ( Q ˉ , g ˉ 0 )
g ˉ 0 r ( g ˉ , Q ˉ ) = E 0 ( A | g ˉ , Q ˉ )
Q ˉ 0 r ( g ˉ ) = E 0 ( Y | A = 1 , g ˉ ) o n l y I P T W S e c t i o n 2
Q ˉ 0 r = Q ˉ 0 r ( g ˉ 0 ) o n l y I P T W S e c t i o n 2
f 0 = P 0 f 2 0.5 .

2 Statistical inference for IPTW-estimator when using super-learning to fit treatment mechanism

We first describe an IPTW-estimator that uses super-learning to fit the treatment mechanism g 0 . Subsequently, we present this IPTW-estimator but now using an update of the super-learning fit of g 0 , and we present a theorem establishing the asymptotic linearity of this targeted IPTW-estimator under appropriate conditions. Finally, we discuss how this targeted IPTW-estimator compares with an IPTW-estimator that relies on a parametric model to fit the treatment mechanism.

2.1 An IPTW-estimator using super-learning to fit the treatment mechanism

We consider a simple IPTW-estimator Ψ ˆ ( P n ) = P n D ( g ˆ ( P n ) ) , where D ( g ) ( O ) = Y A / g ˉ ( W ) , and g ˆ : M N P G is an adaptive estimator of g 0 based on the log-likelihood loss function L ( g ) ( O ) log g ( A | W ) . For a general presentation of an IPTW-estimator, we refer to Robins and Rotnitzky [11], van der Laan and Robins [13], and Hernan et al. [36]. We wish to establish conditions under which reliable statistical inference based on this estimator of ψ 0 can be obtained. One might wish to estimate g 0 with ensemble learning, and, in particular, super-learning in which cross-validation [37] is used to determine the best weighted combination of a library of candidate estimators: van der Laan and Dudoit [8]; van der Laan et al. [9, 38, 39]; van der Vaart et al. [10]; Dudoit and van der Laan [40]; Polley et al. [41]; Polley and van der Laan [42]; van der Laan and Petersen [43]. The super-learner is a general template for construction of an adaptive estimator based on a library of candidate estimators, a loss function whose expectation is minimized over the parameter space by the true parameter value, a parametric family that defines “weighted” combinations of the estimators in the library. We will start with presenting a succinct description of a particular super-learner. Consider a library of estimators g ˆ j : M N P G , j = 1 , , J and a family of weighted (on logistic scale) combinations of these estimators L o g i t g ˆ α ( 1 | W ) = j = 1 J α j L o g i t g ˆ j ( 1 | W ) , indexed by vectors α for which α j [ 0 , 1 ] and j α j = 1 . Consider a random sample split B n { 0 , 1 } n into a training sample { i : B n ( i ) = 0 } of size n ( 1 p ) and validation sample { i : B n ( i ) = 1 } of size np, and let P n , B n 1 and P n , B n 0 denote the empirical distribution of the validation sample and training sample, respectively. Define

α n = arg min α E B n P n , B n 1 L g ˆ α P n , B n 0
= arg min α E B n 1 n p i : B n ( i ) = 1 L g ˆ α P n , B n 0 ( O i ) ,

as the choice of estimator that minimizes cross-validated risk. The super-learner of g 0 is defined as the estimator g ˆ ( P n ) = g ˆ α n ( P n ) .

2.2 Asymptotic linearity of a targeted data-adaptive IPTW-estimator

The next theorem presents an IPTW-estimator that uses a targeted fit g n of g 0 , involving the updating of an initial estimator g n , and conditions under which this IPTW-estimator of ψ 0 is asymptotically linear. For example, g n could be defined as a super-learner of the type presented above. In spite of the fact that such an IPTW-estimator uses a very data adaptive and hard to understand estimator g n , this theorem shows that its influence curve is known and can be well estimated.

Theorem 1We consider a targeted IPTW-estimator Ψ ˆ ( P n ) = P n D ( g n ) , where D ( g ) ( O ) = Y A / g ( A | W ) , and g n is an update of an initial estimator g n of g 0 G defined below.

Definition of targeted estimator g n : Let Q ˉ n r be obtained by non-parametric estimation of the regression function E P 0 ( Y | A = 1 , g ˉ n ( W ) ) treating g ˉ n as a fixed covariate (i.e. function of W). This yields an estimator H n r Q ˉ n r / g ˉ n of H 0 r = Q ˉ 0 r / g ˉ 0 , where Q ˉ 0 r = E P 0 ( Y | A = 1 , g ˉ 0 ) . Consider the submodel L o g i t g ˉ n ( ) = L o g i t g ˉ n + H n r , and fit with the MLE

n = arg max P n log g n ( ) .

We define g n = g n ( n ) as the corresponding targeted update of g n . This TMLE g n satisfies

P n D A ( Q ˉ n r , g ˉ n ) = 0.

Empirical process condition: Assume that D ( g n ) , D A ( Q ˉ n r , g ˉ n ) fall in a P 0 -Donsker class with probability tending to 1.

Negligibility of second-order terms: Define Q ˉ 0 , n r E P 0 ( Y | A = 1 , g ˉ 0 ( W ) , g ˉ n ( W ) ) . Assume g ˉ n > δ > 0 with probability tending to 1 and assume

Q ˉ n r Q ˉ 0 r 0 = o P ( 1 )
g ˉ n g ˉ 0 0 2 = o P ( 1 / n )
Q ˉ n r Q ˉ 0 r 0 g ˉ n g ˉ 0 0 = o P ( 1 / n )
Q ˉ 0 , n r Q ˉ 0 r 0 g ˉ n g ˉ 0 0 = o P ( 1 / n ) .

Then,

Ψ ˆ ( P n ) ψ 0 = ( P n P 0 ) I C ( P 0 ) + o P ( 1 / n ) ,

where

I C ( P 0 ) ( O ) = Y A / g 0 ( A | W ) ψ 0 H 0 r ( W ) ( A g ˉ 0 ( W ) ) .

So under the conditions of this theorem, we can construct an asymptotic 0.95-confidence interval ψ n ± 1.96 σ n / n based on this targeted IPTW-estimator ψ n = Ψ ˆ ( P n ) , where

σ n 2 = P n I C n 2 = 1 n i = 1 n I C n ( O i ) 2 ,

and I C n ( O ) = Y A / g ˉ n ( W ) ψ n H n r ( W ) ( A g ˉ n ( W ) ) is the plug-in estimator of the influence curve I C ( P 0 ) obtained by plugging in g n or g n for g 0 and Q ˉ n r for Q ˉ 0 r .

Regarding the displayed second-order term conditions, we note that these are satisfied if g ˉ n g ˉ 0 converges to zero w.r.t. L 2 ( P 0 ) -norm at rate o P ( n 1 / 4 ) , g ˉ n > δ > 0 for some δ > 0 with probability tending to 1 as n , and the product of the rates at which g ˉ n converges to g ˉ 0 and ( Q ˉ n r , Q ˉ 0 , n r ) converges to Q ˉ 0 r is o P ( 1 / n ) .

Regarding the empirical process condition, we note that an example of a Donsker class is the class of multivariate real-valued functions with uniform sectional variation norm bounded by a universal constant [44]. It is important to note that if each estimator in the library falls in such a class, then also the convex combinations fall in that same class [4]. So this Donsker condition will hold if it holds for each of the candidate estimators in the library of the super-learner.

2.3 Comparison of targeted data-adaptive IPTW and an IPTW using parametric model

Consider an IPTW-estimator using a MLE g n , 1 according to a parametric model for g 0 , and let us contrast this IPTW-estimator with an IPTW-estimator defined in the above theorem based on an initial super-learner g n that includes g n , 1 as an element of the library of estimators. Let us first consider the case that the parametric model is correctly specified. In that case g n , 1 converges to g 0 at a parametric rate 1 / n . From the oracle inequality for cross-validation [8, 10, 38], it follows that g n also converges at the rate 1 / n to g 0 possibly up to a l o g n -factor in case the number of algorithms in the library is of the order n q for some fixed q. As a consequence, all the consistency and second-order term conditions for the IPTW-estimator using a targeted g n based on g n hold. If one uses estimators in the library of algorithms that have a uniform sectional variation norm smaller than a M < with probability tending to 1, then also a weighted average of these estimators will have uniform sectional variation norm smaller than M < with probability tending to 1. Thus, in that case we will also have that D ( g n ) , D A ( Q ˉ n r , g ˉ n ) fall in a P 0 -Donsker class. Examples of estimators that control the uniform sectional variation norm are any parametric model with fewer than K main terms that themselves have a uniform sectional variation norm, but also penalized least-squares estimators (e.g. Lasso) using basis functions with bounded uniform sectional variation norm, and one could map any estimator into this space of functions with universally bounded uniform sectional variation norm through a smoothing operation. Thus, under this restriction on the library, the IPTW-estimator using the super-learner is asymptotically linear with influence curve I C ( P 0 ) ( O ) as stated in the theorem. We note that I C ( P 0 ) is the efficient influence curve for the target parameter E P 0 E P 0 ( Y | A = 1 , g ˉ 0 ( W ) ) if the observed data were ( g ˉ 0 ( W ) , A , Y ) instead of O = ( W , A , Y ) .

The parametric IPTW-estimator is asymptotically linear with influence curve O Y A / g 0 ( A | W ) ψ 0 Π ( Y A / g ˉ 0 ( W ) | T g ) , where T g is the tangent space of the parametric model for g 0 , and Π ( f | T g ) denotes the projection of f onto T g in the Hilbert space L 0 2 ( P 0 ) [13]. This IPTW-estimator could be less or more efficient than the IPTW-estimator using the targeted super-learner depending on the actual tangent space of the parametric model.

For example, if the parametric model happens to have a score equal to O Q ˉ 0 ( W ) ( A / g ˉ 0 ( W ) 1 ) , then the parametric IPTW-estimator would be asymptotically efficient. Of course, a standard parametric model is not tailored to correspond with such optimal scores, but this shows that we cannot claim superiority of one versus the other in the case that the parametric model for g 0 is correctly specified.

If, on the other hand, the parametric model is misspecified, then the IPTW-estimator using g n , 1 is inconsistent. However, the super-learner g n will be consistent if the library contains a non-parametric adaptive estimator, and will perform asymptotically as well as the oracle selector among all the weighted combinations of the algorithms in the library. To conclude, the IPTW-estimator using super-learning to estimate g 0 will be as good as the IPTW-estimator using a correctly specified parametric model (included in the library of the super-learner), but will remain consistent and asymptotically linear in a much larger model than the parametric IPTW-estimator relying on the true g 0 being an element of the parametric model.

3 Statistical inference for TMLE when using super-learning to consistently fit treatment mechanism

In the next subsection, we present a TMLE that targets the fit of the treatment mechanism, analog to the targeted IPTW-estimator presented above. In addition, this subsection presents a formal asymptotic linearity theorem demonstrating that this TMLE will be asymptotically linear even when Q ˉ n is inconsistent under reasonable conditions. We conclude this section with a subsection showing how the iterative updating of the treatment mechanism can be carried out in such a way that the final fit of the treatment mechanism is still bounded away from zero, as required to obtain a stable estimator.

3.1 Asymptotic linearity of a TMLE using a targeted estimator of the treatment mechanism

The following theorem presents a novel TMLE and corresponding asymptotic linearity with specified influence curve, where we rely on consistent estimation of g 0 . The TMLE still uses the same updating step for the estimator of Q ˉ 0 as the regular TMLE [7], but uses a novel updating step for the estimator of g 0 , analogue to the updating step of the IPTW-estimator in the previous section. We remind the reader of the importance of using the logistic fluctuations as working-submodels for Q ˉ 0 in the definition of the TMLE, guaranteeing that the TMLE update stays within the bounded parameter space (see, e.g. Gruber and van der Laan [19]).

Theorem 2

Iterative targeted MLE of ψ 0 :

Definitions: Given Q ˉ , g ˉ , let Q ˉ n r ( Q ˉ , g ˉ ) be a consistent estimator of the regression Q ˉ 0 r ( Q ˉ , g ˉ ) = E P 0 ( Y Q ˉ | A = 1 , g ˉ ) of ( Y Q ˉ ) on g ˉ ( W ) and A = 1 . Let ( g n , Q ˉ n ) be an initial estimator of ( g 0 , Q ˉ 0 ) .

Initialization: Let g n 0 = g n , Q ˉ n 0 = Q ˉ n , and Q ˉ n r 0 = Q ˉ n r ( Q ˉ n 0 , g ˉ n 0 ) . Let k = 0 .

Updating step for g n k : Consider the submodel L o g i t g ˉ n k ( ) = L o g i t g ˉ n k + H A ( Q ˉ n r k , g ˉ n k ) , and fit with the MLE

n = arg max P n log g n k ( ) .

We define g n k + 1 = g n k ( n ) as the corresponding update of g n k . This g n k + 1 satisfies

1 n i = 1 n H A Q ˉ n r k , g ˉ n k ( W i ) A i g ˉ n k + 1 ( W i ) = 0.

Updating step for Q ˉ n k : Let L ( Q ˉ ) ( O ) Y log Q ˉ ( A , W ) + ( 1 Y ) log ( 1 Q ˉ ( A , W ) ) be the quasi-log-likelihood loss function for Q ˉ 0 = E 0 ( Y | A = 1 , W ) (allowing that Y is continuous in [ 0 , 1 ] ). Consider the submodel L o g i t Q ˉ n k ( ) = L o g i t Q ˉ n k + H Y ( g n k ) , and let n = arg min P n L ( Q ˉ n k ( ) ) . Define Q ˉ n k + 1 = Q ˉ n k ( n ) as the resulting update. Define Q ˉ n r k + 1 = Q ˉ n r ( Q ˉ n k + 1 , g ˉ n k + 1 ) .

Iterating till convergence: Now, set k k + 1 , and iterate this updating process mapping a ( g n k , Q ˉ n k , Q ˉ n r k ) into ( g n k + 1 , Q ˉ n k + 1 , Q ˉ n r k + 1 ) till convergence or till large enough K so that the estimating equations (2) below are solved up till an o P ( 1 / n ) -term. Denote the limit of this iterative procedure with ( g n , Q ˉ n , Q ˉ n r ) .

Plug-in estimator: Let Q n = ( Q W , n , Q ˉ n ) , where Q W , n is the empirical distribution estimator of Q W , 0 . The TMLE of ψ 0 is defined as Ψ ( Q n ) .

Estimating equations solved by TMLE: This TMLE ( Q n , g n , Q ˉ n r ) solves

P n D ( Q n , g n ) = 0
(2) P n D A ( Q ˉ n r , g ˉ n ) = 0.

Empirical process condition: Assume that D ( Q n , g n ) , D A ( Q ˉ n r , g ˉ n ) falls in a P 0 -Donsker class with probability tending to 1 as n .

Negligibility of second-order terms: Define

Q ˉ 0 , n r ( W ) E P 0 Y Q ˉ ( 1 , W ) | A = 1 , g ˉ n ( W ) , g ˉ 0 ( W ) Q ˉ 0 r ( W ) E P 0 Y Q ˉ ( 1 , W ) | A = 1 , g ˉ 0 ( W ) H 0 , n r = Q ˉ 0 , n r / g ˉ n H 0 r = Q ˉ 0 r / g ˉ 0 ,

where g ˉ n ( W ) is treated as a fixed covariate (i.e. function of W) in the conditional expectation Q ˉ 0 , n r . Assume that there exists a δ > 0 , so that g ˉ n > δ > 0 with probability tending to 1, and

Q ˉ n Q ˉ 0 = o P ( 1 )
Q ˉ n r Q ˉ 0 r 0 = o P ( 1 )
g ˉ n g ˉ 0 0 Q ˉ n Q ˉ 0 = o P ( 1 / n )
Q ˉ 0 , n r Q ˉ 0 r 0 g ˉ n g ˉ 0 0 = o P ( 1 / n )
g ˉ n g ˉ 0 0 2 = o P ( 1 / n )
Q ˉ n r Q ˉ 0 r 0 g ˉ n g ˉ 0 0 = o P ( 1 / n ) .

Then,

Ψ ( Q n ) ψ 0 = ( P n P 0 ) I C ( P 0 ) + o P ( 1 / n ) ,

where I C ( P 0 ) = D ( Q , g 0 ) D A ( Q ˉ 0 r , g ˉ 0 ) .

Thus, under the assumptions of this theorem, an asymptotic 0.95-confidence interval is given by ψ n ± 1.96 σ n / n , where σ n 2 = P n I C n 2 , and I C n = D ( Q n , g n ) D A ( Q ˉ n r , g ˉ n ) .

3.2 Using a δ -specific submodel for targeting g that guarantees the positivity condition

The following is an application of the constrained logistic regression approach of the type presented in Gruber and van der Lann [19] for the purpose of estimation of g ˉ 0 respecting the constraint that g ˉ 0 > δ > 0 for a known δ > 0 . Recall that A { 0 , 1 } . Suppose that it is known that g ˉ 0 ( W ) ( δ , 1 ] for some δ > 0 , a condition the asymptotic linearity of our proposed estimators relies upon. Define A δ A δ 1 δ . We have g ˉ 0 ( W ) = δ + ( 1 δ ) g ˉ 0 , δ , where g ˉ 0 , δ = E 0 ( A δ | W ) is a regression that is known to be between [ 0 , 1 ] . Let g δ , n 0 be an initial estimator of the true conditional distribution g δ , 0 of A δ , given W, which implies an estimator g ˉ n 0 = δ + ( 1 δ ) g ˉ δ , n 0 of g ˉ 0 . Let k = 0 . Consider the following submodel for the conditional distribution of A δ , given W, through a given estimator g δ , n k :

L o g i t g ˉ δ , n k ( ) = L o g i t g ˉ δ , n k + H A ( Q ˉ n r k , g ˉ δ , n k ) .

The MLE is simply obtained with logistic regression of A δ on W (see, e.g. Gruber and van der Lann [19]) based on the quasi-log-likelihood loss function:

n = arg min P n L g ˉ δ , n k ( ) ,

where

L ( g ˉ δ ) ( O ) = A δ log g ˉ δ ( W ) + ( 1 A δ log ( 1 g ˉ δ ( W ) )

is the quasi-log-likelihood loss. The update g ˉ δ , n k + 1 = g ˉ δ , n k ( n ) implies an update g ˉ n k + 1 = δ + ( 1 δ ) g ˉ δ , n k + 1 of g ˉ n k = δ + ( 1 δ ) g ˉ δ , n k , and, by construction g ˉ n k + 1 > δ > 0 . The above submodel g ˉ n k ( ) = δ + ( 1 δ ) g ˉ δ , n k ( ) and corresponding loss function L ( g ˉ ) = L ( g ˉ δ ) generates the same score equation as the submodel and loss function used in Theorem 2. Therefore, the TMLE algorithm presented in Theorem 2 but now using this δ -specific logistic regression model solves the same estimating equations, so that the same Theorem 2 immediately applies. However, using this submodel we have now guaranteed that g ˉ n k > δ > 0 for all k in the iterative TMLE algorithm, and thereby that g ˉ n > δ > 0 .

4 Double robust statistical inference for TMLE when using super-learning to fit outcome regression and treatment mechanism

In this section, our aim is to present a TMLE that is asymptotically linear with known influence curve if either g 0 or Q 0 is consistently estimated, but we do not need to know which one. Again, this requires a novel way of targeting the estimators g n , Q ˉ n in order to arrange that the relevant smooth functionals of these nuisance parameter estimators are indeed asymptotically linear under appropriate second-order term conditions. In this case, we also need to augment the submodel for the estimator of Q ˉ 0 with another clever covariate: that is, our estimator of Q ˉ 0 needs to be double targeted, once for solving the efficient influence curve equation, but also for achieving asymptotic linearity in the case that the estimator of g 0 is misspecified.

Theorem 3

Definitions: For any given g ˉ , Q ˉ , let g ˉ n r ( g ˉ , Q ˉ ) and Q ˉ n r ( g ˉ , Q ˉ ) be consistent estimators of g ˉ 0 r ( g ˉ , Q ˉ ) = E P 0 ( A | Q ˉ , g ˉ ) and Q ˉ 0 r ( g ˉ , Q ˉ ) = E P 0 ( Y Q ˉ | A = 1 , g ˉ ) , respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let Q ˉ n r = Q ˉ n r ( g ˉ n , Q ˉ n ) and g ˉ n r = g ˉ n r ( g ˉ n , Q ˉ n ) denote these estimators applied to the TMLEs ( g ˉ n , Q ˉ n ) defined below.

Iterative targeted MLE of ψ 0 :

Initialization: Let ( g n , Q ˉ n ) be an initial estimator of ( g 0 , Q ˉ 0 ) . Let g n 0 = g n , Q ˉ n 0 = Q ˉ n and let k = 0 . Let g ˉ n r , k = g ˉ n r ( g ˉ n k , Q ˉ n k ) be obtained by non-parametrically regressing A on Q ˉ n k , g ˉ n k . Let Q ˉ n r , k = Q ˉ n r ( g ˉ n k , Q ˉ n k ) be obtained by non-parametrically regressing Y Q ˉ n k on A = 1 , g ˉ n k .

Updating step: Consider the submodel L o g i t g ˉ n k ( ) = L o g i t g ˉ n k + H A Q ˉ n r , k , g ˉ n k , and fit with the MLE

A , n = arg max P n log g n k ( ) .

Define the submodel L o g i t Q ˉ n k ( ) = L o g i t Q ˉ n k + 1 H Y ( g ˉ n k ) + 2 H Y 1 g ˉ n r , k , g ˉ n k , where H Y 1 ( g ˉ r , g ˉ ) A g ˉ r g ˉ r g ˉ g ˉ

Let Y , n = arg min P n L ( Q ˉ n k ( ) ) be the MLE, where L ( Q ˉ ) is the quasi-log-likelihood loss.

We define g n k + 1 = g n k ( A , n ) as the corresponding targeted update of g n k , and Q ˉ n k + 1 = Q ˉ n k ( Y , n ) as the corresponding update of Q ˉ n k . Let g ˉ n r , k + 1 = g ˉ n r ( g ˉ n k + 1 , Q ˉ n k + 1 ) and Q ˉ n r , k + 1 = Q ˉ n r ( g ˉ n k + 1 , Q ˉ n k + 1 ) .

Iterate till convergence: Now, set k k + 1 , and iterate this updating process mapping a ( g n k , Q ˉ n k , g ˉ n r k , Q ˉ n r k ) into ( g n k + 1 , Q ˉ n k + 1 , g ˉ n r k + 1 , Q ˉ n r k + 1 ) till convergence or till large enough K so that the following three estimating equations are solved up till an o P ( 1 / n ) -term:

P n D ( Q n K , g n K ) = o P ( 1 / n ) P n D A ( Q ˉ n r , K , g ˉ n K ) = o P ( 1 / n ) P n D Y ( Q ˉ n K , g ˉ n r , K , g ˉ n K ) = o P ( 1 / n ) ,

where

D Y ( Q ˉ , g ˉ 0 r , g ˉ ) = H Y 1 ( g ˉ 0 r , g ˉ ) ( Y Q ˉ ) .

Final substitution estimator: Denote the limits of this iterative procedure with Q ˉ n r , g ˉ n r , g n , Q ˉ n . Let Q n = ( Q W , n , Q ˉ n ) , where Q W , n is the empirical distribution estimator of Q W , 0 . The TMLE of ψ 0 is defined as Ψ ( Q n ) .

Equations solved by TMLE:

o P ( 1 / n ) = P n D ( Q n , g n ) o P ( 1 / n ) = P n D A ( Q ˉ n r , g ˉ n ) o P ( 1 / n ) = P n D Y ( Q ˉ n , g ˉ n r , g ˉ n ) .

Empirical process condition: Assume that D ( Q n , g n ) , D A ( Q ˉ n r , g ˉ n ) , D Y ( Q ˉ n , g ˉ n r , g ˉ n ) fall in a P 0 -Donsker class with probability tending to 1 as n .

Negligibility of second-order terms: Define Q ˉ 0 , n r = E P 0 ( Y Q ˉ | A = 1 , g ˉ , g ˉ n ) and g ˉ 0 , n r = E P 0 ( A | g ˉ , Q ˉ , Q ˉ n ) . Assume that there exists a δ > 0 so that g ˉ n > δ > 0 with probability tending to 1, that g ˉ n , Q ˉ n are consistent for g ˉ , Q ˉ w.r.t. 0 -norm, where either g ˉ = g ˉ 0 or Q ˉ = Q ˉ 0 , and assume that the following second-order terms are o P ( 1 / n ) :

Q ˉ n Q ˉ 0 = o P ( 1 ) Q ˉ n r Q ˉ 0 r 0 = o P ( 1 ) g ˉ n r g ˉ 0 r 0 = o P ( 1 ) g ˉ n g ˉ 0 2 = o P ( 1 / n ) g ˉ n g ˉ 0 Q ˉ n Q ˉ 0 = o P ( 1 / n ) Q ˉ 0 , n r Q ˉ 0 r 0 g ˉ n g ˉ 0 = o P ( 1 / n ) Q ˉ n r Q ˉ 0 r 0 g ˉ n g ˉ 0 = o P ( 1 / n ) g ˉ n r g ˉ 0 r 0 Q ˉ n Q ˉ 0 = o P ( 1 / n ) g ˉ 0 , n r g ˉ 0 r 0 Q ˉ n Q ˉ 0 = o P ( 1 / n ) .

Then,

Ψ ( Q n ) ψ 0 = ( P n P 0 ) I C ( P 0 ) + o P ( 1 / n ) ,

where

I C ( P 0 ) = D ( Q , g ) D A ( Q ˉ 0 r , g ˉ ) D Y ( Q ˉ , g ˉ 0 r , g ˉ ) .

Note that consistent estimation of the influence curve I C ( P 0 ) relies on consistency of g ˉ n r , Q ˉ n r as estimators of g ˉ 0 r , Q ˉ 0 r , and estimators Q ˉ n , g ˉ n converging to a Q ˉ , g ˉ for which either Q ˉ = Q ˉ 0 or g ˉ = g ˉ 0 . These estimators imply an estimated influence curve I C n . An asymptotic 0.95-confidence interval is given by ψ n ± 1.96 σ n / n , where σ n 2 = P n I C n 2 .

If g ˉ = g ˉ 0 , then E P 0 ( A | g ˉ , Q ˉ ) = g ˉ , and therefore D Y ( Q ˉ , g ˉ 0 r , g ˉ ) = 0 for all Q ˉ . If Q ˉ = Q ˉ 0 , then it follows that Q ˉ 0 r = 0 , and thus that D A ( Q ˉ 0 r , g ˉ ) = 0 for all g ˉ . In particular, if both g ˉ = g ˉ 0 and Q ˉ = Q ˉ 0 , then I C ( P 0 ) = D ( Q 0 , g 0 ) . We also note that if g ˉ g ˉ 0 , but g ˉ is a true conditional distribution of A, given some function W r of W for which Q ˉ ( W ) is only a function of W r , then it follows that E P 0 ( A | g ˉ , Q ˉ ) = g ˉ and thus D Y = 0 .

As shown in the final remark of the Appendix, the condition of Theorem 3 that either g = g 0 or Q ˉ = Q ˉ 0 can be weakened to ( g ˉ , Q ˉ ) having to satisfy P 0 ( Q ˉ Q ˉ 0 ) ( g ˉ g ˉ 0 ) / g ˉ = 0 , allowing for the analysis of collaborative double robust TMLE, as discussed in the next section. However, as shown in the next section, if one arranges in the TMLE algorithm that g ˉ n = g ˉ n r (i.e. g ˉ n already non-parametrically adjusts for Q ˉ n ), then there is no need for the extra targeting in Q ˉ n k , and the influence curve will be D ( Q , g ) D A ( Q ˉ 0 r , g ˉ ) .

5 Collaborative double robust inference for C-TMLE when using super-learning to fit outcome regression and reduced treatment mechanism

We first review the theoretical underpinning for collaborative estimation of nuisance parameters, in this case, the outcome regression and treatment mechanism. Subsequently, we explain that the desired collaborative estimation can be achieved by applying the previously established template for construction of a C-TMLE to a TMLE that solves certain estimating equations when given an initial estimator of ( Q 0 , g 0 ) . This C-TMLE template involves (1) creating a sequence of TMLEs ( ( g n , k , Q n , k ) : k = 1 , , K ) constructed in such a manner that the empirical risk of both g n , k and Q n , k is decreasing in k, and (2) using cross-validation to select the k for which Q n , k is the best fit of Q 0 . Subsequently, we present this TMLE that maps an initial of ( Q 0 , g 0 ) into targeted estimators solving the desired estimating equations and establish its asymptotic linearity under appropriate conditions, including that the initial estimator of ( Q 0 , g 0 ) is collaboratively consistent. Finally, we present a concrete C-TMLE algorithm that uses this TMLE algorithm as its basis, so that our theorem can be applied to this C-TMLE: a C-TMLE is still a TMLE, but it is a TMLE based on a data adaptively selected initial estimator that is collaboratively consistent, so that we can apply the same theorem to this C-TMLE.

5.1 Motivation and theoretical underpinning of collaborative double robust estimation of nuisance parameters

We note that P 0 D ( Q , g ) = P 0 A g ˉ ( Q ˉ 0 Q ˉ ) + Q ˉ Ψ ( Q ) . If Q W = Q W , 0 , this reduces to

P 0 D ( Q , g ) = P 0 A g ˉ ( Q ˉ 0 Q ˉ ) = Ψ ( Q 0 ) Ψ ( Q ) + P 0 A g ˉ g ˉ ( Q ˉ 0 Q ˉ ) .

Let G be the class of all possible distributions of A, given W, and let g 0 G be the true conditional distribution of A given W. We define the set G ( P 0 , Q ˉ ) g :∈ G : 0 = P 0 ( A g ˉ ) Q ˉ 0 Q ˉ g ˉ . For any g G ( P 0 , Q ˉ ) , we have P 0 D ( Q , g ) = Ψ ( Q 0 ) Ψ ( Q ) . Suppose we have an estimator ( Q n , g n ) satisfying P n D ( Q n , g n ) = 0 and converging to a ( Q , g ) so that g G ( P 0 , Q ˉ ) . Then it follows that P 0 D ( Q , g ) = 0 and P 0 D ( Q , g ) = Ψ ( Q 0 ) Ψ ( Q ) , thereby establishing that Ψ ( Q n ) is a consistent estimator of Ψ ( Q 0 ) . Let us state this crucial result as a lemma

Lemma 1(van der Laan and Gruber [33]) If P 0 ( A g ˉ ) ( Q ˉ 0 Q ˉ ) / g ˉ = 0 , and P 0 D ( Q , g ) = 0 , then Ψ ( Q ) = ψ 0 . More generally, P 0 D ( Q , g ) = Ψ ( Q 0 ) Ψ ( Q ) + P 0 ( A g ˉ ) ( Q ˉ 0 Q ˉ ) / g ˉ .

We note that G ( P 0 , Q ˉ ) contains the true conditional distributions g 0 r of A, given W r , for which ( Q ˉ Q ˉ 0 ) / g ˉ 0 r is a function of W r , i.e. for which Q ˉ Q ˉ 0 only depends on W through W r . We refer to such distributions as reduced treatment mechanisms. However, it contains many more conditional distributions since any conditional distribution g for which ( A g ˉ ( W ) ) is orthogonal to ( Q ˉ 0 Q ˉ ) / g ˉ in L 0 2 ( P 0 ) is an element of G ( P 0 , Q ˉ ) . We refer to van der Laan and Gruber [33] and Gruber and van der Laan [29] for the introduction and general notion of collaborative double robustness.

5.2 C-TMLE

The general C-TMLE introduced in van der Laan and Gruber [33] provides a template for construction of a TMLE ( g n , Q ˉ n ) satisfying P n D ( Q n , g n ) = 0 and converging to a ( g , Q ˉ ) with g G ( P 0 , Q ˉ ) so that P 0 D ( Q , g ) = 0 and thereby Ψ ( Q ) Ψ ( Q 0 ) = 0 . Thus C-TMLE provides a template for construction of targeted MLEs that exploit the collaborative double robustness of TMLEs in the sense that a TMLE will be consistent as long as ( Q n , g n ) converges to a ( Q , g ) for which g G ( P 0 , Q ˉ ) . The goal is not to estimate the true treatment mechanism, but instead to construct a g n that converges to a conditional distribution given a reduction W r of W that is an element of G ( P 0 , Q ˉ ) . We could state that, just as the propensity score provides a sufficient dimension reduction for the outcome regression, so does, given Q ˉ , ( Q ˉ Q ˉ 0 ) provide a sufficient dimension reduction for the propensity score regression in the TMLE. The current literature appears to agree that propensity score estimators are best evaluated with respect to their effect on estimation of the causal effect of interest, not by metrics such as likelihoods or classification rates [4548], and the above-stated general collaborative double robustness provides a formal foundation for such claims.

The general C-TMLE has been implemented and applied to point treatment and longitudinal data [20, 2933, 35]. A C-TMLE algorithm relies on a TMLE algorithm that maps an initial ( Q ˉ n , g n ) into a TMLE ( Q ˉ n , g n ) and uses this algorithm in combination with a targeted variable selection algorithm for generating candidate models for the propensity score to generate a sequence of candidate TMLEs ( g n k , Q ˉ n k ) , increasingly non-parametric in k, and finally uses cross-validation to select the best TMLE among these candidates estimators of Q ˉ 0 .

5.3 A TMLE that allows for collaborative double robust inference

Our next theorem presents a TMLE algorithm and a corresponding influence curve under the assumption that the propensity score correctly adjusts for the possibly misspecified Q ˉ and Q ˉ 0 Q ˉ = E 0 ( Y Q ˉ ( W ) | A = 1 , W ) . The presented TMLE algorithm already arranges that this TMLE indeed non-parametrically adjusts for Q ˉ . In the next subsection, we will present an actual C-TMLE algorithm that generates a TMLE for which the propensity score is targeted to adjust for Q ˉ Q ˉ 0 , so that this theorem can be applied.

Theorem 4

Definitions: For any given g ˉ , Q ˉ , let g ˉ n r ( g ˉ , Q ˉ ) and Q ˉ n r ( g ˉ , Q ˉ ) be consistent estimators of g ˉ 0 r ( g ˉ , Q ˉ ) = E P 0 ( A | g ˉ , Q ˉ ) and Q ˉ 0 r ( g ˉ , Q ˉ ) = E P 0 ( Y Q ˉ | A = 1 , g ˉ ) , respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let Q ˉ n r = Q ˉ n r ( g ˉ n , Q ˉ n ) and g ˉ n r = g ˉ n r ( g ˉ n , Q ˉ n ) denote these estimators applied to the TMLE ( g ˉ n , Q ˉ n ) defined below.

“Score” equations the TMLE should solve: Below, we describe an iterative TMLE algorithm that results in estimators g ˉ n r , Q ˉ n r , g n , Q ˉ n that solve the following equations:

0 = P n D ( Q n , g n )
(3) 0 = P n D A ( Q ˉ n r , g ˉ n ) .

Iterative targeted MLE of ψ 0 :

Initialization: Let Q ˉ n and g n (e.g. aiming to adjust for Q ˉ n Q ˉ 0 ) be initial estimators.

Let Q ˉ n 0 = Q ˉ n , g ˉ n 0 = g ˉ n r ( g ˉ n , Q ˉ n 0 ) , and Q ˉ n r 0 = Q ˉ n r ( g ˉ n 0 , Q ˉ n 0 ) .

Updating step: Consider the submodel L o g i t g ˉ n k ( ) = L o g i t g ˉ n k + H A ( Q ˉ n r , k , g ˉ n k ) , and fit with the MLE

A , n = arg max P n log g n k ( ) .

Define the submodel L o g i t Q ˉ n k ( ) = L o g i t Q ˉ n k + H Y ( g n k ) and let L ( Q ˉ ) be the quasi-log-likelihood loss function for Q ˉ 0 . Let Y , n = arg min P n L Q ˉ n k ( ) be the MLE. Let Q ˉ n k + 1 = Q ˉ n k ( Y , n ) , g ˉ n k + 1 = g ˉ n r ( g ˉ n k ( A , n ) , Q ˉ n k + 1 ) , and Q ˉ n r k + 1 = Q ˉ n r ( g ˉ n k + 1 , Q ˉ n k + 1 ) .

Iterating till convergence: Now, set k k + 1 and iterate this updating process mapping a ( g n k , Q ˉ n k , Q ˉ n r k ) into ( g n k + 1 , Q ˉ n k + 1 , Q ˉ n r k + 1 ) till convergence or till large enough K so that the following estimating equations are solved up till an o P ( 1 / n ) -term:

o P ( 1 / n ) = P n D ( Q n K , g n K ) o P ( 1 / n ) = P n D A ( Q ˉ n r K , g ˉ n K ) .

Final substitution estimator: Denote these limits (in k) of this iterative procedure with g n , Q ˉ n , Q ˉ n r . Let Q n = ( Q W , n , Q ˉ n ) , where Q W , n is the empirical distribution estimator of Q W , 0 . The TMLE of ψ 0 is defined as Ψ ( Q n ) .

Assumption on limits g ˉ , Q ˉ of g ˉ n , Q ˉ n : Assume that ( g ˉ n , Q ˉ n ) is consistent for ( g ˉ , Q ˉ ) w.r.t. 0 -norm, where g ˉ ( W ) = E P 0 ( A | W r ) for some function W r ( W ) of W for which Q ˉ only depends on W through W r , and assume that P 0 Q ˉ Q ˉ 0 g ˉ ( A g ˉ ) = 0 , where the latter holds, in particular, if Q ˉ Q ˉ 0 only depends on W through W r (e.g. g ˉ n involves non-parametric adjustment by Q ˉ , Q ˉ 0 ). As a consequence, we have g ˉ = g ˉ 0 r .

Empirical process condition: Assume that D ( Q n , g n ) , D A ( Q ˉ n r , g ˉ n ) fall in a P 0 -Donsker class with probability tending to 1 as n .

Negligibility of second-order terms: Define

Q ˉ 0 , n r E P 0 Y Q ˉ | A = 1 , g ˉ , g ˉ n .

Assume that the following conditions hold for each of the following possible definitions of g ˉ 0 , n r : E P 0 ( A | g ˉ , Q ˉ , Q ˉ n ) , E P 0 ( A | g ˉ , g ˉ n ) , E P 0 ( A | g ˉ , Q ˉ n r , g ˉ n ) . Note that g ˉ 0 r = E 0 ( A | g ˉ , Q ˉ ) = E 0 ( A | g ˉ ) = g ˉ is the limit of each of these choices for g ˉ 0 , n r .

We assume g ˉ , g ˉ n are bounded away from δ > 0 with probability tending to one, and

Q ˉ n r Q ˉ 0 r 0 = o P ( 1 ) g ˉ n g ˉ 0 2 = o P ( 1 / n ) Q ˉ 0 , n r Q ˉ 0 r 0 g ˉ n g ˉ 0 = o P ( 1 / n ) Q ˉ n Q ˉ 0 g ˉ n g ˉ 0 = o P ( 1 / n ) Q ˉ n r Q ˉ 0 r 0 g ˉ n g ˉ 0 = o P ( 1 / n ) g ˉ 0 , n r g ˉ 0 r 0 Q ˉ n Q ˉ 0 = o P ( 1 / n ) g ˉ 0 , n r g ˉ 0 r 0 g ˉ n g ˉ 0 = o P ( 1 / n ) g ˉ 0 , n r g ˉ 0 r 0 Q ˉ n r Q ˉ 0 r 0 = o P ( 1 / n ) .

Then,

Ψ ( Q n ) ψ 0 = ( P n P 0 ) I C ( P 0 ) + o P ( 1 / n ) ,

where

I C ( P 0 ) = D ( Q , g ) D A ( Q ˉ 0 r , g ˉ ) .

Thus, consistency of this TMLE relies upon the consistency of Q ˉ n r as an estimator of Q ˉ 0 r , and estimator ( Q ˉ n , g ˉ n ) converging to a ( Q ˉ , g ˉ ) for which g ˉ equals a true conditional mean of A, given W r , and Q ˉ 0 Q ˉ , Q ˉ only depend on W through W r . Since Q ˉ 0 , n r Q ˉ 0 r depends on how well g ˉ n approximates g ˉ , Q ˉ n r Q ˉ 0 r depends on how well ( Q ˉ n , g ˉ n ) approximates ( Q ˉ , g ˉ ) , beyond the behavior of the non-parametric regression defining Q ˉ n r . In addition, g ˉ 0 , n r g ˉ 0 r depends on either how well g ˉ n approximates g ˉ or how well Q ˉ n approximates Q ˉ . As a consequence, it follows that each of the second-order terms displayed in the theorem involves square differences of approximation errors g ˉ n g ˉ and Q ˉ n Q ˉ .

It is also interesting to note that the algebraic form of the influence curve of this TMLE is identical to the influence curve of the TMLE of Theorem 2 that relied on g ˉ n being consistent for g ˉ 0 .

5.4 A C-TMLE algorithm

The TMLE algorithm presented in Theorem 4 maps an initial estimator ( Q n 0 , g n 0 ) into an updated estimator ( Q n , g n ) that solves the two estimating equations (3), allowing for statistical inference with known influence curve if the initial estimator ( Q n 0 , g n 0 ) is collaboratively consistent (i.e. the limits of ( Q n , g n ) satisfy the condition in the theorem). The updating algorithm results in a g n that non-parametrically adjusts for Q ˉ n itself, and thus for its limit Q ˉ in the limit. The condition on the limit g was that it should non-parametrically adjust not only for Q ˉ but also for Q ˉ Q ˉ 0 . If the initial estimator g n 0 already adjusted for an approximation of Q ˉ n 0 Q ˉ 0 , for example, ( g n 0 , Q n 0 ) is already a C-TMLE, then this condition might hold approximately. Nonetheless, we want to present a C-TMLE algorithm that simultaneously fits g in response to Q ˉ Q ˉ 0 , but also carries out the non-parametric adjustment by Q ˉ . The latter is normally not part of the C-TMLE algorithm, but we want to enforce this in order to be able to apply Theorem 3 and thereby obtain a known influence curve. We achieve this goal in this subsection by applying the C-TMLE algorithm as presented by van der Laan and Gruber [49] and to the particular TMLE algorithm presented in Theorem 4.

First, we compute a set of K univariate covariates W 1 , , W K , i.e. functions of W, which we will refer to as main terms, even though a term could be an interaction term or a super-learning fit of the regression of A on a subset of the components of W. Let Ω = { W 1 , , W K } be the full collection of main terms. In the previous subsection, we defined an algorithm that maps an initial ( Q , g ) into a TMLE ( Q , g ) . Let O L ( Q ) ( O ) be the loss function for Q 0 .

The general template of a C-TMLE algorithm is the following: given a TMLE algorithm that maps any initial ( Q , g ) into a TMLE ( Q , g ) , the C-TMLE algorithm generates a sequence of increasing sets S k Ω of k main terms, where each set S k has an associated estimator g k of g 0 , and simultaneously it generates a corresponding sequence of Q k , k = 1 , , K , where both g k and Q k are increasingly non-parametric in k. Here increasingly non-parametric means that the empirical mean of the loss function of the fit is decreasing in k. This sequence ( g k , Q k ) maps into a corresponding sequence of TMLEs ( g k , Q k ) using the TMLE algorithm presented in Theorem 4. In this variable selection algorithm, the choice of the next main term to add, mapping S k into S k + 1 , is based on how much the TMLE using the g-fit implied by S k + 1 , using Q k as initial estimator, improves the fit of the corresponding TMLE Q k for Q 0 . Cross-validation is used to select k among these candidate TMLEs Q k , k = 1 , , K , where the last TMLE Q K uses the most aggressive bias reduction by being based on the most non-parametric estimator g K implied by Ω .

In order to present a precise C-TMLE algorithm we will first introduce some notation. For a given subset of main terms S Ω , let S c be its complement within Ω . In the C-TMLE algorithm, we use a forward selection algorithm that augments a given set S k into a next set S k + 1 obtained by adding the best main term among all main terms in the complement S k , c of S k . Each choice S corresponds with an estimator of g 0 . In other words, the algorithm iteratively updates a current estimate g k into a new estimate g k + 1 , but the criterion for g does not measure how well g fits g 0 ; it measures how well the TMLE of Q 0 that uses this g (and as initial estimator Q k ) fits Q 0 .

Given a set S k , an initial g k 1 , Q k 1 , we define a corresponding g k obtained by MLE-fitting of β in the logistic regression working model

L o g i t g ˉ k = L o g i t g ˉ 0 r g ˉ k 1 , Q ˉ k 1 + j S k β j W j ,

where we remind the reader of the definition g ˉ 0 r ( g ˉ , Q ˉ ) = E 0 ( A | Q ˉ ( W ) , g ˉ ( W ) ) . Thus, this estimator g k involves non-parametric adjustment by g ˉ k 1 , Q ˉ k 1 , augmented with a linear regression component implied by S k . This function mapping S k , g k 1 , Q k 1 into a fit g k will be denoted with g ( S k , g k 1 , Q k 1 ) . This also allows us to define a mapping from ( Q k , S k , Q k 1 , g k 1 ) into a TMLE ( Q k , g k ) defined by the TMLE algorithm of Theorem 4 applied to initial Q k and g k = g ( S k , g k 1 , Q k 1 ) . We will denote this mapping into Q k with T M L E ( Q k , S k , Q k 1 , g k 1 ) .

The C-TMLE algorithm defined below generates a sequence ( Q k , S k ) and thereby corresponding TMLEs ( Q k , g k ) , k = 0 , , K , where Q k represents an initial estimate, S k a subset of main terms that defines g k , and Q k , g k the corresponding TMLE that starts with ( Q k , g k ) . These TMLEs Q k represent subsequent updates of the initial estimator Q 0 . The corresponding main term set S k that defines g k in this k-specific TMLE, increases in k, one unit at a time: S 0 is empty, | S k + 1 | = | S k | + 1 , S K = Ω . The C-TMLE uses cross-validation to select k, and thereby to select the TMLE Q k that yields the best fit of Q 0 among the K + 1 k-specific TMLEs ( Q k : k = 0 , , K ) that are increasingly aggressive in their bias-reduction effort. This C-TMLE algorithm is defined as follows and uses the same format as presented in Wang et al. [35]:

Initiate algorithm: Set initial TMLE. Let k = 0 , and Q k = Q 0 , g s t a r t be initial estimates of Q 0 , g 0 , and let S 0 be the empty set. Let g k = g ( S 0 , Q 0 , g s t a r t ) . This defines an initial TMLE

Q 0 = T M L E ( Q 0 , S 0 , Q 0 , g 0 ) .

Determine next TMLE. Determine the next best main term to add:

S k + 1 , c a n d = arg min S k W j : W j S k , c P n L T M L E Q k , S k W j , Q k 1 , g k 1 .

If

P n L T M L E Q k , S k + 1 , c a n d , Q k 1 , g k 1 P n L ( Q k ) ,

then ( S k + 1 = S k + 1 , c a n d , Q k + 1 = Q k ) , else Q k + 1 = Q k , and

S k + 1 = arg min S k W j : W j S k , c P n L T M L E Q k , S k W j , Q k 1 , g k 1 .

[In words: If the next best main term added to the fit of E P 0 ( A | W ) yields a TMLE of E P 0 ( Y | A , W ) that improves upon the previous TMLE Q k , then we accept this best main term, and we have our next ( Q k + 1 , S k + 1 ) and corresponding TMLE Q k + 1 , g k + 1 (which still uses the same initial estimate of Q 0 as Q k uses). Otherwise, reject this best main term, update the initial estimate in the candidate TMLEs to the previous TMLE Q k of E P 0 ( Y | A , W ) , and determine the best main term to add again. This best main term will now always result in an improved fit of the corresponding TMLE of Q 0 , so that we now have our next TMLE Q k + 1 , g k + 1 (which now uses a different initial estimate than Q k used).]

Iterate. Run this from k = 1 to K at which point S K = Ω . This yields a sequence ( Q k , g k ) and corresponding TMLE ( Q k , g k ) , k = 0 , , K .

This sequence of candidate TMLEs Q k of Q 0 has the following property: the estimates g k are increasingly non-parametric in k and P n L ( Q k ) is decreasing in k, k = 0 , , K . It remains to select k. For that purpose we use V-fold cross-validation. That is, for each of the V splits of the sample in a training and validation sample, we apply the above algorithm for generating a sequence of candidate estimates ( Q k : k ) to a training sample, and we evaluate the empirical mean of the loss function at the resulting Q k over the validation sample, for each k = 0 , , K . For each k we take the average over the V splits of the k-specific performance measure over the validation sample, which is called the cross-validated risk of the k-specific TMLE. We select the k that has the best cross-validated risk, which we denote with k n . Our final C-TMLE of Q 0 is now defined as Q n = Q k n , and the TMLE of ψ 0 is defined as ψ n = Ψ ( Q n ) .

Fast version of above C-TMLE: We could carry out the above C-TMLE algorithm but replacing the TMLE that maps an initial ( Q , g ) into ( Q , g ) replaced by the first step of the TMLE that maps ( Q , g ) into ( Q 1 , g 1 ) . In that manner, the selection of the sets S k is based on the bias reduction achieved in a first step of the TMLE algorithm, and most bias reduction occurs in the first step. After having selected the final one-step TMLE Q k n 1 and corresponding g k n , one should still carry out the full TMLE algorithm so that the final Q n = Q k n , g k n is a real TMLE solving the estimating equations of Theorem 4.

Statistical inference for C-TMLE: Let Q ˉ n r = Q ˉ n r ( g ˉ n , Q ˉ n ) be the final estimator of Q ˉ 0 r = Q ˉ 0 r ( g ˉ , Q ˉ ) = E P 0 ( Y Q ˉ | A = 1 , g ˉ ) , a by-product of the TMLE algorithm. An estimate of the influence curve of ψ n is given by

I C n = D ( Q n , g ˉ n ) D A ( Q ˉ n r , g ˉ n ) .

The asymptotic variance of n ( ψ n ψ 0 ) can thus be estimated with σ n 2 = 1 / n i = 1 n I C n ( O i ) 2 . An asymptotically valid 0.95-confidence interval for ψ 0 is given by ψ n ± 1.96 σ n / n .

6 Discussion

Targeted minimum loss-based estimation allows us to construct plug-in estimators Ψ ( Q n ) of a path-wise differentiable parameter Ψ ( Q 0 ) utilizing the state of the art in ensemble learning such as super-learning, while guaranteeing that the estimator Q n and an estimator g n of the nuisance parameter the TMLE utilizes in its targeting step solve a set of user-supplied estimating equations, empirical means of estimating functions. These estimating functions can be selected so that the resulting TMLE of ψ 0 has certain statistical properties such as being efficient, or guaranteed to be more efficient than a given user-supplied estimator [28, 29], and so on. However, most importantly, these estimating equations are necessary to make the TMLE asymptotically linear, i.e. to make the TMLE unbiased enough so that the first-order linear expansion can be used for statistical inference. For example, by selecting the estimating functions to be equal to the canonical gradient of Ψ : M I R one arranges that Ψ ( Q n ) is asymptotically efficient under conditions that assume consistency of Q n and g n .

However, we noted that this level of targeting is insufficient if one only relies on consistency of g n , even when that suffices for consistency of Ψ ( Q n ) . Under such weaker assumptions, additional targeting is necessary so that a specific smooth functional of g n is asymptotically linear, which requires that an unknown smooth function of g n is itself a TMLE. The joint targeting of Q n and g n is achieved by a TMLE that also solves the extra equations making this smooth function of g n asymptotically linear, allowing one to establish asymptotic linearity of Ψ ( Q n ) under milder conditions that assume that the second-order terms are negligible relative to the first-order linear approximation.

In this article we also pushed this additional level of targeting to a new level by demonstrating how it allows for double robust statistical inference, and that even if we estimate the nuisance parameter in a complicated manner that is based on a criterion that cares about how it helps the estimator to fit ψ 0 , as used by the C-TMLE, we can still determine a set of additional estimating equations that need to be targeted by the TMLE in order to establish asymptotic linearity and thereby valid statistical inference based on the central limit theorem. This allows us now to use the sophisticated but often necessary C-TMLE while still preserving valid statistical inference under regularity conditions.

It remains to evaluate the practical benefit of the modifications of IPTW, TMLE, and C-TMLE as presented in this article for both estimation and assessment of uncertainty. We plan to address this in future research.

Even though we focussed in this article on a particular concrete estimation problem, TMLE is a general tool and our TMLE and theorems can be generalized to general statistical models and path-wise differentiable statistical target parameters.

We note that this targeting of nuisance parameter estimators in the TMLE is not only necessary to get a known influence curve but also necessary to make the TMLE asymptotically linear. So it does not simply suffice to run a bootstrap as an alternative of influence curve based inference, since the bootstrap can only work if the estimator is asymptotically linear so that it has an existing limit distribution. In addition, the established asymptotic linearity with known influence curve has the important by-product that one now obtains statistical inference with no extra computational cost. This is particularly important in these large semi-parametric models that require the utilization of aggressive machine learning methods in order to cover the model-space, making the estimators by necessity very computer intensive, so that a (disputable) bootstrap method might simply be too computer extensive.

Acknowledgments

This research was supported by an NIH grant R01 AI074345-06. The author is grateful for the excellent, helpful, and insightful comments of the reviewers.

Appendix

Proof of Theorem 1

To start with we note:

P n D ( g n ) P 0 D ( g 0 ) = ( P n P 0 ) D ( g 0 ) + P n ( D ( g n ) D ( g 0 ) ) = ( P n P 0 ) ( D ( g 0 ) ψ 0 ) + P 0 ( D ( g n ) D ( g 0 ) ) + ( P n P 0 ) ( D ( g n ) D ( g 0 ) ) .

The first term of this decomposition yields the first component D ( g 0 ) ψ 0 of the influence curve. Since g n falls in Donsker class the rightmost term is o P ( 1 / n ) if P 0 ( D ( g n ) D ( g 0 ) ) 2 0 in probability. So it remains to analyze the term P 0 ( D ( g n ) D ( g 0 ) ) . We now note

P 0 D ( g n ) D ( g 0 ) = P 0 Y A 1 / g n 1 / g 0 = P 0 Y A ( g 0 g n ) / ( g n g 0 ) = P 0 Y A g 0 g n / g 0 2 + P 0 Y A g 0 g n 2 / g 0 2 g n .

By our assumptions, the last term

P 0 Y A ( g 0 g n ) 2 / g 0 2 g n = P 0 Q ˉ 0 ( g ˉ n g ˉ 0 ) 2 / ( g ˉ 0 g ˉ n ) = o P ( 1 / n ) .

So it remains to study:

P 0 Y A g 0 g n / g 0 2 = P 0 Q ˉ 0 g ˉ 0 g ˉ n / g ˉ 0 .

Note that this equals { Ψ 1 ( g n ) Ψ 1 ( g 0 ) } , where Ψ 1 ( g ) = P 0 Q ˉ 0 g ˉ 0 g ˉ is an unknown smooth parameter of g. Our strategy is to first approximate this parameter by an easier (still unknown) parameter Ψ 1 r ( g ) = P 0 Q ˉ 0 r / g ˉ 0 g ˉ resulting in a second-order term: Ψ 1 ( g n ) Ψ 1 ( g 0 ) = Ψ 1 r ( g n ) Ψ 1 r ( g 0 ) + o P ( 1 / n ) . This is carried out in the next lemma. The efficient influence curve of a target parameter Φ : g ˉ P 0 H g ˉ (which treats P 0 as known) at g 0 is given by H ( A g ˉ 0 ) . Thus, one likes to construct g ˉ n so that it solves the empirical mean of H 0 r ( A g ˉ n ) for H 0 r = Q ˉ 0 r / g ˉ 0 , so that g ˉ n targets the parameter Ψ 1 r ( g 0 ) . However, H 0 r is unknown. Therefore, instead g ˉ n is constructed to solve the empirical mean of an estimate H n r ( A g ˉ n ) of the efficient influence curve H 0 r ( A g ˉ n ) , and we will show that this indeed suffices to establish the asymptotic linearity of Ψ 1 r ( g ˉ n ) .

Lemma 2Define Ψ 1 ( g ) = P 0 Q ˉ 0 g ˉ 0 g ˉ , Ψ 1 r ( g ) = P 0 Q ˉ 0 r g ˉ 0 g ˉ , Q ˉ 0 , n r E P 0 ( Y | A = 1 , g ˉ 0 ( W ) , g ˉ n ( W ) ) , and Q ˉ 0 r = E P 0 ( Y | A = 1 , g ˉ 0 ( W ) ) , where g ˉ n ( W ) is treated as a fixed function of W when calculating the conditional expectation. Assume

R 1 , n P 0 ( Q ˉ 0 , n r Q ˉ 0 r ) ( g ˉ n g ˉ 0 ) / g ˉ 0 = o P ( 1 / n ) .

Then,

Ψ 1 ( g n ) Ψ 1 ( g 0 ) = Ψ 1 r ( g ˉ n