In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distributed copies of a random variable consisting of baseline covariates, a subsequently assigned binary treatment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score.
This introduction provides an atlas for the contents of this article. It starts with formulating the role of estimation of nuisance parameters to obtain asymptotically linear estimators of a target parameter of interest. This demonstrates the need to target this estimator of the nuisance parameter in order to make the estimator of the target parameter asymptotically linear when the model for the nuisance parameter is large. The general approach to obtain such a targeted estimator of the nuisance parameter is described. Subsequently, we present our concrete example to which we will apply this general method for targeted estimation of the nuisance parameter, and for which we establish a number of formal theorems. Finally, we discuss the link to previous articles that concerned some kind of targeting of the estimator of the nuisance parameter, and we provide an organization of the remainder of the article.
Suppose we observe n independent and identically distributed copies of a random variable O with probability distribution . In addition, assume that it is known that is an element of a statistical model and that we want to estimate for a given target parameter mapping . In order to guarantee that one is forced to only incorporate real knowledge, and, as a consequence, such models are always very large and, in particular, are infinite dimensional. We assume that the target parameter mapping is path-wise differentiable and let denote the canonical gradient of the path-wise derivative of at . An estimator is a functional applied to the empirical distribution of and can thus be represented as a mapping from the non-parametric statistical model into the real line. An estimator is efficient if and only if it is asymptotically linear with influence curve :
The empirical mean of the influence curve represents the first-order linear approximation of the estimator as a functional of the empirical distribution, and the derivation of the influence curve is a by-product of the application of the so-called functional delta-method for statistical inference based on functionals (i.e. ) of the empirical distribution [2–4].
Suppose that only depends on P through a parameter and that the canonical gradient depends on P only through and a nuisance parameter . The construction of an efficient estimator requires the construction of estimators and of these nuisance parameters and , respectively. Targeted minimum loss-based estimation (TMLE) represents a method for construction of (e.g. efficient) asymptotically linear substitution estimators , where is a targeted update of that relies on the estimator [5–7]. The targeting of is achieved by specifying a parametric submodel through the initial estimator and a loss function for , so that the generalized score spans a desired user-supplied estimating function . In addition, one may decide to target by specifying a parametric submodel and loss function for , so that the generalized score spans another desired estimating function for some estimator of nuisance parameter . The parameter is fitted with MLE , providing the first-step update , and similarly . This updating process that mapped a current fit into an update is iterated till convergence at which point the TMLE solves , i.e. the empirical mean of the estimating function equals zero at the final TMLE . If one also targeted , then it also solves . The submodel through will depend on , while the submodel through will depend on another nuisance parameter . By setting equal to the efficient influence curve , the resulting TMLE solves the efficient influence curve estimating equation and thereby will be asymptotically efficient when is consistent for under appropriate regularity conditions, where the targeting of is not needed.
The latter is shown as follows. By the property of the canonical gradient (in fact, any gradient) we have , where involves integrals of second-order products of the differences and . Combined with , this implies the following identity:
The first term is an empirical process term that, under empirical process conditions (mentioned below), equals , where denotes the limit of , plus an -term. This then yields
To obtain the desired asymptotic linearity of one needs , which in general requires at minimal that both nuisance parameters are consistently estimated: and . However, in many problems of interest, only involves a cross-product of the differences and , so that converges to zero if either is consistent or is consistent: i.e. or . In this latter case, the TMLE is so-called double robust. Either way, the consistency of the TMLE relies now on one of the nuisance parameter estimators being consistent, thereby requiring the use of non-parametric adaptive estimation such as super-learning [8–10] for at least one of the nuisance parameters. If only one of the nuisance parameter estimators is consistent, and we are in the double robust scenario, then it follows that the bias is of the same order as the bias of the consistent nuisance parameter estimator. However, if the nuisance parameter estimator is not based on a correctly specified parametric model, but instead is a data-adaptive estimator, then this bias will be converging to zero at a rate slower than : i.e. converges to infinity as . Thus, in that case, the estimator of the target parameter may thus be overly biased and thereby will not be asymptotically linear.
In this article, we demonstrate that if , then it is essential that the consistent nuisance parameter estimator be targeted toward the estimand so that the bias for the estimand becomes second order: that is, in our new TMLEs relying on consistent estimation of presented in this article one simultaneously updates into a so that certain smooth functionals of , derived from the study of , are asymptotically linear under appropriate conditions. Even if both estimators and are consistent, but might be converging at a slower rate than , this targeting of the nuisance parameter estimator may still remove finite sample bias for the estimand. In addition, we also present such TMLE when only relying on one of the nuisance parameters to be consistently estimated, but not knowing which one: i.e. either or . The same arguments applies to other double robust estimators, such as estimating equation based estimators and inverse probability of treatment weighted (IPTW) estimators [11–16]. In fact, we demonstrate such a targeted IPTW-estimator in our next section.
The current article concerns the construction of such targeted IPTW and TMLE that are asymptotically linear under regularity conditions, even when only one of the nuisance parameters is consistent and the estimators of the nuisance parameters are highly data adaptive. In order to be concrete in this article, we will focus on a particular example. In such an example we can concretely present the second-order term mentioned above and thereby develop the concrete form of the TMLE.
The same approach for construction of such TMLE can be carried out in much greater generality, but that is beyond the scope of this article. Nonetheless, it is helpful for the reader to know that the general approach is the following (considering the case that , but Q can be misspecified): (1) approximate for some mapping that depends on (e.g. through ) and the data (e.g. ), and where is a second-order term so that it is reasonable to assume ; (2) approximate , where is a second-order term and is now a known (only based on data) mapping approximating ; (3) construct so that it is a TMLE of the target parameter thereby allowing an expansion with being the efficient influence curve of . That is, in step 3, is iteratively updated to solve with depending on through and a nuisance parameter , so that is an asymptotically linear estimator of under regularity conditions. After these three steps, we have that , where , and these steps provide us with the parameter that needs to be targeted by , thereby telling us how to target in the TMLE of . In addition, we can then conclude that this TMLE is asymptotically linear with known influence curve , where represents the limit of the efficient influence curve of : .
Let us now formulate our concrete example we will cover in this article. Let , W baseline covariates, A a binary treatment, and Y a final outcome. Let be a model that makes at most some assumptions about the conditional distribution of A, given W, but leaves the marginal distribution of W and the conditional distribution of Y, given , unspecified. Let be defined as , the so-called treatment specific mean controlling for the baseline covariates. The canonical gradient, also called the efficient influence curve, of at P is given by , where is the propensity score and is the outcome regression . Let , where is the marginal distribution of W, and note that only depends on P through . For convenience, we will denote the target parameter with in order to not have to introduce additional notation. A targeted minimum loss-based estimator (TMLE) is a plug-in estimator , where is an update of an initial estimator that relies on an estimator of , and it has the property that it solves , where we used the notation .
For this particular example, such TMLE are presented in Scharfstein et al. ; van der Laan and Rubin ; Bembom et al. [18–21]; Rosenblum and van der Laan ; Sekhon et al. ; van der Laan and Rose [6, 24]. Since [25, 26], where we use the notation and , and , we obtain the identity:
The first term equals if falls in a -Donsker class with probability tending to 1, and in probability as [4, 27]. If and are consistent for the true and , respectively, then the second term is a second-order term. If one now assumes that this second-order term is , it has been proven that the TMLE is asymptotically efficient. This provides the general basis for proving asymptotic efficiency of TMLE when both and are consistently estimated.
However, if only one of these nuisance parameter estimators is consistent, then the second term is still a first-order term, and it remains to establish that it is also asymptotically linear with a second-order remainder. For sake of discussion, suppose that converges to a wrong while is consistent. In that case, this remainder behaves in first order as . To establish that such a term is asymptotically linear requires that solves a particular estimating equation: that is, needs to be a TMLE itself targeting the required smooth functional of . This is naturally achieved within the TMLE framework by specifying a submodel through and loss function with the appropriate generalized score, so that a TMLE update step involves both updating and , and the iterative TMLE algorithm now results in a final TMLE , not only solving but also these additional equations that allow us to establish asymptotic linearity of the desired smooth functional of : see general description of TMLE above.
In this article, we present TMLE that targets in a manner that allows us to prove the desired asymptotic linearity of the second term in the right-hand side of eq. (1) when either or is consistent, under conditions that require specified second-order terms to be . The latter type of regularity conditions are typical for the construction of asymptotically linear estimators and are therefore considered appropriate for the sake of this article. Though it is of interest to study cases in which these second-order terms cannot be assumed to be , this is beyond the scope of this article.
The construction of TMLE that utilizes targeting of the nuisance parameter has been carried out in earlier papers. For example, in van der Laan and Rubin , we target to obtain a TMLE that, beyond being double robust locally efficient, also equals the IPTW-estimator. In Gruber and van der Laan  we target to guarantee that, beyond being double robust locally efficient, also outperforms a user-supplied given estimator, based on the original idea of Rotnitzky et al. . In that sense, the distinction of the current article with these previous articles is that is now targeted to guarantee that the TMLE remains asymptotically linear when is misspecified. This task of targeting appears to be one step more complicated than in these previous articles, since the smooth functionals of that need to be targeted are themselves indexed by parameters of the true data distribution , and thus unknown. As mentioned above, our strategy is to approximate these unknown smooth functionals by an estimated smooth functional and develop the targeted estimator that targets this estimated parameter of .
The TMLEs presented in this article are always iterative and thereby rely on convergence of the iterative updating algorithm. Since the empirical risk increases at each updating step, such convergence is typically guaranteed by the existence of the MLE at each updating step (e.g. an MLE of coefficient in a logistic regression). Either way, in this article, we assume this convergence to hold. Since our assumptions of our theorems require to be bounded away from zero, we demonstrate how this property can be achieved by using submodels for updating that guarantee this property. Detailed simulations will appear in a future article.
The organization of this paper is as follows. In Section 2, we introduce a targeted IPTW-estimator that relies on an adaptive consistent estimator of , and we establish its asymptotic linearity with known influence curve, allowing for the construction of asymptotically valid confidence intervals based on this adaptive IPTW-estimator. In the remainder of the article, we focus on construction of TMLE involving the targeting of to establish the asymptotic linearity of the resulting TMLE under appropriate conditions. In Section 3, we introduce a novel TMLE that assumes that the targeted adaptive estimator is consistent for , and we establish its asymptotic linearity. In Section 4, we introduce a novel TMLE that only assumes that either the targeted or the targeted is consistent, and we establish its asymptotic linearity with known influence curve. This TMLE needs to protect the asymptotic linearity under misspecification of either or , and, as a consequence, relies on targeting of (in order to preserve asymptotic linearity when is inconsistent), but also extra targeting of (in order to preserve asymptotic linearity when is consistent, but is inconsistent). The explicit form of the influence curve of this TMLE allows us to construct asymptotic confidence intervals. Since this result allows statistical inference in the statistical model that only assumes that one of the estimators is consistent, and we refer to this as “double robust statistical inference”. Even though double robust estimators have been extensively presented in the current literature, double robust statistical inference in these large semi-parametric models has been a difficult topic: typically, one has suggested to use the non-parametric bootstrap, but there is no theory supporting that the non-parametric bootstrap is a valid method when the estimators rely on data-adaptive estimation.
In Section 5, we extend the TMLE of Section 3 (that relies on being consistent for ) to the case that converges to a possibly misspecified g but one that suffices for consistent estimation of in the sense that will be consistent. We present a corresponding asymptotic linearity theorem for this TMLE that is able to utilize the so-called collaborative double robustness of the efficient influence curve which states that if and for a set (including ). In order to construct a collaborative estimator that aims to converge to an element in in collaboration with , we use the framework of collaborative targeted minimum loss-based estimator (C-TMLE) [20, 29–35]. Our asymptotic linearity theorem can now be applied to this C-TMLE. Again, even though C-TMLEs have been presented in the current literature, statistical inference based on the C-TMLEs has been another challenging topic, and Section 5 provides us with a C-TMLE with known influence curve. We conclude this article with a discussion. The proofs of the theorems are presented in the Appendix.
In the following sections, we will use the following notation. We have , where is a statistical model that makes only assumptions on the conditional distribution of A, given W. Let , and . The target parameter is defined by , where , which will also be denoted with , and is the distribution of W under . We also use the notation , where . In addition, denotes the efficient influence curve of at . We also use the following notation:
We first describe an IPTW-estimator that uses super-learning to fit the treatment mechanism . Subsequently, we present this IPTW-estimator but now using an update of the super-learning fit of , and we present a theorem establishing the asymptotic linearity of this targeted IPTW-estimator under appropriate conditions. Finally, we discuss how this targeted IPTW-estimator compares with an IPTW-estimator that relies on a parametric model to fit the treatment mechanism.
We consider a simple IPTW-estimator , where , and is an adaptive estimator of based on the log-likelihood loss function . For a general presentation of an IPTW-estimator, we refer to Robins and Rotnitzky , van der Laan and Robins , and Hernan et al. . We wish to establish conditions under which reliable statistical inference based on this estimator of can be obtained. One might wish to estimate with ensemble learning, and, in particular, super-learning in which cross-validation  is used to determine the best weighted combination of a library of candidate estimators: van der Laan and Dudoit ; van der Laan et al. [9, 38, 39]; van der Vaart et al. ; Dudoit and van der Laan ; Polley et al. ; Polley and van der Laan ; van der Laan and Petersen . The super-learner is a general template for construction of an adaptive estimator based on a library of candidate estimators, a loss function whose expectation is minimized over the parameter space by the true parameter value, a parametric family that defines “weighted” combinations of the estimators in the library. We will start with presenting a succinct description of a particular super-learner. Consider a library of estimators , and a family of weighted (on logistic scale) combinations of these estimators , indexed by vectors for which and . Consider a random sample split into a training sample of size and validation sample of size np, and let and denote the empirical distribution of the validation sample and training sample, respectively. Define
as the choice of estimator that minimizes cross-validated risk. The super-learner of is defined as the estimator .
The next theorem presents an IPTW-estimator that uses a targeted fit of , involving the updating of an initial estimator , and conditions under which this IPTW-estimator of is asymptotically linear. For example, could be defined as a super-learner of the type presented above. In spite of the fact that such an IPTW-estimator uses a very data adaptive and hard to understand estimator , this theorem shows that its influence curve is known and can be well estimated.
Theorem 1We consider a targeted IPTW-estimator , where , and is an update of an initial estimator of defined below.
Definition of targeted estimator : Let be obtained by non-parametric estimation of the regression function treating as a fixed covariate (i.e. function of W). This yields an estimator of , where . Consider the submodel , and fit with the MLE
We define as the corresponding targeted update of . This TMLE satisfies
Empirical process condition: Assume that fall in a -Donsker class with probability tending to 1.
Negligibility of second-order terms: Define . Assume with probability tending to 1 and assume
So under the conditions of this theorem, we can construct an asymptotic 0.95-confidence interval based on this targeted IPTW-estimator , where
and is the plug-in estimator of the influence curve obtained by plugging in or for and for .
Regarding the displayed second-order term conditions, we note that these are satisfied if converges to zero w.r.t. -norm at rate , for some with probability tending to 1 as , and the product of the rates at which converges to and converges to is .
Regarding the empirical process condition, we note that an example of a Donsker class is the class of multivariate real-valued functions with uniform sectional variation norm bounded by a universal constant . It is important to note that if each estimator in the library falls in such a class, then also the convex combinations fall in that same class . So this Donsker condition will hold if it holds for each of the candidate estimators in the library of the super-learner.
Consider an IPTW-estimator using a MLE according to a parametric model for , and let us contrast this IPTW-estimator with an IPTW-estimator defined in the above theorem based on an initial super-learner that includes as an element of the library of estimators. Let us first consider the case that the parametric model is correctly specified. In that case converges to at a parametric rate . From the oracle inequality for cross-validation [8, 10, 38], it follows that also converges at the rate to possibly up to a -factor in case the number of algorithms in the library is of the order for some fixed q. As a consequence, all the consistency and second-order term conditions for the IPTW-estimator using a targeted based on hold. If one uses estimators in the library of algorithms that have a uniform sectional variation norm smaller than a with probability tending to 1, then also a weighted average of these estimators will have uniform sectional variation norm smaller than with probability tending to 1. Thus, in that case we will also have that fall in a -Donsker class. Examples of estimators that control the uniform sectional variation norm are any parametric model with fewer than K main terms that themselves have a uniform sectional variation norm, but also penalized least-squares estimators (e.g. Lasso) using basis functions with bounded uniform sectional variation norm, and one could map any estimator into this space of functions with universally bounded uniform sectional variation norm through a smoothing operation. Thus, under this restriction on the library, the IPTW-estimator using the super-learner is asymptotically linear with influence curve as stated in the theorem. We note that is the efficient influence curve for the target parameter if the observed data were instead of .
The parametric IPTW-estimator is asymptotically linear with influence curve , where is the tangent space of the parametric model for , and denotes the projection of f onto in the Hilbert space . This IPTW-estimator could be less or more efficient than the IPTW-estimator using the targeted super-learner depending on the actual tangent space of the parametric model.
For example, if the parametric model happens to have a score equal to , then the parametric IPTW-estimator would be asymptotically efficient. Of course, a standard parametric model is not tailored to correspond with such optimal scores, but this shows that we cannot claim superiority of one versus the other in the case that the parametric model for is correctly specified.
If, on the other hand, the parametric model is misspecified, then the IPTW-estimator using is inconsistent. However, the super-learner will be consistent if the library contains a non-parametric adaptive estimator, and will perform asymptotically as well as the oracle selector among all the weighted combinations of the algorithms in the library. To conclude, the IPTW-estimator using super-learning to estimate will be as good as the IPTW-estimator using a correctly specified parametric model (included in the library of the super-learner), but will remain consistent and asymptotically linear in a much larger model than the parametric IPTW-estimator relying on the true being an element of the parametric model.
In the next subsection, we present a TMLE that targets the fit of the treatment mechanism, analog to the targeted IPTW-estimator presented above. In addition, this subsection presents a formal asymptotic linearity theorem demonstrating that this TMLE will be asymptotically linear even when is inconsistent under reasonable conditions. We conclude this section with a subsection showing how the iterative updating of the treatment mechanism can be carried out in such a way that the final fit of the treatment mechanism is still bounded away from zero, as required to obtain a stable estimator.
The following theorem presents a novel TMLE and corresponding asymptotic linearity with specified influence curve, where we rely on consistent estimation of . The TMLE still uses the same updating step for the estimator of as the regular TMLE , but uses a novel updating step for the estimator of , analogue to the updating step of the IPTW-estimator in the previous section. We remind the reader of the importance of using the logistic fluctuations as working-submodels for in the definition of the TMLE, guaranteeing that the TMLE update stays within the bounded parameter space (see, e.g. Gruber and van der Laan ).
Iterative targeted MLE of :
Definitions: Given , let be a consistent estimator of the regression of on and . Let be an initial estimator of .
Initialization: Let , and . Let .
Updating step for : Consider the submodel , and fit with the MLE
We define as the corresponding update of . This satisfies
Updating step for : Let be the quasi-log-likelihood loss function for (allowing that Y is continuous in ). Consider the submodel , and let . Define as the resulting update. Define .
Iterating till convergence: Now, set , and iterate this updating process mapping a into till convergence or till large enough K so that the estimating equations (2) below are solved up till an -term. Denote the limit of this iterative procedure with .
Plug-in estimator: Let , where is the empirical distribution estimator of . The TMLE of is defined as .
Estimating equations solved by TMLE: This TMLE solves
Empirical process condition: Assume that , falls in a -Donsker class with probability tending to 1 as .
Negligibility of second-order terms: Define
where is treated as a fixed covariate (i.e. function of W) in the conditional expectation . Assume that there exists a , so that with probability tending to 1, and
Thus, under the assumptions of this theorem, an asymptotic 0.95-confidence interval is given by , where , and .
The following is an application of the constrained logistic regression approach of the type presented in Gruber and van der Lann  for the purpose of estimation of respecting the constraint that for a known . Recall that . Suppose that it is known that for some , a condition the asymptotic linearity of our proposed estimators relies upon. Define . We have , where is a regression that is known to be between . Let be an initial estimator of the true conditional distribution of , given W, which implies an estimator of . Let . Consider the following submodel for the conditional distribution of , given W, through a given estimator :
The MLE is simply obtained with logistic regression of on W (see, e.g. Gruber and van der Lann ) based on the quasi-log-likelihood loss function:
is the quasi-log-likelihood loss. The update implies an update of , and, by construction . The above submodel and corresponding loss function generates the same score equation as the submodel and loss function used in Theorem 2. Therefore, the TMLE algorithm presented in Theorem 2 but now using this -specific logistic regression model solves the same estimating equations, so that the same Theorem 2 immediately applies. However, using this submodel we have now guaranteed that for all k in the iterative TMLE algorithm, and thereby that .
In this section, our aim is to present a TMLE that is asymptotically linear with known influence curve if either or is consistently estimated, but we do not need to know which one. Again, this requires a novel way of targeting the estimators in order to arrange that the relevant smooth functionals of these nuisance parameter estimators are indeed asymptotically linear under appropriate second-order term conditions. In this case, we also need to augment the submodel for the estimator of with another clever covariate: that is, our estimator of needs to be double targeted, once for solving the efficient influence curve equation, but also for achieving asymptotic linearity in the case that the estimator of is misspecified.
Definitions: For any given , let and be consistent estimators of and , respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let and denote these estimators applied to the TMLEs defined below.
Iterative targeted MLE of :
Initialization: Let , be an initial estimator of . Let , and let . Let be obtained by non-parametrically regressing A on . Let be obtained by non-parametrically regressing on .
Updating step: Consider the submodel , and fit with the MLE
Define the submodel , where
Let be the MLE, where is the quasi-log-likelihood loss.
We define as the corresponding targeted update of , and as the corresponding update of . Let and .
Iterate till convergence: Now, set , and iterate this updating process mapping a into till convergence or till large enough K so that the following three estimating equations are solved up till an -term:
Final substitution estimator: Denote the limits of this iterative procedure with . Let , where is the empirical distribution estimator of . The TMLE of is defined as .
Equations solved by TMLE:
Empirical process condition: Assume that , , fall in a -Donsker class with probability tending to 1 as .
Negligibility of second-order terms: Define and . Assume that there exists a so that with probability tending to 1, that are consistent for w.r.t. -norm, where either or , and assume that the following second-order terms are :
Note that consistent estimation of the influence curve relies on consistency of as estimators of , and estimators converging to a for which either or . These estimators imply an estimated influence curve . An asymptotic 0.95-confidence interval is given by , where .
If , then , and therefore for all . If , then it follows that , and thus that for all . In particular, if both and , then . We also note that if , but is a true conditional distribution of A, given some function of W for which is only a function of , then it follows that and thus .
As shown in the final remark of the Appendix, the condition of Theorem 3 that either or can be weakened to having to satisfy , allowing for the analysis of collaborative double robust TMLE, as discussed in the next section. However, as shown in the next section, if one arranges in the TMLE algorithm that (i.e. already non-parametrically adjusts for ), then there is no need for the extra targeting in , and the influence curve will be .
We first review the theoretical underpinning for collaborative estimation of nuisance parameters, in this case, the outcome regression and treatment mechanism. Subsequently, we explain that the desired collaborative estimation can be achieved by applying the previously established template for construction of a C-TMLE to a TMLE that solves certain estimating equations when given an initial estimator of . This C-TMLE template involves (1) creating a sequence of TMLEs constructed in such a manner that the empirical risk of both and is decreasing in k, and (2) using cross-validation to select the k for which is the best fit of . Subsequently, we present this TMLE that maps an initial of into targeted estimators solving the desired estimating equations and establish its asymptotic linearity under appropriate conditions, including that the initial estimator of is collaboratively consistent. Finally, we present a concrete C-TMLE algorithm that uses this TMLE algorithm as its basis, so that our theorem can be applied to this C-TMLE: a C-TMLE is still a TMLE, but it is a TMLE based on a data adaptively selected initial estimator that is collaboratively consistent, so that we can apply the same theorem to this C-TMLE.
We note that . If , this reduces to
Let be the class of all possible distributions of A, given W, and let be the true conditional distribution of A given W. We define the set . For any , we have . Suppose we have an estimator satisfying and converging to a so that . Then it follows that and , thereby establishing that is a consistent estimator of . Let us state this crucial result as a lemma
Lemma 1(van der Laan and Gruber ) If , and , then . More generally, .
We note that contains the true conditional distributions of A, given , for which is a function of , i.e. for which only depends on W through . We refer to such distributions as reduced treatment mechanisms. However, it contains many more conditional distributions since any conditional distribution g for which is orthogonal to in is an element of . We refer to van der Laan and Gruber  and Gruber and van der Laan  for the introduction and general notion of collaborative double robustness.
The general C-TMLE introduced in van der Laan and Gruber  provides a template for construction of a TMLE satisfying and converging to a with so that and thereby . Thus C-TMLE provides a template for construction of targeted MLEs that exploit the collaborative double robustness of TMLEs in the sense that a TMLE will be consistent as long as converges to a for which . The goal is not to estimate the true treatment mechanism, but instead to construct a that converges to a conditional distribution given a reduction of W that is an element of . We could state that, just as the propensity score provides a sufficient dimension reduction for the outcome regression, so does, given , provide a sufficient dimension reduction for the propensity score regression in the TMLE. The current literature appears to agree that propensity score estimators are best evaluated with respect to their effect on estimation of the causal effect of interest, not by metrics such as likelihoods or classification rates [45–48], and the above-stated general collaborative double robustness provides a formal foundation for such claims.
The general C-TMLE has been implemented and applied to point treatment and longitudinal data [20, 29–33, 35]. A C-TMLE algorithm relies on a TMLE algorithm that maps an initial into a TMLE and uses this algorithm in combination with a targeted variable selection algorithm for generating candidate models for the propensity score to generate a sequence of candidate TMLEs , increasingly non-parametric in k, and finally uses cross-validation to select the best TMLE among these candidates estimators of .
Our next theorem presents a TMLE algorithm and a corresponding influence curve under the assumption that the propensity score correctly adjusts for the possibly misspecified and . The presented TMLE algorithm already arranges that this TMLE indeed non-parametrically adjusts for . In the next subsection, we will present an actual C-TMLE algorithm that generates a TMLE for which the propensity score is targeted to adjust for , so that this theorem can be applied.
Definitions: For any given , let and be consistent estimators of and , respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let and denote these estimators applied to the TMLE defined below.
“Score” equations the TMLE should solve: Below, we describe an iterative TMLE algorithm that results in estimators , , that solve the following equations:
Iterative targeted MLE of :
Initialization: Let and (e.g. aiming to adjust for ) be initial estimators.
Let , , and .
Updating step: Consider the submodel , and fit with the MLE
Define the submodel and let be the quasi-log-likelihood loss function for . Let be the MLE. Let , , and .
Iterating till convergence: Now, set and iterate this updating process mapping a into till convergence or till large enough K so that the following estimating equations are solved up till an -term:
Final substitution estimator: Denote these limits (in k) of this iterative procedure with , , . Let , where is the empirical distribution estimator of . The TMLE of is defined as .
Assumption on limits of : Assume that is consistent for w.r.t. -norm, where for some function of W for which only depends on W through , and assume that , where the latter holds, in particular, if only depends on W through (e.g. involves non-parametric adjustment by ). As a consequence, we have .
Empirical process condition: Assume that , fall in a -Donsker class with probability tending to 1 as .
Negligibility of second-order terms: Define
Assume that the following conditions hold for each of the following possible definitions of : , , . Note that is the limit of each of these choices for .
We assume are bounded away from with probability tending to one, and
Thus, consistency of this TMLE relies upon the consistency of as an estimator of , and estimator converging to a for which equals a true conditional mean of A, given , and only depend on W through . Since depends on how well approximates , depends on how well approximates , beyond the behavior of the non-parametric regression defining . In addition, depends on either how well approximates or how well approximates . As a consequence, it follows that each of the second-order terms displayed in the theorem involves square differences of approximation errors and .
It is also interesting to note that the algebraic form of the influence curve of this TMLE is identical to the influence curve of the TMLE of Theorem 2 that relied on being consistent for .
The TMLE algorithm presented in Theorem 4 maps an initial estimator into an updated estimator that solves the two estimating equations (3), allowing for statistical inference with known influence curve if the initial estimator is collaboratively consistent (i.e. the limits of satisfy the condition in the theorem). The updating algorithm results in a that non-parametrically adjusts for itself, and thus for its limit in the limit. The condition on the limit g was that it should non-parametrically adjust not only for but also for . If the initial estimator already adjusted for an approximation of , for example, is already a C-TMLE, then this condition might hold approximately. Nonetheless, we want to present a C-TMLE algorithm that simultaneously fits g in response to , but also carries out the non-parametric adjustment by . The latter is normally not part of the C-TMLE algorithm, but we want to enforce this in order to be able to apply Theorem 3 and thereby obtain a known influence curve. We achieve this goal in this subsection by applying the C-TMLE algorithm as presented by van der Laan and Gruber  and to the particular TMLE algorithm presented in Theorem 4.
First, we compute a set of K univariate covariates , i.e. functions of W, which we will refer to as main terms, even though a term could be an interaction term or a super-learning fit of the regression of A on a subset of the components of W. Let be the full collection of main terms. In the previous subsection, we defined an algorithm that maps an initial into a TMLE . Let be the loss function for .
The general template of a C-TMLE algorithm is the following: given a TMLE algorithm that maps any initial into a TMLE , the C-TMLE algorithm generates a sequence of increasing sets of k main terms, where each set has an associated estimator of , and simultaneously it generates a corresponding sequence of , , where both and are increasingly non-parametric in k. Here increasingly non-parametric means that the empirical mean of the loss function of the fit is decreasing in k. This sequence maps into a corresponding sequence of TMLEs using the TMLE algorithm presented in Theorem 4. In this variable selection algorithm, the choice of the next main term to add, mapping into , is based on how much the TMLE using the g-fit implied by , using as initial estimator, improves the fit of the corresponding TMLE for . Cross-validation is used to select k among these candidate TMLEs , , where the last TMLE uses the most aggressive bias reduction by being based on the most non-parametric estimator implied by .
In order to present a precise C-TMLE algorithm we will first introduce some notation. For a given subset of main terms , let be its complement within . In the C-TMLE algorithm, we use a forward selection algorithm that augments a given set into a next set obtained by adding the best main term among all main terms in the complement of . Each choice corresponds with an estimator of . In other words, the algorithm iteratively updates a current estimate into a new estimate , but the criterion for g does not measure how well g fits ; it measures how well the TMLE of that uses this g (and as initial estimator ) fits .
Given a set , an initial , we define a corresponding obtained by MLE-fitting of in the logistic regression working model
where we remind the reader of the definition . Thus, this estimator involves non-parametric adjustment by , augmented with a linear regression component implied by . This function mapping into a fit will be denoted with . This also allows us to define a mapping from into a TMLE defined by the TMLE algorithm of Theorem 4 applied to initial and . We will denote this mapping into with .
The C-TMLE algorithm defined below generates a sequence and thereby corresponding TMLEs , , where represents an initial estimate, a subset of main terms that defines , and the corresponding TMLE that starts with . These TMLEs represent subsequent updates of the initial estimator . The corresponding main term set that defines in this k-specific TMLE, increases in k, one unit at a time: is empty, , . The C-TMLE uses cross-validation to select k, and thereby to select the TMLE that yields the best fit of among the k-specific TMLEs that are increasingly aggressive in their bias-reduction effort. This C-TMLE algorithm is defined as follows and uses the same format as presented in Wang et al. :
Initiate algorithm: Set initial TMLE. Let , and , be initial estimates of , , and let be the empty set. Let . This defines an initial TMLE
Determine next TMLE. Determine the next best main term to add:
then , else , and
[In words: If the next best main term added to the fit of yields a TMLE of that improves upon the previous TMLE , then we accept this best main term, and we have our next and corresponding TMLE (which still uses the same initial estimate of as uses). Otherwise, reject this best main term, update the initial estimate in the candidate TMLEs to the previous TMLE of , and determine the best main term to add again. This best main term will now always result in an improved fit of the corresponding TMLE of , so that we now have our next TMLE (which now uses a different initial estimate than used).]
Iterate. Run this from to K at which point . This yields a sequence and corresponding TMLE , .
This sequence of candidate TMLEs of has the following property: the estimates are increasingly non-parametric in k and is decreasing in k, . It remains to select k. For that purpose we use V-fold cross-validation. That is, for each of the V splits of the sample in a training and validation sample, we apply the above algorithm for generating a sequence of candidate estimates to a training sample, and we evaluate the empirical mean of the loss function at the resulting over the validation sample, for each . For each k we take the average over the V splits of the k-specific performance measure over the validation sample, which is called the cross-validated risk of the k-specific TMLE. We select the k that has the best cross-validated risk, which we denote with . Our final C-TMLE of is now defined as , and the TMLE of is defined as .
Fast version of above C-TMLE: We could carry out the above C-TMLE algorithm but replacing the TMLE that maps an initial into replaced by the first step of the TMLE that maps into . In that manner, the selection of the sets is based on the bias reduction achieved in a first step of the TMLE algorithm, and most bias reduction occurs in the first step. After having selected the final one-step TMLE and corresponding , one should still carry out the full TMLE algorithm so that the final is a real TMLE solving the estimating equations of Theorem 4.
Statistical inference for C-TMLE: Let be the final estimator of , a by-product of the TMLE algorithm. An estimate of the influence curve of is given by
The asymptotic variance of can thus be estimated with . An asymptotically valid 0.95-confidence interval for is given by .
Targeted minimum loss-based estimation allows us to construct plug-in estimators of a path-wise differentiable parameter utilizing the state of the art in ensemble learning such as super-learning, while guaranteeing that the estimator and an estimator of the nuisance parameter the TMLE utilizes in its targeting step solve a set of user-supplied estimating equations, empirical means of estimating functions. These estimating functions can be selected so that the resulting TMLE of has certain statistical properties such as being efficient, or guaranteed to be more efficient than a given user-supplied estimator [28, 29], and so on. However, most importantly, these estimating equations are necessary to make the TMLE asymptotically linear, i.e. to make the TMLE unbiased enough so that the first-order linear expansion can be used for statistical inference. For example, by selecting the estimating functions to be equal to the canonical gradient of one arranges that is asymptotically efficient under conditions that assume consistency of and .
However, we noted that this level of targeting is insufficient if one only relies on consistency of , even when that suffices for consistency of . Under such weaker assumptions, additional targeting is necessary so that a specific smooth functional of is asymptotically linear, which requires that an unknown smooth function of is itself a TMLE. The joint targeting of and is achieved by a TMLE that also solves the extra equations making this smooth function of asymptotically linear, allowing one to establish asymptotic linearity of under milder conditions that assume that the second-order terms are negligible relative to the first-order linear approximation.
In this article we also pushed this additional level of targeting to a new level by demonstrating how it allows for double robust statistical inference, and that even if we estimate the nuisance parameter in a complicated manner that is based on a criterion that cares about how it helps the estimator to fit , as used by the C-TMLE, we can still determine a set of additional estimating equations that need to be targeted by the TMLE in order to establish asymptotic linearity and thereby valid statistical inference based on the central limit theorem. This allows us now to use the sophisticated but often necessary C-TMLE while still preserving valid statistical inference under regularity conditions.
It remains to evaluate the practical benefit of the modifications of IPTW, TMLE, and C-TMLE as presented in this article for both estimation and assessment of uncertainty. We plan to address this in future research.
Even though we focussed in this article on a particular concrete estimation problem, TMLE is a general tool and our TMLE and theorems can be generalized to general statistical models and path-wise differentiable statistical target parameters.
We note that this targeting of nuisance parameter estimators in the TMLE is not only necessary to get a known influence curve but also necessary to make the TMLE asymptotically linear. So it does not simply suffice to run a bootstrap as an alternative of influence curve based inference, since the bootstrap can only work if the estimator is asymptotically linear so that it has an existing limit distribution. In addition, the established asymptotic linearity with known influence curve has the important by-product that one now obtains statistical inference with no extra computational cost. This is particularly important in these large semi-parametric models that require the utilization of aggressive machine learning methods in order to cover the model-space, making the estimators by necessity very computer intensive, so that a (disputable) bootstrap method might simply be too computer extensive.
This research was supported by an NIH grant R01 AI074345-06. The author is grateful for the excellent, helpful, and insightful comments of the reviewers.
To start with we note:
The first term of this decomposition yields the first component of the influence curve. Since falls in Donsker class the rightmost term is if in probability. So it remains to analyze the term . We now note
By our assumptions, the last term
So it remains to study:
Note that this equals , where is an unknown smooth parameter of g. Our strategy is to first approximate this parameter by an easier (still unknown) parameter resulting in a second-order term: . This is carried out in the next lemma. The efficient influence curve of a target parameter (which treats as known) at is given by . Thus, one likes to construct so that it solves the empirical mean of for , so that targets the parameter . However, is unknown. Therefore, instead is constructed to solve the empirical mean of an estimate of the efficient influence curve , and we will show that this indeed suffices to establish the asymptotic linearity of .
Lemma 2Define , , , and , where is treated as a fixed function of W when calculating the conditional expectation. Assume