In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distributed copies of a random variable consisting of baseline covariates, a subsequently assigned binary treatment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score.

Keywords:
asymptotic linearity;
cross-validation;
efficient influence curve;
influence curve;
targeted minimum loss based estimation

This introduction provides an atlas for the contents of this article. It starts with formulating the role of estimation of nuisance parameters to obtain asymptotically linear estimators of a target parameter of interest. This demonstrates the need to target this estimator of the nuisance parameter in order to make the estimator of the target parameter asymptotically linear when the model for the nuisance parameter is large. The general approach to obtain such a targeted estimator of the nuisance parameter is described. Subsequently, we present our concrete example to which we will apply this general method for targeted estimation of the nuisance parameter, and for which we establish a number of formal theorems. Finally, we discuss the link to previous articles that concerned some kind of targeting of the estimator of the nuisance parameter, and we provide an organization of the remainder of the article.

Suppose we observe *n* independent and identically distributed copies of a random variable *O* with probability distribution

The empirical mean of the influence curve

Suppose that
*P* through a parameter
*P* only through

The latter is shown as follows. By the property of the canonical gradient (in fact, any gradient) we have

The first term is an empirical process term that, under empirical process conditions (mentioned below), equals

To obtain the desired asymptotic linearity of

In this article, we demonstrate that if

The current article concerns the construction of such targeted IPTW and TMLE that are asymptotically linear under regularity conditions, even when only one of the nuisance parameters is consistent and the estimators of the nuisance parameters are highly data adaptive. In order to be concrete in this article, we will focus on a particular example. In such an example we can concretely present the second-order term

The same approach for construction of such TMLE can be carried out in much greater generality, but that is beyond the scope of this article. Nonetheless, it is helpful for the reader to know that the general approach is the following (considering the case that
*Q* can be misspecified): (1) approximate

Let us now formulate our concrete example we will cover in this article. Let
*W* baseline covariates, *A* a binary treatment, and *Y* a final outcome. Let
*A*, given *W*, but leaves the marginal distribution of *W* and the conditional distribution of *Y*, given
*P* is given by
*W*, and note that
*P* through

For this particular example, such TMLE are presented in Scharfstein et al. [17]; van der Laan and Rubin [7]; Bembom et al. [18–21]; Rosenblum and van der Laan [22]; Sekhon et al. [23]; van der Laan and Rose [6, 24]. Since

(1)
Ψ
(
Q
n
∗
)
−
ψ
0
=
(
P
n
−
P
0
)
D
∗
(
Q
n
∗
,
g
n
)
+
P
0
(
Q
ˉ
0
−
Q
ˉ
n
∗
)
(
g
ˉ
0
−
g
ˉ
n
)
/
g
ˉ
n
.

The first term equals

However, if only one of these nuisance parameter estimators is consistent, then the second term is still a first-order term, and it remains to establish that it is also asymptotically linear with a second-order remainder. For sake of discussion, suppose that

In this article, we present TMLE that targets

The construction of TMLE that utilizes targeting of the nuisance parameter

The TMLEs presented in this article are always iterative and thereby rely on convergence of the iterative updating algorithm. Since the empirical risk increases at each updating step, such convergence is typically guaranteed by the existence of the MLE at each updating step (e.g. an MLE of coefficient in a logistic regression). Either way, in this article, we assume this convergence to hold. Since our assumptions of our theorems require

The organization of this paper is as follows. In Section 2, we introduce a targeted IPTW-estimator that relies on an adaptive consistent estimator of

In Section 5, we extend the TMLE of Section 3 (that relies on
*g* but one that suffices for consistent estimation of

In the following sections, we will use the following notation. We have
*A*, given *W*. Let
*W* under

We first describe an IPTW-estimator that uses super-learning to fit the treatment mechanism

We consider a simple IPTW-estimator
*np*, and let

as the choice of estimator that minimizes cross-validated risk. The super-learner of

The next theorem presents an IPTW-estimator that uses a targeted fit

**Theorem 1***We consider a targeted IPTW-estimator*
*where*
*and*
*is an update of an initial estimator*
*of*
*defined below*.

**Definition of targeted estimator**
*Let*
*be obtained by non-parametric estimation of the regression function*
*treating*
*as a fixed covariate (i.e. function of W). This yields an estimator*
*of*
*where*
*. Consider the submodel*
*and fit*
*with the MLE*

*We define*
*as the corresponding targeted update of*
*. This TMLE*
*satisfies*

**Empirical process condition**: *Assume that*
*fall in a*
*-Donsker class with probability tending to 1*.

**Negligibility of second-order terms**: *Define*
*. Assume*
*with probability tending to 1 and assume*

*Then*,

*where*

So under the conditions of this theorem, we can construct an asymptotic 0.95-confidence interval

and

Regarding the displayed second-order term conditions, we note that these are satisfied if

Regarding the empirical process condition, we note that an example of a Donsker class is the class of multivariate real-valued functions with uniform sectional variation norm bounded by a universal constant [44]. It is important to note that if each estimator in the library falls in such a class, then also the convex combinations fall in that same class [4]. So this Donsker condition will hold if it holds for each of the candidate estimators in the library of the super-learner.

Consider an IPTW-estimator using a MLE
*q*. As a consequence, all the consistency and second-order term conditions for the IPTW-estimator using a targeted
*K* main terms that themselves have a uniform sectional variation norm, but also penalized least-squares estimators (e.g. Lasso) using basis functions with bounded uniform sectional variation norm, and one could map any estimator into this space of functions with universally bounded uniform sectional variation norm through a smoothing operation. Thus, under this restriction on the library, the IPTW-estimator using the super-learner is asymptotically linear with influence curve

The parametric IPTW-estimator is asymptotically linear with influence curve
*f* onto

For example, if the parametric model happens to have a score equal to

If, on the other hand, the parametric model is misspecified, then the IPTW-estimator using

In the next subsection, we present a TMLE that targets the fit of the treatment mechanism, analog to the targeted IPTW-estimator presented above. In addition, this subsection presents a formal asymptotic linearity theorem demonstrating that this TMLE will be asymptotically linear even when

The following theorem presents a novel TMLE and corresponding asymptotic linearity with specified influence curve, where we rely on consistent estimation of

**Theorem 2**

**Iterative targeted MLE of**

**Definitions**: *Given*
*let*
*be a consistent estimator of the regression*
*of*
*on*
*and*
*. Let*
*be an initial estimator of*

**Initialization**: *Let*
*and*
*. Let*

**Updating step for**
*Consider the submodel*
*and fit*
*with the MLE*

*We define*
*as the corresponding update of*
*. This*
*satisfies*

**Updating step for**
*Let*
*be the quasi-log-likelihood loss function for*
*(allowing that Y is continuous in*
*). Consider the submodel*
*and let*
*. Define*
*as the resulting update. Define*

**Iterating till convergence**: *Now, set*
*and iterate this updating process mapping a*
*into*
*till convergence or till large enough K so that the estimating equations (2) below are solved up till an*
*-term. Denote the limit of this iterative procedure with*

**Plug-in estimator**: *Let*
*where*
*is the empirical distribution estimator of*
*. The TMLE of*
*is defined as*

**Estimating equations solved by TMLE**: *This TMLE*
*solves*

(2)
P
n
D
A
(
Q
ˉ
n
r
∗
,
g
ˉ
n
∗
)
=
0.

**Empirical process condition**: *Assume that*
*falls in a*
*-Donsker class with probability tending to 1 as*

**Negligibility of second-order terms**: *Define*

*where*
*is treated as a fixed covariate (i.e. function of W) in the conditional expectation*
*. Assume that there exists a*
*so that*
*with probability tending to 1, and*

*Then*,

*where*

Thus, under the assumptions of this theorem, an asymptotic 0.95-confidence interval is given by

The following is an application of the constrained logistic regression approach of the type presented in Gruber and van der Lann [19] for the purpose of estimation of
*W*, which implies an estimator
*W*, through a given estimator

The MLE is simply obtained with logistic regression of
*W* (see, e.g. Gruber and van der Lann [19]) based on the quasi-log-likelihood loss function:

where

is the quasi-log-likelihood loss. The update
*k* in the iterative TMLE algorithm, and thereby that

In this section, our aim is to present a TMLE that is asymptotically linear with known influence curve if either

**Theorem 3**

**Definitions**: *For any given*
*let*
*and*
*be consistent estimators of*
*and*
*respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let*
*and*
*denote these estimators applied to the TMLEs*
*defined below*.

**Iterative targeted MLE of**

**Initialization**: *Let*
*be an initial estimator of*
*. Let*
*and let*
*. Let*
*be obtained by non-parametrically regressing A on*
*. Let*
*be obtained by non-parametrically regressing*
*on*

**Updating step**: *Consider the submodel*
*and fit*
*with the MLE*

*Define the submodel*
*where*

*Let*
*be the MLE, where*
*is the quasi-log-likelihood loss*.

*We define*
*as the corresponding targeted update of*
*and*
*as the corresponding update of*
*. Let*
*and*

**Iterate till convergence**: *Now, set*
*and iterate this updating process mapping a*
*into*
*till convergence or till large enough K so that the following three estimating equations are solved up till an*
*-term*:

*where*

**Final substitution estimator**: *Denote the limits of this iterative procedure with*
*. Let*
*where*
*is the empirical distribution estimator of*
*. The TMLE of*
*is defined as*

**Equations solved by TMLE**:

**Empirical process condition**: *Assume that*
*fall in a*
*-Donsker class with probability tending to 1 as*

**Negligibility of second-order terms**: *Define*
*and*
*. Assume that there exists a*
*so that*
*with probability tending to 1, that*
*are consistent for*
*w.r.t*.
*-norm, where either*
*or*
*and assume that the following second-order terms are*

*Then*,

*where*

Note that consistent estimation of the influence curve

If
*A*, given some function
*W* for which

As shown in the final remark of the Appendix, the condition of Theorem 3 that either

We first review the theoretical underpinning for collaborative estimation of nuisance parameters, in this case, the outcome regression and treatment mechanism. Subsequently, we explain that the desired collaborative estimation can be achieved by applying the previously established template for construction of a C-TMLE to a TMLE that solves certain estimating equations when given an initial estimator of
*k*, and (2) using cross-validation to select the *k* for which

We note that

Let
*A*, given *W*, and let
*A* given *W*. We define the set

**Lemma 1***(van der Laan and Gruber [**33**]) If*
*and*
*then*
*. More generally*,

We note that
*A*, given
*W* through
*g* for which

The general C-TMLE introduced in van der Laan and Gruber [33] provides a template for construction of a TMLE
*W* that is an element of

The general C-TMLE has been implemented and applied to point treatment and longitudinal data [20, 29–33, 35]. A C-TMLE algorithm relies on a TMLE algorithm that maps an initial
*k*, and finally uses cross-validation to select the best TMLE among these candidates estimators of

Our next theorem presents a TMLE algorithm and a corresponding influence curve under the assumption that the propensity score correctly adjusts for the possibly misspecified

**Theorem 4**

**Definitions**: *For any given*
*let*
*and*
*be consistent estimators of*
*and*
*respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). Let*
*and*
*denote these estimators applied to the TMLE*
*defined below*.

**“Score” equations the TMLE should solve**: *Below, we describe an iterative TMLE algorithm that results in estimators*
*that solve the following equations*:

(3)
0
=
P
n
D
A
(
Q
ˉ
n
r
∗
,
g
ˉ
n
∗
)
.

**Iterative targeted MLE of**

**Initialization**: *Let*
*and*
*(e.g. aiming to adjust for*
*) be initial estimators*.

*Let*
*and*

**Updating step**: *Consider the submodel*
*and fit*
*with the MLE*

*Define the submodel*
*and let*
*be the quasi-log-likelihood loss function for*
*. Let*
*be the MLE. Let*
*and*

**Iterating till convergence**: *Now, set*
*and iterate this updating process mapping a*
*into*
*till convergence or till large enough K so that the following estimating equations are solved up till an*
*-term*:

**Final substitution estimator**: *Denote these limits (in k) of this iterative procedure with*
*. Let*
*where*
*is the empirical distribution estimator of*
*. The TMLE of*
*is defined as*

**Assumption on limits**
**of**
*Assume that*
*is consistent for*
*w.r.t*.
*-norm, where*
*for some function*
*of W for which*
*only depends on W through*
*and assume that*
*where the latter holds, in particular, if*
*only depends on W through*
*(e.g*.
*involves non-parametric adjustment by*
*). As a consequence, we have*

**Empirical process condition**: *Assume that*
*fall in a*
*-Donsker class with probability tending to 1 as*

**Negligibility of second-order terms**: *Define*

*Assume that the following conditions hold for each of the following possible definitions of*
*. Note that*
*is the limit of each of these choices for*

*We assume*
*are bounded away from*
*with probability tending to one, and*

*Then*,

*where*

Thus, consistency of this TMLE relies upon the consistency of
*A*, given
*W* through

It is also interesting to note that the algebraic form of the influence curve of this TMLE is identical to the influence curve of the TMLE of Theorem 2 that relied on

The TMLE algorithm presented in Theorem 4 maps an initial estimator
*g* was that it should non-parametrically adjust not only for
*g* in response to

First, we compute a set of *K* univariate covariates
*W*, which we will refer to as main terms, even though a term could be an interaction term or a super-learning fit of the regression of *A* on a subset of the components of *W*. Let

The general template of a C-TMLE algorithm is the following: given a TMLE algorithm that maps any initial
*k* main terms, where each set
*k*. Here increasingly non-parametric means that the empirical mean of the loss function of the fit is decreasing in *k*. This sequence
*g*-fit implied by
*k* among these candidate TMLEs

In order to present a precise C-TMLE algorithm we will first introduce some notation. For a given subset of main terms
*g* does not measure how well *g* fits
*g* (and as initial estimator

Given a set

where we remind the reader of the definition

The C-TMLE algorithm defined below generates a sequence
*k*-specific TMLE, increases in *k*, one unit at a time:
*k*, and thereby to select the TMLE
*k*-specific TMLEs

**Initiate algorithm: Set initial TMLE**. Let

**Determine next TMLE**. Determine the next best main term to add:

If

then

[In words: If the next best main term added to the fit of

**Iterate**. Run this from
*K* at which point

This sequence of candidate TMLEs
*k* and
*k*,
*k*. For that purpose we use *V*-fold cross-validation. That is, for each of the *V* splits of the sample in a training and validation sample, we apply the above algorithm for generating a sequence of candidate estimates
*k* we take the average over the *V* splits of the *k*-specific performance measure over the validation sample, which is called the cross-validated risk of the *k*-specific TMLE. We select the *k* that has the best cross-validated risk, which we denote with

**Fast version of above C-TMLE**: We could carry out the above C-TMLE algorithm but replacing the TMLE that maps an initial

**Statistical inference for C-TMLE**: Let

The asymptotic variance of

Targeted minimum loss-based estimation allows us to construct plug-in estimators

However, we noted that this level of targeting is insufficient if one only relies on consistency of

In this article we also pushed this additional level of targeting to a new level by demonstrating how it allows for double robust statistical inference, and that even if we estimate the nuisance parameter in a complicated manner that is based on a criterion that cares about how it helps the estimator to fit

It remains to evaluate the practical benefit of the modifications of IPTW, TMLE, and C-TMLE as presented in this article for both estimation and assessment of uncertainty. We plan to address this in future research.

Even though we focussed in this article on a particular concrete estimation problem, TMLE is a general tool and our TMLE and theorems can be generalized to general statistical models and path-wise differentiable statistical target parameters.

We note that this targeting of nuisance parameter estimators in the TMLE is not only necessary to get a known influence curve but also necessary to make the TMLE asymptotically linear. So it does not simply suffice to run a bootstrap as an alternative of influence curve based inference, since the bootstrap can only work if the estimator is asymptotically linear so that it has an existing limit distribution. In addition, the established asymptotic linearity with known influence curve has the important by-product that one now obtains statistical inference with no extra computational cost. This is particularly important in these large semi-parametric models that require the utilization of aggressive machine learning methods in order to cover the model-space, making the estimators by necessity very computer intensive, so that a (disputable) bootstrap method might simply be too computer extensive.

This research was supported by an NIH grant R01 AI074345-06. The author is grateful for the excellent, helpful, and insightful comments of the reviewers.

To start with we note:

The first term of this decomposition yields the first component

By our assumptions, the last term

So it remains to study:

Note that this equals
*g*. Our strategy is to first approximate this parameter by an easier (still unknown) parameter

**Lemma 2***Define*
*and*
*where*
*is treated as a fixed function of W when calculating the conditional expectation. Assume*

*Then*,