## Abstract

Suppose we observe

### 1 Introduction

We consider the general statistical estimation problem defined by a statistical model for the data distribution, a Euclidean valued target parameter mapping defined on the statistical model, and observing

One can construct an asymptotically efficient estimator with the following two general methods. Firstly, the one-step estimator is defined by adding to an initial plug-in estimator of the target parameter an empirical mean of an estimator of the efficient influence curve at this same initial estimator [1]. In the special case that the efficient influence curve can be represented as an estimating function, one can represent this methodology as the first step of the Newton-Raphson algorithm for solving the estimating equation defined by setting the empirical mean of the efficient influence curve equal to zero. Such general estimating equation methodology for construction of efficient estimators has been developed for censored and causal inference models in the literature (e.g., [2, 3]). Secondly, the TMLE defines a least favorable parametric submodel through an initial estimator of the relevant parts (nuisance parameters) of the data distribution, and updates the initial estimator with the MLE over this least favorable parametric submodel. The one-step TMLE of the target parameter is now the resulting plug-in estimator [4, 5, 6]. In this article we focus on the one-step TMLE since it is a more robust estimator by respecting the global constraints of the statistical model, which becomes evident when comparing the one-step estimator and TMLE in simulations for which the information is low for the target parameter (e.g., even resulting in one-step estimators of probabilities that are outside the (0, 1) range) (e.g., [7, 8, 9]). Nonetheless, the results in this article have immediate analogues for the one-step estimator and estimating equation method.

The asymptotic linearity and efficiency of the TMLE and one-step estimator relies on a second order remainder to be

In this article, for each nuisance parameter, we propose a new minimum loss based estimator that minimizes the loss-specific empirical risk over its parameter space under the additional constraint that the variation norm is bounded by a set constant. The constant is selected with cross-validation. We show that these MLEs can be represented as the minimizer of the empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute value of the coefficients is bounded by the constant: i.e., the variation norm corresponds with this

### 2 Example: Treatment specific mean in nonparametric model

Before we start the main part of this article, in this section we will first introduce an example, and use this example to provide the reader with a guide through the different sections.

#### 2.1 Defining the statistical estimation problem

Let

Our target parameter

for some

Let

An estimator

where

We have that

where

Of course,

We define the following two log-likelihood loss functions for

We also define the corresponding Kullback-Leibler dissimilarities

Let the submodel

If we would replace the log-likelihood loss

Let

#### 2.2 One step CV-TMLE

Let

through

through

This universal least favorable submodel implies a recursive construction of

Let

The CV-TMLE of

Note that this latter representation proves that we never have to carry out the TMLE-update step for

We also conclude that this one-step CV-TMLE solves the crucial cross-validated efficient influence curve equation

#### 2.3 Guide for article based on this example

**Section 3: Formulation of general estimation problem**. The goal of this article is far beyond establishing asymptotic efficiency of the CV-TMLE eq. (3) in this example. Therefore, we start in Section 3 by defining a general model and general target parameter, essentially generalizing the above notation for this example. Therefore, having read the above example, the presentation in Section 3 of a very general estimation problem will be easier to follow. Our subsequent definition and results for the HAL-estimator, the HAL-super-learner, and the CV-TMLE in the subsequent Sections 4-6 apply now to our general model and target parameter, thereby establishing asymptotic efficiency of the CV-TMLE for an enormous large class of semi-parametric statistical estimation problems, including our example as a special case.

Let’s now return to our example to point out the specific tasks that are solved in each section of this article. By eqs (1) and (4), we have the following starting identity for the CV-TMLE:

By the Cauchy-Schwarz inequality and bounding

where

Thus, by selecting

**Section 4: Construction and analysis of an **. This challenge of constructing such estimators

**Section 5: Construction and analysis of an HAL-super-learner**. Instead of assuming that the the variation norm of

The convergence results for this super-learner in terms of the Kullback-Leibler loss-based dissimilarities also imply corresponding results for

**Section 6: Construction and analysis of HAL-CV-TMLE**. To control the remainder we need to understand the behavior of the updated initial estimator

This will hold under weak conditions, given that we have estimators

**Section 7: Returning to our example**. In Section 7 we return to our example to present a formal Theorem 2 with specified conditions, involving an application of our general efficiency Theorem 1 in Section 6.

**Appendix**: Various technical results are presented in the Appendix.

### 3 Statistical formulation of the estimation problem

Let

Let

Our goal in this article is to construct a substitution estimator (i.e., a TMLE

**Relevant nuisance parameters **: Let

be the collection of these

Suppose that

We assume that

We also define

as the vector of

We define

as the vector of

We will also use the notation

when

We will abuse notation by also denoting

**Second order remainder for target parameter**: We define the second order remainder

We will also denote

Given the above double robustness property of the canonical gradient (i.e, of the target parameter), if

**Support of data distribution**: The support of

so that

**Cadlag functions on **: Suppose

For a subset

where

denote the set of cadlag functions

**Cartesian product of cadlag function spaces, and its component-wise operations**: Let

of variation norms. If

Our results will hold for general models and pathwise differentiable target parameters, as long as the statistical model satisfies the following key smoothness assumption:

### Assumption 1(**Smoothness Assumption**)

For each

**Definition of bounds on the statistical model**: The properties of the super-learner and TMLE rely on bounds on the model

Note that

**Bounded and Unbounded Models**: We will call the model

are finite. In words, in essence, a bounded model is a model for which the support and the supremum norm

of

**Sequence of bounded submodels approximating the unbounded model**: For an unbounded model

This model

Let

We will assume that

In this paper our initial estimators of

### 4 Highly adaptive Lasso estimator of Nuisance parameters

Let

#### 4.1 Upper bounding the entropy of the parameter space for the HAL-estimator

We remind the reader that a covering number

where

The minimal rates of convergence of our HAL-estimator of

By eq. (17) it follows that any cadlag function with finite variation norm can be represented as a difference of two bounded monotone increasing functions (i.e., cumulative distribution function). The class of

where

#### 4.2 Minimal rate of convergence of the HAL-estimator

Lemma 1 below proves that the minimal rates

Let

#### Lemma 1

For a given vector

Consider an estimator

where

and

**Proof**: We have

which proves eq. (22). Since

We now apply Lemma 7 with

This proves

### 5 Super-learning: HAL-estimator tuning the variation norm of the fit with cross-validation

**Defining the library of candidate estimators**: For an

Note that for all

**Super Learner**: Let

We define the cross-validation selector as the index

that minimizes the cross-validated risk

The following lemma proves that the super-learner

### Lemma 2

Recall the definition of the model bounds

For any fixed

If for each fixed

If for each fixed

Suppose that for each finite

The proof of this lemma is a simple corollary of the finite sample oracle inequality for cross-validation [11, 13, 21, 33, 34], also presented in Lemma 5 in Section A of the Appendix. It uses the convexity of the loss function to bring the

In the Appendix we present the analogue super-learner eq. (37) of

### 6 One-step CV-HAL-TMLE

Cross-validated TMLE (CV-TMLE) robustifies the bias-reduction of the TMLE-step by selecting

#### 6.1 The CV-HAL-TMLE

**Definition of one-step CV-HAL-TMLE for general local least favorable submodel**: Let

be the MLE of

**One-step CV-HAL-TMLE solves cross-validated efficient score equation**: Our efficiency Theorem 1 assumes that

That is, it is assumed that the one-step CV-TMLE already solves the cross-validated efficient influence curve equation up till an asymptotically negligible approximation error. By definition of

**One-step CV-HAL-TMLE preserves fast rate of convergence of initial estimator**: Our efficiency Theorem 1 also assumes that the updated estimator

**A class of multivariate local least favorable submodels that separately updates each nuisance parameter component**: One way to guarantee that

for

**How to construct a local least favorable submodel of type eq. (26)**: A general approach for constructing such a

so that,

That is,

so that the submodel is indeed a local least favorable submodel.

Lemma 14 provides a sufficient set of minor conditions under which the one-step-HAL-CV-TMLE using a local least favorable submodel of the type eq. (26) will satisfy eq. (25). Therefore, the class of local least favorable submodels eq. (26) yields both crucial conditions for the HAL-CV-TMLE: it solves eq. (25) and it preserve the rate of convergence of the initial estimator.

#### 6.2 Preservation of the rate of initial estimator for the one-step CV-HAL-TMLE using eq. (26)

Consider the submodel

#### Lemma 3

Let

We have

By convexity of the loss function

We have

Thus, if for some

It then also follows that for each

#### 6.3 Efficiency of the one-step CV-HAL-TMLE.

We have the following theorem.

#### Theorem 1

Consider the above defined corresponding one-step CV-TMLE

**Initial estimator conditions**: Consider the HAL-super-learners

where

**“Preserve rate of convergence of initial estimator”-condition**: In addition, assume that either (Case A) the CV-TMLE uses a local least favorable submodel of the type eq. (26) so that Lemma 3 applies, or (Case B) assume that for each split

**Efficient influence curve score equation condition and second order remainder condition**: Define

In Case A, for verification of assumption eq. (27) one could apply Lemma 14. In Case A, for verification of the two assumptions eqs (28) and (29) one can use that for each of the

Then,

Condition eq. (32) will practically always trivially hold for **Proof**: By assumptions eqs (30) and (31),

we have

Consider Case A. Lemma 3 proves that under these same assumptions eqs (30), (31), we also have, for each

Suppose

Combining this with eq. (27) yields the following identity:

By assumption eq. (29) we have that

We now note

Thus, it remains to prove that

Note also that the envelope of

This proves also that

### 7 Example: Treatment specific mean

We will now apply Theorem 1 to the example introduced in Section 2. We have the following sieve model bounds (van der Laan et al., 2004):

Since the parameter space

**Verification of eqs (30) and (31)**: Let

Plugging in the above bounds for

Above we showed that if

**Verification of eq**. (28):

Using straightforward algebra and using the triangle inequality for a norm, we obtain

Using that

We bound the last term as follows:

where we used at the third equality that for each split

In order to bound the second empirical process term we apply Lemma 10 to the term

Thus, we have shown

We have

Lemma 4 below shows that

We can conservatively bound this as follows:

where we used conservative bounding by not utilizing that

We need that

**Verification of eq. (29)**: By eq. (6), we can bound the second order remainder as follows:

Thus, it suffices to assume that

We verified the conditions of Theorem 1. Application of Theorem 1 yields the following result.

### Theorem 2

Consider the nonparametric statistical model

Consider the above defined one-step CV-TMLE

Assume that

Then

Thus for large dimension

Above we used the following lemma.

### Lemma 4

We have

We also have

**Proof**: We first prove eq. (34). Let

be the Kullback-Leibler divergence between the Bernoulli laws with probabilities

In van der Vaart (1998, page 62) it is shown that for two densities

Applying the inequality

Now, note that

This completes the proof of eq. (34). We have

Completely analogue to the derivation above of eq. (36) we obtain

and thus

This proves eq. (35).

### 8 Discussion

In this article we established that a one-step CV-TMLE, using a super-learner with a library that includes

This remarkable fact is heavily driven by the fact that this super-learner will always converge at a rate faster than

In order to prove our theorems it was also important to establish that a one-step TMLE already approximately solves the efficient influence curve equation, under very general reasonable conditions. In this article we focused on a one-step TMLE that updates each nuisance parameter with its own one-dimensional MLE update step. This choice of local least favorable submodel guarantees that the one-step TMLE update of the super-learner of the nuisance parameters is not driven by the nuisance parameter component that is hardest to estimate, which might have finite sample advantages. Nonetheless, our asymptotic efficiency theorem applies to any local least favorable submodel.

The fact that a one-step TMLE already solves the efficient influence curve equation is particularly important in problems in which the TMLE update step is very demanding due to a high complexity of the efficient influence curve. In addition, a one-step TMLE has a more predictable robust behavior than a limit of an iterative algorithm. We could have focused on the universal least favorable submodels so that the TMLE is always a one-step TMLE, but in various problems local least favorable submodels are easier to fit and can thus have practical advantages.

By now, we also have implemented the HAL-estimator for nonparametric regression and dimensions

In this article we assumed independent and identically distributed observations. Nonetheless, this type of super learner and the resulting asymptotic efficiency of the one-step TMLE will be generalizable to a variety of dependent data structures such as data generated by a statistical graph that assumes sufficient conditional independencies so that the desired central limit theorems can still be established [4, 23, 24, 25, 26].

This article focused on a CV-TMLE that represents the statistical target parameter

Our general theorems and specifically the theorems for our example demonstrate that the model bound on the variance of the efficient influence curve heavily affects the stability of the TMLE, and that we can only let this bound converge to infinity at a slow rate when the dimension of the data is large. Therefore, knowing this bound instead of enforcing it in a data adaptive manner is crucial for good behavior of these efficient estimators. This is also evident from the well known finite sample behavior of various efficient estimators in causal inference and censored data models that almost always rely on using truncation of the treatment and/or censoring mechanism. If one uses highly data adaptive estimators, even when the censoring or treatment mechanism is bounded away from zero, the estimators of these nuisance parameters could easily get very close to zero, so that truncation is crucial. Careful data adaptive selection of this truncation level is therefore an important component in the definition of these efficient estimators.

Alternatively, one can define target parameters in such a way that their variance of the efficient influence curve is uniformly bounded over the model (e.g., [32]). For example, in our example we could have defined the target parameter

## Acknowledgements:

This research is funded by NIH-grant 5R01AI074345-07. The author thanks Marco Carone, Antoine Chambaz, and Alex Luedtke for stimulating discussions, and the reviewers for their very helpful comments.

#### References

1. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner, J. Efficient and adaptive estimation for semiparametric models. Berlin/ Heidelberg/New York: Springer, 1997.Search in Google Scholar

2. Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: AIDS epidemiology. Basel: Birkhauser, 1992: 297–331.10.1007/978-1-4757-1229-2_14Search in Google Scholar

3. van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer, 2003.10.1007/978-0-387-21700-0Search in Google Scholar

4. van der Laan MJ. Estimation based on case-control designs with known prevalance probability. Int J Biostat. 2008;4(1):Article17.10.2202/1557-4679.1114Search in Google Scholar PubMed

5. van der Laan MJ, Rose S. Targeted learning: Causal inference for observational and experimental data. Berlin/Heidelberg/New York: Springer, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

6. van der Laan MJ, Rubin Daniel B. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):Article 11.10.2202/1557-4679.1043Search in Google Scholar

7. Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat. 2010;6(1).10.2202/1557-4679.1182Search in Google Scholar PubMed PubMed Central

8. Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. Int J Biostat. Jan 1, 2011;7(1): Article 31., 2011. Published online Aug 17, 2011. doi: 10.2202/1557-4679.1308. Also available at: U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 279, http://www.bepress.com/ucbbiostat/paper279.Search in Google Scholar PubMed PubMed Central

9. Sekhon JS, Gruber S, Porter KE, van der Laan MJ. Propensity scorebased estimators and c-tmle. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/ Heidelberg/London: Springer, 2012.Search in Google Scholar

10. Polley EC, Rose S, van der Laan MJ. Super learner. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer, 2011.Search in Google Scholar

11. van der Laan MJ, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. Int J Biostat. 2016;12(1):351–378. doi: 10.1515/ijb-2015-0054.Search in Google Scholar PubMed PubMed Central

12. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol. 2007;6(1):Article 25.10.2202/1544-6115.1309Search in Google Scholar PubMed

13. van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Stat Decis. 2006;24(3):351–371.10.1524/stnd.2006.24.3.351Search in Google Scholar

14. Polley EC, Rose Sherri, van der Laan MJ. Super learning. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer, 2012.Search in Google Scholar

15. Zheng W, van der Laan MJ. Cross-validated targeted minimum loss based estimation. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental studies. New York: Springer, 2011.Search in Google Scholar

16. van der Laan MJ. A generally efficient targeted minimum lossbased estimator. Technical Report 300, UC Berkeley, 2015. http://biostats.bepress.com/ucbbiostat/paper343.Search in Google Scholar

17. Neuhaus G. On weak convergence of stochastic processes with multidimensional time parameter. Ann Stat 1971;42:1285–1295.10.1214/aoms/1177693241Search in Google Scholar

18. van der Vaart AW, Wellner JA. A local maximal inequality under uniform entropy. Electr J Stat. 2011;5:192–203. ISSN: 1935–7524, DOI: 10.1214/11-EJS605.Search in Google Scholar PubMed PubMed Central

19. van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Berlin/Heidelberg/New York: Springer, 1996.10.1007/978-1-4757-2545-2Search in Google Scholar

20. Gill RD, van der Laan MJ, Wellner JA. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré 1995;31:545–597.Search in Google Scholar

21. van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Stat Decis. 2006;24(3):373–395.10.1524/stnd.2006.24.3.373Search in Google Scholar

22. Benkeser D, van der Laan MJ. The highly adaptive lasso estimator. In: Proceedings of the IEEE Conference on Data Science and Advanced Analytics, 2016. To appear.10.1109/DSAA.2016.93Search in Google Scholar PubMed PubMed Central

23. Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, theoretical study. Int J Biostat. 2011a;7(1):1–32. Working paper 258, www.bepress.com/ucbbiostat10.2202/1557-4679.1247Search in Google Scholar

24. Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, simulation study. Int J Biostat. 2011b;7(1):33.Working paper 258,www.bepress.com/ucbbiostat.10.2202/1557-4679.1310Search in Google Scholar

25. van der Laan MJ. Causal inference for networks. Technical Report 300, UC Berkeley, 2012. http://biostats. bepress.com/ucbbiostat/paper300, to appear in Journal of Causal Inference.Search in Google Scholar

26. van der Laan MJ, Balzer LB, Petersen ML. Adaptive matching in randomized trials and observational studies. J Stat Res. 2013;46(2):113–156.Search in Google Scholar

27. Gruber S, van der Laan MJ. Targeted maximum likelihood estimation, R package version 1.2.0-1, Available at http://cran.rproject.org/web/packages/tmle/tmle.pdf, 2012.10.18637/jss.v051.i13Search in Google Scholar

28. Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted maximum likelihood estimation of dynamic and static marginal structural working models. J Causal Inf. 2013;2:147–185.10.1515/jci-2013-0007Search in Google Scholar PubMed PubMed Central

29. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972.10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

30. Iván Díaz, van der Laan MJ. Sensitivity analysis for causal inference under unmeasured confounding and measurement error problems. Int J Biostat. In press.Search in Google Scholar

31. van der Laan MJ, Petersen ML. Targeted learning. In: Zhang C, Ma Y, editors, Ensemble machine learning: methods and applications. New York: Springer, 2012.Search in Google Scholar

32. van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1):Article 3.10.2202/1557-4679.1022Search in Google Scholar PubMed PubMed Central

33. van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical Report 130, Division of Biostatistics, University of California, Berkeley, 2003.Search in Google Scholar

34. van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol. 2004;3(1):Article4.10.2202/1544-6115.1036Search in Google Scholar PubMed

## Appendix

### A Oracle inequality for the cross-validation selector

Lemma 2 is a simple corollary of the following finite sample oracle inequality for cross-validation [11, 13], combined with exploiting the convexity of the loss function allowing us to bring the

### Lemma 5

For any

Similarly, for any

where

If

Similarly, if

### B Super learner of G 0

Completely analogue to the super-learner eq. (23), we can define such a super-learner of

Let

We define the cross-validation selector as the index

that minimizes the cross-validated risk

The same Lemma 2 applies to this estimator

### Lemma 6

Recall the definition of the model bounds

If for each fixed

If for a fixed

Suppose that for each fixed

### C Empirical process results

Theorem 2.1 in [18] establishes the following result for a Donsker class

where

is the entropy integral from

Suppose we want a bound on

Suppose that

Thus, we have

Note that this is a decreasing function in

This proves the following lemma.

### Lemma 7

Consider

If

Consider eq. (39) again, but suppose now that

We can conservatively bound

By plugging this latter bound into eq. (39) we obtain

Note that the right-hand side is increasing in

### Lemma 8

Consider

The following lemma is proved by first applying the Lemma 7 to

### Lemma 9

Consider the following setting:

Then

where

**Proof**: We have

This shows

The right-hand side is of order

We can factor out

This completes the proof of the lemma.

The following lemma is needed in the analysis of the CV-TMLE, where

### Lemma 10

Let

and

**Proof**: For notational convenience, let’s denote

This completes the proof.

### D Implementing the HAL-estimator

For notational convenience, consider the case that

where in this section we redefine

#### D.1 Approximating a function with variation norm M by a linear combination of indicator basis functions with L 1 -norm of the coefficient vector equal to M

Any cadlag function

For each subset

Let

#### D.2 An approximation of the MLE over functions of bounded variation using L 1 -penalization

For an

as the collection of all these finite linear combinations of this collection of basis functions under the constraint that its

The next lemma proves that we can approximate such an MLE over

#### Lemma 11.

Let

Consider now an

**Proof**: We want to show that

which proves that

We now want to show that

Then, by assumption and the dominated convergence theorem,