1 Introduction
We consider the general statistical estimation problem defined by a statistical model for the data distribution, a Euclidean valued target parameter mapping defined on the statistical model, and observing
One can construct an asymptotically efficient estimator with the following two general methods. Firstly, the one-step estimator is defined by adding to an initial plug-in estimator of the target parameter an empirical mean of an estimator of the efficient influence curve at this same initial estimator [1]. In the special case that the efficient influence curve can be represented as an estimating function, one can represent this methodology as the first step of the Newton-Raphson algorithm for solving the estimating equation defined by setting the empirical mean of the efficient influence curve equal to zero. Such general estimating equation methodology for construction of efficient estimators has been developed for censored and causal inference models in the literature (e.g., [2, 3]). Secondly, the TMLE defines a least favorable parametric submodel through an initial estimator of the relevant parts (nuisance parameters) of the data distribution, and updates the initial estimator with the MLE over this least favorable parametric submodel. The one-step TMLE of the target parameter is now the resulting plug-in estimator [4, 5, 6]. In this article we focus on the one-step TMLE since it is a more robust estimator by respecting the global constraints of the statistical model, which becomes evident when comparing the one-step estimator and TMLE in simulations for which the information is low for the target parameter (e.g., even resulting in one-step estimators of probabilities that are outside the (0, 1) range) (e.g., [7, 8, 9]). Nonetheless, the results in this article have immediate analogues for the one-step estimator and estimating equation method.
The asymptotic linearity and efficiency of the TMLE and one-step estimator relies on a second order remainder to be
In this article, for each nuisance parameter, we propose a new minimum loss based estimator that minimizes the loss-specific empirical risk over its parameter space under the additional constraint that the variation norm is bounded by a set constant. The constant is selected with cross-validation. We show that these MLEs can be represented as the minimizer of the empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute value of the coefficients is bounded by the constant: i.e., the variation norm corresponds with this
2 Example: Treatment specific mean in nonparametric model
Before we start the main part of this article, in this section we will first introduce an example, and use this example to provide the reader with a guide through the different sections.
2.1 Defining the statistical estimation problem
Let
Our target parameter
for some
Let
An estimator
where
We have that
where
Of course,
We define the following two log-likelihood loss functions for
We also define the corresponding Kullback-Leibler dissimilarities
Let the submodel
If we would replace the log-likelihood loss
Let
2.2 One step CV-TMLE
Let
through
through
This universal least favorable submodel implies a recursive construction of
Let
The CV-TMLE of
Note that this latter representation proves that we never have to carry out the TMLE-update step for
We also conclude that this one-step CV-TMLE solves the crucial cross-validated efficient influence curve equation
2.3 Guide for article based on this example
Section 3: Formulation of general estimation problem. The goal of this article is far beyond establishing asymptotic efficiency of the CV-TMLE eq. (3) in this example. Therefore, we start in Section 3 by defining a general model and general target parameter, essentially generalizing the above notation for this example. Therefore, having read the above example, the presentation in Section 3 of a very general estimation problem will be easier to follow. Our subsequent definition and results for the HAL-estimator, the HAL-super-learner, and the CV-TMLE in the subsequent Sections 4-6 apply now to our general model and target parameter, thereby establishing asymptotic efficiency of the CV-TMLE for an enormous large class of semi-parametric statistical estimation problems, including our example as a special case.
Let’s now return to our example to point out the specific tasks that are solved in each section of this article. By eqs (1) and (4), we have the following starting identity for the CV-TMLE:
By the Cauchy-Schwarz inequality and bounding
where
Thus, by selecting
Section 4: Construction and analysis of an
Section 5: Construction and analysis of an HAL-super-learner. Instead of assuming that the the variation norm of
The convergence results for this super-learner in terms of the Kullback-Leibler loss-based dissimilarities also imply corresponding results for
Section 6: Construction and analysis of HAL-CV-TMLE. To control the remainder we need to understand the behavior of the updated initial estimator
This will hold under weak conditions, given that we have estimators
Section 7: Returning to our example. In Section 7 we return to our example to present a formal Theorem 2 with specified conditions, involving an application of our general efficiency Theorem 1 in Section 6.
Appendix: Various technical results are presented in the Appendix.
3 Statistical formulation of the estimation problem
Let
Let
Our goal in this article is to construct a substitution estimator (i.e., a TMLE
Relevant nuisance parameters
be the collection of these
Suppose that
We assume that
We also define
as the vector of
We define
as the vector of
We will also use the notation
when
We will abuse notation by also denoting
Second order remainder for target parameter: We define the second order remainder
We will also denote
Given the above double robustness property of the canonical gradient (i.e, of the target parameter), if
Support of data distribution: The support of
so that
Cadlag functions on
For a subset
where
denote the set of cadlag functions
Cartesian product of cadlag function spaces, and its component-wise operations: Let
of variation norms. If
Our results will hold for general models and pathwise differentiable target parameters, as long as the statistical model satisfies the following key smoothness assumption:
For each
Definition of bounds on the statistical model: The properties of the super-learner and TMLE rely on bounds on the model
Note that
Bounded and Unbounded Models: We will call the model
are finite. In words, in essence, a bounded model is a model for which the support and the supremum norm
of
Sequence of bounded submodels approximating the unbounded model: For an unbounded model
This model
Let
We will assume that
In this paper our initial estimators of
4 Highly adaptive Lasso estimator of Nuisance parameters
Let
4.1 Upper bounding the entropy of the parameter space for the HAL-estimator
We remind the reader that a covering number
where
The minimal rates of convergence of our HAL-estimator of
By eq. (17) it follows that any cadlag function with finite variation norm can be represented as a difference of two bounded monotone increasing functions (i.e., cumulative distribution function). The class of
where
4.2 Minimal rate of convergence of the HAL-estimator
Lemma 1 below proves that the minimal rates
Let
For a given vector
Consider an estimator
where
and
Proof: We have
which proves eq. (22). Since
We now apply Lemma 7 with
This proves
5 Super-learning: HAL-estimator tuning the variation norm of the fit with cross-validation
Defining the library of candidate estimators: For an
Note that for all
Super Learner: Let
We define the cross-validation selector as the index
that minimizes the cross-validated risk
The following lemma proves that the super-learner
Recall the definition of the model bounds
For any fixed
If for each fixed
If for each fixed
Suppose that for each finite
The proof of this lemma is a simple corollary of the finite sample oracle inequality for cross-validation [11, 13, 21, 33, 34], also presented in Lemma 5 in Section A of the Appendix. It uses the convexity of the loss function to bring the
In the Appendix we present the analogue super-learner eq. (37) of
6 One-step CV-HAL-TMLE
Cross-validated TMLE (CV-TMLE) robustifies the bias-reduction of the TMLE-step by selecting
6.1 The CV-HAL-TMLE
Definition of one-step CV-HAL-TMLE for general local least favorable submodel: Let
be the MLE of
One-step CV-HAL-TMLE solves cross-validated efficient score equation: Our efficiency Theorem 1 assumes that
That is, it is assumed that the one-step CV-TMLE already solves the cross-validated efficient influence curve equation up till an asymptotically negligible approximation error. By definition of
One-step CV-HAL-TMLE preserves fast rate of convergence of initial estimator: Our efficiency Theorem 1 also assumes that the updated estimator
A class of multivariate local least favorable submodels that separately updates each nuisance parameter component: One way to guarantee that
for
How to construct a local least favorable submodel of type eq. (26): A general approach for constructing such a
so that,
That is,
so that the submodel is indeed a local least favorable submodel.
Lemma 14 provides a sufficient set of minor conditions under which the one-step-HAL-CV-TMLE using a local least favorable submodel of the type eq. (26) will satisfy eq. (25). Therefore, the class of local least favorable submodels eq. (26) yields both crucial conditions for the HAL-CV-TMLE: it solves eq. (25) and it preserve the rate of convergence of the initial estimator.
6.2 Preservation of the rate of initial estimator for the one-step CV-HAL-TMLE using eq. (26)
Consider the submodel
Let
We have
By convexity of the loss function
We have
Thus, if for some
It then also follows that for each
6.3 Efficiency of the one-step CV-HAL-TMLE.
We have the following theorem.
Consider the above defined corresponding one-step CV-TMLE
Initial estimator conditions: Consider the HAL-super-learners
where
“Preserve rate of convergence of initial estimator”-condition: In addition, assume that either (Case A) the CV-TMLE uses a local least favorable submodel of the type eq. (26) so that Lemma 3 applies, or (Case B) assume that for each split
Efficient influence curve score equation condition and second order remainder condition: Define
In Case A, for verification of assumption eq. (27) one could apply Lemma 14. In Case A, for verification of the two assumptions eqs (28) and (29) one can use that for each of the
Then,
Condition eq. (32) will practically always trivially hold for
we have
Consider Case A. Lemma 3 proves that under these same assumptions eqs (30), (31), we also have, for each
Suppose
Combining this with eq. (27) yields the following identity:
By assumption eq. (29) we have that
We now note
Thus, it remains to prove that
Note also that the envelope of
This proves also that
7 Example: Treatment specific mean
We will now apply Theorem 1 to the example introduced in Section 2. We have the following sieve model bounds (van der Laan et al., 2004):
Since the parameter space
Verification of eqs (30) and (31): Let
Plugging in the above bounds for
Above we showed that if
Verification of eq. (28):
Using straightforward algebra and using the triangle inequality for a norm, we obtain
Using that
We bound the last term as follows:
where we used at the third equality that for each split
In order to bound the second empirical process term we apply Lemma 10 to the term
Thus, we have shown
We have
Lemma 4 below shows that
We can conservatively bound this as follows:
where we used conservative bounding by not utilizing that
We need that
Verification of eq. (29): By eq. (6), we can bound the second order remainder as follows:
Thus, it suffices to assume that
We verified the conditions of Theorem 1. Application of Theorem 1 yields the following result.
Consider the nonparametric statistical model
Consider the above defined one-step CV-TMLE
Assume that
Then
Thus for large dimension
Above we used the following lemma.
We have
We also have
Proof: We first prove eq. (34). Let
be the Kullback-Leibler divergence between the Bernoulli laws with probabilities
In van der Vaart (1998, page 62) it is shown that for two densities
Applying the inequality
Now, note that
This completes the proof of eq. (34). We have
Completely analogue to the derivation above of eq. (36) we obtain
and thus
This proves eq. (35).
8 Discussion
In this article we established that a one-step CV-TMLE, using a super-learner with a library that includes
This remarkable fact is heavily driven by the fact that this super-learner will always converge at a rate faster than
In order to prove our theorems it was also important to establish that a one-step TMLE already approximately solves the efficient influence curve equation, under very general reasonable conditions. In this article we focused on a one-step TMLE that updates each nuisance parameter with its own one-dimensional MLE update step. This choice of local least favorable submodel guarantees that the one-step TMLE update of the super-learner of the nuisance parameters is not driven by the nuisance parameter component that is hardest to estimate, which might have finite sample advantages. Nonetheless, our asymptotic efficiency theorem applies to any local least favorable submodel.
The fact that a one-step TMLE already solves the efficient influence curve equation is particularly important in problems in which the TMLE update step is very demanding due to a high complexity of the efficient influence curve. In addition, a one-step TMLE has a more predictable robust behavior than a limit of an iterative algorithm. We could have focused on the universal least favorable submodels so that the TMLE is always a one-step TMLE, but in various problems local least favorable submodels are easier to fit and can thus have practical advantages.
By now, we also have implemented the HAL-estimator for nonparametric regression and dimensions
In this article we assumed independent and identically distributed observations. Nonetheless, this type of super learner and the resulting asymptotic efficiency of the one-step TMLE will be generalizable to a variety of dependent data structures such as data generated by a statistical graph that assumes sufficient conditional independencies so that the desired central limit theorems can still be established [4, 23, 24, 25, 26].
This article focused on a CV-TMLE that represents the statistical target parameter
Our general theorems and specifically the theorems for our example demonstrate that the model bound on the variance of the efficient influence curve heavily affects the stability of the TMLE, and that we can only let this bound converge to infinity at a slow rate when the dimension of the data is large. Therefore, knowing this bound instead of enforcing it in a data adaptive manner is crucial for good behavior of these efficient estimators. This is also evident from the well known finite sample behavior of various efficient estimators in causal inference and censored data models that almost always rely on using truncation of the treatment and/or censoring mechanism. If one uses highly data adaptive estimators, even when the censoring or treatment mechanism is bounded away from zero, the estimators of these nuisance parameters could easily get very close to zero, so that truncation is crucial. Careful data adaptive selection of this truncation level is therefore an important component in the definition of these efficient estimators.
Alternatively, one can define target parameters in such a way that their variance of the efficient influence curve is uniformly bounded over the model (e.g., [32]). For example, in our example we could have defined the target parameter
This research is funded by NIH-grant 5R01AI074345-07. The author thanks Marco Carone, Antoine Chambaz, and Alex Luedtke for stimulating discussions, and the reviewers for their very helpful comments.
References
- 1.↑
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner, J. Efficient and adaptive estimation for semiparametric models. Berlin/ Heidelberg/New York: Springer, 1997.
- 2.↑
Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: AIDS epidemiology. Basel: Birkhauser, 1992: 297–331.
- 3.↑
van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer, 2003.
- 4.↑
van der Laan MJ. Estimation based on case-control designs with known prevalance probability. Int J Biostat. 2008;4(1):Article17.
- 5.↑
van der Laan MJ, Rose S. Targeted learning: Causal inference for observational and experimental data. Berlin/Heidelberg/New York: Springer, 2011.
- 6.↑
van der Laan MJ, Rubin Daniel B. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):Article 11.
- 7.↑
Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat. 2010;6(1).
- 8.↑
Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. Int J Biostat. Jan 1, 2011;7(1): Article 31., 2011. Published online Aug 17, 2011. doi: . Also available at: U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 279, http://www.bepress.com/ucbbiostat/paper279.
- 9.↑
Sekhon JS, Gruber S, Porter KE, van der Laan MJ. Propensity scorebased estimators and c-tmle. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/ Heidelberg/London: Springer, 2012.
- 10.↑
Polley EC, Rose S, van der Laan MJ. Super learner. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer, 2011.
- 11.↑
van der Laan MJ, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. Int J Biostat. 2016;12(1):351–378. doi: .
- 12.↑
van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol. 2007;6(1):Article 25.
- 13.↑
van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Stat Decis. 2006;24(3):351–371.
- 14.↑
Polley EC, Rose Sherri, van der Laan MJ. Super learning. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental data. New York/Dordrecht/Heidelberg/London: Springer, 2012.
- 15.↑
Zheng W, van der Laan MJ. Cross-validated targeted minimum loss based estimation. In: van der Laan MJ, Rose S, editors, Targeted learning: Causal inference for observational and experimental studies. New York: Springer, 2011.
- 16.↑
van der Laan MJ. A generally efficient targeted minimum lossbased estimator. Technical Report 300, UC Berkeley, 2015. http://biostats.bepress.com/ucbbiostat/paper343.
- 17.↑
Neuhaus G. On weak convergence of stochastic processes with multidimensional time parameter. Ann Stat 1971;42:1285–1295.
- 18.↑
van der Vaart AW, Wellner JA. A local maximal inequality under uniform entropy. Electr J Stat. 2011;5:192–203. ISSN: 1935–7524, DOI: .
- 19.↑
van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Berlin/Heidelberg/New York: Springer, 1996.
- 20.↑
Gill RD, van der Laan MJ, Wellner JA. Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré 1995;31:545–597.
- 21.↑
van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Stat Decis. 2006;24(3):373–395.
- 22.↑
Benkeser D, van der Laan MJ. The highly adaptive lasso estimator. In: Proceedings of the IEEE Conference on Data Science and Advanced Analytics, 2016. To appear.
- 23.↑
Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, theoretical study. Int J Biostat. 2011a;7(1):1–32. Working paper 258, www.bepress.com/ucbbiostat
- 24.↑
Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, simulation study. Int J Biostat. 2011b;7(1):33.Working paper 258,www.bepress.com/ucbbiostat.
- 25.↑
van der Laan MJ. Causal inference for networks. Technical Report 300, UC Berkeley, 2012. http://biostats. bepress.com/ucbbiostat/paper300, to appear in Journal of Causal Inference.
- 26.↑
van der Laan MJ, Balzer LB, Petersen ML. Adaptive matching in randomized trials and observational studies. J Stat Res. 2013;46(2):113–156.
- 27.↑
Gruber S, van der Laan MJ. Targeted maximum likelihood estimation, R package version 1.2.0-1, Available at http://cran.rproject.org/web/packages/tmle/tmle.pdf, 2012.
- 28.↑
Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted maximum likelihood estimation of dynamic and static marginal structural working models. J Causal Inf. 2013;2:147–185.
- 29.↑
Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972.
- 30.↑
Iván Díaz, van der Laan MJ. Sensitivity analysis for causal inference under unmeasured confounding and measurement error problems. Int J Biostat. In press.
- 31.↑
van der Laan MJ, Petersen ML. Targeted learning. In: Zhang C, Ma Y, editors, Ensemble machine learning: methods and applications. New York: Springer, 2012.
- 32.↑
van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1):Article 3.
- 33.↑
van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical Report 130, Division of Biostatistics, University of California, Berkeley, 2003.
- 34.↑
van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol. 2004;3(1):Article4.
Lemma 2 is a simple corollary of the following finite sample oracle inequality for cross-validation [11, 13], combined with exploiting the convexity of the loss function allowing us to bring the
For any
Similarly, for any
where
If
Similarly, if
Completely analogue to the super-learner eq. (23), we can define such a super-learner of
Let
We define the cross-validation selector as the index
that minimizes the cross-validated risk
The same Lemma 2 applies to this estimator
Recall the definition of the model bounds
If for each fixed
If for a fixed
Suppose that for each fixed
Theorem 2.1 in [18] establishes the following result for a Donsker class
where
is the entropy integral from
Suppose we want a bound on
Suppose that
Thus, we have
Note that this is a decreasing function in
This proves the following lemma.
Consider
If
Consider eq. (39) again, but suppose now that
We can conservatively bound
By plugging this latter bound into eq. (39) we obtain
Note that the right-hand side is increasing in
Consider
The following lemma is proved by first applying the Lemma 7 to
Consider the following setting:
Then
where
Proof: We have
This shows
The right-hand side is of order
We can factor out
This completes the proof of the lemma.
The following lemma is needed in the analysis of the CV-TMLE, where
Let
and
Proof: For notational convenience, let’s denote
This completes the proof.
For notational convenience, consider the case that
where in this section we redefine
D.1 Approximating a function with variation norm by a linear combination of indicator basis functions with -norm of the coefficient vector equal to
Any cadlag function
For each subset
Let
D.2 An approximation of the MLE over functions of bounded variation using -penalization
For an
as the collection of all these finite linear combinations of this collection of basis functions under the constraint that its
The next lemma proves that we can approximate such an MLE over
Let
Consider now an
Proof: We want to show that
which proves that
We now want to show that
Then, by assumption and the dominated convergence theorem,
which proves that
D.3 An approximation of the MLE over the subspace by an MLE over an -constrained linear model
Above we defined a mapping from a function
Assume that if
Consider a
Similarly, consider a
Proof: We want to show that
which proves that
We now want to show that
Then, by assumption and the dominated convergence theorem,
which proves that
In this section we focus on the one-step TMLE, but the results can be straightforwardly generalized to the one-step CV-TMLE.
The following lemma proves that for a local least favorable submodel with a 1-dimensional
Let
Assume
- –
and ; - –
; - –
falls in a -Donsker class with probability tending to 1; - –
- –\If
for some density parameterization , then (43) holds; - –
.
Then,
The first bullet point condition only assumes that the chosen least favorable submodel is smooth in
Proof of Lemma:
Firstly, by the fact that
By assumption,
By our Donsker class assumption, we have
Thus, it remains to show
By assumptions eq. (42), we have that the left-hand side of last expression equals
so that it remains to show that the first term equals zero. By
By assumption we have
Thus, it remains to show eq. (43), which thus holds by assumption. Suppose that
In the main article we have not proposed a 1-dimensional local least favorable submodel as in Lemma 13, even though our results are straightforwardly generalized to that case. Instead we proposed a
Let
Let
We wish to establish that
For each
- Suppose that by application of the previous lemma to
, submodel , loss function , , and one-step TMLE , we establish its conclusion . For completeness, Lemma 15 below explicitly states these specific conditions of the previous lemma, which are sufficient for this conclusion. - Let
, and assume . For this to hold if suffices to assume that and a.e. - Let
, and assume . For this to hold if suffices to assume that and a.e. ; ; .
Then,
Let
Assume the following conditions:
and ; ; falls in a -Donsker class with probability tending to 1;- \
- If
for some density parameterization , then eq. (44) holds; .
Then,
Proof: This is an immediate application of Lemma 13.
Proof of Lemma 14: Consider a 1-dimensional submodel
By pathwise differentiability of
Since this holds for each
Firstly, we want to prove that
By assumption 2., the latter is
We now want to prove that
By assumption 4., we have
