Suppose we observe independent and identically distributed observations of a finite dimensional bounded random variable. This article is concerned with the construction of an efficient targeted minimum loss-based estimator (TMLE) of a pathwise differentiable target parameter of the data distribution based on a realistic statistical model. The only smoothness condition we will enforce on the statistical model is that the nuisance parameters of the data distribution that are needed to evaluate the canonical gradient of the pathwise derivative of the target parameter are multivariate real valued cadlag functions (right-continuous and left-hand limits, (G. Neuhaus. On weak convergence of stochastic processes with multidimensional time parameter. Ann Stat 1971;42:1285–1295.) and have a finite supremum and (sectional) variation norm. Each nuisance parameter is defined as a minimizer of the expectation of a loss function over over all functions it its parameter space. For each nuisance parameter, we propose a new minimum loss based estimator that minimizes the loss-specific empirical risk over the functions in its parameter space under the additional constraint that the variation norm of the function is bounded by a set constant. The constant is selected with cross-validation. We show such an MLE can be represented as the minimizer of the empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute value of the coefficients is bounded by the constant: i.e., the variation norm corresponds with this -norm of the vector of coefficients. We will refer to this estimator as the highly adaptive Lasso (HAL)-estimator. We prove that for all models the HAL-estimator converges to the true nuisance parameter value at a rate that is faster than w.r.t. square-root of the loss-based dissimilarity. We also show that if this HAL-estimator is included in the library of an ensemble super-learner, then the super-learner will at minimal achieve the rate of convergence of the HAL, but, by previous results, it will actually be asymptotically equivalent with the oracle (i.e., in some sense best) estimator in the library. Subsequently, we establish that a one-step TMLE using such a super-learner as initial estimator for each of the nuisance parameters is asymptotically efficient at any data generating distribution in the model, under weak structural conditions on the target parameter mapping and model and a strong positivity assumption (e.g., the canonical gradient is uniformly bounded). We demonstrate our general theorem by constructing such a one-step TMLE of the average causal effect in a nonparametric model, and establishing that it is asymptotically efficient.
We consider the general statistical estimation problem defined by a statistical model for the data distribution, a Euclidean valued target parameter mapping defined on the statistical model, and observing independent and identically distributed draws from the data distribution. Our goal is to construct a generally asymptotically efficient substitution estimator of the target parameter. An estimator is asymptotically efficient if and only if it is asymptotically linear with influence curve equal to the canonical gradient (also called the efficient influence curve) of the pathwise derivative of the target parameter . For realistic statistical models construction of efficient estimators requires using highly data adaptive estimators of the relevant parts of the data distribution the efficient influence curve depends upon. We will refer to these relevant parts of the data distribution as nuisance parameters.
One can construct an asymptotically efficient estimator with the following two general methods. Firstly, the one-step estimator is defined by adding to an initial plug-in estimator of the target parameter an empirical mean of an estimator of the efficient influence curve at this same initial estimator . In the special case that the efficient influence curve can be represented as an estimating function, one can represent this methodology as the first step of the Newton-Raphson algorithm for solving the estimating equation defined by setting the empirical mean of the efficient influence curve equal to zero. Such general estimating equation methodology for construction of efficient estimators has been developed for censored and causal inference models in the literature (e.g., [2, 3]). Secondly, the TMLE defines a least favorable parametric submodel through an initial estimator of the relevant parts (nuisance parameters) of the data distribution, and updates the initial estimator with the MLE over this least favorable parametric submodel. The one-step TMLE of the target parameter is now the resulting plug-in estimator [4, 5, 6]. In this article we focus on the one-step TMLE since it is a more robust estimator by respecting the global constraints of the statistical model, which becomes evident when comparing the one-step estimator and TMLE in simulations for which the information is low for the target parameter (e.g., even resulting in one-step estimators of probabilities that are outside the (0, 1) range) (e.g., [7, 8, 9]). Nonetheless, the results in this article have immediate analogues for the one-step estimator and estimating equation method.
The asymptotic linearity and efficiency of the TMLE and one-step estimator relies on a second order remainder to be , which typically requires that the nuisance parameters are estimated at a rate faster than w.r.t. an -norm (e.g., see our example in Section 7). To make the TMLE highly data adaptive and thereby efficient for large statistical models we have recommended to estimate the nuisance parameters with a super-learner based on a large library of candidate estimators [10, 11, 12, 13]. Due to the oracle inequality for the cross-validation selector, the super-learner will be asymptotically equivalent with the oracle selected estimator w.r.t. loss-based dissimilarity, even when the number of candidate estimators in the library grows polynomial in sample size. The loss-based dissimilarity (e.g., Kullback-Leibler divergence or loss-based dissimilarity for the squared error loss) behaves as a square of an -norm (see, for example Lemma 4 in our example). Therefore, in order to control the second order remainder, our goal should be to construct a candidate estimator in the library of the super-learner which will converge at a faster rate than w.r.t. square-root of the loss-based dissimilarity.
In this article, for each nuisance parameter, we propose a new minimum loss based estimator that minimizes the loss-specific empirical risk over its parameter space under the additional constraint that the variation norm is bounded by a set constant. The constant is selected with cross-validation. We show that these MLEs can be represented as the minimizer of the empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute value of the coefficients is bounded by the constant: i.e., the variation norm corresponds with this -norm of the vector of coefficients. We will refer to this estimator as the highly adaptive Lasso (HAL)-estimator. We prove that the HAL-estimator converges at a rate that is for all models faster than w.r.t. square-root of the loss-based dissimilarity. This even holds if the model only assumes that the true nuisance parameters have a finite variation norm. As a corollary of the general oracle inequality for cross-validation, we will then show that the super-learner including this HAL-estimator it its library is guaranteed to converge to its true counterparts at the same rate as this HAL-estimator (and thus faster than ). By also including a large variety of other estimators in the library of the super-learner, the super-learner will also have excellent practical performance for finite samples relative to competing estimators . Based on this fundamental result for the HAL-estimator and the super-learner, we proceed in this article with proving a general theorem for asymptotic efficiency of the one-step TMLE for arbitrary statistical models. In this article we will use a one-step cross-validated-TMLE (CV-TMLE), which avoids the Donsker-class entropy condition on the nuisance parameter space, in order to further minimize the conditions for asymptotic efficiency [5, 15]. In our accompanying technical report  we present the analogue results for the one-step TMLE. Beyond establishing these fundamental theoretical general results, we will also discuss the practical implementation of the HAL-estimator and corresponding TMLE.
Before we start the main part of this article, in this section we will first introduce an example, and use this example to provide the reader with a guide through the different sections.
Let be a -dimensional random variable consisting of a -dimensional vector of baseline covariates , binary treatment and binary outcome . We observe i.i.d. copies of . Let and . Let be the marginal cumulative probability distribution of , and . Let the statistical model be of the form , where is a possibly restricted set, and is nonparametric. The only key assumption we will enforce on and is that for each , and are cadlag functions in on a set , and that the variation norm of these functions and are bounded. The definition of variation norm will be presented in the next section. Suppose that assumes that only depends on through a subset of covariates of dimension : if , then this does not represent an assumption.
Our target parameter is defined by . For notational convenience, we will use for both mappings and . It is well known that is pathwise differentiable so that for each 1-dimensional parametric submodel through with score at , we have
for some , where is the Hilbert space of functions of with mean zero endowed with inner product . Here we use the notation . Such an object is called a gradient at of the pathwise derivative. The unique gradient that is also an element of the tangent space is defined as the canonical gradient. The tangent space at is defined as the closure of the linear span of the set of scores of the class of -dimensional parametric submodels we consider. In this example the canonical gradient at is given by:
Let and and note that .
An estimator of is asymptotically efficient (among the class of all regular estimators) if and only if it is asymptotically linear with influence curve equal to the canonical gradient :
where is the empirical probability distribution of . Therefore, the canonical gradient is also called the efficient influence curve.
We have that
where , , and the second order remainder is defined as follows:
Of course, .
We define the following two log-likelihood loss functions for and , respectively:
We also define the corresponding Kullback-Leibler dissimilarities , , and . Here represents an easy to estimate parameter which we will estimate with the empirical probability distribution of .
Let the submodel be defined by the extra restriction that and -a.e.
If we would replace the log-likelihood loss (which becomes unbounded if approximates 0 or 1) by a squared error loss , then one can remove the restriction in the definition of . Given a sequence as , we can define a sequence of models which grows from below to as . By assumption, there exists an so that for we have .
Let and be the corresponding parameter spaces for and , respectively, and specifically, , while .
Let and be initial estimators of , respectively, where denotes a nonparametric model so that the estimator is defined for all realizations of the empirical probability distribution. Let be the estimator of . For a given cross-validation scheme , let be the empirical probability distributions of the validation sample and training sample , respectively. It is assumed that the proportion of observations in the validation sample (i.e., ) is between and for some . Let and be the estimators applied to the training sample . Given a , consider the uniform least favorable submodel (van der Laan and Gruber, 2015)
through at , where . We indeed have for all . Given a , consider also the local least favorable submodel
through at . Indeed, . This local least favorable submodel implies the following uniform least favorable submodel (van der Laan and Gruber, 2015): for
This universal least favorable submodel implies a recursive construction of for all -values, by starting at and moving upwards. For negative values of , we define . For all , , which shows that this is indeed a universal least favorable submodel for .
Let , and . The score equation for shows that . Let and . The score equation for shows that , which implies
The CV-TMLE of is defined as , where . By eq. (2) this implies that the CV-TMLE can also be represented as:
Note that this latter representation proves that we never have to carry out the TMLE-update step for , but that the CV-TMLE is a simple empirical mean of over the validation sample, averaged across the different splits .
We also conclude that this one-step CV-TMLE solves the crucial cross-validated efficient influence curve equation
Section 3: Formulation of general estimation problem. The goal of this article is far beyond establishing asymptotic efficiency of the CV-TMLE eq. (3) in this example. Therefore, we start in Section 3 by defining a general model and general target parameter, essentially generalizing the above notation for this example. Therefore, having read the above example, the presentation in Section 3 of a very general estimation problem will be easier to follow. Our subsequent definition and results for the HAL-estimator, the HAL-super-learner, and the CV-TMLE in the subsequent Sections 4-6 apply now to our general model and target parameter, thereby establishing asymptotic efficiency of the CV-TMLE for an enormous large class of semi-parametric statistical estimation problems, including our example as a special case.
By the Cauchy-Schwarz inequality and bounding by , we can bound the second order remainder as follows:
where . Suppose we can construct estimators and of and so that and for some , . Since the training sample is proportional to sample size , this immediately implies and . In addition, it is easy to show (as we will formally establish in general) that the rate of convergence of the initial estimator carries over to its targeted version so that . Thus, with such initial estimators, we obtain
Thus, by selecting so that , we obtain .
Section 4: Construction and analysis of an -specific HAL-estimator that converges at a rate faster than . This challenge of constructing such estimators and is addressed in Section 4. In the context of our example, in Section 4 we define a minimum loss estimator (MLE) that minimizes the empirical risk over all cadlag functions with variation norm smaller than . In Section 4 we then show that, if is chosen larger than the variation norm of , converges to zero at a faster rate than for some (for each dimension ). We provide an explicit representation eq. (17) of a cadlag function with finite variation norm as an infinite linear combination of indicator functions for which the sum of the absolute value of the coefficients is bounded by . As a consequence, it is shown in Appendix D that this -specific minimum loss-based estimator can be approximated by (or can be exactly defined as) a Lasso-generalized linear regression problem in which the sum of the absolute values of the coefficients is bounded by . Therefore, we will refer to as the -specific HAL-estimator. Our proof of Lemma 1 in Section 4, which establishes the rate of convergence of the -specific HAL-estimator, relies on an empirical process result by  that expresses the upper bound for this rate of convergence in terms of the entropy of the model space of . The representation eq. (17) demonstrates that the set of cadlag functions that have variation norm smaller than a constant is a difference of a“convex” hull of indicator functions, and, as a consequence of a general convex hull result in  this proves that it is a Donsker class with a specified upper bound on its entropy. In this way, we obtain an explicit entropy bound for our model space . Given this explicit upper bound for the entropy, the result in  establishes a rate of convergence of the -specific HAL-estimator faster than for a specified . By selecting larger than the unknown variation norm of the true nuisance parameter value, we obtain an HAL-estimator that converges at a faster rate than .
Section 5: Construction and analysis of an HAL-super-learner. Instead of assuming that the the variation norm of is bounded by a known and use the corresponding -specific HAL-estimator, in Section 5 we define a a collection of such -specific estimators for a set of -values for which the maximum value converges to infinity as sample size converges to infinity. We then use cross-validation to data adaptively select . We now show that the resulting cross-validated selected estimator of will be asymptotically equivalent with the oracle (i.e., best w.r.t. loss-based dissimilarity) choice. This follows from a previously established oracle inequality for the cross-validation selector, as long as the supremum norm bound on the loss-function at the candidate estimators does not grow too fast to infinity as a function of sample size (e.g., [11, 13]). By using such a data adaptively selected one obtains an estimator with better practical performance and it avoids having to know an upper bound . As a consequence, our statistical model does not need to assume a universal bound on the variation norm of the nuisance parameters, but it only needs to assume that each nuisance parameter value has a finite variation norm. For the sake of finite sample performance, we want to use a super-learner that uses cross-validation to select an estimator from a library of candidate estimators that includes these -specific estimators as candidates, beyond other candidate estimators. In this way, the choice of estimator will be adapted to what works well for the actual data set. Therefore, in Section 5, we actually define such a general super-learner and Theorem 2 states that it will converge at least as fast as the best choice in the library, and thus certainly as fast as the -specific HAL-estimator using equal to the true variation norm of . We refer to a super-learner whose library includes this collection of -specific HAL-estimators as an HAL-super-learner. We will use an analogue HAL-super-learner of (Theorem 6).
The convergence results for this super-learner in terms of the Kullback-Leibler loss-based dissimilarities also imply corresponding results for -convergence as needed to control the second order remainder eq. (6): see Lemma 4.
Section 6: Construction and analysis of HAL-CV-TMLE. To control the remainder we need to understand the behavior of the updated initial estimator instead of the initial estimator itself. In our example, since the updated estimator only involves a single updating step of the initial estimator, using a cross-validated MLE selector of , we can easily show that converges at same rate to as the initial estimator . In general, in Section 6 we define a one-step CV-TMLE for our general model and target parameter so that the targeted versions of the initial estimator of converges at the same rate as the initial HAL-super-learner estimator . (Since the initial estimator is an HAL-super-learner, we refer to this type of CV-TMLE as an HAL-CV-TMLE.) This concerns a choice of least favorable submodel for which the CV-TMLE-step separately updates each of the components of the initial estimator . We then show that with this choice of least favorable submodel the CV-TMLE-step preserves the convergence rate of the initial estimator (Lemma 3). We also establish in Appendix D that the one-step CV-TMLE already solves the desired cross-validated efficient influence curve equation (4) up till an -term, so that an iterative CV-TMLE can be avoided (Lemma 13 and Lemma 14). At that point, we have shown that the generalized analogue of eq. (7) indeed holds with a specified . In the final subsection of Section 6, Theorem 1 then establish the asymptotic efficiency of the HAL-CV-TMLE, which now also involves analyzing the cross-validated empirical process term, specifically, showing that
This will hold under weak conditions, given that we have estimators that converge at specified rates to their true counterparts and that, for each split , conditional on the training sample, the empirical process is indexed by a finite dimensional (i.e., dimension of ) class of functions.
Section 7: Returning to our example. In Section 7 we return to our example to present a formal Theorem 2 with specified conditions, involving an application of our general efficiency Theorem 1 in Section 6.
Appendix: Various technical results are presented in the Appendix.
Let be independent and identically distributed copies of a -dimensional random variable with probability distribution that is known to be an element of a statistical model .
Let be a one-dimensional target parameter, so that is the estimand of interest we aim to learn from the observations . We assume that is pathwise differentiable at any with canonical gradient : for a specified rich class of one-dimensional submodels through at and score , we have
Our goal in this article is to construct a substitution estimator (i.e., a TMLE for a targeted estimator of ) that is asymptotically efficient under minimal conditions.
Relevant nuisance parameters and their loss functions: Let be a nuisance parameter of so that for some , so that only depends on through . Let be the parameter space of this parameter . Suppose that has components, and are variation independent parameters . Let be the parameter space of . Thus, the parameter space of is a cartesian product . In addition, suppose that for , for specified loss functions . Let represent parameters that require data adaptive estimation trading off variance and bias (e.g., densities), while represents an easy to estimate parameter for which we have an empirical estimator available with negligible bias. In our treatment specific mean example above , where the easy to estimate parameter was the probability distribution of which is naturally estimated with the empirical probability distribution. The parameter will be estimated with our proposed loss-based HAL-super-learner. In the special case that each of the components of require a super-learner type-estimator, we define as empty (or equivalently, a known value), and in that case . We define corresponding loss-based dissimilarities , . We assume that for a known rate of convergence . Let
be the collection of these loss-based dissimilarities. We use the notation for the vector of loss-based dissimilarities for .
Suppose that only depends on through and an additional nuisance parameter . In the special case that only depends on through , we define as empty (or equivalently, as a known value). Let be a collection of -variation independent parameters of for some integer . Thus the parameter space of is a cartesian product , where is the parameter space of . Let for a loss function , and let be the corresponding loss-based dissimilarity, . Let represents an easy to estimate parameter for which we have a well behaved and understood estimator available. The parameter will be estimated with our proposed HAL-super-learner.
We assume that for a known rate of convergence . As above, let be the collection of these loss-based dissimilarities, and let , where . In the special case that each requires a super-learner based estimator, then we define as empty, and .
We also define
as the vector of loss-based dissimilarities. We will also use the short-hand notation for .
as the vector of -loss functions for , and similarly we define
We will also use the notation and . We will assume that is a convex function in the sense that, for any , , , for each
when and . Similarly, we assume is a convex function. Our results for the TMLE generalize to non-convex loss functions, but the convexity of the loss functions allows a nicer representation for the super-learner oracle inequality, and in most applications a natural convex loss function is available.
We will abuse notation by also denoting and with and , respectively. A special case is that does not depend on an additional nuisance parameter : for example, if , is nonparametric, and is the integral of the square of the Lebesgue density of , then the canonical gradient is given by , so that one would define , and there is no .
Second order remainder for target parameter: We define the second order remainder as follows:
We will also denote with to indicate that it involves differences between and and and , beyond possibly some additional dependence on . In our experience, this remainder can be represented as a sum of terms of the type for some functionals and , where, typically, and represent functions of or . In certain classes of problems we have that only involves cross-terms of the type , so that if either or . In these cases, we say that the efficient influence curve is double robust w.r.t. misspecification of and :
Given the above double robustness property of the canonical gradient (i.e, of the target parameter), if solves , and either or , then . This allows for the construction of so called double robust estimators of that will be consistent if either the estimator of is consistent or the estimator of is consistent.
Support of data distribution: The support of is defined as a set so that . It is assumed that for each , for some finite . We define
so that for all , where is allowed, in which case . That is, is an upper bound of all the supports, and the model states that the support of the data structure is known to be contained in .
Cadlag functions on , supremum norm and variation norm: Suppose is finite, and, in fact, if is not finite, then we will apply the definitions below to a that is finite and converges to . Let be the Banach space of -variate real valued cadlag functions (right-continuous with left-hand limits) . For a , let be the supremum norm. For a , we define the variation norm of  as
For a subset , , , and the in the above definition of the variation norm is over all subsets of . In addition, is the -specific section of that sets the coordinates in the compliment of equal to . Note that is the sum of variation norms of -specific sections of (including itself). Therefore, one might refer to this norm as the sectional variation norm, but, for convenience, for the purpose of this article, we will just refer to it as variation norm. If , then we can, in fact, represent as follows :
where is the measure generated by the cadlag function . For a , let
denote the set of cadlag functions with variation norm bounded by .
Cartesian product of cadlag function spaces, and its component-wise operations: Let be the product Banach space of -dimensional where each , . If , then we define as a vector whose -th component equals the supremum norm of the -th component of . Similarly we define a variation norm of as a vector
of variation norms. If , then is a vector whose components are the -norms of the components of . Generally speaking, in this paper any operation on a function , such as taking a norm , an expectation , operations on a pair of functions , such as , , or an inequality , is carried out component wise: for example, and . In a similar manner, for an , let denote the cartesian product. This general notation allows us to present results with minimal notation, avoiding the need to continuously having to enumerate all the components.
Our results will hold for general models and pathwise differentiable target parameters, as long as the statistical model satisfies the following key smoothness assumption:
For each , , , , , , and , have a finite supremum and variation norm.
Definition of bounds on the statistical model: The properties of the super-learner and TMLE rely on bounds on the model . Our estimators will also allow for unbounded models by using a sieve of models for which its finite bounds slowly approximate the actual model bound as sample size converges to infinity. These bounds will be defined now:
Note that and are defined as vectors of constants, a constant for each component of and , respectively. The bounds guarantee excellent properties of the cross-validation selector based on the loss-function (e.g., [11, 13]). A bound on shows that the loss-based dissimilarity behaves as a square of a difference between and . Similarly, the bounds control the behavior of the cross-validation selector based on the loss function .
Bounded and Unbounded Models: We will call the model bounded if it is a model for which (i.e., universally bounded support), , , , ,
are finite. In words, in essence, a bounded model is a model for which the support and the supremum norm
of , , , and are uniformly (over the model) bounded. Any model that is not bounded will be called an unbounded model.
Sequence of bounded submodels approximating the unbounded model: For an unbounded model , our initial estimators of are defined in terms of a sequence of bounded submodels that are increasing in and approximate the actual model as converges to infinity. The counterparts of the above defined universal bounds on applied to are denoted with , , , , . The conditions of our general asymptotic efficiency Theorem 1 will enforce that these bounds converge slowly enough to infinity (in the case the corresponding true model bound is infinity).
This model could be defined as the largest subset of for which these latter bounds apply. By Assumption 1, with this choice of definition of , for any , there exists an , so that for . Either way, we assume that is defined such that the latter is true.
Let and be the parameter spaces of and under model , and let and be the parameter spaces of and . We define the following true parameters corresponding with this model :
We will assume that is chosen so that and , where . That is, our sieve is not affecting the estimation of the “easy” nuisance parameters and . Note that for , we have and .
In this paper our initial estimators of and are always enforced to be in the parameter spaces of this sequence of models , but if the model is already bounded, then one can set for all . However, even for bounded models , the utilization of a sequence of submodels with stronger universal bounds than could result in finite sample improvements (e.g., if the universal bounds on are very large relative to sample size and the dimension of the data).
Let be given. Our -specific HAL-estimator of is defined as the minimizer of the empirical risk over for which has a variation norm bounded by (see eq. (21)). The rate of convergence of a minimum empirical risk estimator is driven by the rate of convergence of the covering number of the parameter space over which one minimizes (e.g., ). This explains why the rate of convergence of the covering number of this set of functions defines a minimal rate of convergence for this HAL-estimator (while will be selected with the cross-validation selector). Similarly, this applies to our HAL-estimator of . In the next subsection we define the relevant covering numbers and their rates , and establish an upper bound on them. Subsequently, we establish in Lemma 1 the minimal rate of convergence of the HAL-estimator in terms of these rates .
We remind the reader that a covering number is defined as the minimal number of balls of size w.r.t. -norm that are needed to cover the set of functions embedded in . Let and be such that for fixed
where , , and
The minimal rates of convergence of our HAL-estimator of and are defined in terms of and , respectively.
By eq. (17) it follows that any cadlag function with finite variation norm can be represented as a difference of two bounded monotone increasing functions (i.e., cumulative distribution function). The class of -variate monotone increasing/cumulative distribution functions is a convex hull of -variate indicator functions, which is again concretely implied by the representation eq. (17) by noting that . Thus, consists of a difference of two convex hulls of -variate indicator functions. By Theorem 2.6.9 in , which maps the covering number of a set of functions into a covering number of the convex hull of these functions, for a fixed , we have that the universal covering number of is bounded as follows:
where . Let be the vector of integers indicating the dimension of the domain of , and similarly, let be the vector of integers indicating the dimension of the domain of . Since with , with , we have that and .
Lemma 1 below proves that the minimal rates and of our HAL-estimator of and w.r.t. the loss-based dissimilarities and are given by:
Let and be the rates of the simple estimators and of and , respectively. This defines and .
For a given vector of constants, let be the set of all functions in the parameter space for for which the variation norm of its loss is smaller than . (In this definition one can also incorporate some extra -constraints, as long as .) Let be so that . Assume that for a fixed ,
Consider an estimator for which
where . Then
Proof: We have
which proves eq. (22). Since falls in a -Donsker class , it follows that the right-hand side is , and thus . Since , this also implies that . By empirical process theory we have that if falls in a -Donsker class with probability tending to 1, and as . Applying this to shows that , which proves .
We now apply Lemma 7 with , (see eq. (19)), envelope bound and , which proves that
This proves .
Defining the library of candidate estimators: For an , let be the HAL-estimator eq. (21) and let . By Lemma 1 we have , assuming that the numerical approximation error is of smaller order. Let be an ordered collection of -dimensional constants, and consider the corresponding collection of candidate estimators with . We impose that this index set is increasing in such that equals , so that for any , there exists an so that for , we will have that .
Note that for all with , we have that . In addition, let , be an additional collection of estimators of . For example, these candidate estimators could include a variety of parametric model as well as machine learning based estimators. This defines an index set representing a collection of candidate estimators .
Super Learner: Let denote a random cross-validation scheme that randomly splits the sample in a training sample and validation sample . Let denote the proportion of observations in the validation sample. We impose throughout the article that for some and that this random vector has a finite number possible realizations for a fixed . In addition, will denote the empirical probability distributions of the validation and training sample, respectively. Thus, the cross-validated risk of an estimator of is defined as .
We define the cross-validation selector as the index
that minimizes the cross-validated risk over all choices of candidate estimators. Our proposed super-learner is defined by
The following lemma proves that the super-learner converges to at least at the rate the HAL-estimator converges to : . This lemma also shows that the super-learner is either asymptotically equivalent with the oracle selected candidate estimator, or achieves the parametric rate of a correctly specified parametric model.
Recall the definition of the model bounds eq. (18), and let .
For any fixed ,
If for each fixed , divided by is , then
If for each fixed , , then
Suppose that for each finite , the conditions of Lemma 1 hold with negligible numerical approximation error , so that . Let be chosen so that . For each fixed , we have
The proof of this lemma is a simple corollary of the finite sample oracle inequality for cross-validation [11, 13, 21, 33, 34], also presented in Lemma 5 in Section A of the Appendix. It uses the convexity of the loss function to bring the inside the loss-based dissimilarity.
In the Appendix we present the analogue super-learner eq. (37) of and its corresponding Lemma 6.
Cross-validated TMLE (CV-TMLE) robustifies the bias-reduction of the TMLE-step by selecting based on the cross-validated risk [5, 15]. In the next subsection we define the CV-TMLE. In this subsection we propose a particular type of local least favorable submodel that separately updates the initial estimator of for each . Due to this choice, in subsection 2 we now easily establish that the CV-TMLE of converges at the same rate to as the initial estimator, which is important for control of the second order remainder in the asymptotic efficiency proof of the CV-TMLE. In subsection 3 we establish the asymptotic efficiency of the CV-TMLE.
Definition of one-step CV-HAL-TMLE for general local least favorable submodel: Let be the sum loss-function. For a given , let be a parametric submodel through at such that the linear span of at includes the canonical gradient . Let and be our initial estimators of and . We recommend defining the initial estimators and of and to be HAL-super-learners as defined by eqs (23) and (37), so that and . Given a cross-validation scheme , let be the estimator applied to the training sample . Similarly, let . Let be the above submodel with through at . Let
be the MLE of minimizing the cross-validated empirical risk. This defines as the -specific targeted fit of . The one-step CV-TMLE of is defined as
One-step CV-HAL-TMLE solves cross-validated efficient score equation: Our efficiency Theorem 1 assumes that
That is, it is assumed that the one-step CV-TMLE already solves the cross-validated efficient influence curve equation up till an asymptotically negligible approximation error. By definition of we have that it solves its score equation , which provides a basis for verifying eq. (25). As formalized by Lemma 13 in the Appendix D, for our choice of -consistent initial estimators of , a one-step CV-TMLE will satisfy eq. (25) for one-dimensional local least favorable submodels under weak regularity conditions. We believe that such a result can be proved in great generality for arbitrary (also multivariate) local least favorable submodels. Instead, below we propose a particular class of multivariate local least favorable submodels eq. (26) for which we establish eq. (25) under regularity conditions. In (van der Laan and Gruber, 2015) it is shown that one can always construct a so called universal least favorable submodel through with a one dimensional so that at each so that (exactly), independent of the properties of the initial estimator .
One-step CV-HAL-TMLE preserves fast rate of convergence of initial estimator: Our efficiency Theorem 1 also assumes that the updated estimator satisfies for each split . This is generally a very reasonable condition given that for a specified . Our proposed class of local least favorable submodels eq. (26) below guarantees that the rate of convergence of the initial estimator is completely preserved by , so that this condition is automatically guaranteed to hold.
A class of multivariate local least favorable submodels that separately updates each nuisance parameter component: One way to guarantee that is to make sure that the updated estimator converges as fast to as the initial estimator . For that purpose we propose a -dimensional local least favorable submodel of the type
for , and where . By using such a submodel we have and . Thus, in this case is updated with its own , . The advantage of such a least favorable submodel is that the one-step update of is not affected by the statistical behavior of the other estimators , . On the other hand, if one uses a local least favorable submodel with a single , the MLE is very much driven by the worst performing estimator . Lemma 3 shows that, by using such a -variate local least favorable submodel satisfying eq. (26), the rate of convergence of the initial estimator is fully preserved by the TMLE-update (see Lemma 3 below).
How to construct a local least favorable submodel of type eq. (26): A general approach for constructing such a -variate least favorable submodel is the following. Let be the efficient influence curve at a for the parameter defined by that sets all the other components of with equal to its true value under , . Then, it follows immediately from the definition of pathwise derivative that
so that, is an element of the linear span of . Let be a one-dimensional submodel through so that
That is, is a local least favorable submodel at for the parameter , . Now, define by . Then, we have
so that the submodel is indeed a local least favorable submodel.
Lemma 14 provides a sufficient set of minor conditions under which the one-step-HAL-CV-TMLE using a local least favorable submodel of the type eq. (26) will satisfy eq. (25). Therefore, the class of local least favorable submodels eq. (26) yields both crucial conditions for the HAL-CV-TMLE: it solves eq. (25) and it preserve the rate of convergence of the initial estimator.
Consider the submodel of the type eq. (26) presented above. Given an initial estimator , recall the definition as the fluctuated version of the initial estimator applied to the training sample, and . We want to show that converges to at the same rate as the initial estimator (and thus also ). The following lemma establishes this result and it is an immediate consequence of the oracle inequality of the cross-validation selector for the loss function , applied to the set of candidate estimators indexed by , for each .
By convexity of the loss function , this implies
Thus, if for some and for each , then
It then also follows that for each , .
We have the following theorem.
Consider the above defined corresponding one-step CV-TMLE of .
Initial estimator conditions: Consider the HAL-super-learners and defined by eqs (23) and (37), respectively, and, recall that we are given simple estimators and of and . Let and be chosen so that and . Assume the conditions of Theorem 2 and Theorem 6 so that we have
where and . Let and be the corresponding estimators of and , respectively.
“Preserve rate of convergence of initial estimator”-condition: In addition, assume that either (Case A) the CV-TMLE uses a local least favorable submodel of the type eq. (26) so that Lemma 3 applies, or (Case B) assume that for each split for some .
Efficient influence curve score equation condition and second order remainder condition: Define and the class of functions . Assume
In Case A, for verification of assumption eq. (27) one could apply Lemma 14. In Case A, for verification of the two assumptions eqs (28) and (29) one can use that for each of the realizations of , and . In Case B, for verification of the latter two assumptions eqs (28) and (29) one can use that for each of the realizations of , and .
Then, is asymptotically efficient:
Condition eq. (32) will practically always trivially hold for equal to the dimension of : note that this is even true for unbounded models due to the normalizing constant . We already discussed the crucial condition eq. (27) in our subsection defining the CV-TMLE. Conditions eqs (30) and (31) are easily satisfied by controlling the speed at which the model bounds can converge to infinity, and are always true for bounded models (as long as the size of the library of the super-learner behaves as a polynomial power of sample size). For bounded models , condition eq. (28) will typically hold with and equal to the minimum of the components of and : i.e., the efficient influence curve estimator will converge to its true counterpart as fast as the slowest converging nuisance parameter estimator. If the model is unbounded so that the model bounds of the sieve will converge to infinity, then eq. (28) will hold with for some converging to infinity (e.g., ). So, in the latter case one has to control the rate at which the model bounds of the sieve , such as the supremum norm bound for the efficient influence curve, converge to infinity. Finally, the crucial condition eq. (29) will easily hold for bounded models if this slowest rate is larger than , which we know to be true for the HAL-estimator and its super-learner. For unbounded models, this condition eq. (29) will put a serious brake on the speed as which the model bounds of can converge to infinity. Proof: By assumptions eqs (30) and (31),
Suppose so that and . By the identity , we have
Combining this with eq. (27) yields the following identity:
By assumption eq. (29) we have that . Thus, we have shown
We now note
Thus, it remains to prove that . For this we apply Lemma 10 with