2 Example: Treatment specific mean in nonparametric model
Before we start the main part of this article, in this section we will first introduce an example, and use this example to provide the reader with a guide through the different sections.
2.1 Defining the statistical estimation problem
Let
O
=
(
W
,
A
,
Y
)
∼
P
0
be a
d
dimensional random variable consisting of a
(
d
−
2
)
dimensional vector of baseline covariates
W
, binary treatment
A
∈
{
0
,
1
}
and binary outcome
Y
∈
{
0
,
1
}
. We observe
n
i.i.d. copies
O
1
,
…
,
O
n
of
O
∼
P
0
. Let
Q
ˉ
(
P
)
(
W
)
=
E
P
(
Y
∣
A
=
1
,
W
)
and
G
ˉ
(
P
)
(
W
)
=
E
P
(
A
∣
W
)
. Let
Q
2
(
P
)
be the marginal cumulative probability distribution of
W
, and
Q
=
(
Q
1
=
Q
ˉ
,
Q
2
)
. Let the statistical model be of the form
M
=
{
P
:
G
(
P
)
∈
G
,
Q
(
P
)
∈
Q
}
, where
G
is a possibly restricted set, and
Q
is nonparametric. The only key assumption we will enforce on
Q
and
G
is that for each
P
∈
M
,
W
↦
Q
ˉ
(
P
)
(
W
)
and
W
↦
G
ˉ
(
P
)
(
W
)
are cadlag functions in
W
on a set
[
0
,
τ
P
]
⊂
I
R
d
−
2
[17], and that the variation norm of these functions
Q
ˉ
(
P
)
and
G
ˉ
(
P
)
are bounded. The definition of variation norm will be presented in the next section. Suppose that
G
assumes that
G
ˉ
only depends on
W
through a subset of covariates of dimension
d
2
≤
d
−
2
: if
d
2
=
d
−
2
, then this does not represent an assumption.
Our target parameter
Ψ
:
M
→
I
R
is defined by
Ψ
(
P
)
=
∫
Q
ˉ
(
w
)
d
Q
2
(
w
)
≡
Ψ
1
(
Q
1
=
Q
ˉ
,
Q
2
)
. For notational convenience, we will use
Ψ
for both mappings
Ψ
and
Ψ
1
. It is well known that
Ψ
is pathwise differentiable so that for each 1dimensional parametric submodel
{
P
ϵ
:
ϵ
}
⊂
M
through
P
with score
S
at
ϵ
=
0
, we have
d
d
ϵ
Ψ
(
P
ϵ
)

ϵ
=
0
=
P
D
(
P
)
S
=
∫
o
D
(
P
)
(
o
)
S
(
o
)
d
P
(
o
)
,
for some
D
(
P
)
∈
L
2
(
P
)
, where
L
2
(
P
)
is the Hilbert space of functions of
O
with mean zero endowed with inner product
〈
f
,
g
〉
P
=
P
f
g
. Here we use the notation
P
f
≡
∫
f
(
o
)
d
P
(
o
)
. Such an object
D
(
P
)
is called a gradient at
P
of the pathwise derivative. The unique gradient that is also an element of the tangent space
T
(
P
)
is defined as the canonical gradient. The tangent space
T
(
P
)
at
P
is defined as the closure of the linear span of the set of scores of the class of
1
dimensional parametric submodels we consider. In this example the canonical gradient
D
∗
(
P
)
=
D
∗
(
Q
(
P
)
,
G
(
P
)
)
at
P
is given by:
D
∗
(
Q
,
G
)
(
O
)
=
A
G
ˉ
(
W
)
(
Y
−
Q
ˉ
(
W
)
)
+
Q
ˉ
(
W
)
−
Ψ
(
Q
)
.
Let
D
1
∗
(
Q
,
G
)
=
A
/
G
ˉ
(
W
)
(
Y
−
Q
ˉ
(
W
)
)
and
D
2
∗
(
Q
)
=
Q
ˉ
(
W
)
−
Ψ
(
Q
)
and note that
D
∗
(
Q
,
G
)
=
D
1
∗
(
Q
,
G
)
+
D
2
∗
(
Q
)
.
An estimator
ψ
n
of
ψ
0
=
Ψ
(
P
0
)
is asymptotically efficient (among the class of all regular estimators) if and only if it is asymptotically linear with influence curve equal to the canonical gradient
D
∗
(
P
0
)
[1]:
ψ
n
−
ψ
0
=
P
n
D
∗
(
P
0
)
+
o
P
(
n
−
1
/
2
)
,
where
P
n
is the empirical probability distribution of
O
1
,
…
,
O
n
. Therefore, the canonical gradient is also called the efficient influence curve.
We have that
(1)
Ψ
(
P
)
−
Ψ
(
P
0
)
=
(
P
−
P
0
)
D
∗
(
Q
,
G
)
+
R
20
(
(
Q
ˉ
,
G
ˉ
)
,
(
Q
ˉ
0
,
G
ˉ
0
)
)
,
where
Q
=
Q
(
P
)
,
G
=
G
(
P
)
, and the second order remainder
R
20
(
)
is defined as follows:
R
20
(
(
Q
ˉ
,
G
ˉ
)
,
(
Q
ˉ
0
,
G
ˉ
0
)
)
≡
∫
G
ˉ
(
w
)
−
G
ˉ
0
(
w
)
G
ˉ
(
w
)
(
Q
ˉ
(
w
)
−
Q
ˉ
0
(
w
)
)
d
P
0
(
w
)
.
Of course,
P
D
∗
(
Q
,
G
)
=
0
.
We define the following two loglikelihood loss functions for
Q
ˉ
,
Q
2
and
G
ˉ
, respectively:
L
11
(
Q
ˉ
)
(
O
)
=
−
A
Y
log
Q
ˉ
(
W
)
+
(
1
−
Y
)
log
(
1
−
Q
ˉ
(
W
)
)
;
L
12
(
Q
2
)
(
O
)
=
−
log
d
Q
2
(
W
)
;
L
2
(
G
ˉ
)
(
O
)
=
−
A
log
G
ˉ
(
W
)
+
(
1
−
A
)
log
(
1
−
G
ˉ
(
W
)
)
.
We also define the corresponding KullbackLeibler dissimilarities
d
10
,
1
(
Q
ˉ
,
Q
ˉ
0
)
=
P
0
{
L
11
(
Q
ˉ
)
−
L
11
(
Q
ˉ
0
)
}
,
d
10
,
2
(
Q
2
,
Q
20
)
=
P
0
{
L
12
(
Q
2
)
−
L
12
(
Q
20
)
}
, and
d
20
(
G
ˉ
,
G
ˉ
0
)
=
P
0
{
L
2
(
G
ˉ
)
−
L
2
(
G
ˉ
0
)
}
. Here
Q
2
represents an easy to estimate parameter which we will estimate with the empirical probability distribution
Q
2
n
=
Q
ˆ
2
(
P
n
)
of
W
1
,
…
,
W
n
.
Let the submodel
M
(
δ
)
⊂
M
be defined by the extra restriction that
δ
<
Q
ˉ
(
W
)
<
1
−
δ
and
G
ˉ
(
W
)
>
δ
P
0
a.e.
If we would replace the loglikelihood loss
L
11
(
Q
ˉ
)
(which becomes unbounded if
Q
ˉ
approximates 0 or 1) by a squared error loss
(
Y
−
Q
ˉ
(
W
)
)
2
A
, then one can remove the restriction
δ
<
Q
ˉ
(
W
)
<
1
−
δ
in the definition of
M
(
δ
)
. Given a sequence
δ
n
→
0
as
n
→
∞
, we can define a sequence of models
M
n
=
M
(
δ
n
)
which grows from below to
M
as
n
→
∞
. By assumption, there exists an
N
0
=
N
(
P
0
)
<
∞
so that for
n
>
N
0
we have
P
0
∈
M
n
.
Let
Q
n
=
Q
1
n
×
Q
2
n
and
G
n
be the corresponding parameter spaces for
Q
=
(
Q
ˉ
,
Q
2
)
and
G
ˉ
, respectively, and specifically,
Q
1
n
=
{
Q
ˉ
:
δ
n
<
Q
ˉ
<
1
−
δ
n
}
, while
Q
2
n
=
Q
2
.
2.2 One step CVTMLE
Let
Q
ˉ
ˆ
:
M
n
o
n
p
→
Q
1
n
and
G
ˉ
ˆ
:
M
n
o
n
p
→
G
n
be initial estimators of
Q
ˉ
0
,
G
ˉ
0
, respectively, where
M
n
o
n
p
denotes a nonparametric model so that the estimator is defined for all realizations of the empirical probability distribution. Let
Q
ˆ
:
M
n
o
n
p
→
Q
n
be the estimator
Q
ˆ
(
P
n
)
=
(
Q
ˉ
ˆ
(
P
n
)
,
Q
ˆ
2
(
P
n
)
)
of
Q
0
=
(
Q
ˉ
0
,
Q
20
)
. For a given crossvalidation scheme
B
n
∈
{
0
,
1
}
n
, let
P
n
,
B
n
1
,
P
n
,
B
n
0
be the empirical probability distributions of the validation sample
{
O
i
:
B
n
(
i
)
=
1
}
and training sample
{
O
i
:
B
n
(
i
)
=
0
}
, respectively. It is assumed that the proportion of observations in the validation sample (i.e.,
∑
i
B
n
(
i
)
/
n
) is between
δ
and
1
−
δ
for some
0
<
δ
<
1
. Let
Q
n
,
B
n
=
(
Q
ˉ
n
,
B
n
,
Q
2
n
,
B
n
)
=
Q
ˆ
(
P
n
,
B
n
0
)
and
G
ˉ
n
,
B
n
=
G
ˉ
ˆ
(
P
n
,
B
n
0
)
be the estimators applied to the training sample
P
n
,
B
n
0
. Given a
(
Q
ˉ
,
G
ˉ
)
, consider the uniform least favorable submodel (van der Laan and Gruber, 2015)
Logit
Q
ˉ
ϵ
1
=
Logit
Q
ˉ
+
ϵ
1
H
G
ˉ
through
Q
ˉ
at
ϵ
1
=
0
, where
H
G
ˉ
(
W
)
=
1
/
G
ˉ
(
W
)
. We indeed have
d
d
ϵ
1
L
11
(
Q
ˉ
ϵ
1
)
=
D
1
∗
(
Q
ˉ
ϵ
1
,
G
ˉ
)
for all
ϵ
1
. Given a
Q
=
(
Q
ˉ
,
Q
2
)
, consider also the local least favorable submodel
d
Q
2
,
ϵ
2
l
f
m
(
W
)
=
d
Q
2
(
W
)
(
1
+
ϵ
2
D
2
∗
(
Q
)
(
W
)
)
through
Q
2
at
ϵ
2
=
0
. Indeed,
d
d
ϵ
2
L
12
(
Q
2
,
ϵ
2
l
f
m
)

ϵ
2
=
0
=
D
2
*
(
Q
¯
,
Q
2
)
. This local least favorable submodel implies the following uniform least favorable submodel (van der Laan and Gruber, 2015): for
ϵ
2
≥
0
d
Q
2
,
ϵ
2
=
d
Q
2
exp
∫
0
ϵ
2
D
2
∗
(
Q
ˉ
,
Q
2
,
x
)
d
x
.
This universal least favorable submodel implies a recursive construction of
Q
2
,
ϵ
for all
ϵ
values, by starting at
ϵ
=
0
and moving upwards. For negative values of
ϵ
2
, we define
∫
0
ϵ
2
=
∫
ϵ
2
0
. For all
ϵ
2
,
d
d
ϵ
2
L
12
(
Q
2
,
ϵ
2
)
=
D
2
∗
(
Q
ˉ
,
Q
2
,
ϵ
2
)
, which shows that this is indeed a universal least favorable submodel for
Q
2
.
Let
ϵ
1
n
=
arg
min
ϵ
1
E
B
n
P
n
,
B
n
1
L
11
(
Q
ˉ
n
,
B
n
,
ϵ
1
)
, and
Q
ˉ
n
,
B
n
∗
=
Q
ˉ
n
,
B
n
,
ϵ
1
n
. The score equation for
ϵ
1
n
shows that
E
B
n
P
n
,
B
n
1
D
1
∗
(
Q
ˉ
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
=
0
. Let
ϵ
2
n
=
arg
min
ϵ
2
E
B
n
P
n
,
B
n
1
L
12
(
Q
2
n
,
B
n
,
ϵ
2
)
and
Q
2
n
,
B
n
∗
=
Q
2
n
,
B
n
,
ϵ
2
n
. The score equation for
ϵ
2
n
shows that
E
B
n
P
n
,
B
n
1
D
2
∗
(
Q
ˉ
n
,
B
n
∗
,
Q
2
n
,
B
n
∗
)
=
0
, which implies
(2)
E
B
n
P
n
,
B
n
1
Q
ˉ
n
,
B
n
∗
=
E
B
n
Q
2
n
,
B
n
∗
Q
ˉ
n
,
B
n
∗
.
The CVTMLE of
Ψ
(
Q
0
)
is defined as
ψ
n
∗
≡
E
B
n
Ψ
(
Q
n
,
B
n
∗
)
, where
Q
n
,
B
n
∗
=
(
Q
ˉ
n
,
B
n
∗
,
Q
2
n
,
B
n
∗
)
. By eq. (2) this implies that the CVTMLE can also be represented as:
(3)
ψ
n
∗
=
E
B
n
P
n
,
B
n
1
Q
ˉ
n
,
B
n
∗
.
Note that this latter representation proves that we never have to carry out the TMLEupdate step for
Q
2
n
, but that the CVTMLE is a simple empirical mean of
Q
ˉ
n
,
B
n
∗
over the validation sample, averaged across the different splits
B
n
.
We also conclude that this onestep CVTMLE solves the crucial crossvalidated efficient influence curve equation
(4)
E
B
n
P
n
,
B
n
1
D
∗
(
Q
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
=
0.
2.3 Guide for article based on this example
Section 3: Formulation of general estimation problem. The goal of this article is far beyond establishing asymptotic efficiency of the CVTMLE eq. (3) in this example. Therefore, we start in Section 3 by defining a general model and general target parameter, essentially generalizing the above notation for this example. Therefore, having read the above example, the presentation in Section 3 of a very general estimation problem will be easier to follow. Our subsequent definition and results for the HALestimator, the HALsuperlearner, and the CVTMLE in the subsequent Sections 46 apply now to our general model and target parameter, thereby establishing asymptotic efficiency of the CVTMLE for an enormous large class of semiparametric statistical estimation problems, including our example as a special case.
Let’s now return to our example to point out the specific tasks that are solved in each section of this article. By eqs (1) and (4), we have the following starting identity for the CVTMLE:
(5)
E
B
n
Ψ
(
Q
n
,
B
n
∗
)
−
Ψ
(
Q
0
)
=
E
B
n
(
P
n
,
B
n
1
−
P
0
)
D
∗
(
Q
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
+
E
B
n
R
20
(
(
Q
ˉ
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
,
(
Q
ˉ
0
,
G
ˉ
0
)
)
.
By the CauchySchwarz inequality and bounding
1
/
G
ˉ
n
,
B
n
by
1
/
δ
n
, we can bound the second order remainder as follows:
(6)
∣
E
B
n
R
20
(
(
Q
ˉ
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
,
(
Q
ˉ
0
,
G
ˉ
0
)
)
∣≤
1
δ
n
E
B
n
∥
Q
ˉ
n
,
B
n
∗
−
Q
ˉ
0
∥
P
0
∥
G
ˉ
n
,
B
n
−
G
ˉ
0
∥
P
0
,
where
∥
f
∥
P
0
≡
(
P
0
f
2
)
1
/
2
. Suppose we can construct estimators
Q
ˉ
ˆ
and
G
ˉ
ˆ
of
Q
ˉ
0
and
G
ˉ
0
so that
∥
Q
ˉ
n
−
Q
ˉ
0
∥
P
0
=
O
P
(
n
−
1
/
4
−
α
1
)
and
∥
G
ˉ
n
−
G
ˉ
0
∥
P
0
=
O
P
(
n
−
1
/
4
−
α
2
)
for some
α
1
>
0
,
α
2
>
0
. Since the training sample is proportional to sample size
n
, this immediately implies
∥
G
ˉ
n
,
B
n
−
G
ˉ
0
∥
P
0
=
O
P
(
n
−
1
/
4
−
α
2
)
and
∥
Q
ˉ
n
,
B
n
−
Q
ˉ
0
∥
P
0
=
O
P
(
n
−
1
/
4
−
α
1
)
. In addition, it is easy to show (as we will formally establish in general) that the rate of convergence of the initial estimator
Q
ˉ
n
,
B
n
carries over to its targeted version so that
∥
Q
ˉ
n
,
B
n
∗
−
Q
ˉ
0
∥
P
0
=
O
P
(
n
−
1
/
4
−
α
1
)
. Thus, with such initial estimators, we obtain
(7)
E
B
n
R
20
(
(
Q
ˉ
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
,
(
Q
ˉ
0
,
G
ˉ
0
)
)
=
o
P
(
δ
n
−
1
n
−
1
/
2
−
α
1
−
α
2
)
.
Thus, by selecting
δ
n
so that
δ
n
−
1
n
−
α
1
−
α
2
→
0
, we obtain
E
B
n
R
20
(
(
Q
ˉ
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
,
(
Q
ˉ
0
,
G
ˉ
0
)
)
=
o
P
(
n
−
1
/
2
)
.
Section 4: Construction and analysis of an
M
specific HALestimator that converges at a rate faster than
n
−
1
/
4
. This challenge of constructing such estimators
Q
ˉ
ˆ
and
G
ˉ
ˆ
is addressed in Section 4. In the context of our example, in Section 4 we define a minimum loss estimator (MLE)
Q
ˉ
n
,
M
=
arg
min
∥
Q
ˉ
∥
v
<
M
P
n
L
11
(
Q
ˉ
)
that minimizes the empirical risk over all cadlag functions with variation norm smaller than
M
. In Section 4 we then show that, if
M
is chosen larger than the variation norm of
Q
ˉ
0
,
d
10
,
1
1
/
2
(
Q
ˉ
n
,
M
,
Q
ˉ
0
)
converges to zero at a faster rate than
n
−
1
/
4
−
α
1
for some
α
1
=
α
1
(
d
)
>
0
(for each dimension
d
). We provide an explicit representation eq. (17) of a cadlag function with finite variation norm
M
as an infinite linear combination of indicator functions for which the sum of the absolute value of the coefficients is bounded by
M
. As a consequence, it is shown in Appendix D that this
M
specific minimum lossbased estimator can be approximated by (or can be exactly defined as) a Lassogeneralized linear regression problem in which the sum of the absolute values of the coefficients is bounded by
M
. Therefore, we will refer to
Q
ˉ
n
,
M
as the
M
specific HALestimator. Our proof of Lemma 1 in Section 4, which establishes the rate of convergence of the
M
specific HALestimator, relies on an empirical process result by [18] that expresses the upper bound for this rate of convergence in terms of the entropy of the model space
Q
1
of
Q
ˉ
. The representation eq. (17) demonstrates that the set of cadlag functions that have variation norm smaller than a constant
M
is a difference of a“convex” hull of indicator functions, and, as a consequence of a general convex hull result in [19] this proves that it is a Donsker class with a specified upper bound on its entropy. In this way, we obtain an explicit entropy bound for our model space
Q
1
. Given this explicit upper bound for the entropy, the result in [18] establishes a rate of convergence of the
M
specific HALestimator faster than
n
−
1
/
4
−
α
1
for a specified
α
1
>
0
. By selecting
M
larger than the unknown variation norm of the true nuisance parameter value, we obtain an HALestimator that converges at a faster rate than
n
−
1
/
4
.
Section 5: Construction and analysis of an HALsuperlearner. Instead of assuming that the the variation norm of
Q
ˉ
0
is bounded by a known
M
and use the corresponding
M
specific HALestimator, in Section 5 we define a a collection of such
M
specific estimators for a set of
M
values for which the maximum value converges to infinity as sample size converges to infinity. We then use crossvalidation to data adaptively select
M
. We now show that the resulting crossvalidated selected estimator of
Q
ˉ
0
will be asymptotically equivalent with the oracle (i.e., best w.r.t. lossbased dissimilarity) choice. This follows from a previously established oracle inequality for the crossvalidation selector, as long as the supremum norm bound on the lossfunction at the candidate estimators does not grow too fast to infinity as a function of sample size (e.g., [11, 13]). By using such a data adaptively selected
M
one obtains an estimator with better practical performance and it avoids having to know an upper bound
M
. As a consequence, our statistical model does not need to assume a universal bound
M
on the variation norm of the nuisance parameters, but it only needs to assume that each nuisance parameter value has a finite variation norm. For the sake of finite sample performance, we want to use a superlearner that uses crossvalidation to select an estimator from a library of candidate estimators that includes these
M
specific estimators as candidates, beyond other candidate estimators. In this way, the choice of estimator will be adapted to what works well for the actual data set. Therefore, in Section 5, we actually define such a general superlearner
Q
ˉ
ˆ
and Theorem 2 states that it will converge at least as fast as the best choice in the library, and thus certainly as fast as the
M
specific HALestimator using
M
equal to the true variation norm of
Q
ˉ
0
. We refer to a superlearner whose library includes this collection of
M
specific HALestimators as an HALsuperlearner. We will use an analogue HALsuperlearner of
G
ˉ
0
(Theorem 6).
The convergence results for this superlearner in terms of the KullbackLeibler lossbased dissimilarities also imply corresponding results for
L
2
(
P
0
)
convergence as needed to control the second order remainder eq. (6): see Lemma 4.
Section 6: Construction and analysis of HALCVTMLE. To control the remainder we need to understand the behavior of the updated initial estimator
Q
ˉ
n
,
B
n
∗
instead of the initial estimator
Q
ˉ
n
,
B
n
itself. In our example, since the updated estimator only involves a single updating step of the initial estimator, using a crossvalidated MLE selector of
ϵ
, we can easily show that
Q
ˉ
n
,
B
n
∗
converges at same rate to
Q
ˉ
0
as the initial estimator
Q
ˉ
n
,
B
n
. In general, in Section 6 we define a onestep CVTMLE for our general model and target parameter so that the targeted versions of the initial estimator of
Q
ˉ
0
converges at the same rate as the initial HALsuperlearner estimator
Q
ˉ
n
. (Since the initial estimator is an HALsuperlearner, we refer to this type of CVTMLE as an HALCVTMLE.) This concerns a choice of least favorable submodel for which the CVTMLEstep separately updates each of the components of the initial estimator
Q
ˆ
. We then show that with this choice of least favorable submodel the CVTMLEstep preserves the convergence rate of the initial estimator (Lemma 3). We also establish in Appendix D that the onestep CVTMLE already solves the desired crossvalidated efficient influence curve equation (4) up till an
o
P
(
n
−
1
/
2
)
term, so that an iterative CVTMLE can be avoided (Lemma 13 and Lemma 14). At that point, we have shown that the generalized analogue of eq. (7) indeed holds with a specified
α
1
>
0
,
α
2
>
0
. In the final subsection of Section 6, Theorem 1 then establish the asymptotic efficiency of the HALCVTMLE, which now also involves analyzing the crossvalidated empirical process term, specifically, showing that
(8)
E
B
n
(
P
n
,
B
n
1
−
P
0
)
D
∗
(
Q
n
,
B
n
∗
,
G
ˉ
n
,
B
n
)
=
(
P
n
−
P
0
)
D
∗
(
Q
0
,
G
ˉ
0
)
+
o
P
(
n
−
1
/
2
)
.
This will hold under weak conditions, given that we have estimators
Q
n
,
B
n
∗
,
G
n
,
B
n
that converge at specified rates to their true counterparts and that, for each split
B
n
, conditional on the training sample, the empirical process is indexed by a finite dimensional (i.e., dimension of
ϵ
) class of functions.
Section 7: Returning to our example. In Section 7 we return to our example to present a formal Theorem 2 with specified conditions, involving an application of our general efficiency Theorem 1 in Section 6.
Appendix: Various technical results are presented in the Appendix.
3 Statistical formulation of the estimation problem
Let
O
1
,
…
,
O
n
be
n
independent and identically distributed copies of a
d
dimensional random variable
O
with probability distribution
P
0
that is known to be an element of a statistical model
M
.
Let
Ψ
:
M
→
I
R
be a onedimensional target parameter, so that
ψ
0
=
Ψ
(
P
0
)
is the estimand of interest we aim to learn from the
n
observations
o
1
,
…
,
o
n
. We assume that
Ψ
is pathwise differentiable at any
P
∈
M
with canonical gradient
D
∗
(
P
)
: for a specified rich class of onedimensional submodels
{
P
ϵ
:
ϵ
∈
(
−
δ
,
δ
)
}
⊂
M
through
P
at
ϵ
=
0
and score
S
=
d
d
ϵ
log
d
P
ϵ
/
d
P

ϵ
=
0
, we have
d
d
ϵ
Ψ
(
P
ϵ
)

ϵ
=
0
=
P
D
*
(
P
)
S
≡
∫
o
D
*
(
P
)
(
o
)
S
(
o
)
d
P
(
o
)
.
Our goal in this article is to construct a substitution estimator (i.e., a TMLE
Ψ
(
P
n
∗
)
for a targeted estimator
P
n
∗
of
P
0
) that is asymptotically efficient under minimal conditions.
Relevant nuisance parameters
Q
,
G
and their loss functions: Let
Q
(
P
)
be a nuisance parameter of
P
so that
Ψ
(
P
)
=
Ψ
1
(
Q
(
P
)
)
for some
Ψ
1
, so that
Ψ
(
P
)
only depends on
P
through
Q
(
P
)
. Let
Q
=
Q
(
M
)
=
{
Q
(
P
)
:
P
∈
M
}
be the parameter space of this parameter
Q
:
M
→
Q
. Suppose that
Q
(
P
)
=
(
Q
j
(
P
)
:
j
=
1
,
…
,
k
1
+
1
)
has
k
1
+
1
components, and
Q
j
:
M
→
Q
j
are variation independent parameters
j
=
1
,
…
,
k
1
+
1
. Let
Q
j
=
Q
j
(
M
)
be the parameter space of
Q
j
. Thus, the parameter space of
Q
is a cartesian product
Q
=
∏
j
=
1
k
1
+
1
Q
j
. In addition, suppose that for
j
=
1
,
…
,
k
1
+
1
,
Q
j
(
P
0
)
=
arg
min
Q
j
∈
Q
j
P
0
L
1
j
(
Q
j
)
for specified loss functions
(
O
,
Q
j
)
↦
L
1
j
(
Q
j
)
(
O
)
. Let
Q
ˉ
=
(
Q
1
,
…
,
Q
k
1
)
represent parameters that require data adaptive estimation trading off variance and bias (e.g., densities), while
Q
k
1
+
1
represents an easy to estimate parameter for which we have an empirical estimator
Q
ˆ
k
1
+
1
available with negligible bias. In our treatment specific mean example above
Q
=
(
Q
1
=
Q
ˉ
,
Q
2
)
, where the easy to estimate parameter
Q
2
was the probability distribution of
W
which is naturally estimated with the empirical probability distribution. The parameter
Q
ˉ
(
P
0
)
will be estimated with our proposed lossbased HALsuperlearner. In the special case that each of the components of
Q
require a superlearner typeestimator, we define
Q
k
1
+
1
as empty (or equivalently, a known value), and in that case
Q
=
Q
ˉ
. We define corresponding lossbased dissimilarities
d
10
j
(
Q
j
,
Q
j
0
)
=
P
0
L
1
j
(
Q
j
)
−
P
0
L
1
j
(
Q
j
0
)
,
j
=
1
,
…
,
k
1
+
1
. We assume that
d
10
(
k
1
+
1
)
(
Q
ˆ
k
1
+
1
(
P
n
)
,
Q
(
k
1
+
1
)
0
)
=
O
P
(
r
Q
,
k
1
+
1
(
n
)
)
for a known rate of convergence
r
Q
,
k
1
+
1
(
n
)
. Let
(9)
d
10
(
Q
,
Q
0
)
=
(
d
10
j
(
Q
j
,
Q
j
0
)
:
j
=
1
,
…
,
k
1
+
1
)
be the collection of these
k
1
+
1
lossbased dissimilarities. We use the notation
d
10
(
Q
ˉ
,
Q
ˉ
0
)
=
(
d
10
j
(
Q
j
,
Q
j
0
)
:
j
=
1
,
…
,
k
1
)
for the vector of
k
1
lossbased dissimilarities for
Q
ˉ
.
Suppose that
D
∗
(
P
)
only depends on
P
through
Q
(
P
)
and an additional nuisance parameter
G
(
P
)
. In the special case that
D
∗
(
P
)
only depends on
P
through
Q
(
P
)
, we define
G
as empty (or equivalently, as a known value). Let
G
=
(
G
1
,
…
,
G
k
2
+
1
)
be a collection of
(
k
2
+
1
)
variation independent parameters of
G
for some integer
k
2
+
1
≥
1
. Thus the parameter space of
G
is a cartesian product
G
=
∏
j
=
1
k
2
+
1
G
j
, where
G
j
is the parameter space of
G
j
:
M
→
G
j
. Let
G
j
0
=
arg
min
G
∈
G
j
P
0
L
2
j
(
G
j
)
for a loss function
(
O
,
G
j
)
↦
L
2
j
(
G
j
)
(
O
)
, and let
d
2
j
0
(
G
j
,
G
j
0
)
=
P
0
L
2
j
(
G
j
)
−
P
0
L
2
j
(
G
j
0
)
be the corresponding lossbased dissimilarity,
j
=
1
,
…
,
k
2
+
1
. Let
G
k
2
+
1
represents an easy to estimate parameter for which we have a well behaved and understood estimator
G
ˆ
k
2
+
1
available. The parameter
G
ˉ
(
P
0
)
will be estimated with our proposed HALsuperlearner.
We assume that
d
20
(
k
2
+
1
)
(
G
ˆ
k
2
+
1
(
P
n
)
,
G
(
k
2
+
1
)
0
)
=
O
P
(
r
G
,
k
2
+
1
(
n
)
)
for a known rate of convergence
r
G
,
k
2
+
1
(
n
)
. As above, let
d
20
(
G
,
G
0
)
=
(
d
20
j
(
G
j
,
G
j
0
)
:
j
=
1
,
…
,
k
2
+
1
)
be the collection of these lossbased dissimilarities, and let
d
20
(
G
ˉ
,
G
ˉ
0
)
=
(
d
20
j
(
G
j
,
G
j
0
)
:
j
=
1
,
…
,
k
2
)
, where
G
ˉ
=
(
G
1
,
…
,
G
k
2
)
. In the special case that each
G
j
requires a superlearner based estimator, then we define
G
k
2
+
1
as empty, and
G
=
G
ˉ
.
We also define
(10)
d
0
(
(
Q
,
G
)
,
(
Q
0
,
G
0
)
)
=
(
d
10
j
1
(
Q
j
1
,
Q
j
1
0
)
,
d
20
j
2
(
G
j
2
,
G
j
2
0
)
:
j
1
,
j
2
)
as the vector of
k
1
+
k
2
+
2
lossbased dissimilarities. We will also use the shorthand notation
d
0
(
P
,
P
0
)
for
d
0
(
(
Q
,
G
)
,
(
Q
0
,
G
0
)
)
.
We define
(11)
L
1
(
Q
)
=
(
L
1
j
(
Q
j
)
:
j
=
1
,
…
,
k
1
+
1
)
as the vector of
k
1
+
1
loss functions for
Q
=
(
Q
1
,
…
,
Q
k
1
+
1
)
, and similarly we define
(12)
L
2
(
G
)
=
(
L
2
j
(
G
j
)
:
j
=
1
,
…
,
k
2
+
1
)
.
We will also use the notation
L
1
(
Q
ˉ
)
=
(
L
1
(
Q
j
)
:
j
=
1
,
…
,
k
1
)
and
L
2
(
G
ˉ
)
=
(
L
2
j
(
G
j
)
:
j
=
1
,
…
,
k
2
)
. We will assume that
Q
ˉ
↦
L
1
(
Q
ˉ
)
is a convex function in the sense that, for any
Q
ˉ
1
=
(
Q
j
1
:
j
=
1
,
…
,
k
1
)
,
…
,
Q
ˉ
m
=
(
Q
j
m
:
j
=
1
,
…
,
k
1
)
, for each
j
=
1
,
…
,
k
1
(13)
P
0
L
1
j
∑
k
=
1
m
α
k
Q
j
k
≤
∑
k
=
1
m
α
k
P
0
L
1
j
(
Q
j
k
)
when
∑
k
α
k
=
1
and
min
k
α
k
≥
0
. Similarly, we assume
G
ˉ
↦
L
2
(
G
ˉ
)
is a convex function. Our results for the TMLE generalize to nonconvex loss functions, but the convexity of the loss functions allows a nicer representation for the superlearner oracle inequality, and in most applications a natural convex loss function is available.
We will abuse notation by also denoting
Ψ
(
P
)
and
D
∗
(
P
)
with
Ψ
(
Q
)
and
D
∗
(
Q
,
G
)
, respectively. A special case is that
D
∗
(
P
)
=
D
∗
(
Q
(
P
)
)
does not depend on an additional nuisance parameter
G
: for example, if
O
∈
I
R
,
M
is nonparametric, and
Ψ
(
P
)
=
∫
p
(
o
)
2
d
o
is the integral of the square of the Lebesgue density
p
of
P
, then the canonical gradient is given by
D
∗
(
P
)
=
2
p
2
−
2
Ψ
(
P
)
, so that one would define
Q
(
P
)
=
p
, and there is no
G
.
Second order remainder for target parameter: We define the second order remainder
R
2
(
P
,
P
0
)
as follows:
(14)
R
2
(
P
,
P
0
)
≡
Ψ
(
P
)
−
Ψ
(
P
0
)
+
P
0
D
∗
(
P
)
.
We will also denote
R
2
(
P
,
P
0
)
with
R
20
(
(
Q
,
G
)
,
(
Q
0
,
G
0
)
)
to indicate that it involves differences between
Q
and
Q
0
and
G
and
G
0
, beyond possibly some additional dependence on
P
0
. In our experience, this remainder
R
2
(
P
,
P
0
)
can be represented as a sum of terms of the type
∫
(
H
1
(
P
)
−
H
1
(
P
0
)
)
(
H
2
(
P
)
−
H
2
(
P
0
)
)
f
(
P
,
P
0
)
d
P
0
(
o
)
for some functionals
H
1
,
H
2
and
f
, where, typically,
H
1
(
P
)
and
H
2
(
P
)
represent functions of
Q
(
P
)
or
G
(
P
)
. In certain classes of problems we have that
R
2
(
P
,
P
0
)
only involves crossterms of the type
∫
(
H
1
(
Q
)
−
H
1
(
Q
0
)
)
(
H
2
(
G
)
−
H
2
(
G
0
)
)
f
(
P
,
P
0
)
d
P
0
, so that
R
20
(
(
Q
,
G
)
,
(
Q
0
,
G
0
)
)
=
0
if either
Q
=
Q
0
or
G
=
G
0
. In these cases, we say that the efficient influence curve is double robust w.r.t. misspecification of
Q
0
and
G
0
:
P
0
D
∗
(
P
)
=
Ψ
(
P
0
)
−
Ψ
(
P
)
if
G
(
P
)
=
G
(
P
0
)
or
Q
(
P
)
=
Q
(
P
0
)
.
Given the above double robustness property of the canonical gradient (i.e, of the target parameter), if
P
solves
P
0
D
∗
(
P
)
=
0
, and either
G
(
P
)
=
G
0
or
Q
(
P
)
=
Q
0
, then
Ψ
(
P
)
=
Ψ
(
P
0
)
. This allows for the construction of so called double robust estimators of
ψ
0
that will be consistent if either the estimator of
Q
0
is consistent or the estimator of
G
0
is consistent.
Support of data distribution: The support of
P
∈
M
is defined as a set
O
P
⊂
I
R
d
so that
P
(
O
P
)
=
1
. It is assumed that for each
P
∈
M
,
O
P
⊂
[
0
,
τ
P
]
for some finite
τ
P
∈
I
R
>
0
d
. We define
(15)
τ
=
sup
P
∈
M
τ
P
,
so that
[
0
,
τ
P
]
⊂
[
0
,
τ
]
for all
P
∈
M
, where
τ
=
∞
is allowed, in which case
[
0
,
τ
]
≡
I
R
≥
0
d
. That is,
[
0
,
τ
]
is an upper bound of all the supports, and the model
M
states that the support of the data structure
O
is known to be contained in
[
0
,
τ
]
.
Cadlag functions on
[
0
,
τ
]
, supremum norm and variation norm: Suppose
τ
is finite, and, in fact, if
τ
is not finite, then we will apply the definitions below to a
τ
=
τ
n
that is finite and converges to
τ
. Let
I
D
[
0
,
τ
]
be the Banach space of
d
variate real valued cadlag functions (rightcontinuous with lefthand limits) [17]. For a
f
∈
I
D
[
0
,
τ
]
, let
∥
f
∥
∞
=
sup
x
∈
[
0
,
τ
]
∣
f
(
x
)
∣
be the supremum norm. For a
f
∈
I
D
[
0
,
τ
]
, we define the variation norm of
f
[20] as
(16)
∥
f
∥
v
=∣
f
(
0
)
∣
+
∑
s
⊂
{
1
,
…
,
d
}
∫
(
0
s
,
τ
s
]
∣
f
(
d
x
s
,
0
−
s
)
∣
.
For a subset
s
⊂
{
1
,
…
,
d
}
,
x
s
=
(
x
j
:
j
∈
s
)
,
x
−
s
=
(
x
j
:
j
/
∈
s
)
, and the
∑
s
in the above definition of the variation norm is over all subsets of
{
1
,
…
,
d
}
. In addition,
x
s
→
f
(
x
s
,
0
−
s
)
)
is the
s
specific section of
x
→
f
(
x
)
that sets the coordinates in the compliment of
s
equal to
0
. Note that
∥
f
∥
v
is the sum of variation norms of
s
specific sections of
f
(including
f
itself). Therefore, one might refer to this norm as the sectional variation norm, but, for convenience, for the purpose of this article, we will just refer to it as variation norm. If
∥
f
∥
v
<
∞
, then we can, in fact, represent
f
as follows [20]:
(17)
f
(
x
)
=
f
(
0
)
+
∑
s
⊂
{
1
,
…
,
d
}
∫
(
0
s
,
x
s
]
f
(
d
u
s
,
0
−
s
)
,
where
f
(
d
u
s
,
0
−
s
)
is the measure generated by the cadlag function
u
s
↦
f
(
u
s
,
0
−
s
)
. For a
M
∈
I
R
≥
0
, let
F
v
,
M
=
{
f
∈
I
D
[
0
,
τ
]
:∥
f
∥
v
<
M
}
denote the set of cadlag functions
f
:
[
0
,
τ
]
→
I
R
with variation norm bounded by
M
.
Cartesian product of cadlag function spaces, and its componentwise operations: Let
D
k
[
0
,
τ
]
be the product Banach space of
k
dimensional
(
f
1
,
…
,
f
k
)
where each
f
j
∈
I
D
[
0
,
τ
]
,
j
=
1
,
…
,
k
. If
f
∈
D
k
[
0
,
τ
]
, then we define
∥
f
∥
∞
=
(
∥
f
j
∥
∞
:
j
=
1
,
…
,
k
)
as a vector whose
j
th component equals the supremum norm of the
j
th component
f
j
of
f
. Similarly we define a variation norm of
f
∈
D
k
[
0
,
τ
]
as a vector
∥
f
∥
v
=
(
∥
f
j
∥
v
:
j
=
1
,
…
,
k
)
of variation norms. If
f
∈
D
k
[
0
,
τ
]
, then
∥
f
∥
P
0
=
(
∥
f
j
∥
P
0
:
j
=
1
,
…
,
k
)
is a vector whose components are the
L
2
(
P
0
)
norms of the components of
f
. Generally speaking, in this paper any operation on a function
f
∈
D
k
[
0
,
τ
]
, such as taking a norm
∥
f
∥
P
0
, an expectation
P
0
f
, operations on a pair of functions
f
,
g
∈
D
k
[
0
,
τ
]
, such as
f
/
g
,
f
×
g
,
max
(
f
,
g
)
or an inequality
f
<
g
, is carried out component wise: for example,
max
(
f
,
g
)
=
(
max
(
f
j
,
g
j
)
:
j
=
1
,
…
,
k
)
and
inf
Q
∈
Q
P
0
L
1
(
Q
)
=
(
inf
Q
j
∈
Q
j
P
0
L
1
j
(
Q
j
)
:
j
=
1
,
…
,
k
1
+
1
)
. In a similar manner, for an
M
∈
I
R
>
0
k
, let
F
v
,
M
=
∏
j
=
1
k
F
v
,
M
j
denote the cartesian product. This general notation allows us to present results with minimal notation, avoiding the need to continuously having to enumerate all the components.
Our results will hold for general models and pathwise differentiable target parameters, as long as the statistical model satisfies the following key smoothness assumption:
Assumption 1(Smoothness Assumption)
For each
P
∈
M
,
Q
ˉ
=
Q
ˉ
(
P
)
∈
I
D
k
1
[
0
,
τ
]
,
G
ˉ
=
G
ˉ
(
P
)
∈
I
D
k
2
[
0
,
τ
]
,
D
∗
(
P
)
=
D
∗
(
Q
,
G
)
∈
I
D
[
0
,
τ
]
,
L
1
(
Q
ˉ
)
∈
I
D
k
1
[
0
,
τ
]
,
L
2
(
G
ˉ
)
∈
I
D
k
2
[
0
,
τ
]
, and
Q
ˉ
,
G
ˉ
,
D
∗
(
P
)
,
L
1
(
Q
ˉ
)
,
L
2
(
G
ˉ
)
have a finite supremum and variation norm.
Definition of bounds on the statistical model: The properties of the superlearner and TMLE rely on bounds on the model
M
. Our estimators will also allow for unbounded models by using a sieve of models for which its finite bounds slowly approximate the actual model bound as sample size converges to infinity. These bounds will be defined now:
(18)
τ
=
τ
(
M
)
=
sup
P
∈
M
τ
(
P
)
,
M
1
Q
=
M
1
Q
(
M
)
=
sup
Q
,
Q
0
∈
Q
∥
L
1
(
Q
ˉ
)
−
L
1
(
Q
ˉ
0
)
∥
∞
,
M
2
Q
=
M
2
Q
(
M
)
=
sup
P
,
P
0
∈
M
∥
L
1
(
Q
ˉ
)
−
L
1
(
Q
ˉ
0
)
∥
P
0
{
d
10
(
Q
ˉ
,
Q
ˉ
0
)
}
1
/
2
,
M
1
G
=
M
1
G
(
M
)
=
sup
G
,
G
0
∈
G
∥
L
2
(
G
ˉ
)
−
L
2
(
G
ˉ
0
)
∥
∞
,
M
2
G
=
M
2
G
(
M
)
=
sup
P
,
P
0
∈
M
∥
L
2
(
G
ˉ
)
−
L
2
(
G
ˉ
0
)
∥
P
0
{
d
20
(
G
ˉ
,
G
ˉ
0
)
}
1
/
2
,
M
D
∗
=
M
D
∗
(
M
)
=
sup
P
∈
M
∥
D
∗
(
P
)
∥
∞
.
Note that
M
1
Q
,
M
2
Q
∈
I
R
≥
0
k
1
and
M
1
G
,
M
2
G
∈
I
R
≥
0
k
2
are defined as vectors of constants, a constant for each component of
Q
ˉ
and
G
ˉ
, respectively. The bounds
M
1
Q
,
M
2
Q
guarantee excellent properties of the crossvalidation selector based on the lossfunction
L
1
(
Q
ˉ
)
(e.g., [11, 13]). A bound on
M
2
Q
shows that the lossbased dissimilarity
d
01
(
Q
ˉ
,
Q
ˉ
0
)
behaves as a square of a difference between
Q
ˉ
and
Q
ˉ
0
. Similarly, the bounds
M
1
G
,
M
2
G
control the behavior of the crossvalidation selector based on the loss function
L
2
(
G
ˉ
)
.
Bounded and Unbounded Models: We will call the model
M
bounded if it is a model for which
τ
<
∞
(i.e., universally bounded support),
M
1
Q
,
M
2
Q
,
M
1
G
,
M
2
G
,
M
D
∗
are finite. In words, in essence, a bounded model is a model for which the support and the supremum norm
of
Q
ˉ
(
P
)
,
G
ˉ
(
P
)
,
L
1
(
Q
ˉ
)
,
L
2
(
G
ˉ
)
and
D
∗
(
Q
,
G
)
are uniformly (over the model) bounded. Any model that is not bounded will be called an unbounded model.
Sequence of bounded submodels approximating the unbounded model: For an unbounded model
M
, our initial estimators
(
Q
ˉ
n
,
G
ˉ
n
)
of
(
Q
ˉ
0
,
G
ˉ
0
)
are defined in terms of a sequence of bounded submodels
M
n
⊂
M
that are increasing in
n
and approximate the actual model
M
as
n
converges to infinity. The counterparts of the above defined universal bounds on
M
applied to
M
n
are denoted with
τ
n
,
M
1
Q
,
n
,
M
2
Q
,
n
,
M
1
G
,
n
,
M
2
G
,
n
,
M
D
∗
,
n
. The conditions of our general asymptotic efficiency Theorem 1 will enforce that these bounds converge slowly enough to infinity (in the case the corresponding true model bound is infinity).
This model
M
n
could be defined as the largest subset of
M
for which these latter bounds apply. By Assumption 1, with this choice of definition of
M
n
, for any
P
0
∈
M
, there exists an
N
0
=
N
(
P
0
)
, so that for
n
>
N
0
P
0
∈
M
n
. Either way, we assume that
M
n
is defined such that the latter is true.
Let
Q
n
=
Q
(
M
n
)
and
G
n
=
G
(
M
n
)
be the parameter spaces of
Q
and
G
under model
M
n
, and let
Q
ˉ
n
=
Q
ˉ
(
M
n
)
and
G
ˉ
n
=
G
ˉ
(
M
n
)
be the parameter spaces of
Q
ˉ
and
G
ˉ
. We define the following true parameters corresponding with this model
M
n
:
Q
ˉ
0
n
=
arg
min
Q
ˉ
∈
Q
ˉ
n
P
0
L
1
(
Q
ˉ
)
G
ˉ
0
n
=
arg
min
G
ˉ
∈
G
ˉ
n
P
0
L
2
(
G
ˉ
)
.
We will assume that
M
n
is chosen so that
Q
k
1
+
1
(
P
0
n
)
=
Q
k
1
+
1
(
P
0
)
and
G
k
2
+
1
(
P
0
n
)
=
G
k
2
+
1
(
P
0
)
, where
P
0
n
=
arg
max
P
∈
M
n
P
0
log
d
P
d
P
0
. That is, our sieve is not affecting the estimation of the “easy” nuisance parameters
Q
(
k
1
+
1
)
0
and
G
(
k
2
+
1
)
0
. Note that for
n
>
N
0
, we have
Q
0
n
=
Q
0
and
G
0
n
=
G
0
.
In this paper our initial estimators of
Q
ˉ
0
and
G
ˉ
0
are always enforced to be in the parameter spaces of this sequence of models
M
n
, but if the model
M
is already bounded, then one can set
M
n
=
M
for all
n
. However, even for bounded models
M
, the utilization of a sequence of submodels
M
n
with stronger universal bounds than
M
could result in finite sample improvements (e.g., if the universal bounds on
M
are very large relative to sample size and the dimension of the data).
5 Superlearning: HALestimator tuning the variation norm of the fit with crossvalidation
Defining the library of candidate estimators: For an
M
∈
I
R
>
0
k
1
, let
Q
ˉ
ˆ
M
:
M
n
o
n
p
→
Q
ˉ
n
,
M
⊂
F
v
,
M
be the HALestimator eq. (21) and let
Q
ˉ
n
,
M
=
Q
ˉ
ˆ
M
(
P
n
)
. By Lemma 1 we have
d
01
(
Q
ˉ
n
,
M
=
Q
ˉ
ˆ
M
(
P
n
)
,
Q
ˉ
0
n
M
)
=
O
P
(
r
Q
ˉ
2
(
n
)
)
, assuming that the numerical approximation error
r
n
is of smaller order. Let
K
1
,
n
,
v
be an ordered collection
M
1
n
<
M
2
n
<
…
<
M
K
1
,
n
,
v
of
k
1
dimensional constants, and consider the corresponding collection of
K
1
,
n
,
v
candidate estimators
Q
ˉ
ˆ
M
with
M
∈
K
1
,
n
,
v
. We impose that this index set
K
1
,
n
,
v
is increasing in
n
such that
lim
sup
n
→
∞
M
K
1
,
n
,
v
equals
sup
P
∈
M
∥
L
1
(
Q
ˉ
(
P
)
)
∥
v
, so that for any
P
∈
M
, there exists an
N
(
P
)
so that for
n
>
N
(
P
)
, we will have that
M
K
1
,
n
,
v
>∥
L
1
(
Q
ˉ
(
P
)
)
∥
v
.
Note that for all
M
∈
K
1
,
n
,
v
with
M
>∥
L
1
(
Q
ˉ
0
)
∥
v
, we have that
d
01
(
Q
ˉ
ˆ
M
(
P
n
)
,
Q
ˉ
0
)
=
O
P
(
r
Q
ˉ
2
(
n
)
)
. In addition, let
Q
ˉ
ˆ
j
:
M
n
o
n
p
→
Q
n
,
j
∈
K
1
,
n
,
a
be an additional collection of
K
1
,
n
,
a
estimators of
Q
ˉ
0
. For example, these candidate estimators could include a variety of parametric model as well as machine learning based estimators. This defines an index set
K
1
,
n
=
K
1
,
n
,
v
∪
K
1
,
n
,
a
representing a collection of
K
1
n
=
K
1
,
n
,
v
+
K
1
,
n
,
a
candidate estimators
{
Q
ˉ
ˆ
k
:
k
∈
K
1
n
}
.
Super Learner: Let
B
n
∈
{
0
,
1
}
n
denote a random crossvalidation scheme that randomly splits the sample
{
O
1
,
…
,
O
n
}
in a training sample
{
O
i
:
B
n
(
i
)
=
0
}
and validation sample
{
O
i
:
B
n
(
i
)
=
1
}
. Let
q
n
=
∑
i
=
1
n
B
n
(
i
)
/
n
denote the proportion of observations in the validation sample. We impose throughout the article that
q
<
q
n
≤
1
/
2
for some
q
>
0
and that this random vector
B
n
has a finite number
V
possible realizations for a fixed
V
<
∞
. In addition,
P
n
,
B
n
1
,
P
n
,
B
n
0
will denote the empirical probability distributions of the validation and training sample, respectively. Thus, the crossvalidated risk of an estimator
Q
ˉ
ˆ
:
M
n
o
n
p
→
Q
ˉ
n
of
Q
ˉ
0
is defined as
E
B
n
P
n
,
B
n
1
L
1
(
Q
ˉ
ˆ
(
P
n
,
B
n
0
)
)
.
We define the crossvalidation selector as the index
k
1
n
=
K
ˆ
1
(
P
n
)
=
arg
min
k
∈
K
1
n
E
B
n
P
n
,
B
n
1
L
1
(
Q
ˉ
ˆ
k
(
P
n
,
B
n
0
)
)
that minimizes the crossvalidated risk
E
B
n
P
n
L
1
(
Q
ˉ
ˆ
k
(
P
n
,
B
n
0
)
)
over all choices
k
∈
K
1
n
of candidate estimators. Our proposed superlearner is defined by
(23)
Q
ˉ
n
=
Q
ˉ
ˆ
(
P
n
)
≡
E
B
n
Q
ˉ
ˆ
k
1
n
(
P
n
,
B
n
0
)
.
The following lemma proves that the superlearner
Q
ˉ
ˆ
(
P
n
)
converges to
Q
ˉ
0
at least at the rate
r
Q
ˉ
(
n
)
the HALestimator converges to
Q
ˉ
0
:
d
01
(
Q
ˉ
ˆ
(
P
n
)
,
Q
ˉ
0
)
=
O
P
(
r
Q
ˉ
(
n
)
)
. This lemma also shows that the superlearner is either asymptotically equivalent with the oracle selected candidate estimator, or achieves the parametric rate
1
/
n
of a correctly specified parametric model.
Lemma 2
Recall the definition of the model bounds
M
1
Q
,
n
,
M
2
Q
,
n
eq. (18), and let
C
(
M
1
,
M
2
,
δ
)
≡
2
(
1
+
δ
)
2
(
2
M
1
/
3
+
M
2
2
/
δ
)
.
For any fixed
δ
>
0
,
d
01
(
Q
ˉ
n
,
Q
ˉ
0
n
)
≤
(
1
+
2
δ
)
E
B
n
min
k
∈
K
1
n
d
01
(
Q
ˉ
ˆ
k
(
P
n
,
B
n
0
)
,
Q
ˉ
0
n
)
+
O
P
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
n
.
If for each fixed
δ
>
0
,
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
/
n
divided by
E
B
n
min
k
d
01
(
Q
ˉ
ˆ
k
(
P
n
,
B
n
0
)
,
Q
ˉ
0
n
)
is
o
P
(
1
)
, then
d
01
(
Q
ˉ
ˆ
(
P
n
)
,
Q
ˉ
0
n
)
E
B
n
min
k
d
01
(
Q
ˉ
ˆ
k
(
P
n
,
B
n
0
)
,
Q
ˉ
0
n
)
−
1
=
o
P
(
1
)
.
If for each fixed
δ
>
0
,
E
B
n
min
k
d
01
(
Q
ˉ
ˆ
k
(
P
n
,
B
n
0
)
,
Q
ˉ
0
n
)
=
O
P
(
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
/
n
)
, then
d
01
(
Q
ˉ
ˆ
(
P
n
)
,
Q
ˉ
0
n
)
=
O
P
C
(
M
1
n
,
M
2
n
,
δ
)
log
K
1
n
n
.
Suppose that for each finite
M
, the conditions of Lemma 1 hold with negligible numerical approximation error
r
n
, so that
d
01
(
Q
ˉ
n
,
M
=
Q
ˉ
ˆ
M
(
P
n
)
,
Q
ˉ
0
n
M
)
=
O
P
(
r
Q
ˉ
2
(
n
)
)
. Let
λ
1
∈
I
R
>
0
k
1
be chosen so that
r
Q
ˉ
2
(
n
)
=
O
(
n
−
λ
1
)
. For each fixed
δ
>
0
, we have
(24)
d
01
(
Q
ˉ
n
,
Q
ˉ
0
n
)
=
O
P
(
n
−
λ
1
)
+
O
P
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
n
.
The proof of this lemma is a simple corollary of the finite sample oracle inequality for crossvalidation [11, 13, 21, 33, 34], also presented in Lemma 5 in Section A of the Appendix. It uses the convexity of the loss function to bring the
E
B
n
inside the lossbased dissimilarity.
In the Appendix we present the analogue superlearner eq. (37) of
G
0
and its corresponding Lemma 6.
6 Onestep CVHALTMLE
Crossvalidated TMLE (CVTMLE) robustifies the biasreduction of the TMLEstep by selecting
ϵ
based on the crossvalidated risk [5, 15]. In the next subsection we define the CVTMLE. In this subsection we propose a particular type of local least favorable submodel that separately updates the initial estimator of
Q
j
0
for each
j
=
1
,
…
,
k
1
. Due to this choice, in subsection 2 we now easily establish that the CVTMLE of
Q
ˉ
0
converges at the same rate to
Q
ˉ
0
as the initial estimator, which is important for control of the second order remainder in the asymptotic efficiency proof of the CVTMLE. In subsection 3 we establish the asymptotic efficiency of the CVTMLE.
6.1 The CVHALTMLE
Definition of onestep CVHALTMLE for general local least favorable submodel: Let
L
ˉ
1
(
Q
)
≡
∑
j
=
1
k
1
+
1
L
1
j
(
Q
j
)
be the sum lossfunction. For a given
(
Q
,
G
)
, let
{
Q
ϵ
:
ϵ
}
⊂
Q
n
⊂
Q
be a parametric submodel through
Q
at
ϵ
=
0
such that the linear span of
d
d
ϵ
L
ˉ
1
(
Q
ϵ
)
at
ϵ
=
0
includes the canonical gradient
D
∗
(
Q
,
G
)
. Let
Q
ˆ
:
M
n
o
n
p
→
Q
n
and
G
ˆ
:
M
n
o
n
p
→
G
n
be our initial estimators of
Q
0
=
(
Q
ˉ
0
,
Q
0
,
k
1
+
1
)
and
G
0
=
(
G
ˉ
0
,
G
0
,
k
2
+
1
. We recommend defining the initial estimators
Q
ˉ
ˆ
and
G
ˉ
ˆ
of
Q
ˉ
0
and
G
ˉ
0
to be HALsuperlearners as defined by eqs (23) and (37), so that
d
10
(
Q
ˆ
(
P
n
)
,
Q
0
n
)
=
O
P
(
r
Q
2
(
n
)
)
and
d
20
(
G
ˆ
(
P
n
)
,
G
0
n
)
=
O
P
(
r
G
2
(
n
)
)
. Given a crossvalidation scheme
B
n
∈
{
0
,
1
}
n
, let
Q
n
,
B
n
=
Q
ˆ
(
P
n
,
B
n
0
)
∈
Q
n
be the estimator
Q
ˆ
applied to the training sample
P
n
,
B
n
0
. Similarly, let
G
n
,
B
n
=
G
ˆ
(
P
n
,
B
n
0
)
. Let
{
Q
n
,
B
n
,
ϵ
:
ϵ
}
be the above submodel with
(
Q
,
G
)
=
(
Q
n
,
B
n
,
G
n
,
B
n
)
through
Q
n
,
B
n
at
ϵ
=
0
. Let
ϵ
n
=
arg
min
ϵ
E
B
n
P
n
,
B
n
1
L
ˉ
(
Q
n
,
B
n
,
ϵ
)
be the MLE of
ϵ
minimizing the crossvalidated empirical risk. This defines
Q
n
,
B
n
∗
=
Q
n
,
B
n
,
ϵ
n
as the
B
n
specific targeted fit of
Q
0
. The onestep CVTMLE of
ψ
0
is defined as
ψ
n
∗
=
E
B
n
Ψ
(
Q
n
,
B
n
∗
)
.
Onestep CVHALTMLE solves crossvalidated efficient score equation: Our efficiency Theorem 1 assumes that
(25)
E
B
n
P
n
,
B
n
1
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
=
o
P
(
n
−
1
/
2
)
.
That is, it is assumed that the onestep CVTMLE already solves the crossvalidated efficient influence curve equation up till an asymptotically negligible approximation error. By definition of
ϵ
n
we have that it solves its score equation
E
B
n
P
n
,
B
n
1
d
d
ϵ
n
L
ˉ
(
Q
n
,
B
n
,
ϵ
n
)
=
0
, which provides a basis for verifying eq. (25). As formalized by Lemma 13 in the Appendix D, for our choice of
n
−
(
1
/
4
+
)
consistent initial estimators
Q
n
,
G
n
of
Q
0
,
G
0
, a onestep CVTMLE will satisfy eq. (25) for onedimensional local least favorable submodels under weak regularity conditions. We believe that such a result can be proved in great generality for arbitrary (also multivariate) local least favorable submodels. Instead, below we propose a particular class of multivariate local least favorable submodels eq. (26) for which we establish eq. (25) under regularity conditions. In (van der Laan and Gruber, 2015) it is shown that one can always construct a so called universal least favorable submodel through
Q
with a one dimensional
ϵ
so that
d
d
ϵ
L
ˉ
1
(
Q
ϵ
)
=
D
∗
(
Q
ϵ
,
G
)
at each
ϵ
so that
E
B
n
P
n
,
B
n
1
D
∗
(
Q
n
,
B
n
,
ϵ
n
∗
,
G
n
,
B
n
)
=
0
(exactly), independent of the properties of the initial estimator
(
Q
n
,
G
n
)
.
Onestep CVHALTMLE preserves fast rate of convergence of initial estimator: Our efficiency Theorem 1 also assumes that the updated estimator
Q
n
,
B
n
∗
satisfies for each split
B
n
d
01
(
Q
n
,
B
n
∗
,
Q
0
)
=
o
P
(
n
−
1
/
2
)
. This is generally a very reasonable condition given that
d
01
(
Q
n
,
B
n
,
Q
0
)
=
O
P
(
n
−
λ
1
)
for a specified
λ
1
>
1
/
2
. Our proposed class of local least favorable submodels eq. (26) below guarantees that the rate of convergence of the initial estimator
Q
n
,
B
n
is completely preserved by
Q
n
,
B
n
∗
, so that this condition is automatically guaranteed to hold.
A class of multivariate local least favorable submodels that separately updates each nuisance parameter component: One way to guarantee that
d
01
(
Q
n
,
B
n
∗
,
Q
0
)
=
o
P
(
n
−
1
/
2
)
is to make sure that the updated estimator
Q
n
,
B
n
∗
converges as fast to
Q
0
as the initial estimator
Q
n
,
B
n
. For that purpose we propose a
k
1
+
1
dimensional local least favorable submodel of the type
(26)
Q
ϵ
=
(
Q
1
,
ϵ
1
,
…
,
Q
k
1
+
1
,
ϵ
k
1
+
1
)
such that
d
d
ϵ
j
L
1
j
(
Q
j
,
ϵ
j
)

ϵ
j
=
0
=
D
j
*
(
Q
,
G
)
,
for
j
=
1
,
…
,
k
1
+
1
, and where
D
∗
(
Q
,
G
)
=
∑
j
=
1
k
1
+
1
D
j
∗
(
Q
,
G
)
. By using such a submodel we have
Q
j
,
n
,
B
n
∗
=
Q
j
,
n
,
B
n
,
ϵ
n
(
j
)
and
ϵ
n
(
j
)
=
arg
min
ϵ
E
B
n
P
n
,
B
n
1
L
1
j
(
Q
j
,
n
,
B
n
,
ϵ
)
. Thus, in this case
Q
j
,
n
,
B
n
is updated with its own
ϵ
n
(
j
)
,
j
=
1
,
…
,
k
1
+
1
. The advantage of such a least favorable submodel is that the onestep update of
Q
ˉ
j
,
n
,
B
n
is not affected by the statistical behavior of the other estimators
Q
ˉ
l
,
n
,
B
n
,
l
/
=
j
. On the other hand, if one uses a local least favorable submodel with a single
ϵ
, the MLE
ϵ
n
is very much driven by the worst performing estimator
Q
ˉ
j
,
n
,
B
n
. Lemma 3 shows that, by using such a
k
1
+
1
variate local least favorable submodel satisfying eq. (26), the rate of convergence of the initial estimator
Q
ˉ
j
,
n
is fully preserved by the TMLEupdate
Q
ˉ
j
,
n
,
B
n
∗
(see Lemma 3 below).
How to construct a local least favorable submodel of type eq. (26): A general approach for constructing such a
k
1
+
1
variate least favorable submodel is the following. Let
D
j
∗
(
P
)
be the efficient influence curve at a
P
for the parameter
Ψ
j
,
P
:
M
→
I
R
defined by
Ψ
j
,
P
(
P
1
)
=
Ψ
(
Q
−
j
(
P
)
,
Q
j
(
P
1
)
)
that sets all the other components of
Q
l
with
l
/
=
j
equal to its true value under
P
,
j
=
1
,
…
,
k
1
+
1
. Then, it follows immediately from the definition of pathwise derivative that
D
∗
(
P
)
=
∑
j
=
1
k
1
+
1
D
j
∗
(
P
)
,
so that,
D
∗
(
P
)
is an element of the linear span of
{
D
j
∗
(
P
)
:
j
=
1
,
…
,
k
1
+
1
}
. Let
{
Q
j
,
ϵ
(
j
)
:
ϵ
(
j
)
}
⊂
Q
j
n
be a onedimensional submodel through
Q
j
so that
d
d
ϵ
(
j
)
L
1
j
(
Q
j
,
ϵ
(
j
)
)

ϵ
(
j
)
=
0
=
D
j
*
(
Q
,
G
)
,
j
=
1
,
…
,
k
1
+
1.
That is,
{
Q
j
,
ϵ
(
j
)
:
ϵ
(
j
)
}
is a local least favorable submodel at
(
Q
,
G
)
for the parameter
Ψ
j
,
Q
:
M
→
I
R
,
j
=
1
,
…
,
k
1
+
1
. Now, define
{
Q
ϵ
:
ϵ
}
⊂
Q
n
by
Q
ϵ
=
(
Q
j
,
ϵ
(
j
)
:
j
=
1
,
…
,
k
1
+
1
)
. Then, we have
d
d
ϵ
L
¯
(
Q
ϵ
)

ϵ
=
0
=
(
D
j
*
(
Q
,
G
)
:
j
=
1
,
…
,
k
1
+
1
)
⊤
,
so that the submodel is indeed a local least favorable submodel.
Lemma 14 provides a sufficient set of minor conditions under which the onestepHALCVTMLE using a local least favorable submodel of the type eq. (26) will satisfy eq. (25). Therefore, the class of local least favorable submodels eq. (26) yields both crucial conditions for the HALCVTMLE: it solves eq. (25) and it preserve the rate of convergence of the initial estimator.
6.2 Preservation of the rate of initial estimator for the onestep CVHALTMLE using eq. (26)
Consider the submodel
{
Q
ϵ
:
ϵ
}
of the type eq. (26) presented above. Given an initial estimator
Q
ˆ
:
M
n
o
n
p
→
Q
n
, recall the definition
Q
n
,
B
n
,
ϵ
=
Q
ˆ
ϵ
(
P
n
,
B
n
0
)
as the fluctuated version of the initial estimator applied to the training sample, and
ϵ
n
=
arg
min
ϵ
E
B
n
P
n
,
B
n
1
L
1
(
Q
n
,
B
n
,
ϵ
)
. We want to show that
Q
n
,
B
n
,
ϵ
n
converges to
Q
0
at the same rate as the initial estimator
Q
n
,
B
n
(and thus also
Q
ˆ
(
P
n
)
). The following lemma establishes this result and it is an immediate consequence of the oracle inequality of the crossvalidation selector for the loss function
L
1
j
, applied to the set of candidate estimators
P
n
→
Q
j
n
,
ϵ
(
j
)
=
Q
ˆ
j
,
ϵ
(
j
)
(
P
n
)
indexed by
ϵ
(
j
)
, for each
j
=
1
,
…
,
k
1
+
1
.
Lemma 3
Let
ϵ
n
=
arg
min
ϵ
E
B
n
P
n
,
B
n
1
L
1
(
Q
n
,
B
n
,
ϵ
)
.
We have
E
B
n
d
01
(
Q
ˆ
ϵ
n
(
P
n
,
B
n
0
)
,
Q
0
n
)
≤
(
1
+
2
δ
)
min
ϵ
E
B
n
d
01
(
Q
ˆ
ϵ
(
P
n
,
B
n
0
)
,
Q
0
n
)
+
O
P
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
n
q
.
By convexity of the loss function
L
1
(
Q
)
, this implies
d
01
(
E
B
n
Q
ˆ
ϵ
n
(
P
n
,
B
n
0
)
,
Q
0
n
)
≤
(
1
+
2
δ
)
min
ϵ
E
B
n
d
01
(
Q
ˆ
ϵ
(
P
n
,
B
n
0
)
,
Q
0
n
)
+
O
P
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
n
q
.
We have
min
ϵ
E
B
n
d
01
(
Q
ˆ
ϵ
(
P
n
,
B
n
0
)
,
Q
0
n
)
≤
E
B
n
d
01
(
Q
ˆ
(
P
n
,
B
n
0
)
,
Q
0
n
)
.
Thus, if for some
λ
1
>
0
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
/
(
n
q
)
=
O
(
n
−
λ
1
)
and for each
B
n
d
01
(
Q
ˆ
(
P
n
,
B
n
0
)
,
Q
0
n
)
=
O
P
(
n
−
λ
1
)
, then
d
01
(
E
B
n
Q
n
,
B
n
,
ϵ
n
,
Q
0
n
)
=
O
P
(
n
−
λ
1
)
.
It then also follows that for each
B
n
,
d
01
(
Q
ˆ
ϵ
n
(
P
n
,
B
n
0
)
,
Q
0
n
)
=
O
P
(
n
−
λ
1
)
.
6.3 Efficiency of the onestep CVHALTMLE.
We have the following theorem.
Theorem 1
Consider the above defined corresponding onestep CVTMLE
ψ
n
∗
=
E
B
n
Ψ
(
Q
n
,
B
n
,
ϵ
n
)
of
Ψ
(
Q
0
)
.
Initial estimator conditions: Consider the HALsuperlearners
Q
ˉ
ˆ
(
P
n
)
and
G
ˉ
ˆ
(
P
n
)
defined by eqs (23) and (37), respectively, and, recall that we are given simple estimators
Q
ˆ
k
1
+
1
and
G
ˆ
k
2
+
1
of
Q
0
,
k
1
+
1
and
G
0
,
k
2
+
1
. Let
λ
1
and
λ
2
be chosen so that
r
Q
ˉ
(
n
)
=
O
(
n
−
λ
1
)
and
r
G
ˉ
(
n
)
=
O
(
n
−
λ
2
)
. Assume the conditions of Theorem 2 and Theorem 6 so that we have
d
01
(
Q
ˉ
ˆ
(
P
n
)
,
Q
ˉ
0
)
=
O
P
(
n
−
λ
1
(
1
:
k
1
)
)
+
O
P
(
C
(
M
1
Q
,
n
,
M
2
Q
,
n
,
δ
)
log
K
1
n
/
n
)
d
02
(
G
ˉ
ˆ
(
P
n
)
,
G
ˉ
0
)
=
O
P
(
n
−
λ
2
(
1
:
k
2
)
)
+
O
P
(
C
(
M
1
G
,
n
,
M
2
G
,
n
,
δ
)
log
K
2
n
/
n
)
,
where
λ
1
(
1
:
k
1
)
>
1
/
2
and
λ
2
(
1
:
k
2
)
>
1
/
2
. Let
Q
ˆ
=
(
Q
ˉ
ˆ
,
Q
ˆ
k
1
+
1
)
and
G
ˆ
=
(
G
ˉ
ˆ
,
G
ˆ
k
2
+
1
)
be the corresponding estimators of
Q
0
and
G
0
, respectively.
“Preserve rate of convergence of initial estimator”condition: In addition, assume that either (Case A) the CVTMLE uses a local least favorable submodel of the type eq. (26) so that Lemma 3 applies, or (Case B) assume that for each split
B
n
d
01
(
Q
n
,
B
n
∗
,
Q
0
)
=
O
P
(
n
−
λ
1
∗
)
for some
λ
1
∗
>
1
/
2
.
Efficient influence curve score equation condition and second order remainder condition: Define
f
n
,
ϵ
=
D
∗
(
Q
ˆ
ϵ
(
P
n
,
B
n
0
)
,
G
n
,
B
n
)
−
D
∗
(
Q
0
,
G
0
)
and the class of functions
F
n
=
{
f
n
,
ϵ
:
ϵ
}
. Assume
(27)
E
B
n
P
n
,
B
n
1
D
∗
(
Q
n
,
B
n
,
ϵ
n
,
G
n
,
B
n
)
=
o
P
(
n
−
1
/
2
)
,
(28)
∥
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
−
D
∗
(
Q
0
,
G
0
)
∥
P
0
=
o
P
(
r
D
∗
,
n
)
for
r
D
∗
,
n
=
o
(
1
)
,
(29)
E
B
n
R
20
(
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
,
(
Q
0
,
G
0
)
)
=
o
P
(
n
−
1
/
2
)
,
(30)
max
(
M
1
Q
,
n
,
M
2
Q
,
n
2
)
log
K
1
n
n
=
O
(
n
−
λ
1
)
,
(31)
max
(
M
1
G
n
,
M
2
G
n
2
)
log
K
2
n
n
=
O
(
n
−
λ
2
)
,
(32)
sup
Λ
N
(
ϵ
M
D
∗
,
n
,
F
n
,
L
2
(
Λ
)
)
<
K
ϵ
−
p
for a
K
<
∞
,
p
<
∞
.
In Case A, for verification of assumption eq. (27) one could apply Lemma 14. In Case A, for verification of the two assumptions eqs (28) and (29) one can use that for each of the
V
realizations of
B
n
,
d
0
(
Q
n
,
B
n
∗
,
Q
0
)
=
O
P
(
n
−
λ
1
)
and
d
02
(
G
n
,
B
n
,
G
0
)
=
O
P
(
n
−
λ
2
)
. In Case B, for verification of the latter two assumptions eqs (28) and (29) one can use that for each of the
V
realizations of
B
n
,
d
0
(
Q
n
,
B
n
∗
,
Q
0
)
=
O
P
(
n
−
λ
1
∗
)
and
d
02
(
G
n
,
B
n
,
G
0
)
=
O
P
(
n
−
λ
2
)
.
Then,
ψ
n
∗
=
E
B
n
Ψ
(
Q
n
,
B
n
,
ϵ
n
)
is asymptotically efficient:
(33)
ψ
n
∗
−
ψ
0
=
(
P
n
−
P
0
)
D
∗
(
Q
0
,
G
0
)
+
o
P
(
n
−
1
/
2
)
.
Condition eq. (32) will practically always trivially hold for
p
=
k
1
+
1
equal to the dimension of
ϵ
: note that this is even true for unbounded models due to the normalizing constant
M
D
∗
,
n
. We already discussed the crucial condition eq. (27) in our subsection defining the CVTMLE. Conditions eqs (30) and (31) are easily satisfied by controlling the speed at which the model bounds
M
1
Q
,
n
,
M
2
Q
,
n
,
M
1
G
,
n
,
M
2
G
,
n
can converge to infinity, and are always true for bounded models (as long as the size of the library of the superlearner behaves as a polynomial power of sample size). For bounded models
M
, condition eq. (28) will typically hold with
r
D
∗
,
n
=
n
−
λ
and
λ
equal to the minimum of the components of
λ
1
/
2
and
λ
2
/
2
: i.e., the efficient influence curve estimator will converge to its true counterpart as fast as the slowest converging nuisance parameter estimator. If the model
M
is unbounded so that the model bounds of the sieve
M
n
will converge to infinity, then eq. (28) will hold with
r
D
∗
,
n
=
n
−
λ
M
n
for some
M
n
converging to infinity (e.g.,
M
n
=
M
D
∗
,
n
). So, in the latter case one has to control the rate at which the model bounds of the sieve
M
n
, such as the supremum norm bound
M
D
∗
,
n
for the efficient influence curve, converge to infinity. Finally, the crucial condition eq. (29) will easily hold for bounded models
M
if this slowest rate
λ
is larger than
1
/
4
, which we know to be true for the HALestimator and its superlearner. For unbounded models, this condition eq. (29) will put a serious brake on the speed as which the model bounds of
M
n
can converge to infinity. Proof: By assumptions eqs (30) and (31),
we have
d
0
(
(
Q
ˆ
(
P
n
,
B
n
0
)
,
G
ˆ
(
P
n
,
B
n
0
)
,
(
Q
0
,
G
0
)
)
=
O
P
(
n
−
λ
1
,
n
−
λ
2
)
.
Consider Case A. Lemma 3 proves that under these same assumptions eqs (30), (31), we also have, for each
B
n
,
d
01
(
Q
n
,
B
n
,
ϵ
n
,
Q
0
n
)
=
O
P
(
n
−
λ
1
)
. This proves that for each
B
n
,
d
0
(
(
Q
n
,
B
n
∗
=
Q
n
,
B
n
,
ϵ
n
,
G
n
,
B
n
)
,
(
Q
0
,
G
0
)
)
=
O
P
(
n
−
λ
1
,
n
−
λ
2
)
. For Case B, we replace in latter expression
λ
1
by
λ
1
∗
.
Suppose
n
>
N
0
so that
Q
0
n
=
Q
0
and
G
0
n
=
G
0
. By the identity
Ψ
(
Q
n
,
B
n
∗
)
−
Ψ
(
Q
0
)
=
−
P
0
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
+
R
20
(
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
,
(
Q
0
,
G
0
)
)
, we have
E
B
n
Ψ
(
Q
n
,
B
n
∗
)
−
Ψ
(
Q
0
)
=
−
E
B
n
P
0
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
+
E
B
n
R
20
(
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
,
(
Q
0
,
G
0
)
)
.
Combining this with eq. (27) yields the following identity:
ψ
n
∗
−
Ψ
(
Q
0
)
=
E
B
n
Ψ
(
Q
n
,
B
n
∗
)
−
Ψ
(
Q
0
)
=
E
B
n
(
P
n
,
B
n
1
−
P
0
)
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
+
E
B
n
R
20
(
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
,
(
Q
0
,
G
0
)
)
+
o
P
(
n
−
1
/
2
)
.
By assumption eq. (29) we have that
E
B
n
R
20
(
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
,
(
Q
0
,
G
0
)
)
=
o
P
(
n
−
1
/
2
)
. Thus, we have shown
Ψ
(
Q
n
∗
)
−
Ψ
(
Q
0
)
=
E
B
n
(
P
n
,
B
n
1
−
P
0
)
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
+
o
P
(
n
−
1
/
2
)
.
We now note
E
B
n
(
P
n
,
B
n
1
−
P
0
)
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
=
E
B
n
(
P
n
,
B
n
1
−
P
0
)
D
∗
(
Q
0
,
G
0
)
+
E
B
n
(
P
n
,
B
n
1
−
P
0
)
{
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
−
D
∗
(
Q
0
,
G
0
)
}
=
(
P
n
−
P
0
)
D
∗
(
Q
0
,
G
0
)
+
E
B
n
(
P
n
,
B
n
1
−
P
0
)
{
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
−
D
∗
(
Q
0
,
G
0
)
}
.
Thus, it remains to prove that
E
B
n
(
P
n
,
B
n
1
−
P
0
)
{
D
∗
(
Q
n
,
B
n
∗
,
G
n
,
B
n
)
−
D
∗
(
Q
0
,
G
0
)
}
=
o
P
(
n
−
1
/
2
)
. For this we apply Lemma 10 with
f
n
,
ϵ
=
D
∗
(
Q
ˆ
ϵ
(
P
n
,
B
n
0
)
,
G
n
,
B
n
)
−
D
∗