A Fundamental Measure of Treatment Effect Heterogeneity

We offer a non-parametric plug-in estimator for an important measure of treatment effect variability and provide minimum conditions under which the estimator is asymptotically efficient. The stratum specific treatment effect function or so-called blip function, is the average treatment effect for a randomly drawn stratum of confounders. The mean of the blip function is the average treatment effect (ATE), whereas the variance of the blip function (VTE), the main subject of this paper, measures overall clinical effect heterogeneity, perhaps providing a strong impetus to refine treatment based on the confounders. VTE is also an important measure for assessing reliability of the treatment for an individual. The CV-TMLE provides simultaneous plug-in estimates and inference for both ATE and VTE, guaranteeing asymptotic efficiency under one less condition than for TMLE. This condition is difficult to guarantee a priori, particularly when using highly adaptive machine learning that we need to employ in order to eliminate bias. Even in defiance of this condition, CV-TMLE sampling distributions maintain normality, not guaranteed for TMLE, and have a lower mean squared error than their TMLE counterparts. In addition to verifying the theoretical properties of TMLE and CV-TMLE through simulations, we point out some of the challenges in estimating VTE, which lacks double robustness and might be unavoidably biased if the true VTE is small and sample size insufficient. We will provide an application of the estimator on a data set for treatment of acute trauma patients.


Introduction
A clinician might observe highly variable results for a treatment and want to know how much of this variation is due to confounders, thus motivating more precision in how treatment is assigned. Such variation also provides a measure of what to expect from treatment on an individual level in beyond the average treatment effect (ATE). The stratum specific treatment effect function or so-called blip function, is defined as the average treatment effect for a randomly drawn stratum of confounders. We employ targeted learning (van der Laan and Rose 2011) to construct simultaneous plug-in estimators of ATE and VTE, amounting to the sample mean and variance of blip function estimates. Under TMLE conditions, standard error estimates obtained by computing the standard deviation of the efficient influence curve approximation are asymptotically as small as any regular asymptotically linear estimator and nominally cover the truth. Without the targeting step in our estimator (see section 2.3), employing the non-parametric bootstrap might not guarantee valid inference (van der Vaart and Wellner 1996) for our plug-in estimates if we wish to employ a rich ensemble of machine learning algorithms in our prediction of the outcome model and possibly the treatment mechanism. In the case of our prediction methods being costly, we also save considerable time in the process.
We employ ensemble machine learning so as to break from narrow parametric model assumptions that do not respect real knowledge of the statistical model. In doing so, we will see the CV-TMLE has an advantage over the TMLE in that it does not require a donsker condition on the initial predictions, enabling more flexibility in the ensemble learning we employ. Estimating VTE lacks the desirable robustness properties present when estimating ATE. Particularly, for a randomized trial, our CV-TMLE estimate for ATE is consistent where as knowledge of the treatment mechanism does not guarantee consistent VTE estimates. The lack of robustness is due to a stubborn second order remainder term, which we discuss at length. Despite the appeal of targeted learning as in this paper, the second order remainder can make coverage unreliable, an issue for which we offer future improvements. We also have limitations in detecting VTE when the true VTE is relatively small for the sample size.

background
Let us define Y a as a random variable which, is a counterfactual outcome drawn from a population under the intervention to set treatment to a as per the Neyman-Rubin potential outcomes framework (Neyman 1923;Donald Rubin 1974). To obtain the variance in counterfactual treatment effect from individual to individual, one would seek var(Y 1 − Y 0 ). However, this parameter is not generally identifiable, a problem addressed by Neyman, 1923, when computing the standard errors for the counterfactual average difference in yield of two crop varietals. Neyman realized his standard errors relied upon the correlation between counterfactual outcomes on the same plot and thus had not the data to estimate it. Fisher, 1951, similarly to Neyman, realized he had difficulty directly estimating the joint distribution of Y 1 and Y 0 and suggested pairing a treated and untreated varietal in the same pot might closely resemble a set of counterfactuals differences, Y 1 − Y 0 , from which one might apply a t-test. Cox, 1958, facing the same issue, assumes var(Y 1 − Y 0 ) = 0 for predefined homogeneous subgroups, which is very difficult to verify. Strong assumptions must be made to identify var(Y 1 − Y 0 ) in the event it is not 0, such as assuming quantiles are preserved between the counterfactual outcomes (Heckman and Smith 1997). Heckman, 1997 mentions combining the results of Cambanis, 1976 andFrechet, 1951 to tail bound var(Y 1 − Y 0 ) = var(Y 1 ) + var(Y 0 ) − 2γ Y1,Y0 var(Y 1 )var(Y 0 ) and use the bootstrap to test if the lower bound of the confidence interval includes a variance of 0. For randomized trial data, Ding, Feller et al., 2016, construct a Fisher randomization test of the null hypothesis that var(Y 1 − Y 0 ) = 0 under the untestable assumption that there exists a universal τ so that Y 1 = Y 0 + τ . However, a null hypothesis of 0 is not helpful when it comes to assigning treatment based on confounders.
The assumption of Cox, 1958, amounts to assuming var(Y 1 − Y 0 ) = var (E[(Y 1 − Y 0 )|W ]), where W is an a priori known homogeneous subgroup. Thus he assumes all variation of Y 1 − Y 0 is due to confounders, the exact variation we aim to capture by estimating var (E[(Y 1 − Y 0 )|W ]), the variance of the blip function or VTE. In our case we need not consider a priori any homogeneous subgroups. As a simple case to give intuition as to what VTE captures, consider W = indicator of male or female, and binary outcome indicating survival if the outcome is 1. Suppose the men have a blip value of E[(Y 1 − Y 0 )|W = male] = −0.3 and the females, E[(Y 1 − Y 0 )|W = f emale] = 0.7. Assuming men and women are of equal proportion for the population at hand, then the VTE is 0.25 and ATE is 0.2. This would mean the patient gains from treatment an average 20% with a standard deviation of 50%. One should be reminded that the VTE gives a more personal measure of what to expect from treatment, but not an individual effect variance. For instance, within the male subgroup one might have a high or low varying random variable, (Y 1 − Y 0 | male) and such does not count toward the VTE. Hence a clinician's perception of highly varying outcomes does not mean the VTE is high. Rather one would want to estimate VTE to see if the varying outcomes were due to lack of precision in applying the treatment.
Estimating VTE with a plug-in estimator naturally depends on using an estimate of the outcome model, E[Y a |W ], from which to estimate the blip function, E[Y 1 |W ]−E[Y 0 |W ], and then its variance over an estimate of the distribution of W , in our case the empirical distribution of W . If one knows the blip function, for instance, one would know an optimal dynamic rule for treatment (Luedtke and van der Laan 2016). One could also find subgroup specific treatment effects via the blip function. Lu et al (Tian et al. 2014) offered a way to isolate interactions of treatment with confounders in a randomized trial by transforming the predictors of a parametric model. The main idea is to form a variable, z = 2A − 1, where A is the usual treatment indicator, and then put the interaction of this variable with the predictors in the outcome regression. This enables direct estimation of the blip function from which one could obtain a point estimate of the VTE. One could also employ recursive partitioning to divide the data into homogeneous subgroups as far as treatment effects (Athey and Imbens 2016) as well as employ random forests (Athey and Imbens 2015). We could use such subgroups to compute the VTE but as noted in (Bitler et al. 2014), establishing too rough subgroups can miss detecting treatment effect heterogeneity. In applying the CV-TMLE or TMLE, we also may use tree regression methods within our machine learning ensemble but we are only interested in the predictive power of these methods in eliminating second order remainder term bias, as we will discuss.

Methodology
For simplicity we consider only binary treatment but all the analysis here-in can be extended to multinomial treatment if we define the treatment effect to be a contrast of any two of the treatment levels. The reader may assume the outcome is either binary or a bounded continuous outcome. For the latter, we can scale the outcome to be in [0, 1] via the transformation Y s = Y −a b−a where a and b are minimum and maximum outcomes respectively, obtained from the data or known a priori. The distribution P in our model defines an outcome model with conditional mean,Q(A, W ) = E P [Y | A, W ], a treatment mechanism, g(A | W ) = P r[A | W ] and p W , the density of W . For binary outcome, the density of P can be factored p(w, a, y) =Q(a, w) y (1 −Q(a, w)) 1−y g(a | w) a (1 − g(a | w) 1−a p W (w) and the mean of the log-likelihood loss is L(P )(w, a, y)dP = − ylog(Q(a, w)) + (1 − y)log(1 −Q(a, w)) − log g(a | w) a (1 − g(a | w) 1−a p W (w) dP . For continuous outcome scaled to be in [0,1], such is the mean of the so-called quasibinomial loss, also minimized at the true distribution, P 0 (Wedderburn 1974;McCullagh 1983). The benefit of scaling a continuous outcome to [0, 1] and then applying quasibinomial loss is the predictions are constrained to be within 0 and 1 when applying logistic regression (Gruber and van der Laan 2010). Thus, when scaled back, estimates will be within a and b. We scale our continuous outcomes to be between 0 and 1 and use quasibinomial loss, making the targeting portion of the CV-TMLE or TMLE procedure (section 2.3) identical for bounded continuous or binary outcomes. After the CV-TMLE or TMLE algorithm is complete, one may convert the outcomes back to their original scale and form estimates and confidence bands, for which we offer instruction.

Full Data Statistical Model and the link to the Observed Data
Our full data, including unobserved measures, is assumed to be generated according to the following structural equations (S. Wright 1921;Strotz and Wold 1960;Pearl 2000). We can assume a joint distribution, where Y is the outcome, either binary or bounded continuous. We thusly define a distribution P U,X , via (U, X) ∼ P U,X .
Y a is a random outcome under P U,X where we intervene on the structural equations to set treatment to a, is generated by P U X according to the structural equations above. Our true observed data distribution, P 0 , is an element of M, which will be non-parametric. In the case of a randomized trial or if we have some knowledge of the treatment mechanism, M is considered a semi-parametric model and we will incorporate such knowledge, which we will see is much more helpful for estimating ATE than for VTE.

Parameter of Interest and Identification
We define the stratum-specific treatment effect function or blip function as We will impose the randomization assumption (Robins 1986;Greenland and Robins 1986), Y a ⊥ A|W as well yields b P U X (W ) = b P (W ) and we can identify the parameter of interest as a mapping from the observed data model, M, to R 2 via the gcomp formula (Robins 1986) Ψ(P ) = (E P b P (W ), var P b P (W )) = Ψ(P F U X ), i.e., ATE and VTE respectively.

TMLE Conditions and Asymptotic Efficiency
We refer the reader to Targeted Learning Appendix (van der Laan and Rose 2011) as well as (van der Laan 2016; van der Laan and Gruber 2016; van der Laan and Daniel Rubin 2006) for a more detailed look at the theory of TMLE and the use of targeted learning that yields our algorithm below. We offer the reader a brief overview in service of our estimation problem at hand.
The efficient influence curve at a distribution, P , for the parameter mapping, Ψ, is a function of the observed data, O ∼ P , notated as D Ψ (P )(O). Its variance gives the generalized Cramer-Rao lower bound for the variance of any regular asymptotically linear estimator of Ψ (van der Vaart 2000). We also note, in our general discussion, we consider our case of a two dimensional efficient influence curve, D Ψ (P ) = (D Ψ1 (P ), D Ψ2 (P )), where Ψ 1 and Ψ 2 are ATE and VTE respectively. One may generalize to any finite dimension from our discussion.
We will employ the notation, P n f (O), to be the empirical average of function, f (·), and P f (O) to be . Define a loss function, L(P )(O), which is a function of the observed data, O, and indexed at the distribution on which it is defined, P , such that E P0 L(P )(O) is minimized at the true observed data distribution, P = P 0 . The TMLE procedure maps an initial estimate, P 0 n ∈ M, of the true data generating distribution to P n ∈ M such that P n L(P n ) ≤ P n L(P 0 n ) and such that P n D (P n ) = 0 2×1 . P n is called the TMLE of the initial estimate P 0 n . We can then write an expansion with second order remainder term, R 2 , as follows: Ψ(P n ) − Ψ(P 0 ) = (P n − P 0 )D (P n ) + R 2 (P n , P 0 ).

Conditions for Asymptotic Efficiency
Define the norm f L 2 (P ) = E P f 2 . Assume the following TMLE conditions: 1. D Ψj (P n ) is in a P-Donsker class for all j. This condition can be dropped in the case of using CV-TMLE (Zheng and van der Laan 2010). We show the advantages to CV-TMLE in our simulations.
2. Second order remainder condition: ) under TMLE conditions. Thus, our plug-in TMLE estimates and CI's given by Ψ j (P n ) ± z α * σ n (D j (P n )) √ n will be as small as possible for any regular asymptotically linear estimator at significance level, 1 − α, where P r(|Z| ≤ z α ) = α for Z standard normal and σ n (D j (P n )) is the sample standard deviation of {D j (P n )(O i ) | i ∈ 1 : n} (van der Laan and Daniel Rubin 2006). Note, that if the TMLE conditions hold for the initial estimate, P 0 n , then they will also hold for the updated model, P n (van der Laan 2016), thereby placing importance on our ensemble machine learning in constructing P 0 n . (2) and (3) above, we need to be o P (n −0.5 ). If the first factor is o P (n rQ ) and the second is o P (n rg ), then rQ + r g ≤ −0.5 will satisfy the TMLE remainder term condition 2 of section 2.2.1. It is notable the terms disappear in the case of a randomized trial where we incorporate the known g 0 . (5) is also a generous upper bound for the first two terms, which depend on (Q * n −Q 0 )(g n − g 0 ))dP 0 because the integrand can change sign. However, (4) is not generous in this way because the integrand is a square. Precisely, we require Q * n −Q 0 L 2 (P0) to be o P (n −0.25 ) with no help provided by knowing the treatment mechanism. Hence, VTE estimation is not doubly robust. We can apply a large data adaptive ensemble of state-of-the-art machine learning algorithms to mitigate this remainder term but we still have found it can cause bias and poor coverage.

One-step CV-TMLE Algorithm for ATE and VTE
The one-step TMLE algorithm constructs a parametric submodel through the initial estimate, P 0 n , indexed by a single dimensional parameter, : {P n, | ∈ [−δ, δ]} where P n, =0 = P 0 n . The construction is performed recursively in such a way that we arrive in one step at an element of the submodel, P * n , which has minimum empirical average loss of all the elements and also solves the efficient influence curve equation, P n D * Ψ (P * n )(O) = 0 2×1 . We introduce in Appendix C a new iterative analog to the one-step TMLE procedure which also uses one-dimensional parametric submodels, called canonical least favorable submodels , from which one can define the universal least favorable submodel employed in the one-step TMLE algorithm. We also mention the iterative TMLE in van der Laan and Gruber, 2016, that utilizes parametric submodels of dimension the same dimension as the parameter. It has been conjectured that the one-step TMLE may better preserve the properties of the initial fit, P 0 n , than the aforementioned iterative versions, thereby leading to better finite-sample behavior of the second-order remainder term R 2 (P * n , P 0 ) (van der Laan and Gruber 2016). If this conjecture is correct, then we would expect similar gains by using the one-step TMLE in our setting, however, in our simulations we found no appreciable differences. Hence, we will only discuss the one-step TMLE and one-step CV-TMLE algorithms, leaving the the difference in performance between TMLE procedures as a subject for future research.
The efficient influence curve for Ψ(P ) = (Ψ 1 (P ), Ψ 2 (P )), i.e. ATE and VTE, has two components given by where W is a possibly high dimensional set of confounders, A is a binary treatment indicator and Y is a binary outcome or a continuous outcome scaled between 0 and 1. b The reader may visit Appendix A for the derivation.
In the algorithm below, we adjusted the original CV-TMLE procedure (Zheng and van der Laan 2010) for ease of computation without losing any theoretical properties or finite sample performance. The convenience here is that once we obtain initial estimates, there is no difference between CV-TMLE and TMLE as far as implementation is concerned. The reader may consult the Appendix D for the difference between this procedure and the originally defined CV-TMLE (Zheng and van der Laan 2010) regarding our parameter of interest and why neither require the condition 1 in section 2.2.1.

The "Learning" Part of Targeted Learning: Obtaining Initial Estimates
To perform a one-step CV-TMLE we will first randomly select V folds (usually 10), consisting of V disjoint validation sets of equal size, comprising all n observations and the corresponding training sets. Each training set is the complement of the corresponding validation set so that for each fold, training set and validation set comprise all n subjects. For v in 1:V we perform the following: Using the data-adaptive ensemble machine learning package, SuperLearner (Polley, 2009), we fit the true outcome model, withQ 0 n,v (A, W ) on the training set and use the fit to make predictions on the validation set. The V sets of validation set predictions yield one estimate of the outcome prediction for each of the n subjects, denoted Q 0 n (A i , W i ) for the i th subject. In the case of an observational study, we also might use Superlearner to estimate the treatment mechanism, E P0 [A | W ], with g n,v (A | W ) and predict on the validation set. The V sets of validation set predictions, yield one prediction of the treatment assignment probability for each of the n subjects denoted by g n (A i , W i ). The reader may note that for the one-step TMLE algorithm for ATE and VTE we would have just fit the outcome model and treatment mechanism on the whole data and computed predictionsQ 0 n and g n on that same data using SuperLearner (Polley, 2009). Such is the fundamental difference between the two procedures in that CV-TMLE only uses predictions on validation sets.

Initialize Targeting Step
Compute the negative log-likelihood loss for our outcome predictions. Note, Y i is the true outcome: and note H 1 will stay fixed for the entire process for this parameter.
The reader can notice our initial estimate of the parameter is ( 1 ) 2 ), our sample mean and variance of our estimated TE function values.

The Targeting
Step , 2} then P n = P m n and go to step 4.σ n (·) denotes the sample standard deviation as in section 2.2.1. This insures that we stop the process once the bias is second order. Recursions after this occurs are not fruitful. If |P n D Ψj (P m n )| >σ/n, then m = m + 1 and go to step 3.

step 3
Define the following recursion, using euclidean inner product notation, ·, · 2 , the same as a dot product: • d is set to 0.0001 (going smaller only costs more without improving accuracy) For the case of TMLE, this recursively defines an estimate,Q m n (A, W ), of the true outcome model, then return to step 2. OtherwiseQ m n =Q * n and continue to step 4.

step 4
Our estimate for ATE and VTE is 1 If the outcome was scaled as Y −a b−a (see section 2, paragraph 1), then ATE and VTE

Simultaneous Estimation and Confidence bounds
We often want to provide confidence intervals that simultaneously cover all the coordinates of Ψ(P 0 ) at a given significance level. The following is an added benefit of having the efficient influence curve at hand for we can account for correlated estimates in a tighter manner than a bonferroni correction (Dunn 1961). The reader may note we can generalize this procedure to any dimension but will use dimension 2 here as that is relevant to our parameter. After completing the above algorithm we have, for each subject indexed by i ∈ 1 : n. Consider the 2-dimensional random variable Z n = (Z n,1 , Z n,2 ) ∼ N (0 2×1 , Σ n ), defined by two by two matrix, Σ n , the sample correlation matrix of D * Ψ (P n ). Let q n,α be the the α th quantile of the random variable giving the max number of standard deviations over the coordinates of Z. We monte-carlo sample 5 million draws from the random variables M n to find q n,α . We note that 5 million is a sufficient number to guarantee very little error in finding the true q n,α . Applying the continuous mapping theorem (van der Vaart and Wellner 1996) assures us under TMLE conditions that Z n,i ± q α covers Z i for all i ∈ 1 : 2, at (1 − α) × 100%. Then we can apply the extended continuous mapping theoreom (van der Vaart and Wellner 1996) to assure us q n,α −→ q α . q n,α is therefore an estimate of the number of standard errors needed to simultaneously cover both true parameter values at (1 − α) × 100%. This results in the confidence bandsP si j,n ± q n,α * σ n (D j (P n )) √ n which, will asymptotically cover all coordinates, Ψ j (P 0 ) of Ψ(P 0 ), simultaneously at the significance level, 1 − α. The reader may note q n,α is the same as the bonferroni correction (Dunn 1961) if Σ n is the identity matrix. As in section 2.2.1, σ n (·) is the sample standard deviation. If the outcomes were scaled according to Y s = Y −a b−a , then the standard error estimates for ATE and VTE are multiplied by b − a and (b − a) 2 , respectively.

Simulations
We performed two different kinds of simulations, the first primarily to verify the remainder conditions in the theory of TMLE (condition 2, section 2.2.1). The rest were performed to get a sense of what might occur with real data. Inference for all TMLE's used the sample standard deviation of the efficient influence curve approximation to form confidence intervals as per section 2. For logistic regression plug-in estimators of ATE and VTE, confidence bands were formed by using the delta method and the influence curve for the beta coefficients for intercept, main terms and interactions (see appendix b for the derivation). SuperLearner initial estimates had no accompanying measure of uncertainty since there is little theory for such, even if bootstrapping (van der Vaart and Wellner 1996).

Simulations with Controlled Noise
Instead of drawing W then A and then Y under a data generating distribution and then trying to recover the truth with various predictors or SuperLearner as we do later, we directly add heteroskedastic noise tō Q 0 in such a way that the conditions of TMLE hold and then use the noisy estimate as the initial estimate in the TMLE process. This does not necessarily match what happens in practice because the noise we add is not related to the noise in the draw of Y given A and W . However, it is a valid way to directly test the conditions of TMLE in that we can control the noise so that the TMLE conditions hold and watch the asymptotics at play. We also note that we will assume g 0 is known because the other second order terms for VTE, involving bias in estimating g 0 , are dependent on double robustness in the same way as for the ATE, for which the properties of TMLE are already well-known (van der Laan and Daniel Rubin 2006; van der Laan and Rose 2011).

Simulation Set-up
, which is the true density of A given W. We kept our propensity scores between about 0.17 and 0.83 so as to avoid poor performance from positivity violations (Petersen et al. 2012) for a binary outcome. Define the blip function as 0636. This is a substantial VTE to avoid getting near the parameter boundary at 0.
below we illustrate the process for one simulation. For each sample size, n, we performed the simulation 1000 times. We note that rate is some number which we will set to less than -1/4 (-1/3 in this case) in order to satisfy TMLE conditions.
We note that we placed correlated noise on the trueQ 0 (1, W ) andQ 0 (0, W ) so as to make the blip function"estimates" of similar noise variance as the initial "estimates" forQ 0 (A, W ). By a Taylor series expansion about the truth, it is easy to see the above procedure will satisify the remainder term conditions of 2.2.1.
We have thatQ 0 (1, W, n, Z)) and likewise forQ 0 n (0, W ) and thus trivially, E 0 (b 0 n (W ) − b 0 (W )) 2 is of order n rate with rate < −1/4. As previously mentioned, we need not worry about any second order terms but E 0 b 0 n (W ) − b 0 (W ) 2 because we are using the true g 0 . Condition 1 of section 2.2.1 is easily satisfied and Condition 3, the donsker condition, is satisfied since our "estimated" influence curve, D * (Q 0 n , g 0 ), depends on a fixed function of A and W with the addition of independently added random normal noise.
The simulation result, displayed in figure 1, is in alignment with the theory established for the TMLE estimator of VTE but how fast the asymptotics come into play is an important issue as to the relevance of the asymptotic theory.

Simulations That Are More Realistic
We will stick with binary outcome and treatment, though the results will be comparable for continuous outcome. Unless otherwise noted, sample size n = 1000 and the number of simulations = 1000. Throughout the simulations we generated the covariates as follows: W 3 ∼ standard normal and W 4 ∼ standard normal. These simulations are more realistic in that we try to recover via machine learning, an "unknown" treatment mechanism and outcome model. When we specify the models correctly we are considering a "best case" scenario where our regressions achieve parametric rates of convergence to the truth. When we misspecify a model in the data generating system, we try to recover its non-linear functional form with ensemble machine learning, in the event that a linear model including interactions (to pick up heterogeneity) is catastrophic for estimating VTE. If we are going to estimate VTE, a main terms linear model will assume VTE is essentially 0, so comparing ensemble learning methods with such is not very informative.

Well-specified TMLE Initial Estimates, Skewing
"Well-specified" means we are fitting a well-specified (functional form is correct for both outcome model,

as in a logistic linear model for both
the treatment mechanism and outcome models. The only point of these simulations is to show that TMLE preserves excellent initial estimates and also to show approximately what size sample will lead to skewing (and therefore bias) of the sampling distribution for blip variance when the truth is near the lower parameter bound of 0. We can say as a rule of thumb, a sample size of 500 or more is probably needed to even hope to get reliable estimates for blip variances in the neighborhood of 0.025 (15.8% standard deviation), a rule confirmed by figures 2,3 and 4.Q(A, W ) = expit(0.14(2A + W 1 + aAW 1 − bAW 2 + W 2 − W 3 + W 4 )) for the outcome regression, varying a and b to adjust the size of the blip variance.   Polley, et al. 2007). Generally the best or nearly best learner in the library is the optimal convex combination of algorithms that forms the SuperLearner predictor in our simulations and often in practice. The SuperLearner convex combination of algorithms might be more familiar to the reader as a form of model stacking (Wolpert 1992).
3.2.3 Use of Highly Adaptive Lasso: Making Initial PredictionsQ 0 n and g n We refer the reader to the online supplementary materials for more complete details of the SuperLearner results. It is notable that any SuperLearner library containing highly adaptive lasso will yield asymptotically efficient estimates for TMLE, assuming the true conditional mean outcome is right-hand continuous with left-hand limits (Neuhaus 1971) and has variation norm smaller than a constant M (van der Laan 2016). In finite samples, however, some machine learning algorithms might be better suited for prediction and so we rely on ensemble learning.

SuperLearner Library 1, termed SL1, Avoiding Overfitting
This library will be indicated by "SL1" in the simulation results. 4. earthMain (Milborrow 2017) is data adaptive penalized regression spline fitting method. They allow for capturing the subtlety of the true functional form. We allowed degree = 2, which is interaction terms with the default penalty = 3 and a minspan = 10 (minimum observations between knots).
5. SL.glm (R Core Team 2017) logistic regression and we used main terms, top 6 correlated variables with outcome and top 10 as well as a standard glm with main terms and interactions (glm mainint screen.Main) 6. SL.stepAIC (R Core Team 2017) uses Akaike criterion in forward and backward step regression 7. SL.hal is the highly adaptive lasso (Benkeser and van der Laan 2016), which guarantees the necessary L 2 rates of convergence and therefore, if included in the SuperLearner library, guarantees asymptotic efficiency (van der Laan and Gruber 2016). hal output is guaranteed to be of finite sectional variation norm and thus is guaranteed to be donsker or not overfit the data.
8. SL.mean returns the mean outcome for assurance against overfitting 9. rpartPrune (Therneau et al. 2017) is recursive partitioning with cp = 0.001 (must decrease the loss by this factor) minsplit = 5 (min observations to make a split), minbucket = 5 (min elements in a terminal node)

SuperLearner Library 2 termed SL2, More Aggressive, overfits a little
This library will be indicated by "SL2" in the simulation results. This library is identical to Library 1, except we added the following learners, which were tuned to maximize cross validated loss on a few draws from case 2a data generating distribution. Thus these additions do not severely overfit, in general.

Mixed Results and Need for Future Refinements
We again demonstrate how employing targeted learning with CV-TMLE can recover misspecified p-score as well as outcome models when parametric models are terrible, but coverage is below nominal and at times very poor, depending on the situation. This is not a problem solely for the case of an observational study as the authors have found the main culprit in poor coverage to be the second order remainder term consisting of the integral of true blip function minus estimated blip function squared (see Appendix B for remainder term derviation). Standard parametric models are again disastrous for both cases 2 and 3, never covering the truth and missing almost all of the VTE as in case 1. Only in case 2 does CV-TMLE, using a pretty small superlearner library of 8 algorithms including, xgboost, neural networks, glm with main terms and interactions, earth, sample mean and the highly adaptive lasso (van der Laan 2016), achieves decent coverage of 83% and reduces bias of the initial estimate from -0.015 to -0.009. In case 3, CV-TMLE with the same SuperLearner library only covers at 32%.
For both case 2 and case 3 we used the following model for the true treatment mechanism: Case 2 true outcome model: Case 3 true outcome model:

Demonstration on Real Data
The estimator described in the previous sections was applied to a real dataset. GER-INF is a placebocontrolled randomized trial published in 2002 that evaluated the impact on mortality of low dose steroid administration in patients hospitalized in the intensive care unit (ICU) for septic shock (Annane et al. 2002).
This study was performed in 19 ICUs in France and enrolled a total of 299 patients. Because steroid supplementation in this context was expected to be beneficial in patients with relative adrenal insufficiency, a corticotropin stimulation test was performed in all patients. A pre-specified subgroup analysis was planned in patients who were nonresponders to the corticotropin stimulation test (relative adrenal insufficiency) and in those responders to the corticotropin stimulation test (no relative adrenal insufficiency). In this sample, 7 days of low dose hydrocortisone associated with fludrocortisone were associated with a reduced risk of death in patients with septic shock. As expected, this reduction was significant in patients with relative adrenal insufficiency as reflected by a lack of response following the corticotropin stimulation test. Because of the apparent heterogeneity in treatment effect, at least partially explained by the presence of a relative adrenal insufficiency, we used the proposed estimator to quantify treatment effect variability across patients.
We controlled for the following confounders in fitting the outcome model: SAPS2 severity score (Simplified Acute Physiology Score to assess mortality), Sequential Organ Failure Assessment (SOFA) severity score at baseline , lactate level, cortisol level before corticotropin, an indicator of responding to the corticotropin stimulation test, site of infection, mechanical ventilation at baseline, patient origin (hospital acquired infection or not), indicator of elective surgery or urgent surgery, maximum difference in cortisol concentration before and after stimulations, indicator of use of etomidate for anesthesia (drug known to alter the adrenal function), blood sugar and the pathogen responsible for the infection. In this case, the treatment assignment was random and thus we can identify VTE from the data as a measure of how much of the heterogeneity in treatment effect is due to confounders normally used to assign treatment. We note, in the case of variables missing data, we create an indicator of missingness and use the median or, in the case of categorical variables, the most popular category as an imputed value. The table below summarizes our data. We provide estimates below for ATE and VTE simultaneously and used the delta method to also give a confidence interval for the √ V T E, because such is on the scale of measurement of ATE. We can see the left bounds of the confidence interval for VTE and √ V T E strayed into the negative numbers, which are not possible estimates for such parameters. Log-scaling the confidence intervals only make them excessively large and therefore not useful due to the unscaled confidence bands centering close to 0. If we reference our discussion about skewing in section 3.2.1, we see that if the true VTE were as small as our estimate, then our sampling distribution under the best case scenario of well-specified outcome model is skewed. We would need a true VTE of around 0.06 to have any hope, even under the best case scenario of well-specified outcome model, to have normal sampling distributions for estimating VTE. There is also the issue of second order remainder term bias as we discussed, which could account for missing a truly larger VTE in our estimates.
The second order remainder term in section 2.2.2 is the square of the L 2 norm bias in estimating the blip function so, prioritizing the blip function in our outcome model estimation is the subject for future work to improve blip function estimation in finite samples.

SuperLearner
We used a SuperLearner library consisting of 40 algorithms, including main terms and interactions when applying any regression methods, such as logistic regression, bayes generalized linear models, lasso and ridge regressions (combinations of L 1 and L 2 penalty) and earth, which uses data adaptive regression splines. For boosting trees (xgboost), we used depth 2 trees to allow interaction as well as depth 1 trees with main terms and interactions as the covariates. Also for boosting, we used different hyperparameters for the number of trees in combination with different learning rates. We used recursive partitioning and random forest, which are tree methods and therefore account for interactions, as well as neural networks which are more non-parametric approaches. We also applied k nearest neighbors for prediction. In addition we applied screening of the top 25 correlated variables with the outcome in conjunction with then running the machine learning algorithm and did the same for the top 5 variables when running bayes generalized linear models as well as glm. The convex combination of learners, or SuperLearner, had the lowest cross-validated risk (negative log-likelihood) of 0.536 with random forest algorithms and the lasso with all interactions included as variables performing equally well. The discrete SuperLearner (Polley et al. 2017), or the one that chooses the lowest risk algorithm to use for each fold's validation set predictions, had a cross-validated risk of 0.57.
Without knowing which algorithm would perform the best a priori, we can see SuperLearner did the job it was supposed to do in combining our ensemble in an optimal way, according to cross-validated risk.

Discussion
We can see there are two great challenges in estimating VTE, one being the fact the parameter is bounded below at 0, skewing and biasing the estimates when the true variance is too small for the sample size. In the future we might develop an improvement over log-scaling to form adjusted confidence bounds if they stray into the negative zone. To our eyes, this problem is not as crucial as obtaining reliable inference when the true variance is large enough to be estimated for the sample size. On this front the second order 2 , has proven difficult to contain in finite samples. We are certain for larger samples we can show that a Superlearner library including the highly adaptive lasso (Benkeser and van der Laan 2016) will achieve the necessary rates of convergence for this term to be truly second order, however, such is not trivial to show conclusively via very time-consuming and computer intensive simulations and besides, a sample size of 1000 for 4 covariates and treatment is certainly of practical importance. To this end, the authors plan to propose in a follow-up paper, a novel way to estimate the blip function that will also yield estimates of the outcome predictions that are necessarily between 0 and 1, all the while performing asymptotically as good as just fitting the outcome model. We need blip function estimates, b(W ), that are compatible with the outcome predictions in order to perform the crucial targeting step of the TMLE procedure. i.e., we need b(W ) =Q(1, W ) −Q(0, W ).
Another approach to yielding better coverage would be to account for the second order remainder term via the use of the new highly adaptive lasso (HAL) non-parametric bootstrap (van der Laan 2017), to form our confidence intervals. HAL, as mentioned previously in this paper, guarantees the necessary rates of convergence of the second order remainder terms under very weak conditions and is thus guaranteed to yield asymptotically efficient TMLE estimates (van der Laan and Gruber 2016), using the empirical variance of the efficient influence curve approximation for the standard error. However, in finite samples such a procedure yields lower than nominal coverage due to the unforgiving second order remainder term of non-doubly robust estimators as we have for VTE. We feel the HAL CV-TMLE, using a non-parametric bootstrap addresses this issue and accounts for the second order bias, the subject of more future work. Perhaps best will be performing the novel blip fitting procedure with the HAL bootstrap.

Set Up
Our observed data, O, is of the form, O = (W,A,Y), where W is a set of confounders, A is a binary treatment indicator and Y is the outcome, continuous or binary. O ∼ P ∈ M, non-parametric. It is important that we can factorize the density for P is as p(o) = p Y (y|a, w)g(a|w)p W (w).

A.0.1 Tangent Space for Nonparametric Model
We consider the one dimensional set of submodels that pass through P at = 0 (van der Vaart 2000) {P def. by density, p = (1 + S)p| SdP = 0, S 2 dP < ∞}. The tangent space is the closure in L 2 norm of the set of scores, S, or directions for the paths defined above. We write: For a non-parametric model, T = L 2 0 (P ) forms a Hilbert space with inner product defined as f, g = E P f g.
Our notion of orthogonality now is f ⊥ g if and only if f, g = 0 and, therefore, the above direct sum is valid. In other words, every score, S, can be written as d

A.0.2 Efficiency Theory in brief
Our parameter of interest is a mapping from the model, M to the real numbers given by Ψ We note to the reader, we imply a direction, S, when we write P e , which has density p(1 + S), but generally leave it off the notation as understood.
By the riesz representation theorem (Riesz 1909) for Hilbert Spaces, assuming the mapping in (1) is a bounded and linear functional on T , it can be written in the form of an inner product D * (P ), g where D * is a unique element of T , which we call the canonical gradient or efficient influence curve. Thus, in the case of a nonparametric model, the only gradient is the canonical gradient. It is notable that the efficient influence curve has a variance that is the lower bound for any regular asymptotically linear estimator (van der Vaart 2000). Since the TMLE, under conditions as discussed in this paper, asymptotically achieves variance equal to that of the efficient influence curve, the estimator is asymptotically efficient.
As a note to the reader: Our parameter mapping does not depend on the treatment mechanism, g, and also T A ⊥ T Y ⊕ T W which, means our efficient influence curve must therefore be in T Y ⊕ T W for the nonparametric model. Therefore, our efficient influence curve will have two orthogonal components in T Y and T W respectively. We have no component in T A , which is why we need not perform a TMLE update of the initial prediction, g n , of g 0 (A|W ). Such also teaches us that for the semi-parametric model, where the treatment mechanism is known, the efficient influence function will be the same as for the non-parametric model.

A.1 Derivation of the efficient influence curve
For the below proof, when we take the derivatives we will assume that such is in the direction of a given score, S ∈ T . We will assume a dominating measure ν and merely use ν to always denote the dominating measure of all densities involved. We could perhaps just assume continuous densities and use lebesque measure as well and nothing would change below.
Theorem A.1. Let Ψ(P ) = var P (b(W )). The efficient influence curve for Ψ at P is given by: Proof. by our previous discussion, we are guaranteed a unique representer in L 2 0 (P ), called D such that as D , S and D will be our desired formula: Now we can compute the first term in (4) (y −Q(a, w))S(w, a, y)p(w, a, y)ν(d(o)) is the aforementioned representer, completing the proof.

A.2 Remainder terms and asymptotic linearity:
The reader may recall the three conditions assuring asymptotic efficiency of the TMLE estimator. Here we will focus on the 2 nd condition regarding the remainder term.
Theorem A.2. If P 0 is the true distribution, it is necessary to estimate the true blip function b 0 at a rate of 1 n 0.25 in the L 2 (P ) norm in order for TMLE to be a consistent asymptotically efficient estimator under a known treatment mechanism, g 0 . If g 0 is unknown, we also need the product of the L 2 (P 0 ) rates for estimating g 0 andQ 0 to be 1 n 0.5 .
Proof. For this discussion we will drop the subscript, n, and superscript, in P n and merely consider, P , as an estimate of the truth, P 0 . We will use b(W ) to denote the TE function where the conditional expectation is with respect to distribution, P , ie the estimated TE function, and b 0 (W ) to be the true TE function.
Likewise, E 0 is the expectation with respect to the true observed data distribution, P 0 , and leaving the subscript, 0, off the expectation sign means the expectation is with respect to P .
We can regard the (E 0 b 0 (W ) − Eb(W )) 2 term in (6) and notice that for an unknown, g 0 , it is well-known that the double robustness of TMLE in estimating the causal risk difference, E 0 b 0 (W ), implies that if we estimate both g 0 andQ 0 so that the product of the respective L 2 rates of convergence is o(n −0.5 ), where D 1 (P 0 ) is the efficient influence curve for the causal risk difference. We therefore know E 0 b 0 (W ) − Eb(W ) p −→ 0 and by slutsky's theorem, =⇒ 0. Therefore this term poses no additional problem to the rest of the terms. Now we can address the standard "double robust" term: where the last inequality follows from cauchy-schwarz and the strict positivity assumption on g 0 . The difficult term in (6) is E 0 (b 0 (W ) − b(W )) 2 but if we obtain the required L 2 rates it is obvious this term will be second order and we have proven the theorem.

B.1 defining the parameter of interest
Let us consider the nonparametric model, M. We will employ basic statistics to obtain the influence curve for a logistic regression plug-in estimator for both the mean and variance of the blip function, b We define the plug-in estimator as plugging in the outcome conditional density given by the MLE for β where Y and A are binary and P γ (Y | A, W ) is defined as a conditional density, for a fixed function m(·|·).

B.2 Finding the MLE for the coefficients, β
Now if we take n iid draws of O we get that the likelihood of drawing We thus have: We can now derive the multidimensional influence function via the use of a taylor series about Also note that the derivative of the log-likelihood or score, S β (O), has mean 0 by assumption. So P β S β (O) = Thus we get by virtue of β n being the MLE: since we can consider the β n − β 2 term as second order. Therefore is Consider the parameter Ψ a,w (P ) = expit (m(a, w|β)) for fixed (a, w). expit (m(a, w|β) is a continuously differentiable function of β, which means we can apply the ordinary delta method as follows to find the plug-in estimator influence curve estimating Ψ a,w (P ).
We thus have the influence curve for the logistic regression MLE plug-in estimator for by the delta method we easily get B.5 Telescoping to find the 2-d influence curve for 2-d parameter Ψ(P ) Note, that Ψ n = (Ψ 1,n , Ψ 2,n ) is the MLE plug-in estimate for Ψ(P ) = (Ψ 1 , Ψ 2 ).
Now we can ignore the terms that are set aside as they are part of the influence curve in the tangent space Therefore, the influence for the MLE-based estimate of Ψ 1 (P ),denoted by IC Ψ1,n , is regarding the terms not set aside: Thus we arrive at the following influence curve for the plug-in estimator of the two dimensional parameter (Ψ 1 (P ), Ψ 2 (P )). Note, this IC below is written as a sum of two 2-dimensional vectors, one for the components of the IC in T Y and the other for the components in T W .

C Canonical Least Favorable Submodels Introduction
We offer a new way to construct a targeted maximum likelihood estimator for multidimensional parameters via defining the canonical least favorable submodel (clfm) . A more detailed version of this section by Jonathan Levy is on arxiv.org, where we provide a precise general algorithm currently implemented (gentmle2). For the purpose of this appendix we will just show the reader how the clfm leads very naturally to the one-step TMLE (van der Laan and Gruber 2016) and can be viewed as iterative version of the one-step TMLE. Other possible advantages of using the clfm are in . Using clfm's has yet to be explored fully.
C.1 Mapping P 0 n to P n : The Targeting Step Here we will provide an alternate and simple construction of the universal least favorable submodel employed in this paper via the clfm.
Definition C.1. We can define a canonical 1-dimensional locally least favorable submodel (clfm) of an estimate, P 0 n , of the true distribution as where P 0 n, = P 0 n and · 2 is the euclidean norm. We consider a d − dimensional parameter mapping This definition differs from the locally least favorable submodel (lfm) in van der Laan and Gruber, 2016, where the empirical mean of the loss spans the efficient influence curve. Here we can define a clfm with only a single epsilon where as with an lfm, we use an of dimension of minimum the dimension of the parameter of interest.
We can construct the universal least favorable submodel (ulfm) in terms of the clfm if we use the difference equation P n (L(P 0 n,dt ) − L(P 0 n )) ≈ P n D (P 0 n ) 2 dt, where P dt n = P 0 n,dt is an element of the clfm of P 0 n . More generally, we can map any partition t = m × dt for an arbitrarily small, dt, to an equation P n (L(P t+dt n ) − L(P t n )) ≈ P n D (P t n ) 2 dt, where P t+dt n is an element of the clfm of P t n . We therefore can recursively define the integral equation: P n (L(P n ) − L(P 0 n )) = 0 P n D (P t n ) 2 dt and P n will thusly be an element of the ulfm of P 0 n . For log likelihood loss, which is valid for both continuous outcome scaled between 0 and 1 as well as binary outcomes, an analytic formula for a ulfm of distribution with density, p, is therefore defined by the density p = p × exp( 0 D * (P t ) 2 dt) (van der Laan and Gruber 2016) where P t+dt is an element of the clfm of P t .
In applying the one-step TMLE, when the empirical loss is minimized at a given , we will have solved, P n D (P n ) 2 = 0. Therefore, the loss is decreased and all influence curve equations are solved simultaneously with a single in one step. Specifically, P n D j (P n ) = 0 for all j. Thus P n = P n and we have defined the required TMLE mapping.

D An Easy Implementation of CV-TMLE Introduction
The original formulation and theoretical results of cross-validated targeted maximum likelihood estimators, CV-TMLE (Zheng and van der Laan 2010), leads to an algorithm for the CV-TMLE that generally requires 10 targeting steps for each of 10 validation folds for each iteration in an iterative targeted maximum likelihood estimators or TMLE (van der Laan and Daniel Rubin 2006). Such can be done in one regression, which solves the efficient influence curve equation averaged over the validation folds. However, in this pooled regression, we must keep track of the means used in each fold, making the process different than a regular TMLE, once the initial predictions have been formed. The formulation of the CV-TMLE here-in leads to a simpler implementation of the targeting step in that the targeting step can be applied identically as for a regular TMLE once the initial estimates for each validation fold have been computed. The CV-TMLE as discussed here is currently implemented in the R software package of tlverse (Coyle et al. 2018).

D.1 CV-TMLE Definition for General Estimation Problem
We refer the reader to the following sources ( This definition coincides with the least favorable submodel if the d = 1 because in that case we will have where the above · 2 is the euclidean norm. We then define a mapping B n ∈ 0, 1 n to be a random split of

D.2 Illustrative Example, VTE
We will now go through the CV-TMLE algorithm for the VTE, variance of treatment effect. Here, we notice that we never target the distribution of W , but rather use the unbiased estimator, the empirical distribution. This is discussed in Zheng and van der Laan, 2010 so refer the reader there for more detail as to why this is often the case. In short, the component of the efficient influence curve in the tangent space of mean 0 functions of W (van der Vaart 2000) is given by D * For any approximation to this function, its empirical mean will automatically be zero. We denote the following to avoid heavy notation: is the approximation of the outcome model at the kth iteration. This fit is entirely dependent on the training set P 0 n,Bn observations and the fluctuations to the model, performed on the corresponding validation set.
is the approximation of the outcome model at the kth iteration. We will see it actually depends on P 0 n,Bn and P 1 n,BnB (P 0 n,Bn )( → n k ), and hence the entire empirical draw of the data.
• STEP 1: Initial estimates For each split, B n as in standard 10-fold cross-validation, we use an ensemble learning package such as sl3 (Coyle et al. 2018) or SuperLearner (Polley et al. 2017) to fit a model on the training set, denoting the model as P 0 n,Bn . In this case we will fit relevant factors of the likelihood, such as the propensity score and outcome model, but not the distribution of covariates, W . For those, we use the empirical distribution as an unbiased estimator and will not target it. Denote the initial fit of the E P [Y | A, W ], which we denoteQ 0 Bn . For both procedures the initial fits are all the same.
• STEP 2: Check Tolerance For each fold evaluate the so-called clever covariate: ) and the influence curve approximation Our proposed procedure would do and the alternate influence curve approximation Thus, in our procedure we need not keep track of the folds since the average within the clever covariate is merely taken over the entire sample. Thus the process is identical to a TMLE once the initial estimates are made. We just stack them on top of eachother and act is if it is all one initial fit as with the regular TMLE.
We then compute the influence curve approximation for each fold and take the sample mean. Since the T W component, as stated above always has empirical average 0, we only need to take the mean of the component of the influence curve approximation in the tangent space, T Y = mean 0 functions of Y | A, W , which have finite variance (van der Vaart 2000). We then check if the mean of the influence curve is below the tolerance level,σ/n whereσ is the sample standard deviation of the above influence curve computations. This assures we stop the process when the bias is second order as any more fluctuations beyond that point are not helpful. If we are below the tolerance we go to step 4.
Otherwise we continue onward.
• STEP 3: Targeting Step: Run a pooled logistic regression over all the folds with model: That is, a model which suppresses the intercept and uses and the initial predictions as the offset. This is identical to our method, except we would use the slightly different clever covariate as stated above.
• STEP 4: Compute the estimate and CI: Ψ kn (P n ) = E Bn Ψ Q (P 0 n,Bn )( → n kn ) and estimate the standard error via the standard deviation of the influence curve in step 3 divided by root n, which we will just callσ/ √ n and form the confidence bands Ψ kn (P n ) ± z ασ / √ n where z α is the 1−α/2 normal quantile. This entails computing the parameter separately per validation set before averaging the 10 estimates, i.e., compute the sample variance over the validation set forb k Bn , getting 10 estimates and then average them. In our procedure we just have a list of n values ofb k 1,Bn and compute the sample variance over the entire sample.

D.3 Donsker Condition
In the original formulation of the CV-TMLE, we view the estimator as 10 plug-in estimators. To compute each of the 10 estimators, the targeting step is performed on the validation set. Since we can therefore condition on the training set from which the initial estimate is formed, we essentially have a fixed functions Q 0 Bn andĝ Bn , which we are fluctuating on the validation set with a one-dimensional parametric submodel.
Thus the entropy is very low for the class of functions containingQ k Bn in our above algorithm. With our procedure the entropy is a little bigger in that the function,Q k 1,Bn , can be viewed as fixed, yet depending on an average over all validation sets (therefore very slightly inbred before the targeting step) as well as the fluctuation parameter, , determined by the validation set. The influence curve approximation, D k,Bn , defined above, will thus have similarly low entropy as if we allowed another parameter in the parametric submodel.
Consider the following, which we pull out of Zheng and van der Laan, 2010, for the convenience of the reader. where the index set contains n with probability tending to 1. For a deterministic sequence δ n → 0, define subclasses F δn P 0 n,Bn = f→ ∈ F P 0 n,Bn : → − → 0 < δ n If for deterministic sequence δ n → 0 we have E Entro(F δn P 0 n,Bn ) P 0 F (δ n , P 0 n,Bn ) 2 → 0 as n → 0 where F (δ n , P 0 n,Bn ) is the envelope of F δn P 0 n,Bn , then √ n(P 1 n,Bn − P 0 ) f ( We note to the reader that we keep lemma 3.2 identical to what was in Zheng and van der Laan, 2010, except we do not condition solely on P 0 n,Bn when defining F P 0 n,Bn . Such does not at all affect the truth of the lemma.

kn,Bn
. Again, assuming the remainder is o P (1/ √ n) our estimator is asymptotically efficient if D → kn,Bn converges to the true influence curve in L 2 (P 0 ) (van der Laan and Daniel Rubin 2006).
For blip variance the remainder term conditions are no more strict than for the original formulation of the CV-TMLE.

D.4 Conclusion
This slight adjustment to the CV-TMLE algorithm is easier to implement and retains the same theoretical properties, as shown in our example here. It remains to be more formally generalized to include a class of TMLE's for which it is valid but the example used here-in gives the reader sufficient intuition to understand when such can be done. For one, it is obvious if any polynomial factor of a mean (assuming the mean converges) appears as a factor in the clever covariate, then the entropy will be similarly small, so this procedure covers many examples one might find in practice. The procedure overlaps exactly with the originally formulated CV-TMLE with many common parameters where the clever covariates contain no empirical means. It is a subject for future research whether this procedure has any advantages in finite samples, such as in the case of simultaneously estimating the ATE, which is then used as the centering in the VTE computation. Such appears to be perhaps more sensible but simulations have shown no appreciable difference in performance for VTE.