BY 4.0 license Open Access Published by De Gruyter May 25, 2021

A fundamental measure of treatment effect heterogeneity

Jonathan Levy, Mark van der Laan, Alan Hubbard and Romain Pirracchio

Abstract

The stratum-specific treatment effect function is a random variable giving the average treatment effect (ATE) for a randomly drawn stratum of potential confounders a clinician may use to assign treatment. In addition to the ATE, the variance of the stratum-specific treatment effect function is fundamental in determining the heterogeneity of treatment effect values. We offer a non-parametric plug-in estimator, the targeted maximum likelihood estimator (TMLE) and the cross-validated TMLE (CV-TMLE), to simultaneously estimate both the average and variance of the stratum-specific treatment effect function. The CV-TMLE is preferable because it guarantees asymptotic efficiency under two conditions without needing entropy conditions on the initial fits of the outcome model and treatment mechanism, as required by TMLE. Particularly, in circumstances where data adaptive fitting methods are very important to eliminate bias but hold no guarantee of satisfying the entropy condition, we show that the CV-TMLE sampling distributions maintain normality with a lower mean squared error than TMLE. In addition to verifying the theoretical properties of TMLE and CV-TMLE through simulations, we highlight some of the challenges in estimating the variance of the treatment effect, which lack double robustness and might be biased if the true variance is small and sample size insufficient.

MSC 2010: 62G05; 62G20; 62G35

1 Introduction and background

The stratum-specific treatment effect function (also referred to as the CATE function for conditional average treatment effect [ATE]) is a random variable defined as the ATE for a randomly drawn stratum of patient characteristics, which are confounders (potential causal parents of treatment) in that they possibly affect the outcome and occur before treatment assignment. The average of the CATE function is the ATE. However, beyond the mean effect, of interest is also the variation in this effect across the strata of potential confounders. One measure of this variation is simply the variance of the CATE function (as a random function of confounders), or variance of treatment effects (VTEs). For instance, though the ATE might indicate that a treatment is beneficial on average, a comparatively large VTE would indicate there might be a significant portion of the population that has deleterious effects from treatment. We construct the targeted maximum likelihood estimator (TMLE) [1,2] and the cross-validated TMLE (CV-TMLE) [3] to simultaneously estimate ATE and VTE, corresponding to the sample mean and variance of CATE function estimates. We also provide conditions under which the estimators are asymptotically efficient.

Our implementation of both the TMLE and CV-TMLE fits under the general framework of targeted learning [2] based on the use of ensemble machine learning. We will see the CV-TMLE has an advantage over the TMLE in that it does not require a Donsker (entropy) condition on the initial predictions for the relevant nuisance parameters, enabling more flexibility in the ensemble learning we employ. The inference method proposed in this study relies on a sample standard deviation of the efficient influence curve approximation, thereby allowing substantial savings in computational time over a re-sampling based (bootstrap) approach. In fact, we show in Section 2.3 that without the targeting step in our estimator, employing the non-parametric bootstrap might not guarantee valid inference [4].

We also discuss the limitations of our VTE estimator. Despite clear advantages of targeted learning, we will see that estimating VTE lacks the desirable robustness properties present when estimating the ATE. Particularly for a randomized trial, although our CV-TMLE estimate for ATE is consistent, knowing the treatment mechanism does not guarantee consistent VTE estimates. This lack of robustness is due to a second-order remainder term that can make coverage unreliable. We offer future improvements to address this problem. Finally, we also discuss limitations in detecting VTE when the true VTE is small relative to the sample size.

Let us define Y a as a random variable, which is a counterfactual outcome drawn from a population under intervention to set treatment to a as per the Neyman–Rubin potential outcomes framework [5,6]. The variance in counterfactual treatment effect from individual to individual can be defined as var ( Y 1 Y 0 ) . However, this parameter is not generally identifiable, a problem first addressed by Neyman (1923) when attempting to compute the standard errors for the counterfactual average difference in yield of two crop varietals. Fisher (1951) also realized the difficulty in directly estimating the joint distribution of Y 1 and Y 0 and suggested pairing a treated and untreated varietal in the same pot might closely resemble a set of counterfactual differences, Y 1 Y 0 [7]. Strong assumptions must be made to identify var ( Y 1 Y 0 ) in the event it is not 0, such as assuming that for any given strata of patient covariates, X , the quantiles of Y 1 X and Y 0 X are the same [8]. Heckman (1997) mentions combining the results of Cambanis (1976) and Frechet (1951) to tail bound var ( Y 1 Y 0 ) = var ( Y 1 ) + var ( Y 0 ) 2 γ Y 1 , Y 0 var ( Y 1 ) var ( Y 0 ) as a means to test if the lower bound of the confidence interval includes a variance of 0 [8,9,10]. For randomized trial data, Ding, Feller et al. (2016), construct a Fisher randomization test of the null hypothesis that var ( Y 1 Y 0 ) = 0 under the untestable assumption that there exists a universal τ so that Y 1 = Y 0 + τ . However, this approach is not adapted to the situation where treatment is assigned based on confounders [11].

Cox (1958) assumes var ( Y 1 Y 0 ) = 0 for predefined homogeneous subgroups, essentially assuming all the variation of Y 1 Y 0 is due to known variables that affect the response to treatment, i.e., patient characteristics used to assign treatment [12]. Although we do not make this assumption, this is the variation we aim to capture by estimating var ( E [ ( Y 1 Y 0 ) W ] ) , i.e., the variance of the CATE function, where W is a vector of confounders. As an illustration, let W be an indicator of gender, and Y be a binary outcome indicating survival if the outcome is 1. Suppose males have a treatment effect value of E [ ( Y 1 Y 0 ) W = male ] = 0.3 and the females, E [ ( Y 1 Y 0 ) W = female ] = 0.7 . Assuming males and females are of equal proportion for the population at hand, then the VTE is 0.25 and ATE is 0.2, thus giving a sense of the spread of CATE function values for which one could generally use Chebyshev’s inequality to tail bound. VTE does not capture the variance of the random variable, ( Y 1 Y 0 male ) . Hence, a clinician’s perception of highly varying outcomes does not mean the VTE is high. Rather, one would want to estimate VTE to see if the varying outcomes were due to measured confounders.

Estimating VTE with a substitution estimator depends on using an estimate of the outcome model, E [ Y a W ] , from which to estimate the CATE function, E [ Y 1 W ] E [ Y 0 W ] , and then its variance over an estimate of the distribution of W , in our case the empirical distribution of W (see Section 2.3). If one knows the CATE function, for instance, one would know an optimal rule for assigning treatment based on confounders [13]. One could also find subgroup-specific treatment effects. Tian et al. (2014) proposed a way to isolate interactions of treatment with confounders in a randomized trial by transforming the predictors of a parametric model [14]. The main idea is to form a variable, z = 2 A 1 , where A is the usual treatment indicator, and then put the interaction of this variable with the predictors in the outcome regression. This enables direct estimation of the CATE function from which one could obtain a point estimate of the VTE. One could also employ recursive partitioning [15] or random forests [16] to establish homogeneous (constant treatment effect) subgroups of the data. Such subgroups could be used to compute the VTE but as noted in ref. [17], establishing too coarse subgroups can miss detecting treatment effect heterogeneity. Instead, we are proposing to use TMLE or CV-TMLE to estimate the CATE function (and hence, the VTE) without limiting to tree-based or parametric modeling.

2 Methodology

For simplicity, we consider only binary outcome but all the analyses here-in can be extended to bounded continuous outcomes. For the latter, we can scale the outcome to be in [ 0 , 1 ] via the transformation Y s = Y a b a , where a and b are minimum and maximum outcomes, respectively, obtained from the data or known a priori. We denote the mean outcome model as Q ¯ P ( A , W ) = E P [ Y A , W ] , where W , A and Y are the random variables for the data, i.e., confounders, treatment and outcome, respectively, and the data generating distribution, P . For binary outcome or continuous outcome scaled as above, the mean loss given by E P 0 [ Y log ( Q ¯ P ( A , W ) ) + ( 1 Y ) log ( 1 Q ¯ P ( A , W ) ) ] will be minimized at the true outcome model, Q ¯ P 0 [18,19]. Thus, for a scaled continuous outcome as described above or binary outcome, the above loss is valid and, therefore, the targeting step (see Sections 2.2 and 2.3) is identical. After the CV-TMLE or TMLE algorithm is complete, one may convert the outcomes back to their original scale [20], and thus can form confidence intervals for ATE and VTE in the original outcome scale.

2.1 Full data statistical model and the link to the observed data

Our full data, including unobserved measures, are assumed to be generated according to the following structural equations [21,22,23]. We can assume a joint distribution, U = ( U W , U A , U Y ) P U , an unknown distribution of unmeasured variables. O = ( W , A , Y ) are the measured variables. In the time ordering of occurrence we have W = f W ( U W ) where W is a vector of confounders, A = f A ( U A , W ) , where A is a binary treatment and Y = f Y ( U Y , W , A ) , where Y is the outcome. Thus, we define a distribution P U , O , via ( U , O ) P U , O .

Under this framework, Y a is a random outcome under P U , O , where we intervene on the structural equations to set treatment to a , i.e., Y a = f Y ( U Y , a , W ) . The full model, F , consists of all possible P U , O . The observed data model, , is linked to F in that we observe O = ( W , A , Y ) P , where O = ( W , A , Y ) is generated by P U , O according to the structural equations above. Our true observed data distribution is an element of the non-parametric statistical model, . In the case of a randomized trial or if we have some knowledge of the treatment mechanism, is considered a semi-parametric model and we will incorporate such knowledge. We will show that this knowledge is more helpful for estimating ATE than for VTE.

We define the stratum-specific treatment effect function or CATE function as b P U , O ( W ) = E P U , O [ Y 1 W ] E P U , O [ Y 0 W ] . Our parameter of interest is a mapping from F to R 2 defined by Ψ F ( P U , O ) = ( E P U , O b P U , O ( W ) , v a r P U , O b P U , O ( W ) ) . We will impose the randomization assumption [22,23] on F which, states Y a A W , as well as the positivity assumption on so that 0 < P r ( A = a W ) < 1 for all a and W . Defining b P ( W ) = E P [ Y A = 1 , W ] E P [ Y A = 0 , W ] yields b P U , O ( W ) = b P ( W ) and we can identify the parameter of interest as a mapping from the observed data model, , to R 2 via the g-computation formula (for ATE) [24] and taking the variance of b P ( W ) . That is, Ψ ( P ) = ( E P [ b P ( W ) ] , var P [ b P ( W ) ] ) = Ψ ( P U , O F ) , i.e., ATE and VTE, respectively.

2.2 TMLE conditions and asymptotic efficiency

We refer the reader to refs. [1,2,26,27] for a more detailed description of the asymptotic theory of TMLE.

The efficient influence curve at a distribution, P , for the parameter mapping, Ψ , is a function of the observed data, O P , notated as D Ψ ( P ) ( O ) . Its variance gives the generalized Cramer–Rao lower bound for the variance of any regular asymptotically linear estimator [28]. In the present case, we consider a two-dimensional efficient influence curve, D Ψ ( P ) = ( D 1 ( P ) , D 2 ( P ) ) , where the two components are the efficient influence curves for ATE and VTE, respectively.

We will employ the notation, P n f , to be the empirical average of function, f ( O ) , and P f to be E P f ( O ) . We will only consider the case in which our observed data consists of n independent identically distributed draws from the true data distribution, P 0 . We also define a loss function, L ( P ) ( O ) , which is a function of the observed data, O , and indexed at an element, P , of the observed data model such that E P 0 L ( P ) ( O ) is minimized when P = P 0 . The TMLE procedure maps an initial estimate, P n 0 , of the true data generating distribution to P n such that P n L ( P n ) P n L ( P n 0 ) and such that P n D ( P n ) = 0 2 × 1 . P n is called the TMLE of the initial estimate P n 0 . We can then write an expansion with second-order remainder term, R 2 , as follows: Ψ ( P n ) Ψ ( P 0 ) = ( P n P 0 ) D Ψ ( P n ) + R 2 ( P n , P 0 ) . The TMLE is notated as Ψ ( P n ) , since it is a substitution estimator (or plug-in estimator). This ensures our estimates will be within the parameter bounds for VTE, i.e., greater than or equal 0.

2.2.1 Conditions for asymptotic efficiency

Let f L 2 ( P ) = E P f 2 be the L 2 norm. We assume the following TMLE conditions:

  1. D j ( P n ) is in a P-Donsker class for all j . Note that this condition can be dropped in the case of using CV-TMLE [3].

  2. Second-order remainder condition: R 2 , j ( P n , P 0 ) is o p ( 1 / n ) for all j .

  3. D j ( P n ) L 2 ( P 0 ) D j ( P 0 ) for all j .

Then, n ( Ψ ( P n ) Ψ ( P 0 ) ) D N [ 0 2 × 1 , cov P 0 ( D Ψ ( P 0 ) ) 2 × 2 ] where cov P 0 ( D Ψ ( P 0 ) ( O ) ) is a 2 × 2 matrix in our case with the ( i , j ) entry given as E P 0 D i ( P 0 ) ( O ) D j ( P 0 ) ( O ) . The i th diagonal of cov P 0 D Ψ ( P 0 ) ( O ) is the variance of the D i ( P 0 ) and the limiting variance of n ( Ψ i ( P n ) Ψ i ( P 0 ) ) under TMLE conditions. Thus, our TMLE and its CI given by

Ψ j ( P n ) ± z α σ ^ n ( D j ( P n ) ) n

are guaranteed to be as small as possible for any regular asymptotically linear estimator, where P r ( Z z α ) = α , the user chosen significance level, Z is the standard normal, and σ ^ n ( D j ( P n ) ) is the sample standard deviation of { D j ( P n ) ( O i ) i 1 : n } [1]. Furthermore, if the TMLE conditions hold for the initial estimate, P n 0 , then they will also hold for the updated model, P n [26], thereby motivating the importance of using ensemble machine learning in constructing P n 0 .

2.2.2 Remainder term in VTE estimation

We define the true outcome model Q ¯ 0 ( A , W ) = E P 0 [ Y A , W ] and the true treatment mechanism g 0 ( A W ) = P r ( A W ) , which we assume to be bounded away from 0 and 1. Let Q ¯ n 0 be the initial estimate of Q ¯ 0 , and g n be the estimate for g 0 . For estimating ATE and VTE, we will fluctuate an initial outcome model fit, Q ¯ n 0 , to Q ¯ n (see Section 2.3) but g n will not change as in the study of van der Laan and Rubin (2006). We define the true CATE function, b 0 , via b 0 ( W ) = Q ¯ 0 ( 1 , W ) Q ¯ 0 ( 0 , W ) and b n as its estimate. The second-order remainder term for VTE, notated as R 2 ( P n , P 0 ) , is given by:

(1) n R 2 ( P n , P 0 ) = n [ Ψ ( P ) Ψ ( P 0 ) + P 0 ( D Ψ ( P ) ) ] = n ( E 0 b 0 ( W ) E b n ( W ) ) 2

(2) + n E 0 2 ( b n ( W ) E b ( W ) ) g 0 ( 1 W ) g ( 1 W ) g ( 1 W ) ( Q ¯ 0 ( 1 , W ) Q ¯ n ( 1 , W ) ) g 0 ( 0 W ) g ( 0 W ) g ( 0 W ) ( Q ¯ 0 ( 0 , W ) Q ¯ n ( 0 , W ) )

(3) n E 0 ( b 0 ( W ) b n ( W ) ) 2 ,

which is computed in Theorem A.2 in the Appendix. Due to equation (3) it is necessary b n b 0 L 2 ( P 0 ) = o P ( n 0.25 ) and thus VTE estimation is not doubly robust. Q ¯ n Q ¯ 0 L 2 ( P 0 ) = o P ( n 0.25 ) ensures the necessary condition and is sufficient to satisfy condition 2 of Section 2.2.1 if we know g 0 , as in a randomized trial. If we do not know g 0 , then it is sufficient if Q ¯ n Q ¯ 0 L 2 ( P 0 ) , g n g 0 L 2 ( P 0 ) = o P ( n 0.25 ) . We therefore rely upon ensemble machine learning and its associated oracle inequality [29,30] to mitigate the remainder term bias, especially from equation (3), which can lead to poor coverage.

2.3 One-step CV-TMLE algorithm for ATE and VTE

The one-step TMLE algorithm [27] constructs a parametric submodel through the initial estimate, P n 0 , of the true distribution, P 0 . The submodel is indexed by a single dimensional parameter, ε and given by { P n , ε ε [ δ , δ ] } where P n , ε = 0 = P n 0 . The construction is performed recursively in such a way that we arrive in one step at an element of the submodel, P n , which has minimum empirical average loss of all the elements and also solves the efficient influence curve equation, P n D Ψ ( P n ) ( O ) = 0 d × 1 (in our case, d = 2 , but this holds for general dimension).

The efficient influence curve for Ψ ( P ) = ( Ψ 1 ( P ) , Ψ 2 ( P ) ) , i.e., ATE and VTE, has two components given by

D 1 ( P ) ( W , A , Y ) = 2 A 1 g ( A W ) ( Y Q ¯ ( A , W ) ) + b P ( W ) Ψ 1 ( P ) D 2 ( P ) ( W , A , Y ) = 2 ( b P ( W ) E P b P ( W ) ) 2 A 1 g ( A W ) ( Y Q ¯ ( A , W ) ) + ( b P ( W ) E P b P ( W ) ) 2 Ψ 2 ( P )

where W , A and Y are confounders, binary treatment indicator and either binary outcome or a continuous outcome scaled between 0 and 1, respectively, and b P ( W ) = E P [ Y A = 1 , W ] E P [ Y A = 0 , W ] . We refer the reader to Appendix A for details on the derivation. In our ensuing algorithm, if there is no heterogeneity identified in the data, that is, the estimated CATE function is identically 0, then the estimator will yield 0 as an estimate of the VTE with no confidence interval at all. So, for our algorithm to yield a confidence interval there must be some heterogeneity detected.

In the algorithm below, we adjusted the original CV-TMLE procedure [3] for ease of computation without losing any theoretical properties or finite sample performance. Once we obtain initial estimates, there is no difference between CV-TMLE and TMLE as far as implementation is concerned. The reader may consult the technical report by Levy (2018b), for more details regarding the difference between this procedure and the originally defined CV-TMLE [3] regarding our parameter of interest and why neither require condition 1 in Section 2.2.1 [31].

2.3.1 Initial estimates

To perform a one-step CV-TMLE we will first randomly select V folds (e.g., V = 10 ), consisting of V disjoint validation sets of equal size, comprising all n observations and the corresponding training sets. Each training set is the complement of the corresponding validation set so that for each fold, training set and validation set comprise all n subjects. For v in 1 : V we perform the following: using the data-adaptive ensemble machine learning package, SuperLearner [32], we fit the true outcome model, Q ¯ 0 ( A , W ) = E P 0 [ Y A , W ] with Q ¯ n , v 0 ( A , W ) on the training set and use the fit to make predictions on the validation set. The V sets of validation set predictions yield one estimate of the outcome prediction for each of the n subjects, denoted Q ¯ n 0 ( A i , W i ) for the i th subject. In the case of an observational study, we also might use SuperLearner to estimate the treatment mechanism, E P 0 [ A W ] , with g n , v ( A W ) and predict on the validation set. The V sets of validation set predictions yield one prediction of the treatment assignment probability for each of the n subjects denoted by g n ( A i , W i ) . For the one-step TMLE algorithm for ATE and VTE, we would have just fit the outcome model and treatment mechanism on the whole data and computed predictions Q ¯ n 0 ( A i , W i ) and g n ( A i W i ) on that same data using SuperLearner. Hence, the fundamental difference between CV-TMLE and TMLE is that CV-TMLE only uses predictions on validation sets.

Step 1: Initialization of targeting step

This step begins with computing the negative log-likelihood loss for our outcome predictions:

P n L ( P n 0 ) = 1 n i = 1 n [ Y i log Q ¯ n 0 ( A i , W i ) + ( 1 Y i ) log ( 1 Q ¯ n 0 ( A i , W i ) ) ] = L 0

the starting loss for the outcome where Y i is the true outcome, and we use the empirical distribution as the estimate of the distribution of W as in ref. [1]. The initial estimate of g 0 does not need to be further modified. The only portion of the empirical mean loss lowered from the initial estimate is that of the outcome model, which is why this is the only portion of the mean loss included here. Continuous outcomes (between 0 and 1) would require minimization of the quasibinomial loss, while binary outcomes would require minimization of the log-likelihood (see Section 2).

The targeting step then requires computing H 1 0 ( A i , W i ) = 2 A i 1 g n ( A i W i ) and H 2 0 ( A i , W i ) = 2 b n 0 ( W i ) 1 n i = 1 n b n 0 2 A i 1 g n ( A i W i ) , where b n 0 ( W ) = Q ¯ n 0 ( 1 , W ) Q ¯ n 0 ( 0 , W ) . We note that H 1 remains fixed for the entire process for this parameter. Next step is to compute P n D Ψ ( P n 0 ) ( O ) 2 , where 2 is the Euclidean norm and

P n D Ψ ( P n 0 ) = ( P n D 1 ( P n 0 ) , P n D 2 ( P n 0 ) ) = 1 n i = 1 n H 1 0 ( A i , W i ) ( Y i Q ¯ n 0 ( A i , W i ) ) , 1 n i = 1 n H 2 0 ( A i , W i ) ( Y i Q ¯ n 0 ( A i , W i ) ) .

The initial estimate of the parameter is 1 n i = 1 n b n 0 ( W i ) , 1 n i = 1 n b n 0 ( W i ) 1 n i = 1 n b n 0 ( W i ) 2 , i.e., the sample mean and variance of the estimated CATE function.

Step 2 to 4: The targeting steps

Step 2

If P n D j ( P n m ) < σ ˆ n ( D j ( P n m ) ) / n for j { 1 , 2 } , then P n = P n m and we can go directly to Step 4. σ ˆ n ( ) denotes the sample standard deviation as described in Section 2.2.1. This step ensures that we stop the process once the bias is second-order as recursions after this occurs are not fruitful. If P n D Ψ j ( P n m ) > σ ˆ / n , then m = m + 1 and we need to proceed to Step 3.

Step 3

Define the following recursion, using Euclidean inner product notation, , 2 , the so-called dot product.

(4) Q ¯ n m ( A , W ) = expit logit ( Q ¯ n m 1 ( A , W ) ) d ε ( H 1 m 1 ( A , W ) , H 2 m 1 ( A , W ) ) , P n ( D Ψ ( P n m 1 ) ) P n ( D Ψ ( P n m 1 ) ) 2 2 ,

where d ε is set to 0.0001 (see Section 5).

This step then requires computing L m = i = 1 n [ Y i log Q ¯ n m ( A i , W i ) + ( 1 Y i ) log ( 1 Q ¯ n m ( A i , W i ) ) ] . If L m L m 1 , then Step 2 needs to be reiterated. Otherwise, Q ¯ n m = Q ¯ n and we need to proceed to Step 4.

Step 4

The proposed estimate for ATE and VTE is Ψ ( P n ) = 1 n i = 1 n b n ( W i ) , 1 n i = 1 n b n ( W i ) 1 n i = 1 n b n ( W i ) 2 , where b n ( W i ) = Q ¯ n m ( 1 , W 1 ) Q ¯ n m ( 0 , W i ) . If the outcome is scaled as Y a b a (see Section 2, paragraph 1), then the estimate is b a n i = 1 n b n ( W i ) , ( b a ) 2 n i = 1 n b n ( W i ) 1 n i = 1 n b n ( W i ) 2 . Standard errors are given by σ ^ n ( D j ( P n ) ) n as in Section 2.2.1, where σ ^ n ( ) is the sample standard deviation. If the outcomes are scaled according to Y s = Y a b a , then the standard error estimates for ATE and VTE need to be multiplied by b a and ( b a ) 2 , respectively.

2.3.2 Simultaneous estimation and confidence intervals

From the steps described above, we have D Ψ ( P n ) ( O i ) = ( D 1 ( P n ) ( O i ) , D 2 ( P n ) ( O i ) ) , for each subject indexed by i 1 : n . Consider the two-dimensional random variable Z n = ( Z n , 1 , Z n , 2 ) N ( 0 2 × 1 , Σ n ) , defined by two by two matrix, Σ n , the sample correlation matrix of D Ψ ( P n ) . Let q n , α be the α th quantile of the random variable M n = max ( Z n , 1 , Z n , 2 ) . Let Z = ( Z 1 , Z 2 ) N [ 0 2 × 1 , Σ ] , where Σ is the correlation matrix of D Ψ ( P 0 ) . Let q α be the α th quantile of the random variable M = max ( Z 1 , Z 2 ) , i.e., the α th quantile of the random variable giving the max number of standard deviations over the coordinates of Z . We used 5 million Monte-Carlo samples from the random variables M n to find q n , α . Five million is a sufficient number to guarantee very little error in finding the true q n , α . Applying the continuous mapping theorem [4] assures under TMLE conditions that Z n , i ± q α covers Z i for all i 1 : 2 , at ( 1 α ) × 100 % and that q n , α q α . q n , α is therefore an estimate of the number of standard errors needed to simultaneously cover both true parameter values at ( 1 α ) × 100 % . This results in the confidence intervals

Ψ ˆ j , n ± q n , α σ ^ n ( D j ( P n ) ) n

which will asymptotically cover all coordinates, Ψ j ( P 0 ) , of Ψ ( P 0 ) , simultaneously at the significance level, 1 α . The reader may note q n , α is the same as the Bonferroni correction [33] if Σ n is the identity matrix.

3 Simulations

Two different types of simulations were performed: the first one primarily aimed at verifying the remainder conditions in the theory of TMLE (condition 2, Section 2.2.1); the second one evaluated the performance of our proposed estimators in the more realistic situation where the treatment mechanism and the outcome model are unknown and have to be estimated by modelling the observed data. For all TMLE candidates, inference was based on the sample standard deviation of the efficient influence curve approximation to form confidence intervals as per Section 2.2.1. For logistic regression (LR) substitution estimators of ATE and VTE, confidence intervals were formed by using the delta method and the influence curve for the beta coefficients for intercept, main terms and interactions (see Appendix B for the derivation).

3.1 Simulation study 1: simulations with controlled noise

In this first simulation study, we directly added heteroskedastic noise to Q ¯ 0 in such a way that the conditions of TMLE hold and then use the noisy estimate as the initial estimate in the TMLE procedure. This simulation does not necessarily match what happens in practice because the added noise is not related to the noise in the draw of Y given A and W . However, it is a valid way to directly test the conditions of TMLE. In this simulation, we assumed g 0 is known because the other second-order terms for VTE, involving bias in estimating g 0 , are dependent on double robustness in the same way as for the ATE, for which the properties of TMLE are already well known [1,2].

3.1.1 Simulation set-up

W were simulated as follows: W 1 uniform [ 3 , 3 ] , W 2 binomial ( 1 , 0.5 ) , W 3 N [ 0 , 1 ] and W 4 N [ 0 , 1 ] . We defined g 0 ( A W ) = expit ( 0.5 ( 0.8 W 1 + 0.39 W 2 + 0.08 W 3 0.12 W 4 0.15 ) ) , which is the true density of A given W . We kept our propensity scores between about 0.17 and 0.83 so as to avoid large practical positivity violations [34]. The density of Y given A and W for a binary outcome was generated by E 0 [ Y A , W ] = Q ¯ 0 ( A , W ) = expit ( 0.2 ( 0.1 A + 2 A W 1 10 A W 2 + 3 A W 3 + W 1 + W 2 + 0.4 W 3 + 0.3 W 4 ) ) } . Based on our definition of the CATE function, b ( W ) = E 0 [ Y A = 1 , W ] E 0 [ Y A = 0 , W ] , we had Ψ ( P 0 ) = var 0 ( b ( W ) ) = 0.0636 .

Below we illustrate the process for one simulation. For each sample size, n , we performed the simulation 1,000 times. We note that rate is some number, which we will set to less than 1 / 4 ( 1 / 3 in this case) in order to satisfy TMLE conditions.

  1. Define bias ( A , W , n ) = 1.5 n rate ( 0.2 + 1.5 A + 0.2 W 1 + W 2 A W 3 + W 4 ) .

  2. Define heteroskedasticity: σ ( A , W , n ) = 0.8 n rate 3.5 + 0.5 W 1 + 0.15 W 2 + 0.33 W 3 W 4 W 4 .

  3. Define b ( A , W , n , Z ) = bias ( A , W , n ) + Z × σ ( A , W , n ) where Z is standard normal.

  4. Draw { Z i } i = 1 n and { X i } i = 1 n each from standard normals.

  5. Q ¯ n 0 ( 1 , W i ) = expit ( logit ( Q ¯ 0 ( 1 , W i ) ) + b ( 1 , W i , n , Z i ) ) .

  6. Q ¯ n 0 ( 0 , W i ) = expit ( logit ( Q ¯ 0 ( 0 , W i ) ) + 0.5 b ( 1 , W i , n , Z i ) + 0.75 b ( 0 , W i , n , X i ) ) .

  7. Q ¯ n 0 ( A , W ) = A Q ¯ n 0 ( 1 , W ) + ( 1 A ) Q ¯ n 0 ( 0 , W ) .

We note that we placed correlated noise on the true Q ¯ 0 ( 1 , W ) and Q ¯ 0 ( 0 , W ) in order to make the CATE function “estimates” of similar noise variance as the initial “estimates” for Q ¯ 0 ( A , W ) . By a Taylor series expansion about the truth, the above procedure will satisfy the remainder term conditions of Section 2.2.1. We have that Q ¯ n 0 ( 1 , W ) = Q ¯ 0 ( 1 , W ) + Q ¯ 0 ( 1 , W ) ( 1 Q ¯ 0 ( 1 , W ) ) b ( 1 , W , n , Z ) + O ( b 2 ( 1 , W , n , Z ) ) and likewise for Q ¯ n 0 ( 0 , W ) and thus, E 0 ( b n 0 ( W ) b 0 ( W ) ) 2 is of order n rate with rate < 1 / 4 . In this case, using the true g 0 allows us to neglect all second-order terms, but E 0 ( b n 0 ( W ) b 0 ( W ) ) 2 . In addition, conditions 2 and 3 of Section 2.2.1 are satisfied and condition 1, the Donsker condition, is satisfied since our “estimated” influence curve, D ( Q ¯ n 0 , g 0 ) , depends on a fixed function of A and W with the addition of independently added random normal noise.

The simulation result, displayed in Figure 1, is in alignment with the theory established for the TMLE estimator of VTE.

Figure 1

Figure 1

3.2 Simulation study 2: realistic simulations

These simulations were more realistic in that the treatment mechanism and the outcome model are unknown and need to be approximated by using machine learning to model the observed data. Unless otherwise noted, sample size was of n = 1,000 and the number of simulations was 1,000. W were generated as follows: W 1 uniform [ 3 , 3 ] , W 2 standardnormal , W 3 standardnormal and W 4 standardnormal . Two situations were simulated: the first one a “best case” scenario where the models are correctly specified and thus our regressions achieve parametric rates of convergence to the truth; the second one where the models are misspecified and we try to recover its non-linear functional form with ensemble machine learning.

3.2.1 TMLE with well-specified initial estimates

In this first simulation scenario, a LR was used to model the correct functional form for both outcome model, E [ Y A , W ] = Q ¯ ( A , W ) and treatment mechanism, E [ A W ] = g 0 ( A , W ) , thus achieving parametric rates of convergence. The goal of these simulations was to show that TMLE preserves well-specified initial estimates and also to show what size sample would result in substantial bias of the sampling distribution for VTE estimates when the truth is near the lower parameter bound of 0. We can say as a rule of thumb, a sample size of 500 or more is needed to get reliable estimates for VTE’s in the neighborhood of 0.025, a rule that was confirmed by Figures 2, 3 and 4. Q ¯ ( A , W ) = expit ( 0.14 ( 2 A + W 1 + a A W 1 b A W 2 + W 2 W 3 + W 4 ) ) for the outcome regression, varying a and b to adjust the size of the VTE. E [ A W ] = g 0 = expit ( 0.4 W 1 + 0.195 W 2 + 0.04 W 3 0.06 W 4 0.075 ) was the true treatment mechanism and we avoided positivity violations by keeping our propensity scores mostly between 0.10 and 0.90.

Figure 2

Figure 2

Figure 3

Figure 3

Figure 4

Figure 4

3.2.2 Misspecified TMLE initial estimates

In this second simulation scheme, the models for the treatment mechanism and/or the outcome are misspecified, and we try to estimate the corresponding portion of the likelihood using ensemble machine learning.

3.2.2.1 SuperLearner

Targeted learning [2] features the use of data adaptive prediction methods optimized by the ensemble learning R packages, such as SuperLearner [32], H2O [35] or the most recent sl3 [36]. SuperLearner, which picks the best single algorithm in the library, as decided by the cross-validation of a valid loss function, has mean loss that converges to the oracle selector at rate O ( log ( k ( n ) ) / n ) where k ( n ) is the number of candidate algorithms, under very mild assumptions on the library of estimators [29]. Generally, the best or nearly best learner in the library is the optimal convex combination of algorithms that forms the SuperLearner predictor, which is a form of model stacking [37].

3.2.2.2 SuperLearner library 1

The algorithms included in this library (SL1) were selected to avoid overfitting:

  1. SL.hal is the highly adaptive lasso (HAL) [26,38]. Assuming the true outcome model and treatment mechanism (in the case it is unknown) are right-hand continuous with left-hand limits and have variation norm smaller than a constant M, then our estimates will be asymptotically efficient if we include SL.hal in the library [26].

  2. SL.gam3, a gam [39] using degree 3 smoothing splines, screening main terms, top ten correlated variables with the outcome and top six.

  3. SL.glmnet_1, SL.glmnet_2 and SL.glmnet_3 [40] performed a lasso, equal mix between lasso and ridge penalty and ridge regressions.

  4. nnetMain_screen.Main [41] is a neural network with decay = 0.1 and size = 5 using main terms.

  5. earthMain [42] is data adaptive penalized regression spline fitting method. They allow for capturing the subtlety of the true functional form. We allowed degree = 2, which is interaction terms with the default penalty = 3 and a minspan = 10 (minimum observations between knots).

  6. SL.glm [43] LR and we used main terms, top six correlated variables with outcome and top ten as well as a standard glm with main terms and interactions (glm_mainint_screen.Main)

  7. SL.stepAIC [43] uses Akaike criterion in forward and backward step regression

  8. SL.mean returns the mean outcome for assurance against overfitting

  9. rpartPrune [44] is recursive partitioning with cp = 0.001 (must decrease the loss by this factor) minsplit = 5 (min observations to make a split), minbucket = 5 (min elements in a terminal node).

3.2.2.3 SuperLearner library 2

This second library (SL2) was identical to Library 1, except the following learners were added:

  1. SL.ranger [45]: A random forest that picked 3 features at a time formed 2,500 trees and had a minimum leaf size set to 10.

  2. SL.xgboost [46]: One xgboost fit on all main terms and interactions with stumps (depth 1 trees), allowing a minimum of 3 observations per node, a learning rate of 0.001 and summing 10,000 trees. We also included an xgboost using depth of 4 trees on main terms only with same shrinkage and minimum observations per node but only 2,500 trees.

3.2.3 Case 1: Well-specified treatment mechanism, misspecified outcome

In this case, the true models for the outcome and the treatment mechanism were defined as follows:

E [ Y A , W ] = Q 0 = expit ( 0.28 A + 2.8 cos ( W 1 ) A + cos ( W 1 ) 0.56 A ( W 2 2 ) + 0.42 cos ( W 4 ) A + 0.14 A W 1 2 ). E [ A W ] = g 0 = expit ( 0.4 W 1 + 0.195 W 2 + 0.04 W 3 0.06 W 4 0.075 ), which we will specify correctly in all cases with a logistic linear fit. The resulting true ATE was 0.078 and the true CATE variance was equal to 0.085.

The results from case 1, as summarized in Figure 5, highlight the following points:

  1. Benefit of ensemble learning. Using a LR with main terms and interactions substitution estimator and the delta method for inference, yielded a bias of 0.065 (the truth is 0.079), missing almost entirely the VTE and covering at 0%. The TMLE could not help the initial estimates using the same LR, highlighting the risk of using a parametric model as opposed to ensemble learning.

  2. Difference in robustness between estimating ATE and VTE. The severely misspecified LR with main terms and interactions initial estimate for the outcome model and well-specified treatment mechanism yielded a TMLE for ATE (which is doubly robust) that covered at 95.6%, where as for VTE it never covers the truth.

  3. The advantage of CV-TMLE over TMLE. The same SuperLearner used for initial estimates yields bigger bias and variance for TMLE as opposed to a normally distributed CV-TMLE sampling distribution. Some degree of overfitting introduced by random forest about 20% of the time out of the library of 18 learners generated outliers for TMLE, resulting in lack of normality of the sampling distribution and causing higher bias and variance, where as CV-TMLE appeared unaffected by the overfitting.

Figure 5 
                     Truth at black vertical line. CV-TMLE SL2 uses a SuperLearner that overfits yet retains normally, has lowest MSE and covers at 87.4%. TMLE SL2 uses same SuperLearner yet is skewed, the most biased and covers at 83.1% with highest MSE. TMLE SL1 covers at 93.2% and uses a SuperLearner that does not overfit. Orange and Yellow lines are mean estimates of TMLE with logistic regression including interactions as initial estimates and plug-in of outcome fit with logistic regression including interactions, neither of which ever cover the truth and miss almost the entire CATE variance.

Figure 5

Truth at black vertical line. CV-TMLE SL2 uses a SuperLearner that overfits yet retains normally, has lowest MSE and covers at 87.4%. TMLE SL2 uses same SuperLearner yet is skewed, the most biased and covers at 83.1% with highest MSE. TMLE SL1 covers at 93.2% and uses a SuperLearner that does not overfit. Orange and Yellow lines are mean estimates of TMLE with logistic regression including interactions as initial estimates and plug-in of outcome fit with logistic regression including interactions, neither of which ever cover the truth and miss almost the entire CATE variance.

Regarding the Donsker condition 1 of Section 2.2.1 or entropy condition, a large Donkser class is the set of functions of uniformly bounded variation [4], meaning the function class is smooth in some sense, limiting the variation in outcome predictions that overfitting might allow. The efficient influence curve approximation, D ( P n ) , is defined partially in terms of the outcome model fit and thus, condition 1 of Section 2.2.1 is dependent on the outcome model being constrained to a Donsker class, otherwise the TMLE might not be asymptotically normal. For instance, overfitting, which might manifest in modeling random noise in the data, might then estimate false heterogeneous responses to treatment and cause large outlier estimates for VTE, right-skewing the sampling distribution as we saw in the simulations for case 1. CV-TMLE does not require the Donsker condition and, therefore, allows highly adaptive machine learning we might need to minimize the second-order remainder bias without losing the normality of our sampling distribution (Table 1).

Table 1

Performance of the estimators, case 1

var bias mse Coverage
TMLE LR 0.00001 0.08207 0.00675 0
LR plug-in 0.00005 0.07134 0.00514 0
CV-TMLE SL2 0.00028 0.00930 0.00037 0.87375
CV-TMLE SL2 0.00028 0.00924 0.00037 0.88577
TMLE SL2 0.00057 0.01584 0.00082 0.83100
TMLE SL2 0.00057 0.01591 0.00082 0.86000
TMLE SL1 0.00033 0.00802 0.00040 0.93193
TMLE SL1 0.00033 0.00804 0.00040 0.93994

ATE and VTE estimated simultaneously with one-step TMLE covering both parameters for 95% simultaneous confidence intervals. LR indicates logistic regression with main terms and interactions. SL1 library did not overfit SL2 library overfit 20% of the simulations with 1 out of 18 algorithms, still causing outliers, necessitating CV-TMLE as a precaution.

3.2.4 Case 2 and 3

For both cases 2 and 3, we used the following model for the true treatment mechanism:

E 0 [ A W ] = expit ( 0.4 ( 0.4 W 1 W 2 + 0.63 W 2 2 0.66 cos ( W 1 ) 0.25 ) ) .

Case 2 true outcome model:

E 0 [ Y A , W ] = expit ( 0.1 W 1 W 2 + 1.5 A cos ( W 1 ) + 0.15 W 1 0.4 W 2 ( a b s ( W 2 ) > 1 ) 1 W 2 ( a b s ( W 2 1 ) ) ) .

Case 3 true outcome model:

E 0 [ Y A , W ] = expit ( 0.2 W 1 W 2 + 0.1 W 2 2 0.8 A ( cos ( W 1 ) + 0.5 A W 1 W 2 2 ) 0.35 ) .

Cases 2 and 3 again demonstrate how employing targeted learning with CV-TMLE can recover a misspecified treatment mechanism as well as outcome models when parametric models lead to substantially biased results. Indeed, for both cases, standard parametric models were shown to have almost zero coverage and to be missing almost all of the VTE. However, TMLE coverage was below nominal and at times very poor, depending on the situation. This was not a problem solely for the case of an observational study as the authors have found the main culprit in poor coverage to be the second-order remainder term consisting of the integral of true CATE function minus estimated CATE function squared (see Section 2.2.2). Only in case 2 did CV-TMLE, using a pretty small superlearner library of eight algorithms, including, xgboost, neural networks, glm with main terms and interactions, earth, sample mean and the HAL [26], achieve a coverage of 83% and reduced bias of the initial estimate from 0.015 to 0.009 . In case 3, CV-TMLE with the same SuperLearner library only covered 32%.

3.3 Supplementary results

The reader may visit Jonathan Levy’s github for instructions and software on how to reproduce the results obtained in this article, as well as more detailed results that were not included in this manuscript.

4 Demonstration on real data

The CV-TMLE estimator, as detailed in this article, was applied to a real dataset. GER-INF-05 is a placebo-controlled randomized trial published in 2002 that evaluated the impact on mortality of low-dose steroid administration in patients hospitalized in the intensive care unit (ICU) for septic shock [47]. This study was performed in 19 ICUs in France and enrolled a total of 299 patients. Because steroid supplementation in this context was expected to be beneficial in patients with relative adrenal insufficiency, a corticotropin stimulation test was performed in all patients. A pre-specified subgroup analysis was planned in patients who were non-responders to the corticotropin stimulation test (relative adrenal insufficiency) and in those responders to the corticotropin stimulation test (no relative adrenal insufficiency). In this sample, 7 days of low-dose hydrocortisone associated with fludrocortisone were associated with a reduced risk of death in patients with septic shock. As expected, this reduction was significant in patients with relative adrenal insufficiency as reflected by a lack of response following the corticotropin stimulation test. Because of the apparent heterogeneity in treatment effect, at least partially explained by the presence of a relative adrenal insufficiency, we used the proposed estimator to quantify treatment effect variability across the strata of patients.

We controlled the following confounders in fitting the outcome model: Simplified Acute Physiology Score (SAPS2) severity score to assess mortality, Sequential Organ Failure Assessment (SOFA0) severity score at baseline, lactate level, cortisol level before corticotropin (baseline cortisol), maximal cortisol level after corticotropin stimulation (delta cortisol), an indicator of responding to the corticotropin stimulation test (responder), site of infection, mechanical ventilation at baseline, patient origin (hospital acquired infection or not) (origin), indicator for the reason of ICU admission, i.e., medical, elective surgery or urgent surgery (admission type), indicator of use of etomidate for anesthesia (drug known to alter the adrenal function), blood sugar (glycemia) and the responsible pathogen (germ). In this case, the treatment assignment was random and thus we can identify VTE from the data as a measure of how much of the heterogeneity in treatment effect is due to confounders normally used to assign treatment. We note, in the case of variables missing data, we create an indicator of missingness and use the median or, in the case of categorical variables, the mode as an imputed value. The Table 2 summarizes our data.

Table 2

GER-INF summary, N = 299

Variable
Outcome: renal failure
Yes 173
No 126
Treatment: Rec. steroid
Yes 150
No 149
Age in years 60.8 (16.1)
SAPS2 62.6 (23.6)
SOFA0 11 (3.2)
Missing 24
Lactate 4.4 (3.3)
Baseline cortisol 23.3 (30.9)
Delta cortisol 6.1 (22.3)
Responder
Gycemia 175.8 (106.2)
Missing 6
Yes 70
No 229
Mechanical ventilation
Yes 298
No 1
Origin
Yes 112
No 187
Etomidate
Yes 76
No 213
Infection site
Multiple 144
Lung 89
GI 31
Soft tissue 16
Bacteremia 6
Other 12
Missing info 1
Admission type
1: 179
2: 10
3: 110
Germ
Type 1 72
Type 2 38
Type 3 4
Type 4 2
Type 5 24
Type 6 9
Type 7 150

We used a SuperLearner library consisting of 40 algorithms, including main terms and interactions when applying any regression methods, such as LR (glm), Bayes generalized linear models, lasso and ridge regressions (combinations of L 1 and L 2 penalty) and earth, which uses data adaptive regression splines. For boosting trees (xgboost), we used depth 2 trees to allow interaction as well as depth 1 trees with main terms and interactions as the covariates. Also for boosting, we used different hyperparameters for the number of trees in combination with different learning rates. We used recursive partitioning and random forest, as well as neural networks. We also applied k -nearest neighbors for prediction. In addition, we applied screening of the top 25 correlated variables with the outcome in conjunction with then running the machine learning algorithm and did the same for the top 5 variables when running Bayes generalized linear models as well as glm. The convex combination of learners (a.k.a. SuperLearner) had the lowest cross-validated risk (negative log-likelihood) of 0.536 with random forest algorithms and the lasso with all interactions included as variables performing equally well. The discrete SuperLearner [32], or the one that chooses the lowest risk algorithm to use for each fold’s validation set predictions, had a cross-validated risk of 0.57.

We provide estimates in Table 3 for ATE and VTE simultaneously and used the delta method to also give a confidence interval for the V TE , because such is on the scale of measurement of ATE. The lower bounds of the confidence interval for VTE and V TE are negative and therefore are not possible values for these parameters. Log-scaling the confidence intervals only make them excessively large and therefore not useful due to the unscaled confidence intervals centering close to 0. As mentioned earlier in the Section 3.2.1, if the true VTE is as small as the estimate, then the sampling distribution under the best case scenario of well-specified outcome model is skewed. A true VTE of approximately 0.06 would be needed to have normal sampling distributions for estimating VTE in this scenario. As previously discussed, the second-order remainder term can also generate some bias, which could account for missing a truly larger VTE in our estimates. The second-order remainder term in Section 2.2.2 is the square of the L 2 norm bias in estimating the CATE function and so, prioritizing the CATE function in the outcome model estimation is the subject for future work to improve CATE function estimation in finite samples.

Table 3

CV-TMLE results for simultaneous estimation of ATE, VTE and V TE : 95% confidence to cover the truth for all parameters

est se Lower Upper
ATE 0.088 0.050 0.199 0.023
VTE 0.002 0.004 0.008 0.012
V TE 0.045 0.048 0.061 0.152

5 Discussion

The results of the simulation and the real case study highlight the existence of two great challenges in estimating VTE, one being the fact that the parameter is bounded below at 0, skewing and biasing the estimates when the true variance is too small for the sample size. In the future, we might develop an improvement over log-scaling to form adjusted confidence bounds if the lower bound is negative. This problem does not seem as crucial as obtaining reliable inference when the true variance is large enough to be estimated for the sample size. On this front, the second-order remainder term condition has proven difficult to satisfy in finite samples so as to avoid bias and sub-optimal coverage. For larger samples, a SuperLearner library, including the HAL [38] will likely achieve the necessary rates of convergence for this term to be truly second-order, however, such is not trivial to show conclusively via very time-consuming and computer-intensive simulations, and besides, a sample size of 1,000 for four covariates and binary treatment is certainly of practical importance. To this end, the authors plan to propose in a follow-up paper, a novel way to estimate the CATE function that will also yield estimates of the outcome predictions that are necessarily between 0 and 1, all the while performing asymptotically as good as just fitting the outcome model. We need CATE function estimates that are compatible with the outcome predictions in order to perform the crucial targeting step of the TMLE procedure. That is we need the estimate of the CATE function, b n ( W ) = Q ¯ n ( 1 , W ) Q ¯ n ( 0 , W ) , where Q ¯ n ( A , W ) is the estimate of the outcome model.

Another approach to yielding better coverage would be to account for the second-order remainder term via the use of the new HAL non-parametric bootstrap [48], to form our confidence intervals. HAL, as mentioned, guarantees the necessary rates of convergence of the second-order remainder terms under very weak conditions and is thus guaranteed to yield asymptotically efficient TMLE estimates [27], using the empirical variance of the efficient influence curve approximation for the standard error. However, in finite samples such a procedure yields lower than nominal coverage due to the second-order remainder term of Section 2.2.2 that arises when estimating VTE. The HAL CV-TMLE, using a non-parametric bootstrap addresses this issue and can possibly account for the second-order bias, the subject of more future work.

There are two other procedures for mapping P n 0 to P n . Levy introduces a new iterative analog to the one-step TMLE procedure, which also uses one-dimensional parametric submodels, called canonical least favorable submodels [49]. The iterative TMLE in ref. [1] uses parametric submodels of the same dimension as the parameter of interest. It has been conjectured that the one-step TMLE may better preserve the properties of the initial fit, P n 0 , than the aforementioned iterative versions, thereby leading to better finite-sample behavior of the second-order remainder term R 2 ( P n , P 0 ) [27]. If this conjecture is correct, then we would expect similar gains by using the one-step TMLE in our setting. However, in our simulations we found no appreciable differences.

In Step 3 Section 2.3, we used d ε = 0.0001 because this step size was small enough that we lost no appreciable accuracy in constructing the integrals that define the elements of the universal least favorable model in van der Laan and Gruber.

Acknowledgement

We would like to extend a special acknowledgment to Professor Djillali Annane who granted us real data for use in this paper.

  1. Conflict of interest: Prof. Mark van der Laan is one of the Editors in the Journal of Causal Inference but was not involved in the review process of this article.

Appendix The influence curve and remainder term derivations

Set up

Our observed data, O , is of the form, O = ( W , A , Y ) , where W is a set of confounders, A is a binary treatment indicator and Y is the outcome, continuous or binary. O P , non-parametric. It is important that we can factorize the density for P as p ( o ) = p Y ( y a , w ) g ( a w ) p W ( w ) .

A.0.1 Tangent space for nonparametric model

We consider the one-dimensional set of submodels that pass through P at ε = 0 [26] { P ε def. by density , p ε = ( 1 + ε S ) p S d P = 0 , S 2 d P < } . The tangent space is the closure in L 2 norm of the set of scores, S , or directions for the paths defined above. We write:

T = { S ( o ) E S = 0 , E S 2 < } ¯ = { S ( y a , w ) E P Y S = 0 , E S 2 < } ¯ { S ( a w ) E P A S = 0 , E S 2 < } ¯ { S ( w ) E P W S = 0 , E S 2 < } ¯ = T Y T A T W .

For a non-parametric model, T = L 0 2 ( P ) forms a Hilbert space with inner product defined as f , g = E P f g . Our notion of orthogonality now is f g if and only if f , g = 0 and, therefore, the above direct sum is valid. In other words, every score, S , can be written as d d ε log ( p ε ) ε = 0 = S ( w , a , y ) = S Y ( y a , w ) + S A ( a w ) + S W ( w ) , where due to the fact p ε = ( 1 + ε S ) p = p Y ε p A ε p W ε , it follows that d d ε log ( p Y ε ) ε = 0 = S Y ( y a , w ) , d d ε log ( p A ε ) ε = 0 = S A ( a w ) and d d ε log ( p W ε ) ε = 0 = S W ( w ) . Furthermore, we know that a projection of S on T Y is

S Y ( y w , a ) = S ( w , a , y ) E [ S ( W , A , Y ) W = w , A = a ] = S ( w , a , y ) S ( w , a , y ) p Y ( y w , a ) d ν ( y ) = d d ε log p Y ε ( y a , w ) ε = 0

and thus we have the following identities:

(5) d d ε p Y ε ( o a , w ) ε = 0 = ( S ( o ) E [ S ( O ) w , a ] ) p Y ( o a , w )

(6) d d ε p W ε ( w ) ε = 0 = ( E [ S ( O ) w ] E S ( O ) ) p W ( w ) = E [ S ( O ) w ] p W ( w )

A.0.2 Efficiency theory in brief

Our parameter of interest is a mapping from the model, to the real numbers given by Ψ ( P ) = E ( b ( W ) E b ) 2 where b ( W ) = E [ Y A = 1 , W ] E [ Y A = 0 , W ] . We can borrow from van der Vaart, who defines an influence function, otherwise known as a gradient, as a continuous linear map from T to the reals given by

(7) lim ε 0 Ψ ( P ε ) Ψ ( P ) ε Ψ ˙ P ( s ) .

We note to the reader, we imply a direction, S , when we write P e , which has density p ( 1 + ε S ) , but generally leave it off the notation as understood.

By the Riesz representation theorem [50] for Hilbert Spaces, if the functional defined in equation (7) is a bounded and linear functional on T it can be written in the form of an inner product D ( P ) , g where D is a unique element of T , which we call the canonical gradient or efficient influence curve. Thus, in the case of a nonparametric model, the only gradient is the canonical gradient. It is notable that the efficient influence curve has a variance that is the lower bound for any regular asymptotically linear estimator [26]. Since the TMLE, under conditions as discussed in this article, asymptotically achieves variance equal to that of the efficient influence curve, the estimator is asymptotically efficient.

As a note to the reader: Our parameter mapping does not depend on the treatment mechanism, g , and also T A T Y T W , which means our efficient influence curve must therefore be in T Y T W for the nonparametric model. Therefore, our efficient influence curve will have two orthogonal components in T Y and T W , respectively. We have no component in T A , which is why we need not perform a TMLE update of the initial prediction, g n , of g 0 ( A W ) . This also teaches us that for the semi-parametric model, where the treatment mechanism is known, the efficient influence function will be the same as for the non-parametric model.

A.1 Derivation of the efficient influence curve

For the below proof, when we take the derivatives we will assume that such is in the direction of a given score, S T . We will assume a dominating measure ν and merely use ν to always denote the dominating measure of all densities involved. We could perhaps just assume continuous densities and use Lebesque measure as well and nothing would change below.

Theorem A.1

Let Ψ ( P ) = var P ( b ( W ) ) . The efficient influence curve for Ψ at P is given by:

D ( P ) ( W , A , Y ) = 2 ( b ( W ) E b ( W ) ) 2 A 1 g ( A W ) ( Y Q ¯ ( A , W ) ) + ( b ( W ) E b ) 2 Ψ ( P ) ,

where Q ¯ ( A , W ) = E ( Y A , W ) .

Proof

(8) d d ε Ψ ( P ε ) ( S ) ε = 0 = d d ε E P ε ( b P ε ( W ) E P ε b P ε ( W ) ) 2 ε = 0 = d d ε ( b P ε ( w ) E P ε b P ε ( W ) ) 2 p ε ( o ) ν ( d o ) ε = 0 = 2 ( b P ε ( w ) E P ε b P ε ( W ) ) d d ε ( b P ε ( w ) E P ε b P ε ) p ( o ) ν ( d o ) ε = 0 + ( b P ( w ) E P b P ( W ) ) 2 d d ε p W , ε ( o ) ε = 0 ν ( d o ) = 2 ( b P ( w ) E P b P ( W ) ) d d ε b P ε ( w ) p ( o ) ν ( d o ) ε = 0 + E [ ( b P ( W ) E P b P ( W ) ) 2 E [ S ( O ) W ] p W ( W ) ] from (6) = 2 ( b P ( w ) E P b P ( W ) ) d d ε b P ε ( w ) p ( o ) ν ( d o ) ε = 0 + E [ ( ( b P ( W ) E P b P ( W ) ) 2 Ψ ( P ) ) S ( O ) ] .

Now we can compute the first term in (8).

2 ( b P ( w ) E P b P ( W ) ) d d ε ( y p Y ε ( y a = 1 , w ) y p Y ε ( y A = 0 , w ) ) ν ( d y ) p W ( w ) ν ( d w ) ε = 0 = 5 2 ( b P ( w ) E P b P ( W ) ) 2 a 1 g ( a w ) y ( E P [ S ( O ) y , a , w ] E P [ S ( O ) a , w ] ) p Y ( y a , w ) g ( a w ) ν ( d ( y × a ) ) p W ( w ) ν ( d w ) = fubini 2 ( b P ( w ) E P b P ( W ) ) ( 2 a 1 ) g ( a w ) y S ( o ) p ( o ) ν ( d o ) 2 ( b P ( w ) E P b P ( W ) ) ( 2 a 1 ) g ( a w ) y Q ¯ ( a , w ) S ( o ) p ( o ) ν ( d o ) = 2 ( b P ( w ) E P b P ( W ) ) ( 2 a 1 ) g ( a w ) ( y Q ¯ ( a , w ) ) S ( o ) p ( o ) ν ( d ( o ) )

2 ( b ( W ) E b ( W ) ) 2 A 1 g ( A W ) ( Y Q ¯ ( A , W ) ) + ( b ( W ) E b ) 2 Ψ ( P ) is the aforementioned representer, completing the proof.□

A.2 Remainder terms and asymptotic linearity

The reader may recall the three conditions assuring asymptotic efficiency of the TMLE estimator. Here, we will focus on condition 2 in Section 2.2.1, which requires the remainder term,

(9) R 2 ( P n , P 0 ) = o p ( 1 / n ) .

Theorem A.2

Let ( W , A , Y ) = confounders , treatment indicator and outcome, resp., ( W , A , Y ) P 0 , the true distribution, Q ¯ 0 ( A , W ) = E P 0 [ Y A , W ] , g 0 ( A W ) be the true treatment mechanism and Q ¯ n and g n be estimated counterparts. Assume δ > 0 s.t. δ g ( A W ) 1 δ for all W . A sufficient condition to guarantee (9) is Q ¯ n Q ¯ 0 L 2 ( P 0 ) , g n g 0 L 2 ( P 0 ) = o P ( n 0.25 ) . It is necessary to estimate the true CATE function b 0 with b n so that b n b 0 L 2 ( P 0 ) = o P ( n 0.25 ) in order to satisfy (9). Assume Q ¯ n Q ¯ 0 L 2 ( P 0 ) = o P ( n r Q ¯ ) and g n g 0 L 2 ( P 0 ) = o P ( n r g ) then, in addition to the necessary condition, it is sufficient that also r Q ¯ + r g