Skip to content
Publicly Available Published by De Gruyter January 10, 2015

Balancing Score Adjusted Targeted Minimum Loss-based Estimation

  • Samuel David Lendle EMAIL logo , Bruce Fireman and Mark J. van der Laan


Adjusting for a balancing score is sufficient for bias reduction when estimating causal effects including the average treatment effect and effect among the treated. Estimators that adjust for the propensity score in a nonparametric way, such as matching on an estimate of the propensity score, can be consistent when the estimated propensity score is not consistent for the true propensity score but converges to some other balancing score. We call this property the balancing score property, and discuss a class of estimators that have this property. We introduce a targeted minimum loss-based estimator (TMLE) for a treatment-specific mean with the balancing score property that is additionally locally efficient and doubly robust. We investigate the new estimator’s performance relative to other estimators, including another TMLE, a propensity score matching estimator, an inverse probability of treatment weighted estimator, and a regression-based estimator in simulation studies.

1 Introduction

Estimators based on the propensity score (PS), the probability of receiving a treatment given baseline covariates, are popular for estimation of causal effects such as the average treatment effect (ATE), average treatment effect among the treated (ATT), or the average outcome under treatment. Such methods can be thought of as adjusting for the propensity score in place of baseline covariates, and generally require consistent estimation of the propensity score if it is not known. Common propensity score methods include stratification or subclassification [13], inverse probability of treatment weighting (IPTW) [4, 5], and propensity score matching [68].

A “balancing score” as defined by Rosenbaum and Rubin [8] is a function of baseline covariates such that treatment and baseline covariates are independent conditional on that function. The propensity score is perhaps the most well-known example of a balancing score, but balancing scores are more general. Typically, propensity score-based methods are said to be consistent when the true propensity score is consistently estimated. Methods that adjust for the propensity score nonparametrically, such as matching or stratification by the propensity score, actually only need that the estimated propensity score converge to some balancing score in order for the parameter of interest to be estimated consistently. However, we are not aware of specific claims in the literature that particular propensity score-based methods are consistent under this weaker condition. We say that an estimator using the propensity score or other balancing score has the balancing score property if it is consistent when the estimated propensity score converges to a balancing score.

Though not guaranteed in general, it is possible for an estimated propensity score based on a misspecified model to converge to a balancing score that is not equal to the true propensity score. Propensity score-based estimators that have the balancing score property are robust to this sort of estimator misspecification of the PS, while other propensity score-based estimators are not. The balancing score property is desirable because, even though most such estimators were initially developed based on the PS specifically, they inherit this robustness for free. Estimators with the balancing score property are in general not efficient.

An efficient estimator is one that achieves the minimum asymptotic variance of all regular estimators. In many cases, for example when estimating the ATE, ATT, and average outcome under treatment, doubly robust estimators can be constructed. A doubly robust estimator is one that relies on an estimate of both the propensity score and of the outcome regression, the conditional mean of the outcome given baseline covariates and treatment. Doubly robust estimators are consistent if either the estimated propensity score or outcome regression is consistent. Examples include targeted minimum loss-based estimation (TMLE) [9, 10] and augmented inverse probability of treatment weighted estimation (A-IPTW) [11, 12]. In addition to being doubly robust, both TMLE and A-IPTW are efficient when both the propensity score and outcome regression are consistently estimated.

In this article, we discuss a general class of estimators that have the balancing score property. We also construct a TMLE [9, 10] with the balancing score property. This new TMLE not only has the benefit of the robustness provided by the balancing score property, it also is a locally efficient, doubly robust plug-in estimator. This means that our new estimator retains all of the attractive properties of a traditional TMLE while gaining robustness that other estimators with the balancing score property enjoy when the propensity score only converges to a balancing score.

In Section 2, we introduce notation and define the statistical parameter we wish to estimate. In Section 3 we describe a TMLE for the statistical parameter. In Section 4 we discuss the balancing score property and describe the proposed new TMLE. In Section 5 we compare the performance of the new estimator to a traditional TMLE as well as other common estimator and conclude with a discussion in Section 6. A list of notation used throughout the article is provided in Appendix A. Some results and proofs not included in the main text are in Appendix A.2 and two modifications to the TMLE algorithm are presented in Appendix A.3. An example implementation of the proposed new TMLE in R [13] is provided in Appendix A.4.

2 Preliminaries

Consider the random variable O=(W,A,Y) where W is a real-valued vector, A is binary with values in {0,1} and Y is univariate real number. Call the probability distribution of OP0M where M is the statistical model. Assume P0(A=1|W)>0 for almost every W. This is sometimes called a positivity assumption. Define the parameter mapping Ψ from M to that maps P to EP(EP(Y|A=1,W)) where EP denotes expected value under probability distribution PM.

Suppose A=1 indicates some treatment of interest and A=0 represents some control or reference treatment, W represents a vector of baseline covariates measured before treatment, and Y represents some outcome measured after treatment. Then under additional causal assumptions, Ψ(P0) can be interpreted as a causal quantity. In particular, we may assume that observed treatment A is independent of the counterfactual outcome had each observation received treatment 1 given covariates W. This is known as the randomization assumption or the “no unmeasured confounders” assumption, and the validity depends on the particular application. Under the randomization positivity assumptions, Ψ(P0) can be interpreted as the average outcome had everyone in the population received treatment 1. In this paper we focus on estimation of the statistical parameter Ψ(P0), but other similar statistical parameters can, under assumptions, be interpreted as causal parameters such as the ATE or the ATT [14].

For a probability distribution PM, Qˉ(a,w)=EP(Y|A=a,W=w) is the regression of the outcome on covariates and treatment. Let QW(w)=P(W=w) be the distribution of baseline covariates. The conditional distribution of treatment on baseline covariates is called g(a|w)=P(A=a|W=w), and define the propensity score as gˉ(w)=g(1|w), the probability of treatment given covariates w. The parameter mapping Ψ depends on P only through Q=(Qˉ,QW), so recognizing the abuse of notation, we sometimes write Ψ(P)=Ψ(Q)=Ψ(Qˉ,QW).

For a distribution PM, we make no assumptions on the outcome regression Qˉ or on the distribution QW of W. We may put some restriction on possible functions g, for example we may know that P(A|W) depends only on a subset of W. The model M is therefore nonparametric or semiparametric.

Let O1,,On be a data set of n independent and identically distributed random variables drawn from P0 where Oi=(Wi,Ai,Yi). We use the subscript 0 to denote the true probability distribution, and n to denote an estimate based on a dataset of size n, so, for example, E0 denotes expectation with respect to P0, Qˉ0(a,w)=E0(Y|A=a,W=w), and Qˉn is an estimate of Qˉ0. Let ψ0=Ψ(P0).

3 Targeted minimum loss-based estimation

A plug-in estimator takes an estimate of the distribution P0, or relevant parts of P0, and plugs it into the parameter mapping Ψ. In this case, Ψ depends on P through Qˉ and QW. Using an estimate Qˉn of Qˉ0, and letting QWn be the empirical distribution of W, we can calculate the plug-in estimate as


That is, we take the mean of Qˉn(1,W) with respect to the empirical distribution of W. Plug-in estimators are desirable because they fully utilize known global constraints of Q0 (by using an estimate Qn that satisfies these constraints) and guarantee that estimates are in the parameter space, even in small samples. Non-plug-in estimators such as IPTW can produce estimates outside of the parameter space. For instance if our estimand is a probability, a method like IPTW could yield an estimate outside of [0,1] when the sample size is small.

TMLE is a general framework for constructing a plug-in estimator for ψ0 with additional properties such as efficiency. TMLE takes an initial estimate of the outcome regression Qˉ0, say Qˉn0, and, using an estimate gˉn(W) of the propensity score, updates it to Qˉn. Using the empirical distribution of W along with the updated Qˉn, the final estimate is calculated as Ψ(Qˉn,QWn). The updated Qˉn is constructed in such a way that the final estimate is efficient or attains other properties. We now review some background and a specific implementation of the TMLE procedure for Ψ(P0).

An estimator that is asymptotically linear can be written as


for some mean zero function IC(P0) where oP(1) is a term that converges in probability to 0. The function IC(P0) is called the influence curve of the estimator at P0. For an estimator to be efficient, that is, to have the minimum asymptotic variance among all regular estimators, it must be asymptotically linear with influence curve equal to the so called efficient influence curve [9, 15]. The efficient influence curve for a particular parameter mapping Ψ depends on the model. For our model, regardless of the model for g0, the efficient influence curve at a PM written in terms of Q and g is


A derivation of the efficient influence curve is presented in Chapter 4 van der Laan and Rose [9].

Suppose for now Y is binary or bounded by 0 and 1. A modification to the algorithm and a different TMLE are described in Appendix A.3 if this is not the case. The initial estimate Qˉn0 can be obtained via a parametric model for E0(Y|A,W), such as a generalized linear model [16], or with a data adaptive machine learning algorithm such as the SuperLearner algorithm [9, 17], which combines parametric and data adaptive estimators using cross-validation.

The updating step is defined by a choice of loss function L for Q such that E0L(Q)(O) is minimized at Q0, and a working parametric submodel with finite dimensional real-valued parameter ε, {Q(ε):ε} such that Q(0)=Q. The submodel is typically chosen so that the efficient influence curve is in the linear span of the components of the “score” ddεL(Q(ε)(O)) at ε=0. When L is the negative log likelihood, ddεL(Q(ε)(O)) is the score in the usual sense. Starting with k=0, the empirical risk minimizer εnk=argminεi=1nL(Qnk(ε))(Oi) is calculated and Qnk is updated to Qnk+1=Qnk(εnk). The process is iterated until εk0, sometimes converging in one step. Details can be found in Refs [9, 10, 18, 19].

Define the loss function L(Q)(O)=LY(Qˉ)(O)+LW(QW)(O) where


and LW(QW)(O)=log(QW(W)). When Y is binary, LY(Qˉ)(O) is the negative conditional log likelihood of the Bernoulli distribution. Because Y is at least bounded by 0 and 1 if not binary, LY(Qˉ)(O) is a valid loss function for the conditional mean. That is, Qˉ0=argminQˉE0LY(Qˉ)(O) [20]. The function LW(QW)(O) is the negative log likelihood of the distribution of W, and its true mean is minimized by QW0. Thus, the sum loss function is a valid loss function for Q0=(Qˉ0,QW0).

For a working submodel for Qˉ, we use


indexed by ε. We call this a logistic working model because it is a logistic regression model with offset logitQˉ(A,W) and single covariate A/g(1|W). The score of this model at ε=0 is


For QW, we can use as working submodel


which has score Qˉ(1,W)Ψ(Qˉ,QW) at ε=0. We can see that the efficient influence curve D(P0) can be written as a linear combination of the scores of these submodels when Q=Q0 and g=g0.

The estimate εn0 can be calculated using standard logistic regression software with logit(Qˉn0(A,W)) as a fixed offset term, and A/gn(1|W) as a covariate. By using the empirical distribution of W as an initial estimate for QWn0, and negative log likelihood loss function for LW, the empirical risk is already minimized at QWn0, so εn0=0 and no update is needed. In this case, the algorithm converges in one step, because Agn(1|W) is not updated between iterations, so an additional update to Qˉn1 will yield εn1=0. The estimate Qˉn=Qˉn0(εn0) and the TMLE estimate of Ψ(P0) is calculated as


Under regularity conditions, the TMLE is asymptotically linear and doubly robust, meaning that if the initial estimate Qˉn0 is consistent for Qˉ0, or gˉn is consistent for gˉ0, then Ψ(Qˉn,QWn) is consistent for Ψ(P0). Additionally, when both Qˉn0 and gn are consistent, the influence curve of the TMLE is equal to the efficient influence curve, so the estimator achieves the semiparametric efficiency bound. Precise regularity conditions for asymptotic linearity and efficiency are presented in Appendix A.2 in Theorem 3.

4 Balancing score property and proposed estimator

A function b of W is called a balancing score if AW|b(W) [8]. Trivially, b(W)=W is a balancing score, and by definition of the propensity score, gˉ0(W), is a balancing score. In general, any function b(W) is a balancing score if and only if there exists some function f such that gˉ0(W)=f(b(W)) (Theorem 2 [8]). For example, any monotone transformation of the propensity is a balancing score. Such a function is called a “balancing score” because, conditional on b(W), the distribution of W between the treated and untreated observations is equal or balanced. That is, P0(W|A=1,b(W))=P0(W|A=0,b(W)). Rosenbaum and Rubin [8] show that adjusting for a balancing score yields the same estimand as adjusting for the full set of covariates W which we state in Lemma 1 and offer a different proof in Appendix A.2.

Lemma 1

Ifb(W)is a balancing score under distribution P, thenEP(EP(Y|A=1,b(W)))=Ψ(P).

This result gives rise to methods for estimating Ψ(P0) based on a balancing score and not on an estimate of Qˉ0. The propensity score is the balancing score most commonly used for estimating Ψ(P0), and frequently used estimators include propensity score matching, stratification, and IPTW. When the propensity score is not known, these estimators rely on an estimated propensity score gˉn, and, under regularity conditions, are consistent when gˉn is consistent for gˉ0. The IPTW estimator, in particular, requires that gˉn converges to gˉ0 for consistency. However, many of these methods, such as propensity score matching and stratification by the propensity score, can be seen as nonparametrically adjusting for the propensity score and only rely on the propensity score being a balancing score. For these estimators, it is sufficient for gˉn to converge to some balancing score under P0. We call this property the balancing score property.

In practice, an estimator gˉn can approximate a balancing score well but not converge to the true propensity score. A parametric logistic regression estimator will estimate some function of the covariates that is a projection of gˉ0 onto the model determined by the parametrization of the estimator. If the parametric estimator is correctly specified, this projection will be gˉ0. Depending on the true gˉ0 and distribution of covariates, it is possible for this projection to be a balancing score or at least approximate some balancing score when the estimator is not correctly specified. For example, suppose the true gˉ0 depends on higher order interactions of covariates. Though not the case in general, in some settings a main terms logistic regression may approximate a balancing score well. We explore such a setting via simulation in Section 5. In another example, suppose gˉ0 depends on covariates in an additive on the logit scale but not necessarily linear or even smooth way. A logistic regression estimator with linear or possibly higher order polynomial main terms may again approximate some balancing score.

Estimators based only on the propensity score are not doubly robust. We now construct a locally efficient doubly robust estimator with the balancing score property. We start with initial estimators Qˉn for Qˉ0 and gˉn for gˉ0. We then update Qˉn by nonparametrically regressing Y on A and gˉn(W) using Qˉn(A,W) as an offset. Similarly to the TMLE procedure in Section 3, we use this updated estimate of Qˉ0 to estimate ψ0 by plugging it in to the parameter mapping Ψ along with the empirical distribution of W.

To update Qˉn by further adjusting for A and gˉn, we specify a working model and loss function pair. The working model and loss function pair is somewhat analogous to that in the updating step in the TMLE procedure described in Section 3. The loss function can be the same as that in the TMLE procedure’s updating step, but it need not be. Define Qˉ and b to be the limits of Qˉn and gˉn, respectively, as n. Let Θ be the class of all functions of A and b(W), and let θ be some function in that class. Here Qˉ is not necessarily Qˉ0 and b is not necessarily gˉ0 or even a balancing score. For concreteness, consider two working model and loss function pairs: a logistic working model


with loss function


which is the negative log likelihood loss when Y is binary, and a linear working model


with loss function


the squared error loss. In both working models, we leave the function θ unspecified. We can view a working model used for the updating step in the TMLE procedure as a special case of the working model here by restricting θ to have the form


where ε is real, using notation b(W) in place of g(1|W) as used in Section 3.



Given Qˉ, the limit of some estimate for Qˉ0, one can think of θ0, a function of A and b(W), as the residual bias between E0(Qˉ(A,W)|A,b(W)) and E0(Y|A,b(W)) on either the logistic or linear scale. When the initial estimator Qˉn is consistent, so Qˉ=Qˉ0, θ0(A,b(W)) will be 0, because Qˉ will already be fully adjusting for A and b(W).

Suppose for now that we have an estimate of θ0 which we call θn. We return to the problem of estimating θ0 later in this section. Calculate the update of Qˉn as Qˉngˉn,θn and using this updated regression, a final estimate of ψ0 is calculated as Ψ(Qˉngˉn,θn,QWn), which we call a doubly robust balancing score adjusted (DR-BSA) plug-in estimator. In Theorem 1 in Appendix A.2, we show that the DR-BSA estimator doubly robust in the sense that it is consistent when either Qˉ=Qˉ0 or θn consistently estimates θ0 and b is a balancing score.

When initial estimator Qˉn does not consistently estimate Qˉ0, consistency of the DR-BSA estimate requires that b is a balancing score and θ0 is consistently estimated. To weaken this requirement, we now construct a TMLE with the balancing score property by using Qˉn0=Qˉngn,θn as the initial estimate in the TMLE procedure in Section 3 and updating it to Qˉn. The TMLE of Ψ(P0) is calculated as Ψ(Qˉn,QWn). We call this a balancing score adjusted TMLE (BSA-TMLE). In Theorem 2 in Appendix A.2, we show that the BSA-TMLE is consistent if any of the three conditions hold: (1) Qˉ=Qˉ0, (2) b=gˉ0, or (3) b is a balancing score and θn consistently estimates θ0. The BSA-TMLE is therefore doubly robust in the usual sense and also has the balancing score property. The BSA-TMLE is a TMLE as described in Section 3 where in addition to attempting to adjust for W, the initial estimator Qˉn0 is making an extra attempt to adjust for a balancing score. If θ0 is consistently estimated, then like the standard TMLE, when both the initial estimates of Qˉ0 and g0 are consistent, the influence curve of the BSA-TMLE is the efficient influence curve. Therefore, under regularity conditions, the BSA-TMLE is locally efficient and keeps all of the attractive properties of TMLE while also having the balancing score property.

We now return to the problem of estimating θ0. The working model in the definition of θ0 depends is Qˉb,θ which depends on limits Qˉ and b. To estimate θn, we use Qˉngˉn,θ as the working model. If gˉn(W) is discrete and θ0 is estimated in a saturated parametric model, Ψ(Qˉngˉn,θn,QWn) is exactly a TMLE as proved in Lemma 2 in Appendix A.2. When gˉn(W) is not discrete, it can be discretized into k categories based on quantiles. The parameter θ0 can be estimated with a saturated parametric model with standard logistic regression software with dummy variables for each stratum and treatment combination, and logitQˉn(A,W) as an offset. When Qˉn(A,W) is unadjusted for W, for example Qˉn is estimated in a GLM with only an intercept and treatment as a main term, this reduces to usual propensity score stratification. In general, when the number of categories k is fixed and does not grow with sample size, stratification is not consistent, though one hopes that the residual bias is small [2]. If k is too large, there is a possibility of all observations in a particular stratum having the same value for A, in which case θn(A,W) is not well defined. In many applications, the number of strata is often set based on the rule of thumb k=5 recommended by Rosenbaum and Rubin [3]. Though the stratification estimator of ψ0 is not root-n consistent when k is fixed, the BSA-TMLE removes this remaining bias if gn consistently estimates the true propensity score while preserving the balancing score property. In practice, the number of strata k can be chosen based on cross-validation in such a way that it can grow with sample size.

Alternatively, when gˉn(W) is not discrete or has many levels, θ0 can be estimated in an generalized additive model [21] with Qˉn as an offset. We can parameterize this model as


with θ=(θ1,θ2) where θ1 and θ2 are unspecified. Other parametric or nonparametric methods can be used and cross-validation based SuperLearning can be used to select the best weighted combination of estimators for θ0 [9, 17]. When the linear model (2) is used, θ0(A,W)=E0(YQˉ(A,W)|A,gn(1|W)). In this case, a nearest neighbor or kernel regression can be used where residuals from the initial estimate, Ri=YiQˉn(Ai,Wi), are treated as an outcome. This is similar to the bias corrected matching estimator presented by Abadie and Imbens [22].

5 Simulations

We demonstrate properties of the proposed BSA-TMLE in various scenarios, and compare it to other estimators. The estimators compared in simulations include a plug-in estimator based on just the initial estimator of Qˉ0 without balancing score adjustment, DR-BSA plug-in estimators without a TMLE update, non-doubly robust BSA plug-in estimators, an inverse probability of treatment weighted estimator, and a TMLE using an initial estimator for Qˉ0 not directly adjusted for a balancing score.

The plug-in estimator not adjusted for a balancing score is calculated as Ψ(Qˉn,QWn) with Qˉn as defined in Section 4. We call this the simple plug-in estimator. The DR-BSA plug-in estimator uses the balancing score adjusted Qˉn0 as in Section 4 and is calculated as Ψ(Qˉn0,QWn). The non-doubly robust BSA plug-in estimator adjusts for the balancing score, but uses as initial Qˉn an unadjusted estimate that is not a function of W. The non-DR-BSA plug-in estimator can be thought of as only adjusting for gn(1|W) and not the whole covariate vector W. The IPTW estimator is calculated as


The estimators we compare are summarized in Table 1.

Table 1

Summary of properties of compared estimators

EstimatorPlug-inConsistent ifEfficient if
QˉnQˉ0gˉngˉ0gˉn BSQˉnQˉ0&gˉngˉ0
Simple plug-in

In the simulation studies, we use two methods for adjusting the initial estimator with the propensity score. All simulations were conducted in R [13]. The initial estimator Qˉn was adjusted with either a generalized additive model (GAM) in eq. (3), or a nearest neighbor approach analogous to propensity score matching. The non-DR-BSA plug-in estimator based on nearest neighbors reduces exactly to a propensity score matching estimator. The GAM was fitted with the mgcv package [21] and the nearest neighbor/propensity score matching type estimator was implement with the Matching package [23].

The initial estimates for Qˉ0 and gˉ0 are estimated using generalized linear models. Specifically, gˉ0 is estimated using logistic regression, and Qˉ0 is estimated with least squares when Y is continuous, and logistic regression when Y is binary. To investigate robustness to various kinds of model misspecification, models are either correctly specified, or some relevant covariates are excluded.

The data generating distribution in the simulations was as follows. Baseline covariates W1, W2 and W3 have independent uniform distributions on [0,1]. Treatment A is Bernoulli with mean


Outcome Y is either Bernoulli or normal with variance 1 and mean


where m is logit1 if Y is Bernoulli, or the identity if Y is normal. All estimators were evaluated on 1,000 datasets of size n=100 and n=1,000. Bias, variance, and mean squared error (MSE) are calculated for each estimator.

In the first scenario, which we call distribution one, α=(α0,α1,α2,α3,α4)=(3,2,2,0.5) and β=(β0,β1,β2,β3,β4)=(3,1,1,0,5) so W1 and W2 are confounders, and the propensity score depends on the product W1W2. The true parameter ψ00.0985 and the variance bound is approximately 1.5691/n. The variance bound of a parameter in a semiparametric model is the minimum asymptotic variance that a regular estimator can achieve, and depends on the parameter mapping Ψ and the true distribution P0 [15]. This is analogous with the Cramér-Rao bound in a parametric model. An estimator that asymptotically achieves the variance bound is called efficient.

The first set of results in Table 2 demonstrate the balancing score property. The initial estimate Qˉn is unadjusted. A correct logistic regression model is specified for gˉ0, but predictions are transformed by the Beta cumulative distribution function with both shape parameters equal to 2. Although artificial, this means that gˉn converges to a monotone transformation of gˉ0, which is a balancing score, but does not converge to the true gˉ0. We can see that the TMLE not adjusted for the propensity score and the IPTW estimators are not consistent as the bias is not decrease substantially when sample size increase. Conversely, methods where the initially estimate Qˉn is adjusted with the propensity score, are consistent, as bias is decreasing quickly with sample size.

Table 2

Simulation results for distribution one with Qˉn unadjusted and gˉn correctly specified but transformed with Beta CDF

Estimatorn = 100n = 1,000
BSA, NN0.02760.01800.01880.00260.00180.0018
BSA, GAM0.00750.01630.01630.00410.00150.0015
BSA-TMLE, NN0.02760.01800.01880.00260.00180.0018
BSA-TMLE, GAM0.00700.01640.01650.00370.00150.0015

Table 3 shows similar performance in a more realistic scenario. In this setting, the initial estimator for Qˉn is unadjusted, but the logistic regression model for the propensity score is misspecified by excluding the interaction term W1W2. Here predictions are not transformed. Here gˉn is close to but not exactly a balancing score, but it is close enough that the bias in estimators that nonparametrically adjust for gˉn is small. The IPTW estimator, however, is still biased at large n because gˉn is not converging to gˉ0. In this case TMLE performs well even with an unadjusted initial estimator but this is not guaranteed when gˉn is misspecified.

Table 3

Simulation results for distribution one with Qˉn unadjusted, and gˉn misspecified but close to a balancing score

Estimatorn = 100n = 1,000
BSA, NN0.03110.01660.01760.00270.00160.0016
BSA, GAM0.01470.01590.01610.00330.00140.0014
BSA-TMLE, NN0.03110.01660.01760.00270.00160.0016
BSA-TMLE, GAM0.01010.01890.0190–0.00420.00150.0016

Table 4 examines the performance of estimators when the model for gˉ0 is misspecified (only including W1 in the logistic regression model,) but the initial estimate Qˉn is a correctly specified model. Here we see that estimates that rely only on estimated propensity score (the non-doubly robust BSA estimators and IPTW,) fail to be consistent, but estimates that use the correctly specified initial estimate of Qˉ0, are consistent. Importantly, even when the initial estimate is adjusted with the completely misspecified gˉn, final estimates are still consistent when the initial Qˉn is correctly specified.

Table 4

Simulation results for distribution one with Qˉn correctly specified and gˉn misspecified

Estimatorn = 100n = 1,000
Simple plug-in0.00710.01200.01200.00110.00130.0013
BSA, NN0.11900.01260.02680.10640.00140.0128
DR-BSA, NN0.00640.01390.01400.00030.00150.0015
BSA, GAM0.11390.01160.02460.10960.00120.0133
DR-BSA, GAM0.01520.01290.01320.00150.00130.0013
BSA-TMLE, NN0.00640.01390.01400.00030.00150.0015
BSA-TMLE, GAM0.01540.01330.01360.00140.00130.0013

In a second scenario, called distribution two, Y is conditionally normal with α=(0,10,8,0,2) and β=(1,0,0,3,0). Here Y depends on W1 and W2 but A does not, so they are not confounders. Additionally, A depends on W3, but Y does not, so W3 is an instrumental variable. In this setting, because none of the baseline covariates are confounders, an unadjusted estimator of ψ0 will be consistent but not efficient, because it will fail to take into account the relationship with the non-confounding baseline covariates W1 and W2. Here, the true ψ0 is 2 and the variance bound is approximately 5.1979/n.

Table 5 shows results from distribution two where the initial estimate for Qˉ0 is the least squares estimate from a linear regression model with A, W1, W2, and W3 are main terms, and the initial estimate for the propensity score is the MLE from a logistic regression model with main terms W1, W2, and W3. Here we see that, although all estimators have low bias, those that only adjust for gˉn, (the non-doubly robust BSA estimators and IPTW,) have much higher variance than those with a correctly specified initial estimate. This demonstrates the importance in terms of efficiency of attempting to estimate Qˉ0 well with the initial estimate even when confounding is not a concern.

Table 5

Simulation results from distribution two with Qˉn correctly specified and gˉn correctly specified and includes an instrumental variable

Estimatorn = 100n = 1,000
Simple plug-in–0.01120.05050.05060.00070.00480.0048
BSA, NN0.00800.18150.18150.00200.01850.0185
DR-BSA, NN–0.01080.05780.05790.00240.00590.0060
BSA, GAM–0.00610.32070.3208–0.00080.00970.0097
DR-BSA, GAM–0.01120.05650.05660.00100.00510.0051
BSA-TMLE, NN–0.01080.05780.05790.00240.00590.0060
BSA-TMLE, GAM–0.01810.05870.05900.00090.00530.0053

6 Discussion

In this paper, we discuss the balancing score property of estimators that nonparametrically adjust for the propensity score. We see in simulations that, even when the propensity score estimator is not consistent, Ψ(P0) can be estimated with low bias if the estimate of the propensity score approximates a balancing score well enough. Additionally, we introduce a balancing score adjusted TMLE which has the balancing score property and is also doubly robust and locally efficient, and provide regularity conditions for asymptotic linearity in Appendix A.2.

In order for an estimator to have the balancing score property, we need to estimate some balancing score. We acknowledge that in practice, one does not expect an estimate of the propensity score to converge exactly to a balancing score that is not g0 in general. However, because the propensity score is a single element of the large class of balancing scores, the condition that an estimated propensity score gn converges to some balancing score is strictly weaker than requiring gn to converge to g0. When gn fails to converge to g0, we may still have a chance at approximating a balancing score, and the proposed BSA-TMLE can still reduce bias relative to an estimator that requires that gn converges to g0 without sacrificing double robustness or efficiency.

We now discuss some possible generalizations to the work in this paper and areas for further research. The estimators present in this paper are for the statistical parameter E0[E0(Y|A=1,W)], which, under assumptions, can be interpreted as the population mean of a variable Y when Y is subject to missingness [24]. The results and similar estimators are immediately applicable to other interesting statistical parameters such as




which, under non-testable causal assumptions, can be interpreted as causal parameters called the ATE or ATT, respectively [9, 14]. Additionally, the results are immediately generalizable to the estimation of parameters in marginal structural models [25, 26].

Propensity score-based methods are most often applied in settings where the treatment variable is binary. In settings where the treatment variable is not binary, Imai and Van Dyk [27] generalize the notion of the propensity score to the propensity function, the conditional probability of observed treatment given covariates. Imai and Van Dyk [27] show that the propensity function is a balancing score. When the propensity function can be characterized by a finite dimensional parameter, one can estimate parameters of the distribution of counterfactuals by adjusting for the dimensional characterization of the propensity function in place of all covariates. Using the approach of Imai and Van Dyk [27], the methods in this paper may be extended to develop estimators that are doubly robust and efficient with the balancing score property for more general situations where treatment is categorical or potentially even continuous.

Traditionally, propensity score-based estimators estimate the propensity score based on how well gˉn approximates the true gˉ0. Collaborative targeted minimum loss-based estimation (CTMLE) is a method that chooses an estimator for the propensity score based on how well it helps reduce bias in the estimation of Ψ(P0) in collaboration with an initial estimate of Qˉ0 using cross-validation [9, 28]. In doing so, CTMLE attempts to adjust the propensity score for the most important confounders first and avoid adjustment for instrumental variables. This can lead to improvements in efficiency and robustness to violations of the assumption P0(A=a|W)>0. Applying an analogous techniques of estimator selection for balancing score adjusted estimators is an area of further research.

Funding statement: Funding: U.S. Department of Health and Human Services-National Institutes of Health 5R01AI074345-05.


Mark van der Laan was supported by NIH grant 5R01AI074345-05. We thank the associate editor and anonymous referees for their insightful comments, which we believe helped greatly improve the quality of the paper.


A.1 Notation

  1. O=(W,A,Y): observed data structure

    1. W: vector of covariates

    2. A: treatment indicator, 0 or 1

    3. Y: univariate outcome

    4. P: a distribution of O

  2. M: statistical model, set of possible probability distributions P

  3. Ep(): expectation under distribution P

  4. Q=(Qˉ,QW)

    1. Qˉ(a,w)=EP(Y|A=a,W=w)

    2. QW(w)=P(W=w)

  5. g(a|w)=P(A=a|W=w)

  6. gˉ(w)=g(1|W), also called the propensity score when.

  7. Ψ: statistical parameter mapping from M to .

    1. In particular, Ψ(P)=EP[EP(Y|A=1,W)]

    2. Also written as Ψ(Q)

  8. ψ=Ψ(P)

  9. Subscript 0: indicates the truth, e.g. ψ0=Ψ(P0) is the true parameter value

  10. Subscript n: indicates an estimate based on n observations, e.g. Qˉn is an estimate of Qˉ0

  11. Qˉn0 an initial estimate of Qˉ0

  12. L: loss function

  13. LY: loss function for Qˉ

  14. LW: loss function for QW

  15. Q(ε) a working submodel through Q

  16. IC: an influence curve

  17. D: the efficient influence curve

  18. Qˉn a TMLE updated estimate of some initialQˉn0

  19. b(w): some function of w that is a potential balancing score

  20. θ: some function of a and b(w)

  21. Qˉb,θ: a working submodel through Qˉ for a particular b and θ

  22. La loss function for Qˉb,θ, used in Section 4

A.2 Some results and proofs

Proof of Lemma~1. In this proof, E means expectation with respect to P. First note that E(Y|A=1,W,b(W))=E(Y|A=1,W) because b is a function of only W. Next,


because the inner conditional expectation is a function of only W and WA|b(W) when b is a balancing score. Thus,


Theorem 1



In addition, assume that either gˉ is a balancing score or Qˉ=Qˉ0. Then Ψ((Qˉngn,θn,QWn)) is consistent for ψ0.

Proof By definition of θ0, we have


for all functions h of A and b(W). Theorem 2 Rosenbaum and Rubin [8] show that b is a balancing score if and only if there exists a function f so that gˉ0(w)=f(b(w)) a.e., so we can select the function


In addition, we also have that E0Qˉb,θ0(1,W)Ψ((Qˉb,θ0,QW,0))=0. This proves that


where D is the efficient influence curve of Ψ at P, and notation


for some function ϕ of O and distribution P. Since P0D(Qˉ,QW,g0)=ψ0Ψ(Q), this shows


This proves that under the stated consistency condition, we indeed have that Ψ((Qˉngn,θn,QWn)) is consistent for ψ0. This proves the consistency under the condition that b is a balancing score.

Consider now the case that Qˉ=Qˉ0. Then θ0=0 and thus Qˉb,θ0=Qˉ0. Thus, the limit Ψ((Qˉb,θ0,QW0))=Ψ((Qˉ0,QW0)), which proves the second claim of the theorem.□

Theorem 2



where ε0=argminεP0L(Qˉb,θ0(ε)).

In addition, assume that b is a balancing score, or Qˉ=Qˉ0. Then ε0=0 and Ψ((Qˉngˉn,θn(εn),QWn)) is consistent for ψ0.

Proof. Firstly, assume b is a balancing score so by Theorem 2 Rosenbaum and Rubin [8] there exists a mapping f so that g0(w)=f(b(w)) a.e.. In the proof of the previous theorem we showed that


The left-hand side equals ddεP0L(Q¯b,θ0(ε))|ε=0 and this score equation in ε is solved by ε0. This proves that ε0=0 under the assumption that this score equation P0L(Qˉb,θ0(ε))=0 has a unique solution. The latter follows from the fact that the submodel with single parameter ε has an expected loss that is strictly convex.

This now proves that the limit Ψ((Qˉb,θ0(ε0),QW0))=Ψ((Qˉb,θ0,QW,0)) so that we can apply the previous theorem which shows that the latter limit equals ψ0. This proves the consistency of the TMLE when b is a balancing score.

Consider now the case that Qˉ=Qˉ0. Then θ0=0 and thus Qˉb,θ0=Qˉ0. Thus, the limit Ψ((Qˉb,θ0,QW0))=Ψ((Qˉ0,QW0)), which proves the consistency under the condition that Qˉ=Qˉ0. In the latter case, it also follows that ε0=0.□

Lemma 2

Ifgˉntakes only discrete values with support G, thenΨ((Qˉngˉn,θn,QWn))is a TMLE ifθ0is estimated asθnusing MLE in a saturated parametric model


where Qˉn is some initial estimator for Qˉ0 and I is the indicator function.

Proof of Lemma~2. The MLE θn (or empirical risk minimizer for the negative quasi-binomial log likelihood, if Y is not binary), solves the score equations for each parameter θa,c:


Additionally, any function h of A and gˉn(W) is in the linear span of basis functions I(A=a,gˉn(W)=c) for all a{0,1}, cG, so


In particular, the above equation is solved when h(a,w)=agˉn(w), which is the score from the parametric submodel in eq. (4). Thus if the TMLE update is applied to the initial estimate Qˉn0=Qˉngˉn,θn, εn=0, and Qˉn=Qˉn0 so Ψ((Qˉngˉn,θn,QWn)) is a TMLE.□

Theorem 3

DefineΦ1(Q)=P0Qˉgˉgˉ0gˉandΦ2(g)=P0(QˉQˉ0)gˉgˉ0. AssumeD(Qn,gn)falls in aP0-Donsker class with probability tending to 1; P0{D(Qn,gn)D(Q,g)}20in probability asn;

Φ1(Qˉn) and Φ2(gˉn) are asymptotically linear estimators of Φ1(Qˉ) and Φ2(gˉ) with influence curves IC1 and IC2, respectively.

Then Ψ(Qn) is asymptotically linear with influence curve D(Q,g)+IC1+IC2.

Proof. Since P0D(Q,g)=ψ0Ψ(Q)+P0(Qˉ0Qˉ)(gˉ0gˉ)/gˉ (e.g., Zheng and Laan [29]); Zheng and van der Laan [30]), where we use the notation Qˉ(W)=Qˉ(1,W), this results in the identity:


The first term equals (PnP0)D(Q,g)+oP(1/n) if D(Qn,gn) falls in a P0-Donsker class with probability tending to 1, and P0{D(Qn,gn)D(Q,g)}20 in probability as n [31, 32]. We write


Assume that the last term is oP(1/n). We now write


where Φ1(Q)=P0Qˉgˉgˉ0gˉ and Φ2(g)=P0(QˉQˉ0)gˉgˉ0. We assume that the first term is oP(1/n), the last term equals zero (i.e., either g=g0 or Qˉ=Qˉ0), and Φ1(Qˉn) and Φ2(gˉn) are asymptotically linear estimators with influence curves IC1 and IC2, respectively. This proves Ψ(Qn) is asymptotically linear with influence curve D(Q,g)+IC1+IC2.□

A.3 TMLE when Y is not bounded by 0 and 1

If Y is not bounded by 0 and 1, but we can assume Y is bounded by l and u with <l<u<, Y can be transformed to Y=Ylul. Similarly Qˉn0 can be transformed to Qˉn0=Qˉn0lul. The procedure described in Section 3 can be applied to the data structure (W,A,Y) using Qˉn0 as initial estimator, and the final estimate can be transformed back to the original scale as Ψ((Qˉn,QWn))(ul)+l. When l and u are not known, they can be set to the minimum and maximum of the observed Y as described in [20].

For completeness we can define an alternative TMLE using a linear working model where


with loss function


the squared error loss. Here, ε0=argminεE0LY(Qˉ)(O) can be estimated by standard least squares regression software, with Qˉn0(A,W) as an offset.

Asymptotically, a TMLE using a linear working model (or linear fluctuation) is the equivalent to a TMLE with a logistic working model, but in practice can perform poorly. This is because if gn(1|Wi) is very small for some observations, which is more likely in small samples, εn0 can be large in absolute value, having a large effect on Qˉn with a linear fluctuation, which is unbounded. Because of this, if it is reasonable to bound Y by some l and u, it the logistic working model is recommended because Qˉn always respects these bounds, even if εn0 is large.

A.4 Example implementation of a BSA-TMLE estimator in R

bsatmle <- function(QnA1, QnA0, gn1, A, Y, family = “binomial”) {

# computes estimates of E(E(Y|A=1, W)) (called ey1 in the

# output), E(E(Y|A=0, W)) (called ey0), and

# E(E(Y|A=1, W)) – E(E(Y|A=1, W)) (called ate)


# Inputs:

# QnA1, QnA0: vectors, initial estimates of \bar{Q}_n(1, W)

# and \bar{Q}_n(O, W)

# gn1: vector, estimates of g_n(1|W)

# A: vector, indicator of treatment

# Y: vector, outcome

# family: “binomial” for logistic fluctuation, “gaussian”

# for linear fluctuation.

# if “binomial”, Y should be binary or bounded

# by 0 and 1

if (!require(mgcv)) stop(“mgcv package is required”)

if (family==“binomial”) {

#use quasibinomial to suppress error messages about

#non-integer Y

family <- “quasibinomial”

link <- qlogis

} else {

link <- identity


QnAA <- ifelse(A==1, QnA1, QnA0)

# Use a generalized additive model to estimate theta_0

# using the initial estimate of \bar{Q}

gamfit <- gam(Y factor(A)+s(gn1, by=factor(A))+offset(off),

family, data=data.frame(A=A, gn1=gn1, off=link(QnAA)))

#Get predictions from gam fit

QnA1.gam <- predict(gamfit, type=“response”,

newdata=data.frame(A=1, gn1=gn1, off=link(QnA1)))

QnA0.gam <- predict(gamfit, type=“response”,

newdata=data.frame(A=0, gn1=gn1, off=link(QnA0)))

QnAA.gam <- ifelse(A==1, QnA1.gam, QnA0.gam)

# compute a/g_n(1|W)

hA1 <- 1/gn1

hA0 <- –1/(1 – gn1)

hAA <- ifelse(A==1, hA1, hA0)

#using glm, fluctuate the gam-updated initial fit of \bar{Q}

glmfit <- glm(Y -1+h + offset(off), family,

data=data.frame(h=hAA, off=link(QnAA.gam))) <- predict(glmfit, type=“response”,

newdata=data.frame(h=hA1, off=link(QnA1.gam))) <- predict(glmfit, type=“response”,

newdata=data.frame(h=hA0, off=link(QnA0.gam)))

#compute the final estimates

ey1 <- mean(

ey0 <- mean(

ate <- ey1-ey0

list(ey1=ey1, ey0=ey0, ate=ate)



1. AustinPC. The performance of different propensity-score methods for estimating differences in proportions (risk differences or absolute risk reductions) in observational studies. Stat Med2010;29:213748.10.1002/sim.3854Search in Google Scholar PubMed PubMed Central

2. LuncefordJ, DavidianM. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med2004;23:293760.10.1002/sim.1903Search in Google Scholar PubMed

3. RosenbaumP, RubinD. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc1984;79:51624.10.1080/01621459.1984.10478078Search in Google Scholar

4. RobinsJ, HernánM, BrumbackB. Marginal structural models and causal inference in epidemiology. Epidemiology2000;11:55060.10.1097/00001648-200009000-00011Search in Google Scholar PubMed

5. RosenbaumP. Model-based direct adjustment. J Am Stat Assoc1987;82:38794.10.1080/01621459.1987.10478441Search in Google Scholar

6. CaliendoM, KopeinigS. Some practical guidance for the implementation of propensity score matching. J Econ Surv2008;22:3172.10.1111/j.1467-6419.2007.00527.xSearch in Google Scholar

7. DehejiaR, WahbaS. Propensity score-matching methods for nonexperimental causal studies. Rev Econ Stat2002;84:15161.10.1162/003465302317331982Search in Google Scholar

8. RosenbaumP, RubinD. The central role of the propensity score in observational studies for causal effects. Biometrika1983;70:41.10.1093/biomet/70.1.41Search in Google Scholar

9. van der LaanMJ, RoseS. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

10. van der LaanMJ, RubinD. Targeted maximum likelihood learning. Int J Biostat2006;2. Available at: in Google Scholar

11. RobinsJM, RotnitzkyA, ZhaoLP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc1994;89:84666.10.1080/01621459.1994.10476818Search in Google Scholar

12. van der LaanMJ, RobinsJM. Unified methods for censored longitudinal data and causality. New York: Springer, 2003.10.1007/978-0-387-21700-0Search in Google Scholar

13. R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2012. Available at:, ISBN 3-900051-07-0Search in Google Scholar

14. HahnJ. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica1998;66:31531.10.2307/2998560Search in Google Scholar

15. BickelPJ, KlaassenCAJ, RitovY, WellnerJA. Efficient and adaptive estimation for semiparametric models. Baltimore, MD: The Johns Hopkins University Press, 1993.Search in Google Scholar

16. McCullaghP, NelderJ. Generalized linear models, Vol. 37. Boca Raton, FL: Chapman & Hall/CRC, 1989.Search in Google Scholar

17. van der LaanMJ, PolleyEC, HubbardAE. Super learner. Stat Appl Genet Mol Biol2007;6. Available at: in Google Scholar

18. van der LaanMJ. Targeted maximum likelihood based causal inference: part I. Int J Biostat2010;6. Available at: in Google Scholar

19. van der LaanMJ. Targeted maximum likelihood based causal inference: part II. Int J Biostat2010;6. Available at:$002fijb.2010.6.2$002fijb.2010.6.2.1241$002fijb.2010.6.2.1241.xmlSearch in Google Scholar

20. GruberS, Van Der LaanM. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat2010;6. Available at: in Google Scholar

21. WoodSN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc Ser B (Stat Methodol)2011;73:336.10.1111/j.1467-9868.2010.00749.xSearch in Google Scholar

22. AbadieA, ImbensG. Bias-corrected matching estimators for average treatment effects. J Bus Econ Stat2011;29:111.10.1198/jbes.2009.07333Search in Google Scholar

23. SekhonJS. Multivariate and propensity score matching software with automated balance optimization: the matching package for R. J Stat Softw2011;42:152. Available at: in Google Scholar

24. KangJ, SchaferJ. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci2007;22:52339.10.1214/07-STS227Search in Google Scholar PubMed PubMed Central

25. RobinsJM. Marginal structural models. In Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1–10, 1997.Search in Google Scholar

26. RosenblumM, van der LaanMJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat2010;6:Article 19.10.2202/1557-4679.1238Search in Google Scholar PubMed PubMed Central

27. ImaiK, Van DykDA. Causal inference with general treatment regimes. J Am Stat Assoc2004;99:854–66.10.1198/016214504000001187Search in Google Scholar

28. van der LaanMJ, GruberS. Collaborative double robust targeted maximum likelihood estimation. Int J Biostat2010;6. Available at: in Google Scholar

29. ZhengW, LaanMJ Asymptotic theory for cross-validated targeted maximum likelihood estimation. Working Paper 273, U.C. Berkeley Division of Biostatistics Working Paper Series, 2010. Available at: in Google Scholar

30. ZhengW, van der LaanMJ. Targeted maximum likelihood estimation of natural direct effects. Int J Biostat2012;8. Available at: in Google Scholar

31. van der VaartAW. Asymptotic statistics. Cambridge and New York: Cambridge University Press, 1998.Search in Google Scholar

32. van der VaartAW, WellnerJA. Weak convergence and empirical processes. New York: Springer, 1996.10.1007/978-1-4757-2545-2Search in Google Scholar

Published Online: 2015-1-10
Published in Print: 2015-9-1

©2015 by De Gruyter

Downloaded on 6.2.2023 from
Scroll Up Arrow