Estimating the causal effect of an exposure A on an outcome Y when the relation between them is confounded by a set of covariates is a very common problem in causal inference, of high relevance for applications in epidemiology, medical, and social research, among other fields.
In general, causal effects are defined as parameters of the distribution of the counterfactual outcome process [1, 2] that contains the variables that would have been observed if, possibly contrary to the fact, the subject would have received level a of the exposure. Computation of causal parameters involves expectations with respect to the distribution of the stochastic process that one would have observed if, for each subject, all the counterfactual outcomes were observed. Since the observed data for subject i contain only one of the counterfactuals, namely (this is often called the consistency assumption), additional untestable assumptions are needed in order to identify parameters of the counterfactual process distribution as parameters of the observed data distribution. These assumptions are usually described in terms of the so-called no unmeasured confounders assumption, a particular case of the coarsening at random assumption, which roughly states that the censoring or exposure processes cannot depend on unobserved covariates that are also related to the outcome.
In spite of the large number of causal inference problems that are inherently defined in terms of exposures of continuous nature, most of the attention in the field of causal inference has focused in the definition and estimation of parameters for binary treatments, in which it is natural to compare the counterfactual outcome under two possible exposure levels. Estimation of causal parameters for binary exposures has been widely studied [3–9]. The main reason why consistent and efficient estimators of the causal dose–response curve (CDRC) for continuous treatments in the nonparametric model have not yet been developed is that it is not a pathwise differentiable parameter [10, chapter 3 and 5] and therefore cannot be estimated at a consistency rate of . Examples of pathwise differentiable parameters that measure the causal effect of a continuous exposure on an outcome of interest are given by the parameters defined in Díaz and van der Laan [11, 12]. These approaches make use of stochastic interventions [13–15] as a means to define a counterfactual outcome in a post-intervened world, which compared to the expectation of the actual outcome defines the causal effect of an intervention.
The most widely known method for estimation of the CDRC for continuous exposures is the so-called marginal structural model (MSM) framework, which was first proposed by Robins et al. . In addition to the structural causal assumptions required for identifiability, the veracity of the statistical claims of MSM analyses relies on the correct specification of a parametric model for the CDRC, that is, the model needs to contain the true dose–response curve. Neugebauer and van der Laan  generalize the MSM methodology to avoid dependence on the correct specification of a parametric model by defining the parameter of interest as the projection of the true CDRC on the space of functions defined by the parameterization implied by the MSM, providing robustness against misspecification of the parametric MSM. Their work also includes identification results for this projection parameter, as well as inverse probability weighted estimators (IPTW), G-computation (G-comp) and augmented IPTW (A-IPTW) double robust estimators. Marginal structural models represent only a provisional solution to the problem, because in many instances the interest relies on estimating the actual CDRC and not its projection on some parametric space of functions.
An alternative and widely used method for estimating non–pathwise differentiable parameters is the selection of the best performing candidate among a list of algorithm estimators, where performance is defined in terms of the cross-validated risk. Formal analytical asymptotic arguments backing the use of cross-validation as an estimator selection tool were first given by van der Laan and Dudoit , van der Vaart , van der Laan et al. , among others. The main result of these works is a finite sample size inequality that bounds the risk of the cross-validation selector by the risk of the oracle selector (the selector based on the true distribution), which in turn used to establish, under certain conditions, the asymptotic equivalence between the cross-validation and the oracle selectors. These results are later explored in specific contexts by Dudoit and van der Laan , van der Laan et al. , among others. Of special interest is the work of van der Vaart et al. , in which the cross-validation oracle inequalities are extended to candidate libraries with a continuous index set and unbounded loss functions. van der Laan et al.  demonstrates that this oracle property for cross-validation combined with the right library of estimators results in a minimal adaptive optimal estimator. van der Laan et al.  use these optimality results in order to define the super learner prediction algorithm, implemented in the SuperLearner R library. van der Laan and Petersen  propose a general methodology for estimating the risk of a loss function indexed by nuisance parameters using cross-validated targeted minimum loss–based estimators (CV-TMLE, ).
For the particular case of the CDRC, van der Laan and Dudoit [18, p. 52] prove that under convergence of the initial estimators, the candidate selector based on the cross-validated A-IPTW risk is asymptotically equivalent to the oracle selector. An important assumption for identification and estimation of causal effects is the positivity assumption, which loosely ensures that there is “enough experimentation” in the data so that all the subjects have a positive probability of receiving every level of the exposure. Since the general A-IPTW methodology does not provide bounded estimators, the estimates can fall outside the parameter space and be overly sensitive to violations of the positivity assumption, which are very likely to occur when working with continuous exposures.
The main contribution of this article is to present a CV-TMLE of the risk of a CDRC candidate estimator that is endowed with an oracle inequality analogue to that of the A-IPTW. The CV-TMLE we propose is more robust to empirical violations of the positivity assumption than the CV-A-IPTW of Dudoit and van der Laan . The CV-TMLE is also a substitution estimator, which guarantees estimates that are within the bounds of the parameter space. These two estimators have also been proven to be asymptotically linear with influence function equal to the efficient influence function, under certain conditions, which implies that they are consistent and efficient estimators of the risk.
The article is organized as follows. In Section 2, we formally describe the inference problem, define the loss and risk functions, and present the efficient influence function of the parameter of interest. In order to first introduce relevant concepts, in Section 2.1 we present four estimators (G-comp, IPTW, A-IPTW, and TMLE) of the risk when the candidate estimators of the CDRC are assumed fixed functions. In Section 2.2, we generalize these estimators to the case when the candidates are estimated from the sample and present the corresponding cross-validated versions of the A-IPTW and TMLE. In Section 3, we present a theorem describing the conditions under which the CV-TML estimator of the risk is an asymptotically linear estimator, the conditions under which it is consistent and efficient, as well as a discussion on the estimation of its variance. Section 4 presents the main contribution of this article; an oracle inequality for the selectorbased on the CV-TML estimator of the risk, and the conditions under which it is asymptotically equivalent to the oracle selector. In Section 5, we use Monte Carlo simulation to compare the performance of CV-TMLE and CV-A-IPTW selectors and estimators of the risk in finite sample sizes. We conclude in Section 6 with a summary of the work and a discussion of the limitations of our proposal and directions of future research.
2 Definition and estimation of the risk of an estimator of the CDRC
Consider an experiment in which an exposure variable A, a continuous or binary outcome Y and a set of covariates W are measured for n randomly sampled subjects. Let represent a random variable with distribution , and represent n i.i.d. observations of O. The range of W, A and Y will be denoted by , and , respectively. Assume that the following non-parametric structural equation model (NPSEM) holds:
where , and are exogenous random variables such that (randomization assumption). The true distribution of O can be factorized as
where we denote , , , , and for a given function f. For a given value , the counterfactual of Y is defined as the value , the counterfactual process of Y is given by and the full data is denoted by .
In this article, we will discuss the estimation of the causal dose–response curve within strata of the covariates , given by the expression
where , , the superscript f stands for full data,and h is a non-negative function such that . The second equality in eq. (2) is true because is the projection of into the space of functions of Z, and is the integral over of the squared norm of . The randomization assumption implies that , which allows identification of the full data parameter (2) in terms of a function of the observed data distribution as the mapping
where we denote . If A is continuous, is not a pathwise differentiable parameter in the non-parametric model, and consistent estimation is not possible [10, chapter 3, 5]. However, the risk of a given candidate value , is a pathwise differentiable parameter for which it is possible to find regular asymptotically linear estimators.
Following the ideas of Wang et al. , consider a list of candidates values for . Throughout the article, we will make a distinction between candidate values (denoted ) and candidate estimators (denoted ), where the difference is that the former are given functions, whereas the latter are functions of estimated from the sample.
If the full data X were observed, a general selection procedure would involve computing , and estimating based on , where . Of course, this optimization procedure cannot be carried out as described because: (1) only a coarsened version of X denoted by O is observed, (2) the distribution of O is unknown and (3) in most cases we have a list of candidate estimators , as opposed to a list of candidate values , which raises the issue of over-fitting.
In order to overcome these obstacles one needs to:
Find a mapping that identifies , i.e., a mapping such that equals , under certain assumptions. It is common that for a loss function that is now indexed by a nuisance parameter .
If is known, the value suffices to find a selector among the candidate values. However, since is unknown, we now need to estimate . At this point, it is worth to note that even though is not a pathwise differentiable parameter, the mapping is pathwise differentiable and can therefore be -consistently estimated under regularity conditions.
If candidate values are not available, it is necessary to estimate the risk of candidate estimators that are trained in the sample, which makes necessary the use of cross-validated versions of these estimators.
lead to the same definition of the risk. Loss functions (5) and (7) come from more intuitive definitions of the risk, whereas the loss function (8) comes from efficient estimation theory and is closely related to the efficient influence function of . This fact is exploited by Wang et al.  in order to define estimators of the risk as a cross-validated average of estimators of these loss functions. We will work toward the definition of a CV-TMLE analogue of those estimators and present similar results to those obtained by van der Laan and Dudoit  in terms of an oracle inequality, as well as the conditions under which the estimator of the risk is asymptotically linear.
The loss function (8), referred to as the double robust loss function, defines the efficient influence function of parameter and plays a very important role in double robust and efficient estimation of , as explained in the next section.
Parameter (4) is a pathwise differentiable parameter, for which consistent asymptotically linear estimators can be found. Note that , as defined in eq. (4), depends on P only through , where . In an abuse of notation, we will use and interchangeably, and the true value will be denoted by . We will also use the notations and interchangeably. In Section 2.1, we will focus on the estimation of the risk when the candidates are given values. Given candidate values constitute a situation that is not very common in research problems, but provides an easy way to introduce the estimators that are going to be developed in Section 2.2, in which we will generalize these estimators to the case of a candidate estimated from the sample. Cross-validation will be used as a tool to avoid over-fitting and will lead to an oracle inequality presented in Section 4.
The efficient influence function of the risk is given by the expression
with defined in eq. (8).
2.1 Estimators of the risk of a candidate parameter value
In this section, we exploit the definitions of the risk in terms of loss functions given in the previous section in order to define various estimators of the risk. As we will see, the definitions of the risk through the different loss functions previously described lead to the definition of G-comp, IPTW and A-IPTW estimators. We will also use the efficient influence function of in order to define a targeted maximum likelihood estimator of . The A-IPTW loss function is closely related to the efficient influence curve of , which results in the consistency and efficiency of the A-IPTW and TMLE. Analytical properties of these estimators have been discussed elsewhere [8, 9, 28].
We will assume that is a given function of a and Z in the sense that it is not estimated from the sample. Such scenario is attainable, for example, in situations in which a pilot study is conducted in order to postulate candidate estimators with the objective of assessing their performance with data from a posterior study.
Let and be initial estimators of and , respectively. These estimators will be denoted or , depending on whether it is necessary to emphasize their dependence on the empirical distribution
with denoting a Dirac delta with a point mass at x. Estimation of is only necessary when the risk is a parameter of interest in itself, and as we will see in Section 4 it is unnecessary for candidate selection.
2.1.1 G-comp, IPTW and A-IPTW estimators
The equivalent definitions of the risk through G-comp, IPTW and A-IPTW loss functions allow the straightforward definition of three estimators of the risk of a candidate value, given by:
which can be seen as solutions in R of the corresponding estimating equations , , and , where
According to theorem 5.11 of van der Vaart , if falls in a Glivenko–Cantelli class with probability tending to one, and , then the IPTW estimator is consistent for . Under an appropriate Donsker condition and consistency of , the IPTW estimator is also asymptotically linear with influence function , as explained in theorem 6.18 of van der Vaart  and the theorems of Chapter 2 of van der Laan and Robins . As a consequence, it is an inefficient estimator of the risk , and its variance can be estimated with the empirical variance of . The G-comp estimator is asymptotically linear and efficient in a parametric model that is correctly specified, but it is inefficient otherwise.
Following similar arguments, the A-IPTW estimator is double robust in the sense that it is consistent if either of or is consistent. It is also efficient if both and are consistent. Even though the A-IPTW represents an important improvement with respect to the G-comp or the IPTW, it suffers from some of the drawbacks inherited from the estimating equation methodology. One of the most important problems of such methodology is the possibility of solutions out of the parameter space, or very unstable estimators if the positivity assumption is practically violated. For this reason, we prefer estimators that are substitution estimators, i.e. estimators that are the result of applying the map to a certain estimated distribution . As we will see, the TMLE is such a substitution estimator.
2.1.2 Targeted minimum loss–based estimator
For a review on TMLE and its properties, we refer the interested reader to van der Laan and Rose . TML estimation requires the specification of three components: a valid loss function for the relevant part of the likelihood, a parametric submodel whose generalized score equals the efficient influence function and initial estimators of the relevant parts of the likelihood.
We will assume that Y is binary or that for known values a and b, in which case we can work with and interpret the results accordingly. Consider the loss functions
, for , and the parametric fluctuations given by , where
corresponding with the first two parts of the efficient influence curve presented in eq. (9). The marginal distribution of W is estimated with the empirical distribution of . It can be shown that solves (the third part of the efficient influence curve equation) at any .
For initial estimators and , the first step TMLE of is given by , where
The TMLE of is now defined as the plug-in estimator , where .
Under certain conditions explained in detail in van der Laan and Rose [9, appendix A.18], if and are consistently estimated, this TMLE of is asymptotically linear with influence curve , which means that it is consistent and efficient. If is consistent but is not, the TMLE is consistent but inefficient, and its variance can be conservatively estimated by
If one uses data-adaptive estimators in and , it is often appropriate to replace the estimate of the variance by a cross-validated estimator.
The conditions needed for asymptotic linearity of the TMLE [9, appendix 18] include a Donsker condition on the class of functions that contains the estimated efficient influence function D. Such Donsker conditions impose certain restrictions on the type of algorithms that can be used for estimation of and , forcing the user to find a trade off between obtaining the best possible prediction algorithms and not using algorithms that are too data-adaptive, because data-adaptive algorithms might lead to estimators that do not belong to a Donsker class (e.g. random forest).
The cross-validated TMLE, whose theoretical properties are discussed in Zheng and van der Laan , provides a template for the joint use of cross-validation and TMLE methodology that avoids Donsker conditions and therefore allows the use of very data-adaptive techniques in order to find consistent estimators of and . An additional advantage of CV-TMLE in this setting is that it allows us to have a valid estimator of the risk of an estimated CDRC, solving the issue of over-fitting through the use of cross-validation.
2.2 Estimators of the risk of a candidate estimator
The previous section provided an algorithm to estimate the risk of a candidate value for the causal dose–response curve, when the value is given and not estimated from the sample. That scenario is very rare in real data applications, and it is very common that the CDRC candidates have to be estimated from the sample as well. In such situations, if the algorithms are trained in the whole sample, the use of the estimators of the risk presented in the previous sections would lead to the selection of the candidates that overfit the data.
van der Laan and Dudoit , van der Vaart , van der Laan et al. , among others, show that cross-validation is a powerful tool for estimating the risk of a candidate estimator of a non–pathwise differentiable parameter and shows that such cross-validation–based selection endows the selector with an oracle inequality that translates into asymptotic optimality.
Assume now that is a mapping that maps elements in a non-parametric statistical model into a space of functions of a and Z (). An estimate of is now seen as such map evaluated in the empirical distribution of , i.e., .
Consider the following cross-validation scheme. Let a random variable S taking values in index a random sample split into a validation sample and a training sample , where S has a uniform distribution over a given set such that as for all . Note that this restriction excludes certain widely used types of cross-validation (e.g. leave-one-out cross-validation). This restriction is necessary because the asymptotic results of Section 3 and 4 rely on empirical process theory applied to the validation sample and therefore assume that its size converges to infinity. We also note that the union of the validation samples equals the total sample: , and the validation samples are disjoint: for . Denote and the empirical distributions of a training and validation sample, respectively. For a function , we denote .
Since is now a value that depends on the sample, it does not make sense to talk about a parameter , because it does not agree with the formal definition of a parameter. Nonetheless, in an abuse of language, we will talk about “estimation” of the “parameter” , which we call the conditional (on the sample) risk of .
2.3 Cross-validated augmented IPTW
This estimator is also discussed by Wang et al.  and is given by the solution of the cross-validated version of the A-IPTW estimating equation, given by
This estimator is asymptotically linear under the conditions presented in van der Laan and Dudoit . An oracle inequality for the selector based on the A-IPTW risk estimator is also proved in the original paper.
2.4 Cross-validated TMLE
The cross-validated targeted maximum likelihood estimator was introduced by Zheng and van der Laan  as an alternative to the TMLE that avoids the Donsker conditions on the efficient influence curve (discussed in Section 2.1.2). Donsker conditions on the class of functions generated by the estimated efficient function D represent an important limitation to the kind of algorithms that can be used in the initial estimators of and : very data-adaptive techniques will give as a result functions that do not belong to a Donsker class. As discussed in Section 2.1.2, the consistency and efficiency of the risk estimator depend on the consistency of the initial estimator of and . It is common practice in statistics to assume parametric models in order to estimate these quantities. Such parametric models are often chosen ad-hoc, based on arbitrary preferences of the researcher, and do not encode legitimate knowledge about the data generating process. Thus, we avoid such parametric assumptions and prefer to use data-adaptive techniques to find the algorithm that best approximates and .
As we will see in the next section, the use of cross-validation also equips the CV-TML selector with an oracle inequality, meaning that such selector performs asymptotically as well as a selector in which the risk is computed based on the true (unknown) probability distribution.
Zheng and van der Laan  present two types of CV-TML estimators: one for a general parameter, and a specific CV-TMLE for the case in which Q can be partitioned as and the mapping that defines the parameter is linear in . As discussed in Section 2, the risk of a given candidate depends on P only through , and it can be easily verified that is linear in .
The construction of a CV-TML estimator requires the specification of the same three components discussed in Section 2.1.2: a logistic loss function, a logistic parametric fluctuation and an initial estimator of Q. For each S, let
where were defined in Section 2.1.2. This is the same fluctuation considered before, but defined only based on the training sample. With this modification, the CV-TMLE is defined analogous to the regular TMLE. Let
and for each S define the updates
which results in the plug-in estimator of the oracle risk
where , denotes the empirical distribution of W in the validation sample S, and denotes the size of .
For a definition of the CV-TMLE for general parameters, the interested reader is referred to the original article. In the next sections, we will present the asymptotic linearity of the previous estimator, as well as an oracle inequality for the selector based on it.
3 Asymptotic linearity of CV-TML estimator of the risk
In this section, we present a theorem establishing asymptotic linearity of the CV-TML estimator of the risk. This theorem is analogue to the theorems presented in Zheng and van der Laan , and its proof uses the same ideas presented in that article.
An analogue version of this theorem for the CV-A-IPTW is presented in van der Laan and Dudoit . The CV-TMLE is expected to perform better than the CV-A-IPTW in finite sample sizes, in which practical positivity violations are often present and lead to CV-A-IPTW estimators that are either very unstable or provide solutions out of the range of the parameter of interest.
Theorem 1 (Asymptotic linearity). Define
with . For a function of O, define the norm . Assume:
There exist constants and such that and
and converge to some fixed and in the sense that
For some mean zero function , we have
Then we have that
for the efficient influence function of .
The proof of this theorem is presented in Appendix A. Next we will discuss the plausibility and implications of the assumptions of Theorem 1.
3.1 Discussion on the assumptions of result 1
This assumption is a natural assumption, equivalent to the positivity assumption for binary treatments, and needed to identify and also needed to estimate the risk using IPTW or A-IPTW estimators.
This is a very important assumption stating that is a consistent estimator of . It is required that the rate of convergence is or faster. This condition is automatically true in randomized controltrials (RCT), in which the treatment mechanism is known. It is also true if g is known to belong to a parametric model, and in semi-parametric models that assume enough smoothness of . If is completely unknown, it is important to use aggressive data-adaptive estimation techniques such as the super learner  to find an estimator that is more likely to satisfy this assumption.
This assumption states that the updated estimator converges to some unspecified limit at a certain rate. It is worth to note that such limit is not assumed to be , the only requirement is convergence to some value at a certain rate that depends on the rate of convergence of to . The desired rate of convergence can be achieved if, for example, is -consistent (i.e. ) and converges to at any rate (i.e. ). The same is true for and .
In an RCT, in which is known, one could set and this condition would be trivially satisfied. On the other hand, since cross-validation allows for the use of very aggressive techniques for estimation of , we could have that , and the condition would also be satisfied.
In other cases, this assumption seems to be conflicting with assumption 2. If the treatment mechanismis completely unknown, it is necessary to use very aggressive data-adaptive techniques to find estimators that satisfy assumption 2. The use of such estimators will usually lead to estimates of that do not provide the asymptotic linearity needed in 4. Likewise, the use of an inconsistent estimator that satisfies this condition (e.g. a parametric model) will violate assumption 2. In that case, it is necessary to rely on the consistency of in the sense that , in which case assumption 4 will be trivially satisfied. This condition seems to suggest that the initial estimator must also be fluctuated to target a smooth functional of . This is a direction of future research beyond the scope of this article.
As opposed to the regular TMLE or A-IPTW, in which the Donsker conditions on D limit the use of very aggressive techniques for estimation of , the use of cross-validation allows us to implement any type of algorithm, which in turn makes consistency of a very sensible assumption. We encourage the use of super learning for estimation of both and . Super learner is a methodology that uses cross-validated risks to find an optimal estimator among a library defined by the convex hull of a user-supplied list of candidate estimators. One of its most important theoretical properties is that its solution converges to the oracle estimator (i.e. the candidate in the library that minimizes the loss function with respect to the true probability distribution). Proofs and simulations regarding these and other asymptotic properties of the super learner can be found in van der Laan et al.  and van der Laan and Dudoit.
4 Asymptotic optimality of the CDRC estimate selector based on CV-TMLE risk
If the objective is to choose the best candidate among a list of candidate estimators , it suffices to construct a ranking based on the pseudo-risk
which has the advantage that does not need to be estimated, providing additional robustness of the candidate selector. In an abuse of notation and will also be denoted by R and whenever the difference is clear from the context. Estimation of this pseudo-risk can be carried out in a similar fashion to estimation of the full risk presented in the previous section, with efficient influence function given by
which results in a CV-TMLE defined as
with exactly as in eq. (12). We will discuss now asymptotic optimality of the selector based on the CV-TMLE. Assume that we have a list of candidate estimators for the CDRC given by . Each of these algorithms is viewed as a map , where is the space of functions of a and Z. Define the CV-TMLE selector as
with . The following theorem proves that these two selectors are asymptotically equivalent under certain consistency conditions of the initial estimator of .
Theorem 2 (Oracle inequality). For each k, define
where for finite c,
Let be the CV-TMLE targeted toward estimation of the true conditional risk
Assume that , , , , and have supremum norm smaller than a constant with probability 1.Let be the total number of possible points for across , so that . Define and , where
is the TMLE of . The expression means that for a constant c. For a function of O, define the norm . We have for each , there exists a so that
where is the CV-TMLE of obtained when the target parameter is , and is either or , whichever gives the worst bound.
A proof of this theorem is provided in Appendix B. The use of a grid of size for constant c when estimating does not represent a limitation of the result of the theorem, since the result without the grid will be similar up to a term that does not affect the asymptotic behavior of the CV-TMLE selector. However, a grid of size allows the proof presented in Appendix B.
The following corollary provides the conditions under which the CV-TMLE selector is asymptotically equivalent to the oracle selector.
Corollary 1 (Asymptotic optimality). In addition to the conditions of Theorem 2, assume that
the convergences assumed in Corollary 1 are expected to hold, for example, if converges to at a rate faster than converges to .
In the following section, we will show the results of a simulation study in which the finite sample size properties of the CV-TMLE based selector of their risk are explored for a specific data generation process.
In order to explore some of the finite sample size properties of the risk estimators and the selectors based on them, we performed a Monte Carlo simulation. We generated 500 samples of sizes 100, 500 and 1,000 from the following data generating process:
Under this parameterization . We considered four candidates algorithms given by marginal structural models (MSM) of the form , where is a polynomial of degree on a with coefficients . The coefficients were estimated with IPTW estimators as presented in Robins et al.  and Neugebauer and van der Laan . The true value of was computed from this data generating distribution by drawing a sample of size 100.000 and, for each a, computing the empirical mean of . All the simulations were performed assuming the true parametric model for the outcome and treatment mechanism was known. Figure 1 presents the true dose–response curve, as well as the expectation of the candidate estimators across the 500 samples. From this graph, we can see that among the candidates chosen, a polynomial of degree 2 seems to provide the closest approximation to the true dose–response curve with fewer parameters, therefore providing the best bias-variance trade-off. Table 1 shows the expectation of the random variable , which from Theorem 1 should approach zero as the sample size increases. As we can see, that is not the case for the CV-A-IPTW estimator with sample size 100 due to the presence of empirical violations of the positivity assumption that cause very small treatment weights and therefore very unstable, non-regular estimates. However, that problem seems to be fixed asymptotically, since for large sample sizes empirical violations of the positivity assumption are less likely to occur.
Table 2 shows the proportion of estimates that fell outside the interval (–10, 10) or fell out of the parameter space. The interval (–10, 10) was chosen arbitrarily and represents an extreme of inadmissible bounds for an estimator of a parameter that ranges in the interval (0, 1). Since the TMLE is a substitution estimator, all the estimates fell within the parameter space and are thus not presented. Due to practical violations of the positivity assumption previously mentioned, an important proportion (around 5%) of the A-IPTW estimates fell outside the parameter space for sample size 100.
Table 3 contains the expected values of across 500 simulated samples once the estimates that fell outside the interval (0, 1) were removed. In this case, the expectation of the A-IPTW–based estimator of the risk is much closer to what is expected theoretically and had already been achieved by the TML estimator.
Finally, Table 4 shows the proportion of times that a given candidate is chosen according to the A-IPTW, TMLE and the oracle selector. As we can see, both the A-IPTW- and the TMLE-based selectors perform similar to the oracle selector, particularly as the sample size increases, thus showing no apparent advantage (at least for this particular data generating mechanism) of either method when evaluated as a candidate selector procedure.
In this article, we discuss estimation of the causal dose–response curve for cross-sectional studies and propose a procedure for selecting an estimator among a list of candidate algorithms. In particular, we introduced a cross-validated targeted minimum loss–based estimator of the risk of an estimator and its associated candidate selector. The approach we propose differs from commonly used approaches in that we aim to choose an algorithm among a list of candidates to fit the true dose–response curve in situations in which its functional form is unknown.
The use of augmented inverse probablity of treatment weigthed estimators of the risk has alread been discussed in van der Laan and Dudoit , in this article we provide a targeted minimum loss–based analogous estimator that has improved performance, as demonstrated by the simulations and analytical properties of the estimators. We provide a theorem establishing asymptotic linearity of the risk estimator, which together with its variance estimator can be used to construct hypothesis tests and confidence intervals. We also provide a theorem proving an oracle inequality bounding the expectation of the CV-TMLE risk estimator. The main corollary of this theorem is a result that establishes asymptotic equivalence of the CV-TMLE selector and the oracle selector.
The asymptotic optimality of cross-validation procedures such as the super learner  and our proposal is best exploited with very rich libraries of candidate estimators that contain algorithms that can provide an accurate description of the true underlying statistical (or causal) parameter. The aspects that must be taken into account when postulating candidate estimators for the dose–response curve are analogous to those in estimation of the regression function. For example, one can decide to add a candidate to the library because it represents prior knowledge about the nature of the phenomena under study, or because it is data-adaptive enough to accomodate various data generating processes. Since the extent to which the postulated algorithms approximate the true curve will often be unknown to the user, cross-validation selection will be optimal with libraries that contain a rich class of algorithms, some of which should be data-adaptive or non-parametric. Unfortunately, estimation of the causal dose–response curve has received very little attention from the machine learning research community and few data-adaptive estimators are available.
However, in a situation in which the library consists of a few parametric models (e.g. the MSMs in the simulations), our proposal provides an optimal model selection tool. Asymptotic equivalence to the oracle selector guarantees that our selector will asymptotically pick the parametric model that is closest to the true dose–response curve, among those in the library.
In this appendix, we will proof Theorem 1. This proof follows closely the proofs presented in Zheng and van der Laan  for a general CV-TMLE. Those proofs rely heavily on empirical process theory, of which van der Vaart and Wellner  provide a complete study.
Proof of Theorem 1. First of all note that
which implies that
so that we can write
We will first handle the term (16). By Cauchy–Schwarz, eq. (16) can be bounded by
Using a similar argument, the last term can be bounded (using assumption 1 and up to universal constants)by
whereas the second term in eq. (17) is bounded by
for constants .
On the other hand, since when f does not depend on S, we may rewrite (15) as
Following similar arguments to those presented by Zheng and van der Laan , and using the assumptions of the theorem, it can be proven that (19) is , which implies
Before proceeding to prove Theorem 2, we will first present and prove the following useful theorem.
Theorem 3. Define and , where
is the TMLE of the true conditional risk
for some function that depends on such that
Proof of Theorem 3.
Proof of Theorem 2. Recall from Theorem 1 that
where D is the efficient influence function
Applying this same equality to the constant algorithm and subtracting it from eq. (20) yields
where denotes the function inside square brackets in eq. (24), and , and denote eqs (22), (23) and (25), respectively. From the definition of the efficient influence function D, note that , which implies
where the term inside square brackets in eq. (27) is a constant, and (27) equals zero. Note that depends on only through , thus allowing us to rewrite
From the identity ( denotes the g-comp loss function) we have that
This fact together with eq. (26) prove that satisfies the conditions of Theorem 3, with . By application of Theorem 3 we obtain
It remains to study . Let us first consider . Note that
Since and are assumed bounded away from zero (positivity assumption), we can apply the Cauchy–Schwarz inequality to obtain
We now consider . By applying the Cauchy–Schwarz inequality we obtain
From the definition of in the CV-TMLE of , note that satisfies the equation
Applying the same equation for and , and subtracting it from the previous one yields
which can be written as
By empirical process theory [31, theorem 2.14.1], noting that the first two terms are empirical processes applied to functions in a class of functions , the expectations of the first two terms are bounded by . The third term is bounded by . These facts together with eq. (28) yield
Finally, consider the term . We can bound this term by , for
Let be the envelope of the class of functions , over which we take the maximum. We have . We will apply the following inequality for empirical processes indexed by a finite class of functions :
where F is an envelope of . Thus, given , we can bound the conditional expectation of by , which results in the following bound for the marginal expectation:
Accumulation of these bounds for the different components of and yields the following inequality:
where is either or , whichever gives the worst bound. This inequality can be written as , for
and can be solved using the quadratic formula as , which in turn implies , proving Theorem 2. □
Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge University Press, 2000. Google Scholar
Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974; 66(5):688. Available at: http://www.eric.ed.gov/ERICWebPortal/detail?accno=EJ118470. Crossref
Robins J. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Math Modell 1986;7:1393–512.CrossrefGoogle Scholar
Rubin D. Matched sampling for causal effects. Cambridge, MA: Cambridge University Press, 2006.Google Scholar
van der Laan M, Robins J. Unified methods for censored longitudinal data and causality. New York: Springer, 2003. Google Scholar
Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Baltimore: Johns Hopkins University Press, 1997. Google Scholar
Díaz I, van der Laan M.. Assessing the causal effect of policies: an approach based on stochastic interventions. Technical report, 2012. Google Scholar
Díaz I, van der Laan M. Population intervention causal effects based on stochastic interventions. Biometrics 2012;68:541–9. Available at: http://dx.doi.org/10.1111/j.1541-0420.2011.01685.x. Web of Science
Dawid AP, Didelez V. Identifying the consequences of dynamic treatment strategies: a decision-theoretic overview. CoRR 2010;abs/1010.3425. Google Scholar
Didelez V, Dawid AP, Geneletti S. Direct and indirect effects of sequential treatments. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, 138–146. UAI 2006.Google Scholar
Korb K, Hope L, Nicholson A, Axnick K. Varieties of causal intervention. In: Zhang C,Guesgen HW, Yeap W-K, editors. PRICAI 2004: trends in artificial intelligence. Lecture Notes in Computer Science, volume 3157. Berlin: Springer, 2004:322–31. Google Scholar
Neugebauer R, van der Laan M. Nonparametric causal effects based on marginal structural models. J Stat Plann Inference 2007;137:419–34. Available at: http://www.sciencedirect.com/science/article/pii/S0378375806000334. Crossref
van der Laan M, Dudoit S.. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, CA, 2003. Google Scholar
van der Vaart A. Notes on cross-validation. Technical report, Department of Mathematics, Free University, Amsterdam, 2003. Google Scholar
van der Laan M, Dudoit S, van der Vaart A. The cross-validated adaptive epsilon-net estimator. Stat Decisions 2006;24:373–95. Google Scholar
van der Vaart A, Dudoit S, van der Laan M. Oracle inequalities for multi-fold cross-validation. Stat Decisions 2006;24:351–71. Google Scholar
van der Laan MJ, Petersen ML. Targeted learning. In: Zhang C,Ma Y, editors. Ensemble machine learning. New York: Springer, 2012:117–56. Google Scholar
Zheng W, van der Laan MJ. Cross-validated targeted minimum-loss-based estimation. In: van der Laan M,Rose S, editors. Targeted learning. New York: Springer Series in Statistics, Springer, 2011:459–74. Google Scholar
Wang Y, Bembom O, van der Laan M. Data adaptive estimation of the treatment specific mean. J Stat Plann Inference 2006;137(6):1871–1887. Google Scholar
van der Vaart A. Semiparameric statistics. In: Bolthausen, Erwin, Edwin Arend Perkins, and Aad W. van der Vaart. Lectures on Probability Theory and Statistics: Ecole D’Ete de Probabilites de Saint-Flour XXIX-1999. Ed. Pierre Bernard. Springer, 2002:331–457. Google Scholar
Zheng W, van der Laan MJ. Asymptotic theory for cross-validated targeted maximum likelihood estimation. Division of Biostatistics Working Paper Series, 2010. Google Scholar
van der Vaart AW, Wellner JA. Weak convergence and empirical processes. New York: Springer-Verlag, 1996. Google Scholar