Show Summary Details
More options …

# Epidemiologic Methods

### Edited by faculty of the Harvard School of Public Health

Ed. by Tchetgen Tchetgen, Eric J / VanderWeele, Tyler J. / Daniel, Rhian

1 Issue per year

Online
ISSN
2161-962X
See all formats and pricing
More options …
Volume 3, Issue 1

# Discussion of Identification, Estimation and Approximation of Risk under Interventions that Depend on the Natural Value of Treatment Using Observational Data, by Jessica Young, Miguel Hernán, and James Robins

Mark J. van der Laan
/ Alexander R. Luedtke
/ Iván Díaz
Published Online: 2014-11-07 | DOI: https://doi.org/10.1515/em-2014-0012

## Abstract

Young, Hernán, and Robins consider the mean outcome under a dynamic intervention that may rely on the natural value of treatment. They first identify this value with a statistical target parameter, and then show that this statistical target parameter can also be identified with a causal parameter which gives the mean outcome under a stochastic intervention. The authors then describe estimation strategies for these quantities. Here we augment the authors’ insightful discussion by sharing our experiences in situations where two causal questions lead to the same statistical estimand, or the newer problem that arises in the study of data adaptive parameters, where two statistical estimands can lead to the same estimation problem. Given a statistical estimation problem, we encourage others to always use a robust estimation framework where the data generating distribution truly belongs to the statistical model. We close with a discussion of a framework which has these properties.

## 1 Basic summary of article to set stage for discussion

The authors of this excellent article discuss the identification and estimation of the mean outcome under a dynamic intervention that assigns treatment not only in response to the observed past before treatment but also in response to the actual observed treatment itself under this intervention, where the latter is called the natural value of treatment. We want to congratulate the authors for this nice, welcome, and inspiring article.

Let us consider a single time point example of the type studied in Díaz and van der Laan (2012) in order to provide a very basic summary of the article and set the stage for this discussion. Even though the article considers much more complex, general longitudinal data structures, we believe this simpler example is useful as a starting point for discussion.

Consider a nonparametric structural equation model [NPSEM, Pearl, 2009] with $W={f}_{W}\left({U}_{W}\right)$, $A={f}_{A}\left(W,\phantom{\rule{thinmathspace}{0ex}}{U}_{A}\right)$, $Y={f}_{Y}\left(W,\phantom{\rule{thinmathspace}{0ex}}A,\phantom{\rule{thinmathspace}{0ex}}{U}_{Y}\right)$, defined by unspecified functions ${f}_{W},\text{\hspace{0.17em}}{f}_{A},\text{\hspace{0.17em}}{f}_{Y}$, and some model on the probability distribution of $U=\left({U}_{W},\phantom{\rule{thinmathspace}{0ex}}{U}_{A},\phantom{\rule{thinmathspace}{0ex}}{U}_{Y}\right)$. This defines the model on the full data $\left(U,\phantom{\rule{thinmathspace}{0ex}}W,\phantom{\rule{thinmathspace}{0ex}}A,\phantom{\rule{thinmathspace}{0ex}}Y\right)$ and the observed data $O=\left(W,A,Y\right)$. Here W are baseline characteristics, A is the intervention node (e.g. treatment variable, missingness indicator, etc.), and Y is the outcome of interest. The full data model (i.e. the allowed set of probability distributions of $\left(U,O\right)$) implies the observed data model (i.e. the allowed set of probability distributions of O). The latter is called the statistical model. Consider now an intervention defined by $W={f}_{W}\left({U}_{W}\right)$, $A={f}_{A}\left(W,{U}_{A}\right)$, ${A}_{d}=d\left(A,W\right)$, ${Y}_{d}={f}_{Y}\left(W,{A}_{d},{U}_{Y}\right)$, where d is a deterministic function mapping the observed treatment A and covariates W into the treatment value that is assigned to the unit under the intervention. The authors refer to such an intervention as a dynamic intervention that depends on the natural value of treatment. The authors show that the mean outcome under this intervention d is equivalent to the mean outcome under a stochastic intervention ${g}_{0}^{\ast }$ on A that is only a function of W. The counterfactuals under this stochastic intervention are denoted as ${Y}_{{g}_{0}^{\ast }}$. We note that in the special case of a single time point, the natural value of treatment actually equals the observed treatment, while in a longitudinal data structure the natural value of treatment is the counterfactual treatment that would have been observed given the intervention was followed in the past.

Instead of using ${f}^{\mathrm{o}\mathrm{b}\mathrm{s}}$ for the treatment/censoring mechanism, we will use the commonly used notation g. This differs from the use of g in the main text and appendix B of the work of interest where g was, respectively, used to represent dynamic regimes which do not depend on the natural value of treatment and dynamic regimes which may depend on the natural value of treatment. We instead use d to represent a dynamic treatment that may depend on the natural value of treatment. In the main text, the authors use ${f}^{d}$ to describe the distribution of such a (possibly stochastic) rule, whereas in this commentary we will focus on deterministic rules d for simplicity. Finally, we use ${g}^{\ast }$ instead of ${f}^{\mathrm{i}\mathrm{n}\mathrm{t}}$ for the stochastic intervention that corresponds with the dynamic intervention that relies on the natural value of treatment.

Let $O=\left(W,A,Y\right)\sim {P}_{0}$ and ${O}_{d}=\left(W,A,{A}_{d},{Y}_{d}\right)\sim {P}_{0,d}$ where ${P}_{0,d}$ is called the post-intervention probability distribution. The notation for the variables has changed slightly from the original work to emphasize that we are considering the simpler point treatment case in this commentary. Note that ${P}_{0,d}$ is determined by the probability distribution of the full data $\left(U,O\right)$. The first goal is to identify the full-data parameter ${E}_{{P}_{0,d}}{Y}_{d}$ as a mapping depending only on the observed data distribution ${P}_{0}$, so that the mean outcome under intervention d can be learned from the observed data. Robins et al. (2004) proposed the extended g-computation formula for this parameter for general longitudinal data structures, and Richardson and Robins (2013) establish the desired identifiability conditions, under the statistical assumption that the extended g-computation formula is well-defined (i.e. the positivity condition holds).

As the authors nicely demonstrate, in many applications, this type of intervention might be considered unrealistic and thereby not interesting: i.e. after the unit has received its natural treatment, one cannot turn around the clock and undo this treatment by replacing it by a new treatment value $d\left(A,W\right)$. In these applications, the authors propose an approximation of the target intervention by a dynamic intervention $\left({A}_{1},W\right)\to d\left({A}_{1},W\right)$, where now ${A}_{1}$ is an intended treatment value ${A}_{1}$, instead of the actual realized treatment value, under the assumption that one actually observes such an intended treatment value.

On the other hand, one can also imagine applications in which $\left(A,W\right)\to d\left(A,W\right)$ corresponds with an augmentation of the observed treatment value, in which case such an intervention measures the effect of augmenting the treatment by a certain amount that possibly depends on the characteristics of the unit. Thus clearly identification of the mean outcome under such a type of intervention is of both theoretical and practical interest. The SWIG causal graph theory developed in Richardson and Robins (2013) provides a graphical methodology to establish such identification results for general complex, longitudinal data structures. In the single time point example, it is also possible to establish the desired identifiability mathematically, as in Díaz and van der Laan (2012). If

• (i)

Randomization: A is independent of ${U}_{Y}$, given W, and

• (ii)

Positivity: $P\left(A=a|W=w\right)=0$ implies $P\left(d\left(A,W\right)=a|W=w\right)=0$ for all w in the support of W,

then $\begin{array}{rl}{E}_{0}{Y}_{d}& ={E}_{0}{f}_{Y}\left(W,d\left(A,W\right),{U}_{Y}\right)\\ & =\sum _{a,w}{E}_{0}\left[{f}_{Y}\left(W,d\left(A,W\right),{U}_{Y}\right)|d\left(A,W\right)=a,W=w\right]{P}_{0}\left(d\left(A,W\right)=a,W=w\right)\\ & =\sum _{a,w}{E}_{0}\left[{f}_{Y}\left(w,a,{U}_{Y}\right)|A\in {d}_{W}^{-1}\left(a\right),W=w\right]{P}_{0}\left(d\left(A,W\right)=a,W=w\right)\end{array}$ $by (i)=∑a,wE0[fY(w,a,UY)|W=w]P0(d(A,W)=a|W=w)P0(W=w)by (i) and (ii)=∑a,wE0[fY(w,a,UY)|A=a,W=w]P0(d(A,W)=a|W=w)P0(W=w) =∑a,wEP0(Y|A=a,W=w)g0*(a|w)P0(W=w),$ where ${d}_{W}^{-1}\left(a\right)=\left\{{a}^{\prime }:{d}_{W}\left({a}^{\prime },W\right)=a\right\}$ and ${g}_{0}^{\ast }\left(\cdot |w\right)$ is the conditional distribution of $d\left(A,W\right)$ given $W=w$, which can be identified as a function of ${P}_{0}\left(A|W\right)$.

Note that, in this simple case where such mathematical derivation is tractable, the necessary identification conditions arise naturally in the derivation process. The mathematical derivation above also shows that ${E}_{0}{Y}_{d}$ equals the mean outcome ${E}_{0}{Y}_{{g}_{0}^{\ast }}$ under a stochastic intervention that replaces the equation $A={f}_{A}\left(W,{U}_{A}\right)$ by drawing A given W, from ${g}_{0}^{\ast }$. This point is made in general by the authors: the extended g-computation formula of the mean outcome under dynamic interventions depending on the natural value of treatment equals the regular g-computation formula for the mean outcome under a stochastic intervention that first involves drawing from the actual conditional distribution of treatment, before evaluating the deterministic rule. The authors stress that the equivalence of the two g-computation formulas shows that the positivity assumption for this stochastic intervention equals the positivity assumption for the dynamic intervention depending on the natural value of treatment, and that one can use estimators developed for stochastic interventions to estimate the mean under these dynamic interventions depending on natural value of treatment. The authors propose a particular inverse probability of treatment weighted estimator, and contrast estimators based on parametric models and estimators based on semi-parametric statistical models.

## 2 Discussion items

We focus the discussion on the following points, which are indirectly or directly raised by this article:

Separation of statistical estimation and causal modeling: By recognizing that the identifiability results for two different causal parameters result in the same estimand and statistical model, one can borrow statistical methods and their properties (including their statistical assumptions such as the positivity assumption) developed for one causal parameter to solve the statistical estimation problem for the other.

Enhancing statistical interpretation by using multiple nested identifiability results: Consider two identifiability results that correspond with the same estimand and statistical model, but one result relies on weaker or the same causal assumptions as the other (i.e. one set of assumptions is a subset of the other set of assumptions). Should one not use the former in the interpretation of the statistical results?

Data adaptive target parameters: The estimand for many causal effects of interest corresponds with the estimand for the mean outcome under a stochastic intervention that itself needs to be learned from the data: the mean outcome under a dynamic treatment depending on the natural value of treatment represents one such example. Is it not of interest to define data adaptive target parameters/estimands defined by replacing the stochastic intervention by a data dependent fit of this stochastic intervention? What are the implications for statistical inference for such data adaptive target parameters?

Robust statistical inference: When pursuing statistical inference for the causal quantity of interest, what is the scientific rationale (if any) to select statistical methods that rely on parametric assumptions?

We discuss each of these points in some detail in the remainder of this commentary.

## 3 Separating statistical estimation from causal modeling

The full data model and full-data target parameter play an important role in obtaining knowledge from subject-matter experts about the data generating experiment and determining the full data target parameter that represents the answer to the scientific question of interest. The full-data model ${\mathcal{M}}^{F}$ should represent a priori knowledge about the phenomena under study, and the full-data target parameter ${\Psi }^{F}:{M}^{F}\to ℝ$ should provide the answer ${\mathrm{\psi }}_{0}^{F}={\mathrm{\Psi }}^{F}\left({P}_{0}^{F}\right)$ to the scientific question of interest.

Subsequently, it is necessary to establish identifiability of the full-data target parameter from the probability distribution of the observed data, under assumptions which might exceed the assumptions coded by the full-data model. Based on these findings, one will need to commit to a statistical model $\mathcal{M}$ and a statistical target parameter, $\Psi :M\to ℝ$, with the following two main considerations: (1) the statistical model incorporates the realistic assumptions in the full data model ${\mathcal{M}}^{F}$, but not the possibly extra unrealistic assumptions that were needed for the identifiability result, in order to guarantee that the statistical model contains the true probability distribution of the data (i.e. ${P}_{0}\in \mathcal{M}$) and (2) the target parameter defines an estimand ${\mathrm{\psi }}_{0}=\mathrm{\Psi }\left({P}_{0}\right)$ that approximates the full data parameter value ${\mathrm{\psi }}_{0}^{F}$ as best as the data allows. In particular, the estimand ${\mathrm{\psi }}_{0}$ should equal the full data target parameter value ${\mathrm{\psi }}_{0}^{F}$ when the identifiability assumptions hold. At this point, the statistical estimation problem is well defined. The full-data target parameter and underlying full-data model can be completely ignored in the process of developing estimators and corresponding statistical inference for the statistical parameter.

Consider two of these exercises, possibly starting with different full-data models and full-data parameters, but leading to the same statistical model and statistical target parameter, so that the two statistical estimation problems are identical. In this case, it would be most scientifically coherent to have an estimation procedure that depends only on assumptions affecting the statistical model and statistical target parameter. Therefore, it is very good practice to always be explicit in the formulation of the statistical model $\mathcal{M}$ and target parameter $\mathrm{\Psi }$ so that the scientific community knows what statistical estimation problem has been addressed, which might be relevant for answering other scientific questions of interest as well. The roadmaps for causal inference presented in Rose and van der Laan (2011), Pearl (2009), and Petersen and van der Laan (2014) make each of these steps explicit. The only role of the full-data model, full data target parameter, and the identifiability result in the estimation process is to generate a statistical model and statistical target parameter. The authors of this article have exemplified this insight by borrowing statistical results for mean outcomes under stochastic interventions, since, for a well-defined stochastic intervention, that problem used the same statistical model and statistical target parameter as used the mean outcome under a dynamic treatments depending on the natural value of treatment.

We have used this general insight into our work as well. For example, in Hubbard et al. (2011), we noted that the identifiability result for a particular type of natural direct effect yielded the same estimand and statistical model as used for the causal effect among the treated, even though the two problems assumed a different time ordering of the data and thus incomparable sets of causal assumptions. This allowed us to use the efficient and double robust targeted minimum loss-based estimator developed for the causal effect among the treated (Rose and van der Laan 2011) to efficiently estimate this natural direct effect parameter. Similarly, in Lendle et al. (2013) we borrowed the latter TMLE for the effect among the untreated to efficiently estimate the natural direct effect among the untreated. In section A5 of the appendix in Rose and van der Laan (2011), we discuss this general and useful (although trivial in some sense) point in more detail: statistical theorems are invariant to (e.g. non-testable) assumptions that do not change the statistical model and statistical target parameter, allowing us to use the same theorems across very different applications and causal models.

## 4 Enhancing interpretation of statistical output by referencing multiple identifiability results

The authors discuss identification and estimation of the mean outcome under a dynamic intervention that depends on the natural value of treatment. Suppose that a data analyst uses the extended g-computation formula to define an estimator and also provides a $95\mathrm{%}$ confidence interval under statistical assumptions S. The data analyst could now make the following two statements: (1) The confidence interval (as a random interval) contains the statistical estimand with probability 0.95 under the statistical assumptions S; (2) The confidence interval contains the mean outcome under the dynamic intervention depending on the natural value of treatment with probability 0.95 under the statistical assumptions S and the additional identifiability (causal) assumptions C. Statement 1) concerns the pure statistical interpretation of the estimand. Statement 2) concerns a statement about the desired causal quantity, under additional assumptions C. As shown by the authors, under the same causal assumptions C, the estimand also equals the mean outcome under a corresponding stochastic intervention where A is drawn from ${g}_{0}^{\ast }\left(\cdot |w\right)$ conditional on $W=w$. Thus, the data analyst could make a third statement: 3) the confidence interval contains the mean outcome under the stochastic intervention ${g}_{0}^{\ast }$ with probability 0.95 under the same assumptions S and C. For a longitudinal rather than point treatment data structure, the causal assumptions for 3) can be a subset of the causal assumptions for 2) so that the causal interpretations that can be applied will also vary with which causal assumptions hold.

In our point treatment example (Díaz and van der Laan, 2012), the application of interest might be one where the dynamic intervention $\left(A,W\right)\to d\left(A,W\right)$ cannot be carried out in the real world, but the stochastic intervention represents a perfectly plausible experiment of interest. In that case, the additional statement 3) is important for the interpretation of the statistical output.

Let us consider another example. Suppose $O=\left(W,A,Z,Y\right)\sim {P}_{0}$ and assume the nonparametric structural equation model $W={f}_{W}\left({U}_{W}\right)$, $A={f}_{A}\left(W,{U}_{A}\right)$, $Z={f}_{Z}\left(W,A,{U}_{Z}\right)$ and $Y={f}_{Y}\left(W,A,Z,{U}_{Y}\right)$. Suppose that one is concerned with estimation of the natural direct effect defined as ${\mathrm{\psi }}_{0}^{F}={E}_{0}\left[Y\left(1,{Z}_{0}\right)-Y\left(0,{Z}_{0}\right)\right],$where $Y\left(a,{Z}_{0}\right)={f}_{Y}\left(W,a,{Z}_{0},{U}_{Y}\right)$ and ${Z}_{0}={f}_{Z}\left(W,0,{U}_{Z}\right)$, $a\in \left\{0,1\right\}$. One may now use the following identifiability result from the current literature [e.g. Petersen et al. (2006)]: if ${C}_{1}$) $\left(A,Z\right)\phantom{\rule{thinmathspace}{0ex}}\mathrm{\perp }\phantom{\rule{thinmathspace}{0ex}}Y\left(a,z\right)|W$, for all values $\left(a,z\right)$, ${C}_{2}$) $A\mathrm{\perp }Z\left(a\right)|W$ for all values a, and ${C}_{3}$) ${E}_{0}\left[Y\left(1,z\right)-Y\left(0,z\right)|Z\left(0\right)=z,W\right]={E}_{0}\left[Y\left(1,z\right)-Y\left(0,z\right)|W\right]$ for all z, then $\begin{array}{l}{\psi }_{0}^{F}=\int \left\{{\overline{Q}}_{0}\left(1,w,z\right)-{\overline{Q}}_{0}\left(0,w,z\right)\right\}d{P}_{0}\left(z|A=0,w\right)d{P}_{0}\left(w\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\equiv \text{\hspace{0.17em}}\Psi \left({P}_{0}\right).\end{array}$Here we denoted ${E}_{0}\left[Y|A=a,W=w,Z=z\right]$ with ${\stackrel{ˉ}{Q}}_{0}\left(a,w,z\right)$. The data analyst who just computed a $95\mathrm{%}$ confidence interval for ${\mathrm{\psi }}_{0}$ under statistical assumptions S can now make the following statements: (1) the confidence interval contains ${\mathrm{\psi }}_{0}$ with probability 0.95 under assumptions S; (2) the confidence interval contains ${\mathrm{\psi }}_{0}^{F}$ with probability 0.95 under assumptions S and the above listed causal assumptions ${C}_{1},{C}_{2},{C}_{3}$. However, many will argue that assumption ${C}_{3}$ is particularly hard to defend. One may now note that the estimand ${\mathrm{\psi }}_{0}$ also equals $ND{E}^{\ast }\equiv {E}_{0}\left[{Y}_{{g}_{0,1}^{\ast }}-{Y}_{{g}_{0,0}^{\ast }}\right]$, where ${g}_{0,a}^{\ast }$ is the stochastic intervention on $\left(A,Z\right)$ defined as: ${g}_{0,a}^{\ast }\left({a}^{\prime },z|W\right)=I\left({a}^{\prime }=a\right){P}_{0}\left(Z=z|A=0,W\right)$, $a\in \left\{0,1\right\}$.

In other words, the full data parameter $ND{E}^{\ast }$ is now defined in terms of the mean outcomes under two stochastic interventions on $\left(A,Z\right)$ that deterministically set $A=1$ or $A=0$ and draws Z from the conditional distribution of Z, given $A=0,W$ (which equals the conditional distribution of ${Z}_{0}$, given W, by ${C}_{2}$). Since $ND{E}^{\ast }$ equals $\mathrm{\Psi }\left({P}_{0}\right)$ under the randomization assumptions ${C}_{1}$ and ${C}_{2}$ only, the data analysis can now also state 3) the confidence interval contains ${\mathrm{\psi }}_{0}^{F\ast }$ with probability 0.95 under S and ${C}_{1},{C}_{2}$. In this manner, one might still obtain reliable inference for $ND{E}^{\ast }$ while reliable inference for NDE is out of the question, due to the indefensible assumption ${C}_{3}$ (Zheng and van der Laan, 2011). Using this approach, Zheng and van der Laan (2012) obtain an identifiability result for a natural direct effect on a time to event outcome, controlling for a time-dependent intermediate process defined in terms of a mean outcome under a stochastic intervention only differing in a static intervention on treatment, where the identifiability only relies on the sequential randomization assumptions required for identification of the mean outcome under these two stochastic interventions.

## 5 Statistical inference for data adaptive target parameters such as the mean outcome under a stochastic intervention learned from data

It appears that many causal parameters of interest are defined by a mean outcome under a stochastic intervention that itself needs to be learned from data. Let us denote this causal quantity with ${E}_{0}{Y}_{{g}_{0}^{\ast }}$, where ${g}_{0}^{\ast }$ denotes a stochastic intervention that can be identified as a function of ${P}_{0}$. For example, as argued above, the article under discussion defines a full data parameter whose g-computation formula equals the extended g-computation formula for the mean outcome under a dynamic treatment that depends on the natural value of treatment. The authors might agree that in some applications, in which the dynamic intervention is impossible to carry out and “intended treatment values” are not available, ${E}_{0}{Y}_{{g}_{0}^{\ast }}$ might be of more interest than the original dynamic treatment parameter. The discussion in this section is relevant in such cases.

Above, we indicated that natural direct effect parameters inspire such analogue natural direct effect parameters which are now defined in terms of stochastic interventions. The mean outcome ${E}_{0}{Y}_{{d}_{0}}$ under an optimal dynamic treatment ${d}_{0}=arg{\phantom{\rule{thinmathspace}{0ex}}min}_{d\in \mathcal{D}}{E}_{0}{Y}_{d}$ is another example of interest, where ${g}_{0}^{\ast }={d}_{0}$ is now deterministic but unknown nonetheless. van der Laan and Petersen (2007) and Robins et al. (2008) recommend defining causal quantities (e.g. working marginal structural models) that correspond with realistic dynamic treatment interventions defined as rules that satisfy the strong positivity assumption, where it is often possible to define such rules in terms of the actual (unknown) treatment mechanism ${g}_{0}$. The mean outcome under such a realistic rule is now ${E}_{0}{Y}_{{d}_{0}}$ where ${d}_{0}$ is a dynamic treatment defined in terms of ${g}_{0}$. An example of a realistic rule that would belong to this class for the point treatment data structure $O=\left(W,A,Y\right)$ is the dynamic treatment ${d}_{0}\left(W\right)=I\left({g}_{0}\left(1|W\right)>\mathrm{\delta }\right)$ for some $\mathrm{\delta }>0$ that sets $A=1$ if there is support, but sets $A=0$ otherwise.

Suppose that ${g}_{n}^{\ast }={\stackrel{ˆ}{g}}^{\ast }\left({P}_{n}\right)$ is an estimator of this unknown stochastic intervention ${g}_{0}^{\ast }$, mapping the empirical probability distribution ${P}_{n}$ of the observed data sample ${O}_{1},\dots ,{O}_{n}$ into a realized estimate of ${g}_{0}^{\ast }$. Given a data set, we have this estimate ${g}_{n}^{\ast }$ in our hand. One can imagine that after we have presented our collaborator with a confidence interval for ${E}_{0}{Y}_{{g}_{0}^{\ast }}$, he or she might ask, what is ${g}_{0}^{\ast }$ like? The natural answer is to show the collaborator a plot of our estimate ${g}_{n}^{\ast }$. Our collaborator might then also consider the target parameter ${{E}_{0}{Y}_{{g}^{\ast }}}_{{g}^{\ast }={g}_{n}^{\ast }}$, which would tell us what would happen if the tangible rule ${g}_{n}^{\ast }$ were actually implemented in the population. This parameter is known, given the data, and thus well-defined. Our collaborator might want statistical inference for this data adaptive target parameter as well: that is, one wants a confidence interval that contains the random data adaptive parameter with probability 0.95. In van der Laan et al. (2013) we defined such general data adaptive target parameters and established various theorems for statistical inference. In particular, statistical inference can be developed for such data adaptive parameters under appropriate conditions, including a Donsker class condition and a stabilization condition on ${g}_{n}^{\ast }$ [see theorem 1 in van der Laan et al. (2013)]. The main message is that one can use the same estimator as developed for $E{Y}_{{g}_{0}^{\ast }}$, but the influence curve is different since it contains no contribution due to estimation of ${g}_{0}^{\ast }$. As a consequence, one often ends up with narrower confidence intervals. In fact, it might be difficult or impossible to develop valid inference for ${E}_{0}{Y}_{{g}_{0}^{\ast }}$, while statistical inference for the data adaptive target parameter can be simply based on the estimator of ${E}_{0}{Y}_{{g}^{\ast }}$ for a fixed ${g}^{\ast }$, but setting ${g}^{\ast }={g}_{n}^{\ast }$.

For example, in van der Laan (2013), van der Laan and Luedtke (2014b), we developed such estimators and such confidence intervals for the mean outcome under an estimate ${d}_{n}$ of the optimal dynamic treatment ${d}_{0}$. In this case, one can show that ${E}_{0}{Y}_{{d}_{n}}-{E}_{0}{Y}_{{d}_{0}}$ is a second-order term so that one might assume that it is ${o}_{P}\left(1/\sqrt{n}\right)$. Under that assumption and the assumption that the blip functions are nonzero with probability one (Robins and Rotnitzky, 2014; van der Laan and Luedtke, 2014a), the statistical inference for ${E}_{0}{Y}_{{d}_{0}}$ and ${{E}_{0}{Y}_{d}}_{d={d}_{n}}$ relied on the same estimator and same confidence intervals. Nonetheless, even in this case, the confidence interval for ${E}_{0}{Y}_{{d}_{n}}$ avoids reliance on this assumption ${E}_{0}{Y}_{{d}_{n}}-{E}_{0}{Y}_{{d}_{0}}={o}_{P}\left(1/\sqrt{n}\right)$, and one obtains better finite sample performance of the estimator and confidence interval even if this assumption holds.

As another example, consider the point treatment data structure and a realistic rule ${d}_{0}$ that sets $A=1$ if ${g}_{0}\left(1|W\right)>\mathrm{\delta }>0$ and sets 0 otherwise. Statistical inference for ${E}_{0}{Y}_{{d}_{0}}$ is problematic due to the fact that the unknown ${g}_{0}$ appears within an indicator defining the treatment rule. As a consequence, the contribution ${E}_{0}{Y}_{{d}_{n}}-{E}_{0}{Y}_{{d}_{0}}$ obtained by estimating this rule might not behave well, so that, contrary to the optimal dynamic treatment example, it is unreasonable to assume that ${E}_{0}{Y}_{{d}_{n}}-{E}_{0}{Y}_{{d}_{0}}={o}_{P}\left(1/\sqrt{n}\right)$. If one is willing to assume that $w\to I\left({g}_{n}\left(1|w\right)>\mathrm{\delta }\right)$ has a limit in ${L}_{2}\left({P}_{0}\right)$ as n gets large, then these serious statistical inference problems for ${E}_{0}{Y}_{{d}_{0}}$ are completely avoided by simply targeting ${E}_{0}{Y}_{{d}_{n}}$ where the given rule ${d}_{n}$ is now defined by setting $A=1$ if ${g}_{n}\left(1|W\right)>\mathrm{\delta }>0$. Few people would claim that the latter is less interesting than the mean outcome under the unknown realistic rule ${d}_{0}$.

In van der Laan and Luedtke (2014b), our estimator ${d}_{n}$ is based on a highly data adaptive super-learner of ${d}_{0}$ developed in Luedtke and van der Laan (2014), so that one might be concerned that the Donsker class condition on ${d}_{n}$ might be violated theoretically or negatively affect the finite sample coverage of the confidence interval for ${E}_{0}{Y}_{{d}_{n}}$. To deal with this challenge, in van der Laan et al. (2013) and van der Laan (2013), van der Laan and Luedtke (2014b) we started a general theory for estimation and inference for data adaptive parameters, such as theorem 2 in van der Laan et al. (2013) that avoids any conditions on the estimator ${\stackrel{ˆ}{g}}^{\ast }$, beyond convergence to some fixed ${g}^{\ast }$. First, we defined data adaptive target parameters of the type ${\frac{1}{V}{\sum }_{v=1}^{V}{E}_{0}{Y}_{g}}_{g=\stackrel{ˆ}{g}\left({P}_{n,v}^{0}\right)}$, where ${P}_{n,v}^{0}$ is the training sample for split v in a V-fold sample splitting scheme. That is, one uses the vth training sample to generate a v-specific data adaptive target parameter ${E}_{0}{Y}_{g}$ with $g=\stackrel{ˆ}{G}\left({P}_{n,v}^{0}\right)$, and the final data adaptive target parameter is defined as the average of these v-specific data adaptive parameters across the V splits. As shown in van der Laan et al. (2013) one can estimate and obtain inference for such a V-fold data adaptive target parameter by estimating each v-specific data adaptive target parameter based on the v-specific complementary sample. However, if estimators of these v-specific target parameters are highly non-linear such an estimator will suffer from large second-order terms. Therefore, in van der Laan (2012), van der Laan and Luedtke (2014b) we developed a cross-validated TMLE in which only the targeting step relies on cross-validation, and as a consequence the actual estimator of this V-fold data adaptive target parameter will have nice theoretical and practical behavior, not negatively affected by the sample splitting. Specifically, in van der Laan and Luedtke (2014b) we present a cross-validated TMLE of $1/V{\sum }_{v=1}^{V}{E}_{0}{Y}_{\stackrel{ˆ}{d}\left({P}_{n,v}^{0}\right)}$, where ${P}_{n}\to \stackrel{ˆ}{d}\left({P}_{n}\right)$ is a super-learner of the optimal dynamic treatment ${d}_{0}$. In this case the CV-TMLE presents a general method for general cross-validated data adaptive parameters.

We refer to van der Laan et al. (2013) for many other motivating examples demonstrating that statistical inference for data adaptive target parameters opens up a wealth of new scientific questions (that one would not know before looking at the data) and corresponding statistical inference, allowing for data mining to generate the parameters and hypotheses of interest (thereby also avoiding massive multiple testing adjustments).

## 6 Robust statistical inference: lack of scientific rationale to rely on parametric assumptions

As the authors point out, there is absolutely no reason to use the parametric extended g-computation formula method to estimate the desired mean outcome under the dynamic intervention depending on the natural value of treatment, especially since researchers also have access to more robust methods in the semi-parametric model literature. In particular, the authors present an inverse probability of treatment weighted type estimator of the desired extended g-computation formula estimand. In this section, we will discuss this point in more detail.

The identification results in causal inference aim to rely on minimal assumptions, in particular, these results typically avoid any statistical assumptions (i.e. restrictions on the probability distribution of the data). That is, many of the identifiability results correspond with nonparametric statistical models. All that hard work for the purpose of reliable inference about causal quantities in this part of the causal inference literature goes to waste if one uses estimators that are biased due to relying on parametric assumptions that are known to be false. It is not scientifically sensible to be nonparametric for the sake of identification but parametric for the sake of estimation given that parametric assumptions are made out of convenience. That is exactly what we do when we use, for example, parametric model-based estimators to estimate the estimand defined by the extended g-computation formula, or IPTW estimators based on parametric models for the treatment mechanism. Using such a parametric model-based approach for causal inference makes it less relevant to worry about the causal assumptions since one cannot even trust the estimator of the statistical estimand. This makes one wonder whether there is any theoretical scientific argument to use estimation procedures based on arbitrary parametric assumptions.

One argument might be that estimators based on parametric models can be shown to be asymptotically normally distributed. In other words, we have theorems that show that the confidence intervals have the desired coverage asymptotically under the assumption that these parametric assumptions are true. But what is the point of relying on a theorem whose assumptions are known to be false?

In addition, by not enforcing that a statistical model needs to be correctly specified (i.e. contain the true distribution), different statisticians often end up generating different statistically significant output, even when they are addressing identical statistical estimation problems and have equal access to all the statistical information about the data generating experiment. The problem here is that the choice of statistical model is viewed as an art instead of a choice driven by scientific knowledge, missing the fact this choice heavily affects the choice of target estimand, the corresponding estimator, and its statistical properties. Some data analysts like to quote “all models are wrong, but some are useful” and use it as an argument that we should not worry too much about the model choice. The truth is that as long as the field of applied statistics is driven by arbitrary model choices, we do not satisfy common sense scientific standards.

Important advances have been made in empirical process theory, weak convergence theory (e.g. van der Vaart and Wellner, 1996), efficiency theory for semi-parametric models (e.g. Bickel et al., 1997), general methods for construction of efficient estimators (e.g. Robins and Rotnitzky, 1992; van der Laan and Robins, 2003; van der Laan and Rose, 2012; Hernan and Robins, 2014), providing us with theorems establishing asymptotic consistency, normality, and efficiency of highly data adaptive estimators in large statistical models. Let us use a concrete demonstration of such a type of theorem concerning the estimation of a pathwise differentiable target parameter $\mathrm{\Psi }:\mathcal{M}\to ℝ$ with canonical gradient/efficient influence curve $\left(P,O\right)\to {D}^{\ast }\left(P\right)\left(O\right)$ at P. Given this $\mathrm{\Psi }$ and ${D}^{\ast }\left(P\right)$ one obtains, by definition of pathwise differentiability, that $\mathrm{\Psi }\left(P\right)-\mathrm{\Psi }\left({P}_{0}\right)=-{P}_{0}{D}^{\ast }\left(P\right)+{R}_{2}\left(P,{P}_{0}\right),$where ${R}_{2}\left(P,{P}_{0}\right)$ is a second-order difference between P and ${P}_{0}$ that can be explicitly determined for each choice of target parameter $\mathrm{\Psi }$ and model $\mathcal{M}$ [see for example, van der Laan, 2012 (2014), for a detailed demonstration]. It is assumed that we select the statistical model $\mathcal{M}$ so that one feels confident that ${P}_{0}\in \mathcal{M}$.

Consider a substitution estimator $\mathrm{\Psi }\left({P}_{n}^{\ast }\right)$, such as a TMLE based on an initial super-learner-based estimator ${P}_{n}^{0}$ (van der Laan and Dudoit, 2003; van der Vaart et al., 2006; van der Laan et al., 2006, 2007; Polley et al., 2012) that is then updated into a targeted estimator ${P}_{n}^{\ast }$. The estimator $\mathrm{\Psi }\left({P}_{n}^{\ast }\right)$ might also be a parametric g-computation estimator relying on a parametric model-based MLE ${P}_{n}^{\ast }$ of ${P}_{0}$. As shown in van der Laan and Rubin (2006) (or many other subsequent articles, including (van der Laan and Rose, 2012)) $\mathrm{\Psi }\left({P}_{n}^{\ast }\right)$ asymptotically normally distributed and efficient for $\mathrm{\Psi }\left({P}_{0}\right)$ if

• (i)

${P}_{n}{D}^{\ast }\left({P}_{n}^{\ast }\right)={o}_{P}\left(1/\sqrt{n}\right)$,

• (ii)

${D}^{\ast }\left({P}_{n}^{\ast }\right)$ falls in a ${P}_{0}$-Donsker class with probability tending to 1,

• (iii)

${P}_{0}\left({D}^{\ast }\left({P}_{n}^{\ast }\right)-{D}^{\ast }\left({P}_{0}\right){\right)}^{2}\to 0$ in probability, and

• (iv)

${R}_{2}\left({P}_{n}^{\ast },{P}_{0}\right)={o}_{P}\left(1/\sqrt{n}\right)$.

If one uses the TMLE, then condition (i) is automatically satisfied. Condition (ii) would be satisfied, for example, if ${D}^{\ast }\left({P}_{n}^{\ast }\right)$ falls in the class of multivariate real-valued functions with uniform sectional variation norm bounded by some $M<\mathrm{\infty }$ (Gill et al., 1995), a much less stringent assumption from requiring that ${P}_{n}^{\ast }$ is estimated in a parametric model. In addition, if one uses a CV-TMLE (Zheng and van der Laan, 2011; Rose and van der Laan, 2011), then condition (ii) can be removed. Let us consider such a CV-TMLE so that the only conditions for asymptotic efficiency are the weak asymptotic consistency condition (iii) and condition (iv). Clearly, condition (iv) is the condition to worry about (if (iv) holds one certainly expects (iii) to hold).

If ${P}_{n}^{\ast }$ is based on a misspecified parametric model, then there is no hope that ${R}_{2}\left({P}_{n}^{\ast },{P}_{0}\right)$ will converge to zero, i.e. (iv) will not hold. To make this crucial condition as realistic as possible we have promoted the use of super-learning, a cross-validated ensemble learner which incorporates the state of the art of machine learning algorithms and possibly a large variety of parametric model-based estimators. The oracle inequality for the super-learner (see above references) teaches us that we make this condition more and more likely to hold by selecting a library of diverse estimators that grow in size polynomial in sample size. That is, there is no trade off such as that we cannot be too data adaptive, but, on the contrary, we have to push the envelope as much as possible to be maximally data adaptive in order to ensure that ${R}_{2}\left({P}_{n}^{\ast },{P}_{0}\right)={o}_{P}\left(1/\sqrt{n}\right)$. In addition, under this condition (iv), the estimator is asymptotically efficient and thus also asymptotically regular, a nice by-product for reliable confidence intervals.

In order to move our field forward, we need to fully acknowledge these issues and start defining the estimation problem truthfully. In our work, we defined the field Targeted Learning as the subfield of statistics that is concerned with theory, estimation, and statistical inference (i.e. confidence intervals) for target parameters (representing the answer to the actual scientific question of interest) in realistic statistical models (i.e. incorporating actual knowledge). By necessity, Targeted Learning requires integrating the state of the art in data adaptive estimation, beyond incorporation of subject-matter driven estimators and requires targeting the estimation procedure toward the target parameter of interest. Given these estimators, Targeted Learning requires targeting the estimation procedure toward the target parameter of interest. Targeted minimum loss-based estimation (and its variants such as CV-TMLE, C-TMLE), combined with Super-Learning, provides a general template to construct such targeted substitution estimators (van der Laan and Rubin, 2006; van der Laan and Rose, 2012).

An example of this methodology, relevant to the paper under discussion, is the longitudinal TMLE of summary measures of the mean outcome under dynamic interventions (such as defined by working MSM) in Gruber and van der Laan (2012), Petersen et al. (2013). The TMLE for this problem is inspired by important double robust targeted estimators established in earlier work of Bang and Robins (2005). This TMLE is implemented by the R-package ltmle and fully utilizes the important sequential regression representation presented in Bang and Robins (2005). This TMLE is easily extended to TMLE of summary measures of mean outcomes under stochastic interventions. The extended g-computation formula corresponds with an estimated stochastic intervention, so that the statistical inference will now also need to take into account that the stochastic intervention was estimated. On the other hand, if we go after the mean outcome under a data adaptive fit of the desired stochastic intervention, then the statistical inference is identical to treating the fitted stochastic intervention as known. In this manner, by extending the current ltmle R-package to stochastic (and possibly unknown) interventions (instead of only dynamic interventions), this method would now be accessible to many practitioners, thereby allowing data analysts to significantly improve on the parametric extended g-computation formula approach and IPTW estimators relying on parametric models for the treatment mechanism.

We again commend the excellent work of the authors. The field needs more important observations such as this one, which allow the straightforward application of previously described identifiability results and robust estimators to new problems. We further advocate the consideration of newly developed data adaptive target parameters, which often similarly allow for the application of existing estimators to interesting new problems.

## Acknowledgments

This research was supported by an NIH grant R01 AI074345-06. AL was supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.

## References

• Bang, H., and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61:962–972.

• Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. (1997). Efficient and Adaptive Estimation for Semiparametric Models. New York: Springer-Verlag. Google Scholar

• Díaz, I., and van der Laan, M (2012). Population intervention causal effects based on stochastic interventions. Biometrics, 680(2):541–549. ISSN 1541–0420. URL http://dx.doi.org/10.1111/j.1541-0420.2011.01685.x

• Gill, R. D., van der Laan, M. J., and Wellner, J. A. (1995). Inefficient estimators of the bivariate survival function for three models. Annales de l’Institut Henri Poincaré, 31:545–597. Google Scholar

• Gruber, S., and van der Laan, M. J (2012). Targeted minimum loss based estimator that outperforms a given estimator. The International Journal of Biostatistics, 80(1):Article 11.

• Hernan, M., and Robins, J. M. (2014). Causal Inference. Progress edition. London: Chapman & Hall. Google Scholar

• Hubbard, A. E., Jewell, N. P., and van der Laan, M. J. (2011). Direct effects and effect among the treated, Chapter 8. In: Targeted Learning, Springer Series in Statistics, 133–145. New York: Springer. ISBN 978-1-4419-9781-4. Google Scholar

• Lendle, S. D., Subbaraman, M. S., and van der Laan, M. J. (2013). Identification and efficient estimation of the natural direct effect among the untreated. Biometrics, 69:301–317.

• Luedtke, A. R., and van der Laan, M. J. (2014). Super learning of an optimal dynamic treatment rule. Technical Report 326, UC Berkeley, 2014. http://biostats.bepress.com/ucbbiostat/paper326, revised for publication in Journal of Causal Inference.

• Pearl, J (2009). Causality: Models, Reasoning, and Inference. 2nd Edition. New York: Cambridge University Press. Google Scholar

• Petersen, M., Schwab, J., Gruber, S., Blaser, N., Schomaker, M., and van der Laan, M. J. (2013). Targeted minimum loss based estimation of marginal structural working models. Journal of Causal Inference, to appear. Technical report. http://biostats.bepress.com/ucbbiostat/paper312/

• Petersen, M. L., Sinisi, S. E., and van der Laan, M. J. (2006). Estimation of direct causal effects. Epidemiology, 170(3):276–284.

• Petersen, M. L., and van der Laan, M. J. (2014). Causal models and learning from data: integrating causal modeling and statistical estimation. Epidemiology, 250(3):418–426.

• Polley, E. C., Rose S., and van der Laan, M. J. (2012). Super learning. In: Targeted Learning: Causal Inference for Observational and Experimental Data, M.J. van der Laan and S. Rose (Eds.), 43–66. New York, Dordrecht, Heidelberg, and London: Springer.

• Richardson, T. S., and Robins, J. M. (2013). Single world intervention graphs (swigs): a unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper, 128. Google Scholar

• Robins, J. M., Hernán, M. A., and Siebert, U. (2004). Effects of multiple interventions. In: Comparative Quantification of Health Risks: Global and Regional Burden of Disease Attributable to Selected Major Risk Factors, A. D. M. Ezzati, A. Rodgers Lopez, and C. J. L. Murray (Eds.), 2191–2230. Geneva: World Health Organization. Google Scholar

• Robins, J. M., Orellana, L., and Rotnitzky, A. (2008). Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27:4678–4721.

• Robins, J. M., and Rotnitzky, A. (1992). Recovery of information and adjustment for dependent censoring using surrogate markers. In: AIDS Epidemiology, Methodological Issues, Nicholas P. Jewell, Klaus Dietz, and Vernon T. Farewell (Eds.). Basel, Switzerland: Birkäuser. Google Scholar

• Robins, J., and Rotnitzky, A. (2014). Discussion of dynamic treatment regimes: technical challenges and applications. Electronic Journal of Statistics, 80(1):1273–1289. http://dx.doi.org/10.1214/14-EJS908 Crossref

• Rose, S., and van der Laan, M. J. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer.

• van der Laan, M. J. (2012). Statistical inference when using data adaptive estimators of nuisance parameters. Technical Report 302, Division of Biostatistics, University of California, Berkeley, CA. Google Scholar

• van der Laan, M. J. (2013). Targeted learning of an optimal dynamic treatment and statistical inference for its mean outcome. Technical Report 317, UC Berkeley. http://biostats.bepress.com/ucbbiostat/paper317

• van der Laan, M. J. (2014). Targeted estimation of nuisance parameters to obtain valid statistical inference. International Journal of Biostatistics, pii:/j/ijb.ahead-of-print/ijb-2012-0038/ijb-2012-0038.xml. doi: 10.1515/ijb-2012-0038.

• van der Laan, M. J., and Dudoit, S. (2003). Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, November 2003. Google Scholar

• van der Laan, M. J., Dudoit, S., and van der Vaart, A. W. (2006a). The cross-validated adaptive epsilon-net estimator. Statistics and Decisions, 240(3):373–395. Google Scholar

• van der Laan, M. J., Hubbard, A. E., and Kherad, S. (2013). Statistical inference for data adaptive target parameters. Technical Report 314, UC Berkeley, 2013. http://biostats.bepress.com/ucbbiostat/paper314, revised for publication in Biometrics

• van der Laan, M. J., and Luedtke, A. R. (2014a). Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Technical Report 329, UC Berkeley. Google Scholar

• van der Laan, M. J., and Luedtke, A. R. (2014b). Targeted learning of the mean outcome under an optimal dynamic treatment rule. Technical Report 325, UC Berkeley, http://biostats.bepress.com/ucbbiostat/paper325, revised for publication in Journal of Causal Inference.

• van der Laan, M. J., and Petersen, M. L. (2007). Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics, 3(1):Article 3.

• van der Laan, M. J., Polley, E., and Hubbard, A. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 60(25):Article 25. ISSN 1544–6115.

• van der Laan, M. J., and Robins, J. M. (2003). Unified Methods for Censored Longitudinal Data and Causality. New York: Springer. Google Scholar

• van der Laan, M. J., and Rose, S. (2012). Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer.

• van der Laan, M. J., and Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 20(1):Article 11.

• van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. (2006). Oracle inequalities for multi-fold cross-validation. Statistics and Decisions, 240(3):351–371. Google Scholar

• van der Vaart, A. W., and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. New York: Springer-Verlag. Google Scholar

• Zheng, W., and van der Laan, M. J. (2011). Cross-validated targeted minimum loss based estimation. In: Targeted Learning: Causal Inference for Observational and Experimental Studies, M. J. van der Laan and S. Rose (Eds.), 459–474. New York: Springer.

• Zheng, W., and van der Laan, M. J. (2012). Causal mediation in a survival setting with time-dependent mediators. Technical Report 295, UC Berkeley, 2012. http://biostats.bepress.com/ucbbiostat/paper295

## About the article

Published Online: 2014-11-07

Published in Print: 2014-12-01

Research funding: National Institute of Allergy and Infectious Diseases (Grant/Award Number: “R01 AI074345-06”).

Citation Information: Epidemiologic Methods, Volume 3, Issue 1, Pages 21–31, ISSN (Online) 2161-962X, ISSN (Print) 2194-9263,

Export Citation

©2014 by De Gruyter.