Show Summary Details
More options …

# Journal of Causal Inference

Ed. by Imai, Kosuke / Pearl, Judea / Petersen, Maya Liv / Sekhon, Jasjeet / van der Laan, Mark J.

2 Issues per year

Online
ISSN
2193-3685
See all formats and pricing
More options …
Volume 3, Issue 1

# Targeted Learning of the Mean Outcome under an Optimal Dynamic Treatment Rule

Mark J. van der Laan
/ Alexander R. Luedtke
Published Online: 2014-10-14 | DOI: https://doi.org/10.1515/jci-2013-0022

## Abstract

We consider estimation of and inference for the mean outcome under the optimal dynamic two time-point treatment rule defined as the rule that maximizes the mean outcome under the dynamic treatment, where the candidate rules are restricted to depend only on a user-supplied subset of the baseline and intermediate covariates. This estimation problem is addressed in a statistical model for the data distribution that is nonparametric beyond possible knowledge about the treatment and censoring mechanism. This contrasts from the current literature that relies on parametric assumptions. We establish that the mean of the counterfactual outcome under the optimal dynamic treatment is a pathwise differentiable parameter under conditions, and develop a targeted minimum loss-based estimator (TMLE) of this target parameter. We establish asymptotic linearity and statistical inference for this estimator under specified conditions. In a sequentially randomized trial the statistical inference relies upon a second-order difference between the estimator of the optimal dynamic treatment and the optimal dynamic treatment to be asymptotically negligible, which may be a problematic condition when the rule is based on multivariate time-dependent covariates. To avoid this condition, we also develop TMLEs and statistical inference for data adaptive target parameters that are defined in terms of the mean outcome under the estimate of the optimal dynamic treatment. In particular, we develop a novel cross-validated TMLE approach that provides asymptotic inference under minimal conditions, avoiding the need for any empirical process conditions. We offer simulation results to support our theoretical findings.

## 1 Introduction

Suppose we observe n in4dependent and identically distributed observations of a time-dependent random variable consisting of baseline covariates, initial treatment and censoring indicator, intermediate covariates, subsequent treatment and censoring indicator, and a final outcome. For example, this could be data generated by a sequentially randomized controlled trial (RCT) in which one follows up a group of subjects, and treatment assignment at two time points is sequentially randomized, where the probability of receiving treatment might be determined by a baseline covariate for the first-line treatment, and time-dependent intermediate covariate (such as a biomarker of interest) for the second-line treatment [1]. Such trials are often called sequential multiple assignment randomized trials (SMART). A dynamic treatment rule deterministically assigns treatment as a function of the available history. If treatment is assigned at two time points, then this dynamic treatment rule consists of two rules, one for each time point [14]. The mean outcome under a dynamic treatment is a counterfactual quantity of interest representing what the mean outcome would have been if everybody would have received treatment according to the dynamic treatment rule [511]. Dynamic treatments represent prespecified multiple time-point interventions that at each treatment-decision stage are allowed to respond to the currently available treatment and covariate history. Examples of multiple time-point dynamic treatment regimes are given in Lavori and Dawson [12, 13]; Murphy [14]; Rosthøj et al. [15]; Thall et al. [16, 17]; Wagner et al. [18]; Petersen et al. [19]; van der Laan and Petersen [20]; and Robins et al. [21], ranging from rules that change the dose of a drug, change or augment the treatment, to making a decision on when to start a new treatment, in response to the history of the subject.

More recently, SMART designs have been implemented in practice: Lavori and Dawson [12, 22]; Murphy [14]; Thall et al. [16]; Chakraborty et al. [23]; Kasari [24]; Lei et al. [25]; Nahum-Shani et al. [26, 27]; Jones [28]; Lei et al. [25]. For an extensive list of SMARTs, we refer the reader to the website http://methodology.psu.edu/ra/adap-inter/projects. For an excellent and recent overview of the literature on dynamic treatments we refer to Chakraborty and Murphy [29].

We define the optimal dynamic multiple time-point treatment regime as the rule that maximizes the mean outcome under the dynamic treatment, where the candidate rules are restricted to only respond to a user-supplied subset of the baseline and intermediate covariates. The literature on Q-learning shows that we can describe the optimal dynamic treatment among all dynamic treatments in a sequential manner [14, 3033]. The optimal rule can be learned through fitting the likelihood and then calculating the optimal rule under this fit of the likelihood. This approach can be implemented with maximum likelihood estimation based on parametric models. It has been noted (e.g., Robins [32], Chakraborty and Murphy [29]) that the estimator of the parameters of one of the regressions (except the first one) when using parametric regression models is a non-smooth function of the estimator of the parameters of the previous regression, and that this results in non-regularity of the estimators of the parameter vector. This raises challenges for obtaining statistical inference, even when assuming that these parametric regression models are correctly specified. Chakraborty and Murphy [29] discuss various approaches and advances that aim to resolve this delicate issue such as inverting hypothesis testing [32], establishing non-normal limit distributions of the estimators (E. Laber, D. Lizotte, M. Qian, S. Murphy, submitted), or using the m out of n bootstrap.

Murphy [30] and Robins [31, 32] developed structural nested mean models tailored to optimal dynamic treatments. These models assume a parametric model for the “blip function” defined as the additive effect of a blip in current treatment on a counterfactual outcome, conditional on the observed past, in the counterfactual world in which future treatment is assigned optimally. Statistical inference for the parameters of the blip function proceeds accordingly, but Robins [32] points out the irregularity of the estimator, resulting in some serious challenges for statistical inference as referenced above. Structural nested mean models have also been generalized to blip functions that condition on a (counterfactual) subset of the past, thereby allowing the learning of optimal rules that are restricted to only using this subset of the past [32] and Section 6.5 in van der Laan and Robins [34].

An alternative approach, referenced as the direct approach in Chakraborty and Murphy [29], uses marginal structural models (MSMs) for the dynamic regime-specific mean outcome for a user-supplied class of dynamic treatments. If one assumes the marginal structural models are correctly specified, then the parameters of the marginal structural model map into a dynamic treatment that is optimal among the user-supplied class of dynamic regimes. In addition, the MSM also provides the complete dose–response curve, that is, the mean counterfactual outcome for each dynamic treatment in the user-supplied class. This generalization of the original marginal structural models for static interventions to MSMs for dynamic treatments was developed independently by Orellana et al. [35]; van der Laan and Petersen [20]. These articles present inverse probability of treatment and censoring weighted (IPCW) estimators and double robust augmented IPCW estimators based on general longitudinal data structures, allowing for right censoring, time-dependent covariates, and survival outcomes. Double robust estimating equation-based methods that estimate the nuisance parameters with sequential parametric regression models using clever covariates were developed for static intervention MSMs by Bang and Robins [36]. An analogous targeted minimum loss-based estimator (TMLE) [3739] was developed for marginal structural models for a user-supplied class of dynamic treatments by Petersen et al. [40]. This estimator builds on the TMLE for the mean outcome for a single dynamic treatment developed by van der Laan and Gruber [41]. Additional application papers of interest are [4244] which involve fitting MSMs for dynamic treatments defined by treatment-tailoring threshold using IPCW methods.

Each of the above referenced approaches for learning an optimal dynamic treatment that also aims to provide statistical inference relies on parametric assumptions: obviously, Q-learning based on parametric models, but also the structural nested mean models and the marginal structural models both rely on parametric models for the blip function and dose–response curve, respectively. As a consequence, even in a SMART, the statistical inference for the optimal dynamic treatment heavily relies on assumptions that are generally believed to be false, and will thus be expected to be biased.

To avoid such biases, we define the statistical model for the data distribution as nonparametric, beyond possible knowledge about the treatment mechanism (e.g., known in an RCT) and censoring mechanism. This forces us to define the optimal dynamic treatment and the corresponding mean outcome as parameters defined on this nonparametric model, and to develop data adaptive estimators of the optimal dynamic treatment. In order to not only consider the most ambitious fully optimal rule, we define the V-optimal rules as the optimal rule that only uses a user-supplied subset V of the available covariates. This allows us to consider suboptimal rules that are easier to estimate and thereby allow for statistical inference for the counterfactual mean outcome under the suboptimal rule. This is analogous to the generalized structural nested mean models whose blip functions only condition on a counterfactual subset of the past. In a companion article we describe how to estimate the V-optimal rule.

In Example 4 of Robins et al. [45], the authors develop an asymptotic confidence set for the optimal treatment regime in an RCT under a large semiparametric model that only assumes that the treatment mechanism is known. This confidence set is certainly of interest and warrants further consideration in the optimal treatment literature. They get this confidence set by deriving the efficient influence curve for the mean squared blip function. They propose selecting a data adaptive estimate of the optimal treatment rule by a particular cross-validation scheme over a set of basis functions, and show that this estimator achieves a data adaptive rate of convergence under smoothness assumptions on the blip function. Our work is distinct from this earlier work in that the earlier work does not directly consider the mean outcome under the optimal rule and only considers data generated by a point treatment RCT.

In this article we describe how to obtain semiparametric inference about the mean outcome under the two time point V-optimal rule. We will show that the mean outcome under the optimal rule is a pathwise differentiable parameter of the data distribution, indicating that it is possible to develop asymptotically linear estimators of this target parameter under conditions. In fact, we obtain the surprising result that the pathwise derivative of this target parameter equals the pathwise derivative of the mean counterfactual outcome under a given dynamic treatment rule set at the optimal rule, treating the latter as known. By a reference to the current literature for double robust and efficient estimation of the mean outcome under a given rule, we then obtain a TMLE for the mean outcome under the optimal rule. Subsequently, we prove asymptotic linearity and efficiency of this TMLE, allowing us to construct confidence intervals for the mean outcome under the optimal dynamic treatment or its contrast with respect to a standard treatment. Thus, contrary to the irregularity of the estimators of the unknown parameters in the semiparametric structural nested mean model, we can construct regular estimators of the mean outcome under the optimal rule in the nonparametric model.

In a SMART the statistical inference would only rely upon a second-order difference between the estimator of the optimal dynamic treatment and the optimal dynamic treatment itself to be asymptotically negligible. This is a reasonable condition if we restrict ourselves to rules only responding to a one-dimensional time-dependent covariate, or if we are willing to make smoothness assumptions. To avoid this condition, we also develop TMLEs and statistical inference for data adaptive target parameters that are defined in terms of the mean outcome under the estimate of the optimal dynamic treatment (see van der Laan et al. [46] for a general approach for statistical inference for data adaptive target parameters). In particular, we develop a novel cross-validated TMLE (CV-TMLE) approach that provides asymptotic inference under minimal conditions.

For the sake of presentation, we focus on two time point treatments in this article. In the appendices of our earlier technical reports [47, 48] we generalize these results to general multiple time point treatments, and develop general (sequential) super-learning based on the efficient CV-TMLE of the risk of a candidate estimator. In this appendix we also develop a TMLE of a projection of the blip functions on a parametric working model (with corresponding statistical inference, which presents a result of interest in its own right). We emphasize that this technical report is distinct from our companion paper in this issue, which focuses on the data adaptive estimation of optimal treatment strategies.

## 1.1 Organization of article

Section 2 defines the mean outcome under the optimal rule as a causal parameter and gives identifiability assumptions under which the causal parameter is identified with a statistical parameter of the observed data distribution.

The remainder of the paper describes strategies to estimate the counterfactual mean outcome under the optimal rule and related quantities. This paper assumes that we have an estimate of the optimal rule in our semiparametric model. In our companion paper we describe how to obtain estimates of the V-optimal rule.

The first part of this article concerns estimation of the mean outcome under the optimal rule. Section 3 establishes the pathwise differentiability of the mean outcome under the V-optimal rule conditions. A closed form expression for the efficient influence curve for this statistical parameter is given, which represents a key ingredient in semiparametric inference for the statistical target parameter. We obtain the surprising result that, under straightforward conditions, estimating the mean outcome under the unknown optimal treatment rule is the same in first order as estimating the mean outcome under the optimal rule when the rule is known from the outset. Section 4 presents the key properties of a TMLE for the mean outcome under the optimal rule, which is presented in detail in “TMLE of the mean outcome under a given rule” in Appendix B due to its similarity to TMLEs presented previously in the literature. Section 5 presents an asymptotic linearity theorem for this TMLE and corresponding statistical inference.

The second part of this article concerns statistical inference for data adaptive target parameters that are defined in terms of the mean outcome under the estimate of the optimal dynamic treatment, thereby avoiding the consistency and rate condition for the fitted V-optimal rule as required for asymptotic linearity of the TMLE of the mean outcome under the actual V-optimal rule. These results are of interest in practice because an estimated, possibly suboptimal, rule will be implemented in the population, not some unknown optimal rule. Section 6 presents an asymptotic linearity theorem for the TMLE presented in Section 4, but now with the target parameter defined as the mean outcome under the estimated rule. In Section 7 we present the CV-TMLE framework. A specific CV-TMLE algorithm is described in “CV-TMLE of the mean outcome under data adaptive V-optimal rule” in Appendix B due to its similarity to CV-TMLEs presented previously in the literature. The CV-TMLE provides asymptotic inference under minimal conditions for the mean outcome under a dynamic treatment fitted on a training sample, averaged across the different splits in training sample and validation sample. Both results allow us to construct confidence intervals that have the correct asymptotic coverage of the random true target parameter, and the fixed mean outcome under the optimal rule under conditions, but statistical inference based on the CV-TMLE does not require an empirical process condition that would put a brake on the allowed data adaptivity of the estimator.

Section 8 presents the simulation methods. The simulations estimate the optimal rule using an ensemble algorithm presented in our companion paper, and then given this estimate apply the estimators of the optimal rule presented in this paper. Section 9 presents the coverage and efficiency of the various estimators in our simulation. Appendix C gives analytic intuition as to why some of the simulation results may have occurred. Section 10 closes with a discussion and directions for future work.

All proofs can be found in Appendix A.

## 2 Formulation of optimal dynamic treatment estimation problem

Suppose we observe n i.i.d. copies ${O}_{1},\dots ,{O}_{n}\in \mathcal{O}$ of $O=\left(L\left(0\right),A\left(0\right),L\left(1\right),A\left(1\right),Y\right)\sim {P}_{0},$where $A\left(j\right)=\left({A}_{1}\left(j\right),{A}_{2}\left(j\right)\right)$, ${A}_{1}\left(j\right)$ is a binary treatment, and ${A}_{2}\left(j\right)$ is an indicator of not being right censored at “time” j, $j=0,\phantom{\rule{1pt}{0ex}}\phantom{\rule{thinmathspace}{0ex}}1$. That is, ${A}_{2}\left(0\right)=0$ implies that $\left(L\left(1\right),{A}_{1}\left(1\right),Y\right)$ is n ot observed, and ${A}_{2}\left(1\right)=0$ implies that Y is not observed. Each time point j has covariates $L\left(j\right)$ that precede treatment, $j=0,1$, and the outcome of interest is given by Y and occurs after time point 1. For a time-dependent process $X\left(\cdot \right)$, we use the notation $\stackrel{ˉ}{X}\left(t\right)=\left(X\left(s\right):s\le t\right)$, where $\stackrel{ˉ}{X}\left(-1\right)=\mathrm{\varnothing }$. Let $\mathcal{M}$ be a statistical model that makes no assumptions on the marginal distribution ${Q}_{0,L\left(0\right)}$ of $L\left(0\right)$ and the conditional distribution ${Q}_{0,L\left(1\right)}$ of $L\left(1\right)$, given $A\left(0\right),L\left(0\right)$, but might make assumptions on the conditional distributions ${g}_{0A\left(j\right)}$ of $A\left(j\right)$, given $\stackrel{ˉ}{A}\left(j-1\right),\stackrel{ˉ}{L}\left(j\right)$, $j=0,1$. We will refer to ${g}_{0}$ as the intervention mechanism, which can be factorized in a treatment mechanism ${g}_{01}$ and censoring mechanism ${g}_{02}$ as follows: ${g}_{0}\left(O\right)=\prod _{j=1}^{2}{g}_{01}\left({A}_{1}\left(j\right)|\stackrel{ˉ}{A}\left(j-1\right),\stackrel{ˉ}{L}\left(j\right)\right){g}_{02}\left({A}_{2}\left(j\right)|{A}_{1}\left(j\right),\stackrel{ˉ}{A}\left(j-1\right),\stackrel{ˉ}{L}\left(j\right)\right).$In particular, the data might have been generated by a SMART, in which case ${g}_{01}$ is known.

Let $V\left(1\right)$ be a function of $\left(L\left(0\right),A\left(0\right),L\left(1\right)\right)$, and let $V\left(0\right)$ be a function of $L\left(0\right)$. Let $V=\left(V\left(0\right),V\left(1\right)\right)$. Consider dynamic treatment rules $V\left(0\right)\to {d}_{A\left(0\right)}\left(V\left(0\right)\right)\in \left\{0,1\right\}×\left\{1\right\}$ and $\left(A\left(0\right),V\left(1\right)\right)\to {d}_{A\left(1\right)}\left(A\left(0\right),$ $V\left(1\right)\right)\in \left\{0,1\right\}×\left\{1\right\}$ for assigning treatment $A\left(0\right)$ and $A\left(1\right)$, respectively, where the rule for $A\left(0\right)$ is only a function of $V\left(0\right)$, and the rule for $A\left(1\right)$ is only a function of $\left(A\left(0\right),V\left(1\right)\right)$. Note that these rules are restricted to set the censoring indicators ${A}_{2}\left(j\right)=1$, $j=0,1$. Let $\mathcal{D}$ be the set of all such rules. We assume that $V\left(0\right)$ is a function of $V\left(1\right)$ (i.e., observing $V\left(1\right)$ includes observing $V\left(0\right)$), but in the theorem below we indicate an alternative assumption. For $d\in \mathcal{D}$, we let $d\left(a\left(0\right),v\right)\equiv \left({d}_{A\left(0\right)}\left(v\left(0\right)\right),{d}_{A\left(1\right)}\left(a\left(0\right),v\left(1\right)\right)\right).$If we assume a structural equation model [7] for variables stating that $L\left(0\right)={f}_{L\left(0\right)}\left({U}_{L\left(0\right)}\right)$ $A\left(0\right)={f}_{A\left(0\right)}\left(L\left(0\right),{U}_{A\left(0\right)}\right)$ $L\left(1\right)={f}_{L\left(1\right)}\left(L\left(0\right),A\left(0\right),{U}_{L\left(1\right)}\right)$ $A\left(1\right)={f}_{A\left(1\right)}\left(\stackrel{ˉ}{L}\left(1\right),A\left(0\right),{U}_{A\left(1\right)}\right)$ $Y={f}_{Y}\left(\stackrel{ˉ}{L}\left(1\right),\stackrel{ˉ}{A}\left(1\right),{U}_{Y}\right),$where the collection of functions $f=\left({f}_{L\left(0\right)},{f}_{A\left(0\right)},{f}_{L\left(1\right)},{f}_{A\left(1\right)}\right)$ is unspecified or partially specified, we can define counterfactuals ${Y}_{d}$ defined by the modified system in which the equations for $A\left(0\right),A\left(1\right)$ are replaced by $A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)$ and $A\left(1\right)={d}_{A\left(1\right)}\left(A\left(0\right),V\left(1\right)\right)$, respectively. Denote the distribution of these counterfactual quantities as ${P}_{0,d}$, where we note that ${P}_{0,d}$ is implied by the collection of functions f and the joint distribution of exogeneous variables $\left({U}_{L\left(0\right)},{U}_{A\left(0\right)},{U}_{L\left(1\right)},{U}_{A\left(1\right)},{U}_{Y}\right)$. We can now define the causally optimal rule under ${P}_{0,d}$ as ${d}_{0}^{\ast }=arg\phantom{\rule{thinmathspace}{0ex}}{max\phantom{\rule{thinmathspace}{0ex}}}_{d\in \mathcal{D}}{E}_{{P}_{0,d}}{Y}_{d}$. If we assume a sequential randomization assumption stating that $A\left(0\right)$ is independent of ${U}_{L\left(1\right)},{U}_{Y}$, given $L\left(0\right)$, and $A\left(1\right)$ is independent of ${U}_{Y}$, given $\stackrel{ˉ}{L}\left(1\right),A\left(0\right)$, then we can identify ${P}_{0,d}$ with observed data under the distribution ${P}_{0}$ using the G-computation formula: $\begin{array}{rl}& {p}_{0,d}\left(L\left(0\right),A\left(0\right),L\left(1\right),A\left(1\right),Y\right)\\ & \phantom{\rule{1em}{0ex}}\equiv I\left(A=d\left(A\left(0\right),V\right)\right){q}_{0,L\left(0\right)}\left(L\left(0\right)\right){q}_{0,L\left(1\right)}\left(L\left(1\right)|L\left(0\right),A\left(0\right)\right){q}_{0,Y}\left(Y|\stackrel{ˉ}{L}\left(1\right),\stackrel{ˉ}{A}\left(1\right)\right),\end{array}$(1)where ${p}_{0,d}$ is the density of ${P}_{0,d}$ and ${q}_{0,L\left(0\right)}$, ${q}_{0,L\left(1\right)}$, and ${q}_{0,Y}$ are the densities for ${Q}_{0,L\left(0\right)}$, ${Q}_{0,L\left(1\right)}$, and ${Q}_{0,Y}$, respectively, where ${Q}_{0,Y}$ represents the distribution of Y given $\stackrel{ˉ}{L}\left(1\right),\stackrel{ˉ}{A}\left(1\right)$. We assume that all densities above are absolutely continuous with respect to some dominating measure $\mathrm{\mu }$. We have a similar identifiability result/G-computation formula under the Neyman-Rubin causal model [8]. For the right censoring indicators ${A}_{2}\left(0\right)$ and ${A}_{2}\left(1\right)$, we note the parallel between the coarsening at random assumption and the sequential randomization assumption [49]. Thus here we have encoded our missingness assumptions in our causal assumptions.

More generally, for a distribution $P\in \mathcal{M}$ we can define the G-computation distribution ${P}_{d}$ as the distribution with density $\begin{array}{rl}& {p}_{d}\left(L\left(0\right),A\left(0\right),L\left(1\right),A\left(1\right),Y\right)\\ & \phantom{\rule{1em}{0ex}}\equiv I\left(A=d\left(A\left(0\right),V\right)\right){q}_{L\left(0\right)}\left(L\left(0\right)\right){q}_{L\left(1\right)}\left(L\left(1\right)|L\left(0\right),A\left(0\right)\right){q}_{Y}\left(Y|\stackrel{ˉ}{L}\left(1\right),\stackrel{ˉ}{A}\left(1\right)\right),\end{array}$where ${q}_{L\left(0\right)}$, ${q}_{L\left(1\right)}$, and ${q}_{Y}$ are the counterparts to ${q}_{0,L\left(0\right)}$, ${q}_{0,L\left(1\right)}$, and ${q}_{0,Y}$, respectively, under P.

For the remainder of this article, if for a static or dynamic intervention d, we use notation ${L}_{d}$ (or ${Y}_{d}$, ${O}_{d}$) we mean the random variable with the probability distribution ${P}_{d}$ in (1) so that all of our quantities are statistical parameters. For example, the quantity ${E}_{{P}_{0}}\left({Y}_{a\left(0\right)a\left(1\right)}|{V}_{a\left(0\right)}\left(1\right)\right)$ defined in the next theorem denotes the conditional expectation of ${Y}_{a\left(0\right)a\left(1\right)}$, given ${V}_{a\left(0\right)}\left(1\right)$, under the probability distribution ${P}_{0,a\left(0\right)a\left(1\right)}$ (i.e., G-computation formula presented above for the static intervention $\left(a\left(0\right),a\left(1\right)\right)$. In addition, if we write down these parameters for some ${P}_{d}$, we will automatically assume the positivity assumption at P required for the G-computation formula to be well defined. For that it will suffice to assume the following positivity assumption at P: $\mathrm{P}{\mathrm{r}}_{P}\left(0<\underset{{a}_{1}\in \left\{0,1\right\}}{min}{g}_{0A\left(0\right)}\left({a}_{1},1|L\left(0\right)\right)\right)=1$ $\mathrm{P}{\mathrm{r}}_{P}\left(0<\underset{{a}_{1}\in \left\{0,1\right\}}{min}{g}_{0A\left(1\right)}\left({a}_{1},1|\stackrel{ˉ}{L}\left(1\right),A\left(0\right)\right)\right)=1.$(2)The strong positivity assumption will be defined as the above assumption, but where the 0 is replaced by a $\mathrm{\delta }>0$.

We now define a statistical parameter representing the mean outcome ${Y}_{d}$ under ${P}_{d}$. For any rule $d\in \mathcal{D}$, let ${\mathrm{\Psi }}_{d}\left(P\right)\phantom{\rule{thinmathspace}{0ex}}\equiv \phantom{\rule{thinmathspace}{0ex}}{E}_{{P}_{d}}{Y}_{d}.$For a distribution P, define the V-optimal rule as ${d}_{P}=\mathrm{a}\mathrm{r}\mathrm{g}\underset{d\in \mathcal{D}}{\mathrm{m}\mathrm{a}\mathrm{x}\phantom{\rule{thinmathspace}{0ex}}}{E}_{{P}_{d}}{Y}_{d}.$For simplicity, we will write ${d}_{0}$ instead of ${d}_{{P}_{0}}$ for the V-optimal rule under ${P}_{0}$. Define the parameter mapping $\mathrm{\Psi }:\mathcal{M}\to \mathrm{I}\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{R}$ as $\mathrm{\Psi }\left(P\right)={E}_{{P}_{{d}_{P}}}{Y}_{{d}_{P}}$. The first part of this article is concerned with inference for the parameter ${\mathrm{\psi }}_{0}\equiv \mathrm{\Psi }\left({P}_{0}\right)={E}_{{P}_{0,{d}_{0}}}{Y}_{{d}_{0}}.$Under our identifiability assumptions, ${d}_{0}$ is equal to the causally optimal rule ${d}_{0}^{\ast }$. Even if the sequential randomization assumption does not hold, the statistical parameter ${\mathrm{\psi }}_{0}$ represents a statistical parameter of interest in its own right. We will not concern ourselves with the sequential randomization assumption for the remainder of this paper.

The next theorem presents an explicit form of the V-optimal individualized treatment rule ${d}_{0}$ as a function of ${P}_{0}$.

Theorem 1. Suppose $V\left(0\right)$ is a function of $V\left(1\right)$. The V-optimal rule ${d}_{0}$ can be represented as the following explicit parameter of ${P}_{0}$:

${\overline{Q}}_{20}\left(a\left(0\right),v\left(1\right)\right)=$ $\phantom{\rule{1em}{0ex}}{E}_{{P}_{0}}\left({Y}_{a\left(0\right),A\left(1\right)=\left(1,1\right)}|{V}_{a\left(0\right)}\left(1\right)=v\left(1\right)\right)-{E}_{{P}_{0}}\left({Y}_{a\left(0\right),A\left(1\right)=\left(0,1\right)}|{V}_{a\left(0\right)}\left(1\right)=v\left(1\right)\right)$ ${d}_{0,A\left(1\right)}\left(A\left(0\right),V\left(1\right)\right)=\left(I\left({\overline{Q}}_{20}\left(A\left(0\right),V\left(1\right)\right)>0\right),1\right)$ ${\overline{Q}}_{10}\left(v\left(0\right)\right)={E}_{{P}_{0}}\left({Y}_{\left(1,1\right),{d}_{0,A\left(1\right)}}|V\left(0\right)\right)-{E}_{{P}_{0}}\left({Y}_{\left(0,1\right),{d}_{0,A\left(1\right)}}|V\left(0\right)\right)$ ${d}_{0,A\left(0\right)}\left(V\left(0\right)\right)=\left(I\left({\overline{Q}}_{10}\left(V\left(0\right)\right)>0\right),1\right),$

where $a\left(0\right)\in \left\{0,1\right\}×\left\{1\right\}$. If $V\left(1\right)$ does not include $V\left(0\right)$, but, for all $\left(a\left(0\right),a\left(1\right)\right)\in \left\{\left\{0,1\right\}×\left\{1\right\}{\right\}}^{2}$,

${E}_{{P}_{0}}\left({Y}_{a\left(0\right),a\left(1\right)}|V\left(0\right),{V}_{a\left(0\right)}\left(1\right)\right)={E}_{{P}_{0}}\left({Y}_{a\left(0\right),a\left(1\right)}|{V}_{a\left(0\right)}\left(1\right)\right),$(3)

then the above expression for the V-optimal rule ${d}_{0}$ is still true.

## 3 The efficient influence curve of the mean outcome under V-optimal rule

In this section we establish the pathwise differentiability of $\mathrm{\Psi }$ and give an explicit expression for the efficient influence curve [34, 50, 51]. Before presenting this result, we give the efficient influence curve for the parameter $\mathrm{\Psi }:\mathcal{M}\to \mathbb{R}$ where ${\mathrm{\Psi }}_{d}\left(P\right)\phantom{\rule{thinmathspace}{0ex}}\equiv \phantom{\rule{thinmathspace}{0ex}}{E}_{P}{Y}_{d}$ and the rule $d=\left({d}_{A\left(0\right)},{d}_{A\left(1\right)}\right)\in \mathcal{D}$ is treated as known. This influence curve has previously been presented in the literature [36, 41]. The parameter mapping ${\mathrm{\Psi }}_{d}$ has efficient influence curve: ${D}^{\ast }\left(d,P\right)=\sum _{k=0}^{2}{D}_{k}^{\ast }\left(d,P\right)$

where ${D}_{0}^{*}\left(d,P\right)={E}_{P}\left[{Y}_{d}|L\left(0\right),A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right]-{E}_{P}{Y}_{d}$ ${D}_{1}^{*}\left(d,P\right)=\frac{I\left(A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right)}{{g}_{A\left(0\right)}\left(O\right)}$ $×\left({E}_{P}\left[Y|\overline{A}\left(1\right)=d\left(A\left(0\right),V\right),\overline{L}\left(1\right)\right]-{E}_{P}\left[{Y}_{d}|L\left(0\right),A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right]\right)$ ${D}_{2}^{\ast }\left(d,P\right)=\frac{I\left(\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right)\right)}{{\prod }_{j=0}^{1}{g}_{A\left(j\right)}\left(O\right)}\left(Y-{E}_{P}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]\right).$(4)Above $\left({g}_{A\left(0\right)},{g}_{A\left(1\right)}\right)$ is the intervention mechanism under the distribution P. We remind the reader that ${Y}_{d}$ has the G-computation distribution from (1) so that: ${E}_{P}\left[{Y}_{d}|L\left(0\right),A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right]$ $={E}_{P}\left[{E}_{P}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right)\right]|\phantom{\rule{thinmathspace}{0ex}}L\left(0\right),A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right]$At times it will be convenient to write ${D}_{k}^{\ast }\left(d,{Q}^{d},g\right)$ instead of ${D}_{k}^{\ast }\left(d,P\right)$, where ${Q}^{d}$ represents both of the conditional expectations in the definitions of ${D}_{1}^{\ast }$ and the marginal distribution of $L\left(0\right)$ under P and g represents the intervention mechanism under P. We will denote these conditional expectations under ${P}_{0}$ for a given rule d by ${Q}_{0}^{d}$. We will similarly at times denote ${D}^{\ast }\left(d,P\right)$ by ${D}^{\ast }\left(d,{Q}^{d},g\right)$.

Whenever ${D}^{\ast }\left(P\right)$ does not contain an argument for a rule d, this ${D}^{\ast }\left(P\right)$ refers to the efficient influence curve of the parameter mapping $\mathrm{\Psi }$ for which $\mathrm{\Psi }\left(P\right)={E}_{P}{Y}_{{d}_{P}}$, where the optimal rule ${d}_{P}$ under P is not treated as known. Not treating ${d}_{P}$ as known means that ${d}_{P}$ depends on the input distribution P in the mapping $\mathrm{\Psi }\left(P\right)$. The following theorem presents the efficient influence curve of $\mathrm{\Psi }$ at a distribution P. The main condition on this distribution P is that $\underset{{a}_{0}\left(0\right)\in \left\{0,1\right\}}{max}\mathrm{P}{\mathrm{r}}_{P}\left({\stackrel{ˉ}{Q}}_{2}\left(\left({a}_{0}\left(0\right),1\right),{V}_{a\left(0\right)=\left({a}_{0}\left(0\right),1\right)}\right)=0\right)=0$ $\mathrm{P}{\mathrm{r}}_{P}\left({\stackrel{ˉ}{Q}}_{1}\left(V\left(0\right)\right)=0\right)=0,$(5)where ${\stackrel{ˉ}{Q}}_{2}$ and ${\stackrel{ˉ}{Q}}_{1}$ are defined analogously to ${\stackrel{ˉ}{Q}}_{20}$ and ${\stackrel{ˉ}{Q}}_{10}$ in Theorem 1 with the expectations under ${P}_{0}$ replaced by expectations under P. That is, we assume that each of the blip functions under P is nowhere zero with probability 1. Distributions that do not satisfy this assumption have been referred to as “exceptional laws” [32, 52]. These laws are indeed exceptional when one expects that treatment will have a beneficial or harmful effect in all V-strata of individuals. When one only expects that treatment will have an effect on outcome in some but not all strata of individuals then this assumption may be violated. We will make this assumption about ${P}_{0}$ for all subsequent asymptotic linearity results about ${E}_{{P}_{0}}{Y}_{{d}_{0}}$, and we will assume a weaker but still not completely trivial assumption for the data adaptive target parameters in Sections 6 and 7.

Theorem 2. Suppose $\mathcal{P}\in \mathcal{M}$ such that $P{r}_{P}\left(|Y| for some $M<\mathrm{\infty }$ and the positivity assumption (2) and (5). Then the parameter $\mathrm{\Psi }:\mathcal{M}\to \mathrm{I}\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{R}$ is pathwise differentiable at P with canonical gradient given by

${D}^{\ast }\left(P\right)\equiv {D}^{\ast }\left({d}_{P},P\right)=\sum _{k=0}^{2}{D}_{k}^{\ast }\left({d}_{P},P\right).$

That is, ${D}^{\ast }\left(P\right)$ equals the efficient influence curve ${D}^{\ast }\left({d}_{P},P\right)$ for the parameter ${\mathrm{\Psi }}_{d}\left(P\right)\equiv {E}_{P}{Y}_{d}$ at the V-optimal rule $d={d}_{P}$, where ${\mathrm{\Psi }}_{d}$ treats d as given.

The above theorem is proved as Theorem 8 in van der Laan and Luedtke [48] so the proof is omitted here.

We will at times denote ${D}^{\ast }\left(P\right)$ by ${D}^{\ast }\left(Q,g\right)$, where Q represents ${Q}^{{d}_{P}}$, along with portions of the likelihood which suffice to compute the V-optimal rule ${d}_{P}$. We denote ${d}_{P}$ by ${d}_{Q}$ when convenient. We explore which parts of the likelihood suffice to compute the V-optimal rule in our companion paper, though Theorem 1 shows that ${\stackrel{ˉ}{Q}}_{20}$ and ${\stackrel{ˉ}{Q}}_{10}$ suffice for ${d}_{0}$ (and analogous functions suffice for a more general ${d}_{P}$). We have the following property of the efficient influence curve, which will provide a fundamental ingredient in the analysis of the TMLE presented in the next section.

Theorem 3. Let ${d}_{Q}$ be the V-optimal rule corresponding with Q. For any $Q,g$, we have

${P}_{0}{D}^{\ast }\left(Q,g\right)=\mathrm{\Psi }\left({Q}_{0}\right)-\mathrm{\Psi }\left(Q\right)+{R}_{1{d}_{Q}}\left({Q}^{{d}_{Q}},{Q}_{0}^{{d}_{Q}},g,{g}_{0}\right)+{R}_{2}\left(Q,{Q}_{0}\right)$

where for all $d\in \mathcal{D}$

${R}_{1d}\left({Q}^{d},{Q}_{0}^{d},g,{g}_{0}\right)\equiv {P}_{0}{D}^{\ast }\left(d,{Q}^{d},g\right)-\left({\mathrm{\Psi }}_{d}\left({Q}_{0}^{d}\right)-{\mathrm{\Psi }}_{d}\left({Q}^{d}\right)\right),$

${\mathrm{\Psi }}_{d}\left(P\right)={E}_{P}{Y}_{d}$ is the statistical target parameter that treats d as known, and ${D}^{\ast }\left(d,{Q}_{0}^{d},{g}_{0}\right)$ is the efficient influence curve of ${\mathrm{\Psi }}_{d}$ at ${P}_{0}$ as given in Theorem 2. In addition,

$\begin{array}{rl}{R}_{2}\left(Q,{Q}_{0}\right)& \phantom{\rule{thinmathspace}{0ex}}\equiv \phantom{\rule{thinmathspace}{0ex}}{\mathrm{\Psi }}_{{d}_{Q}}\left({Q}_{0}^{{d}_{Q}}\right)-{\mathrm{\Psi }}_{{d}_{0}}\left({Q}_{0}^{{d}_{0}}\right)\\ & ={E}_{{P}_{0}}\left({d}_{Q,A\left(0\right)}-{d}_{0,A\left(0\right)}\right)\left(V\left(0\right)\right){\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right)\\ & \phantom{\rule{1em}{0ex}}+{E}_{{P}_{0}}\left({d}_{Q,A\left(1\right)}-{d}_{0,A\left(1\right)}\right)\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(1\right)\right){\stackrel{ˉ}{Q}}_{20}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(1\right)\right)\\ & \phantom{\rule{thinmathspace}{0ex}}\equiv \phantom{\rule{thinmathspace}{0ex}}{R}_{2A\left(0\right)}\left(Q,{Q}_{0}\right)+{R}_{2A\left(1\right)}\left(Q,{Q}_{0}\right).\end{array}$From the study of the statistical target parameter ${\mathrm{\Psi }}_{d}$ in van der Laan and Gruber [41], we know that ${P}_{0}{D}^{\ast }\left(d,{Q}^{d},g\right)={\mathrm{\Psi }}_{d}\left({Q}_{0}^{d}\right)-{\mathrm{\Psi }}_{d}\left({Q}^{d}\right)+{R}_{1d}\left({Q}^{d},{Q}_{0}^{d},g,{g}_{0}\right)$, where ${R}_{1d}$ is a closed form second-order term involving integrals of differences ${Q}^{d}-{Q}_{0}^{d}$ times differences $g-{g}_{0}$.

The following lemma bounds ${R}_{2}$. We note that this lemma, which concerns how well we can estimate ${d}_{0}$ rather than how well we can make inference about ${E}_{{P}_{0}}{Y}_{{d}_{0}}$, does not require condition (5) to hold. We showed in Theorem 1 that knowing the blip functions ${\stackrel{ˉ}{Q}}_{10}$ and ${\stackrel{ˉ}{Q}}_{20}$ suffices to define the optimal rule ${d}_{0}$. For general Q, we will let ${\stackrel{ˉ}{Q}}_{1}$ and ${\stackrel{ˉ}{Q}}_{2}$ represent the blip functions under this parameter mapping.

Lemma 1. Let ${R}_{2}$ be as in Theorem 3. Let ${P}_{0,\left(0,1\right)}$ represent the static intervention-specific G-computation distribution where treatment $\left(0,1\right)$ is given at the first time point. Suppose there exist some ${\mathrm{\beta }}_{1},{\mathrm{\beta }}_{2}>1$ such that: ${E}_{{P}_{0}}\left[|{\stackrel{ˉ}{Q}}_{10}\left(V{\left(0\right)\right)|}^{-{\mathrm{\beta }}_{1}}I\left(|{\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right)|>0\right)\right]<\mathrm{\infty }$

${E}_{{P}_{0,\left(0,1\right)}}\left[|{\stackrel{ˉ}{Q}}_{20}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(0\right)\right){|}^{-{\mathrm{\beta }}_{2}}I\left(|{\stackrel{ˉ}{Q}}_{20}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(0\right)\right)|>0\right)\right]<\mathrm{\infty },$(6)

where the expression in each expectation is taken to be 0 when the indicator is 0. Fix $p\in \left(1,\mathrm{\infty }\right]$ and define $h:\left(1,\mathrm{\infty }\right]×\left(1,\mathrm{\infty }\right)$ as the function for which $h\left(p,\mathrm{\beta }\right)=\frac{p\left(\mathrm{\beta }+1\right)}{p+\mathrm{\beta }}$ when $p<\mathrm{\infty }$ and $h\left(p,\mathrm{\beta }\right)=\mathrm{\beta }+1$ otherwise. Then:

${R}_{2A\left(0\right)}\left(Q,{Q}_{0}\right)\le {K}_{1}{∥{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}∥}_{p,{P}_{0}}^{h\left(p,{\mathrm{\beta }}_{1}\right)}$ ${R}_{2A\left(1\right)}\left(Q,{Q}_{0}\right)\le {K}_{2}{∥{\stackrel{ˉ}{Q}}_{2}-{\stackrel{ˉ}{Q}}_{20}∥}_{p,{P}_{0,\left(0,1\right)}}^{h\left(p,{\mathrm{\beta }}_{2}\right)},$

where ${∥\cdot ∥}_{p,P}$ denotes the ${L}_{p,P}$ norm for the distribution P and ${K}_{1},{K}_{2}\ge 0$ are finite constants that respectively rely on p, ${P}_{0}$, ${\mathrm{\beta }}_{1}$ and p, ${P}_{0,\left(0,1\right)}$, ${\mathrm{\beta }}_{2}$.

The conditions in (6) are moment bounds which ensure that ${\stackrel{ˉ}{Q}}_{10}$ and ${\stackrel{ˉ}{Q}}_{20}$ do not put too much mass around zero. To get the tightest bound, we should always choose ${\mathrm{\beta }}_{1},{\mathrm{\beta }}_{2}$ to be as large as possible. We remind the reader that convergence in ${L}_{p,P}$ implies convergence in ${L}_{q,P}$ for all distributions P and $1\le q\le p\le \mathrm{\infty }$. Hence there is a trade-off between the chosen bounding norm, ${L}_{p,P}$, and the rate we need to obtain with respect to that norm so that the term can be expected to be of order ${n}^{-1/2}$. See Table 1 for some examples of rates of convergence that suffice to give ${R}_{2A\left(0\right)}={o}_{{P}_{0}}\left({n}^{-1/2}\right)$.

Using the upper bound on ${\stackrel{ˉ}{Q}}_{10}$ and applying Cauchy-Schwarz inequality to eq. (15) in the proof of the lemma shows that: ${R}_{2A\left(0\right)}\left(Q,{Q}_{0}\right)\le {∥{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}∥}_{2,{P}_{0}}\sqrt{P{r}_{{P}_{0}}\left(0<|{\stackrel{ˉ}{Q}}_{10}|<\left|{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}\right|\right)}.$Hence ${R}_{2A\left(0\right)}={o}_{{P}_{0}}\left({n}^{-1/2}\right)$ without any moment condition when ${∥{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}∥}_{2,{P}_{0}}={O}_{{P}_{0}}\left({n}^{-1/2}\right)$, which occurs when one has correctly specified a parametric model for ${\stackrel{ˉ}{Q}}_{10}$. In general it is unlikely that one can correctly specify a parametric model for ${\stackrel{ˉ}{Q}}_{10}$. In these cases, Lemma 1 shows that the term ${R}_{2A\left(0\right)}$ will still be ${o}_{{P}_{0}}\left({n}^{-1/2}\right)$ if a moment condition holds and ${\stackrel{ˉ}{Q}}_{10}$ is estimated at a sufficient rate. The analogue holds for ${\stackrel{ˉ}{Q}}_{20}$.

The bounds given in Lemma 1 are loose. It is not in general necessary to estimate the blip functions ${\stackrel{ˉ}{Q}}_{10}$ and ${\stackrel{ˉ}{Q}}_{20}$ correctly, only their signs. As an extreme example of the looseness of the bounds, one can have that ${inf}_{v\left(0\right)}|{\stackrel{ˉ}{Q}}_{1n}\left(v\left(0\right)\right)-{\stackrel{ˉ}{Q}}_{10}\left(v\left(0\right)\right)|\to \mathrm{\infty }$ as $n\to \mathrm{\infty }$ and still have that ${R}_{2A\left(0\right)}\left(Q,{Q}_{n}\right)=0$ for all n. Nonetheless, these bounds give interpretable sufficient conditions under which the term ${R}_{2}$ converges faster than a root-n rate. We consider methods that do not directly estimate the blip functions in our companion paper.

Table 1

Convergence rates of estimators of ${\stackrel{ˉ}{Q}}_{10}$ which suffice for ${R}_{2A\left(0\right)}$ to be ${o}_{{P}_{0}}\left({n}^{-1/2}\right)$ according to Lemma 1. The higher the moments of ${\stackrel{ˉ}{Q}}_{10}^{-1}$ that are finite, the slower the estimator needs to converge. It is of course preferable to have an estimator which converges according to the ${P}_{0}$ essential supremum than just in ${L}_{2,{P}_{0}}$, but whether or not there is convergence in ${L}_{\mathrm{\infty },{P}_{0}}$ depends on the estimator used and the underlying distribution ${P}_{0}$

## 4 TMLE of the mean outcome under V-optimal rule

Throughout this and the next section we assume that condition (5) holds at ${P}_{0}$. Our proposed TMLE is to first estimate the optimal rule ${d}_{0}$, giving us an estimated rule ${d}_{n}\left(A\left(0\right),V\right)={d}_{n,A\left(0\right)}\left(V\left(0\right)\right),{d}_{n,A\left(1\right)}\left(A\left(0\right),V\left(1\right)\right)$, and subsequently apply the TMLE of $E{Y}_{d}$ for a fixed rule d at $d={d}_{n}$ as presented in van der Laan and Gruber [41]. This TMLE is an analogue of the double robust estimating equation method presented in Bang and Robins [36]: see also Petersen et al. [40] for a generalization of the TMLE to marginal structural models for dynamic treatments. In a companion paper we describe a data adaptive estimator of ${d}_{0}$. In this paper we take ${d}_{n}$ as given. We review the TMLE for ${\mathrm{\Psi }}_{d}\left({P}_{0}\right)={E}_{{P}_{0}}{Y}_{d}$ at a fixed rule d in “TMLE of the mean outcome under a given rule” in Appendix B. Observations which are only partially observed due to right censoring do not cause a problem for the TMLE. In particular, the TMLE only uses individuals who are not right censored at the first or second time point to obtain initial estimates of ${E}_{{P}_{0}}\left[{Y}_{d}|A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right),L\left(0\right)\right]$ and ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ in (4), respectively. See the appendix for details.

Here we note some of the key properties of the TMLE. Let ${Q}_{n}^{{d}_{n}\ast }$ consist of the empirical distribution ${Q}_{L\left(0\right),n}$ of $L\left(0\right)$, a regression function $l\left(0\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{n}^{\ast }\left[{Y}_{d}|L\left(0\right)=l\left(0\right)\right]$ that estimates ${E}_{{P}_{0}}\left[{Y}_{d}|L\left(0\right)\right]$, and a regression function $\left(a\left(0\right),\stackrel{ˉ}{l}\left(1\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{n}^{\ast }\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(a\left(0\right),v\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}\left(1\right)\right]$that estimates ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$, where we note that v is a function of $\stackrel{ˉ}{l}\left(1\right)$. In the appendix we describe our proposed algorithm to get the estimates in ${Q}_{n}^{{d}_{n}\ast }$. The proposed TMLE for ${\mathrm{\psi }}_{0}={E}_{{P}_{0}}{Y}_{{d}_{0}}$ is given by ${\mathrm{\psi }}_{{d}_{n},n}^{\ast }={\mathrm{\Psi }}_{{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast }\right)=\frac{1}{n}\sum _{i=1}^{n}{E}_{n}^{\ast }\left[{Y}_{{d}_{n}}|L\left(0\right)=l{\left(0\right)}_{i}\right],$where we have applied the TMLE in the appendix to the case where $d={d}_{n}$, treating ${d}_{n}$ as known. Note that ${\mathrm{\Psi }}_{{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast }\right)$ is a plug-in estimator in that it is obtained by plugging ${Q}^{{d}_{n}^{\ast }}$ into the parameter mapping ${Q}^{d}\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{\mathrm{\Psi }}_{d}\left({Q}^{d}\right)$ for $d={d}_{n}$. We expect our plug-in estimator to give reasonable estimates in finite samples because it naturally respects the constraints of our model. In the next section we show that this estimator also enjoys many desirable asymptotic properties.

Recall that ${D}^{\ast }\left(d,{Q}^{d},g\right)$ is the efficient influence curve for the target parameter ${E}_{{P}_{0}}{Y}_{d}$ which treats d as fixed, and Theorem 2 showed that ${D}^{\ast }\left({d}_{0},{Q}_{0}^{{d}_{0}},{g}_{0}\right)$ is the efficient influence curve of the target parameter $E{Y}_{{d}_{0}}$ where ${d}_{0}$ is the V-optimal rule. The TMLE $\left({d}_{n},{Q}_{n}^{{d}_{n}\ast }\right)$ described in the appendix solves the efficient influence curve estimating equation: ${P}_{n}{D}^{\ast }\left({d}_{n},{Q}_{n}^{{d}_{n}\ast },{g}_{n}\right)=0.$(7)Further, one can show using standard M-estimator analysis that the targeted ${Q}_{n}^{{d}_{n}\ast }$ proposed in the appendix maintains the same rate of convergence as the initial estimator ${Q}_{n}^{{d}_{n}}$ under very mild conditions. We do not concern ourselves with these conditions in this paper, and will instead state all conditions directly in terms of ${Q}_{n}^{{d}_{n}\ast }$. The above will be a key ingredient in proving the asymptotic linearity of the TMLE for ${\mathrm{\psi }}_{0}={E}_{{P}_{0}}{Y}_{{d}_{0}}$.

## 5 Asymptotic efficiency of the TMLE of the mean outcome under the V-optimal rule

We now wish to analyze the TMLE ${\mathrm{\psi }}_{n}^{\ast }={\mathrm{\Psi }}_{{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast }\right)$ of ${\mathrm{\psi }}_{0}={\mathrm{\Psi }}_{{d}_{0}}\left({Q}_{0}^{{d}_{0}}\right)=\mathrm{\Psi }\left({Q}_{0}\right)$. We first give a representation that will allow us to prove the asymptotic linearity of the TMLE under conditions. The result allows ${Q}_{n}^{{d}_{n}\ast }$ to be misspecified, even though the intervention mechanism ${g}_{0}$ and the rule ${d}_{n}$ are assumed to be consistent for ${g}_{0}$ and ${d}_{0}$, respectively.

Theorem 4. Assume $Y\in \left[0,1\right]$, the strong positivity assumption, condition (5) at ${P}_{0}$, ${D}_{n}^{\ast }\equiv {D}^{\ast }\left({d}_{n},{Q}_{n}^{{d}_{n}\ast },{g}_{n}\right)$ falls in a ${P}_{0}$-Donsker class with probability tending to 1, ${P}_{0}\left\{{D}_{n}^{\ast }-{D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right){\right\}}^{2}$ converges to zero in probability for some ${Q}^{{d}_{0}}$, and

${R}_{2}\left({Q}_{n},{Q}_{0}\right)={o}_{{P}_{0}}\left(1/\sqrt{n}\right),$

where ${R}_{2}$ is defined in Theorem 3 and an upper bound is established in Lemma 1. Then ${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)+{R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}},{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)+{o}_{{P}_{0}}\left({n}^{-\frac{1}{2}}\right),$(8)

where ${R}_{1d}$ is defined in Theorem 3.

The proof of the above theorem, which is given in the appendix, makes use of the fact that the TMLE satisfies (7). We now give two sets of conditions which control the remainder term ${R}_{1{d}_{n}}$ in (8) to prove the asymptotic linearity of the TMLE. The first result is an immediate consequence of the fact that ${R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}},{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)=0$ whenever ${g}_{n}={g}_{0}$.

Corollary 1. Suppose the conditions of Theorem 4 further suppose that ${g}_{n}={g}_{0}$ (i.e., RCT). Then:

${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)+{o}_{{P}_{0}}\left({n}^{-1/2}\right)$

That is, ${\mathrm{\psi }}_{n}^{\ast }$ is asymptotically linear with influence curve ${D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)$.

The next corollary is more general in that it applies to situations where the intervention mechanism ${g}_{0}$ is estimated from the data. The above result emerges as a special case.

Corollary 2. Suppose all of the conditions of Theorem 4 hold, and that

${R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast },{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)-{R}_{1{d}_{n}}\left({Q}^{{d}_{n}},{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)={o}_{{P}_{0}}\left(1/\sqrt{n}\right)$

for some ${Q}^{{d}_{n}}$. In addition, we assume the following asymptotic linearity condition on a smooth functional of ${g}_{n}$:

${R}_{1{d}_{n}}\left({Q}^{{d}_{n}},{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)=\left({P}_{n}-{P}_{0}\right){D}_{g}\left({P}_{0}\right)+{o}_{{P}_{0}}\left(1/\sqrt{n}\right),$(9)

for some function ${D}_{g}\left({P}_{0}\right)\left(O\right)\in {L}_{0}^{2}\left({P}_{0}\right)\equiv \left\{h:{P}_{0}h=0,{P}_{0}{h}^{2}<\mathrm{\infty }\right\}$. Then,

${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right)\left\{{D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)+{D}_{g}\left({P}_{0}\right)\right\}+{o}_{{P}_{0}}\left(1/\sqrt{n}\right).$(10)

If it is also know that ${g}_{n}$ is an MLE of ${g}_{0}$ according to a correctly specified model $G$ for ${g}_{0}$ with tangent space ${T}_{g}\left({P}_{0}\right)$ at ${P}_{0}$, then (9) holds with

${D}_{g}\left({P}_{0}\right)=-\mathrm{\Pi }\left({D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)|{T}_{g}\left({P}_{0}\right)\right),$(11)

where $\mathrm{\Pi }\left(\cdot |{T}_{g}\left({P}_{0}\right)\right)$ denotes the projection operator onto ${T}_{g}\left({P}_{0}\right)\subset {L}_{0}^{2}\left({P}_{0}\right)$ in the Hilbert space ${L}_{0}^{2}\left({P}_{0}\right)$.

Equation (11) is a corollary of Theorem 2.3 of van der Laan and Robins [34]. The rest of the theorem is the result of a simple rearrangement of terms, so the proof is omitted.

Condition (9) is trivially satisfied in a randomized clinical trial without missingness, where we can take ${g}_{n}={g}_{0}$ and thus ${D}_{g}\left({P}_{0}\right)$ is the constant function 0. Nonetheless, (11) suggests that it would be better to estimate ${g}_{0}$ using a parametric model that contains the true (known) intervention mechanism. For example, at each time point one may use a main terms linear logistic regression with treatment and covariate histories as predictors. If ${Q}_{n}^{{d}_{n}}$ consistently estimates ${Q}_{0}^{{d}_{0}}$, then ${D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)$ is orthogonal to ${T}_{g}\left({P}_{0}\right)$ and hence the projection in (11) is the constant function 0. Otherwise the projection will decrease the variance of ${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0}$ without affecting asymptotic bias, thereby increasing the asymptotic efficiency of the estimator. One can then use an empirical estimate of the variance of ${D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)$ to get asymptotically conservative confidence intervals for ${\mathrm{\psi }}_{0}$.

## 5.1 Asymptotic linearity of TMLE in a SMART setting

Suppose the data is generated by a sequential RCT and there is no missingness so that ${g}_{0}$ is known. Further suppose that (5) holds at ${P}_{0}$, that is, that treating at each time point has either a positive or negative effect with probability 1, regardless of the choice of the regimen at earlier time points. In addition, assume that $V\left(0\right)$ and $V\left(1\right)$ are both univariate scores, and assume condition (3) so that the optimal rule ${d}_{0,A\left(1\right)}$ based on $\left(A\left(0\right),V\left(0\right),V\left(1\right)\right)$ is the same as the optimal rule ${d}_{0,A\left(1\right)}$ based on $A\left(0\right),V\left(1\right)$: for example, $V\left(1\right)$ is the same score as $V\left(0\right)$ but measured at the next time point, so that it is reasonable to assume that an effect of $V\left(0\right)$ on Y will be fully blocked by $V\left(1\right)$. Suppose we want to use the data of the RCT to learn the V-optimal rule ${d}_{0}$ and provide statistical inference for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$. Further suppose that the moment conditions in Lemma 1 hold with ${\mathrm{\beta }}_{1}={\mathrm{\beta }}_{2}=2$. Since both $V\left(0\right)$ and $V\left(1\right)$ are one-dimensional, using kernel smoothers or sieve-based estimation to generate a library of candidate estimators for the sequential loss-based super-learner of the blip functions $\left({\stackrel{ˉ}{Q}}_{10},{\stackrel{ˉ}{Q}}_{20}\right)$ described in our companion paper, we can obtain an estimator ${\stackrel{ˉ}{Q}}_{n}=\left({\stackrel{ˉ}{Q}}_{1n},{\stackrel{ˉ}{Q}}_{2n}\right)$ of ${\stackrel{ˉ}{Q}}_{0}=\left({\stackrel{ˉ}{Q}}_{10},{\stackrel{ˉ}{Q}}_{20}\right)$ that converges in ${L}_{2}$ at a rate such as ${n}^{-2/5}$ under the assumption that ${\stackrel{ˉ}{Q}}_{10},{\stackrel{ˉ}{Q}}_{20}$ are continuously differentiable with a uniformly bounded derivative, or at a better rate under additional smoothness assumptions. As a consequence, in this case ${R}_{2}\left({Q}_{n},{Q}_{0}\right)={O}_{{P}_{0}}\left({n}^{-3/5}\right)={o}_{{P}_{0}}\left({n}^{-1/2}\right)$ by Lemma 1. As a consequence, all conditions of Theorem 4 hold, and it follows that the proposed TMLE is asymptotically linear with influence curve ${D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)$, where ${Q}^{{d}_{0}}$ is the possibly misspecified limit of ${Q}^{{d}_{n}\ast }$ in the TMLE. To conclude, sequential RCTs allow us to learn V-optimal rules at adaptive optimal rates of convergence, and allow valid asymptotic statistical inference for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$. If $V\left(j\right)$ is higher dimensional, then one will have to rely on enough smoothness assumptions on the blip functions and/or moment conditions on $1/|{\stackrel{ˉ}{Q}}_{10}|$ and $1/|{\stackrel{ˉ}{Q}}_{20}|$ from Lemma 1 in order to guarantee that ${R}_{2}\left({Q}_{n},{Q}_{0}\right)={o}_{{P}_{0}}\left(1/\sqrt{n}\right)$.

If there is right censoring, then ${g}_{0}={g}_{01}{g}_{02}$ factors in a treatment mechanism ${g}_{01}$ and censoring mechanism ${g}_{02}$, where ${g}_{01}$ is known, but ${g}_{02}$ is typically not known. Having a lot of knowledge about how censoring depends on the observed past might make it possible to obtain a good estimator of ${g}_{02}$. In that case, the above conclusions still apply, but one now estimates the nuisance parameters of the loss function (e.g., one uses a double robust loss function in which ${g}_{02}$ is replaced by an estimator, see our companion paper).

## 5.2 Statistical inference

Suppose one wishes to estimate the mean outcome under the optimal rule ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ and that (5) holds. Above we developed the TMLE ${\mathrm{\psi }}_{n}^{\ast }$ for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$. By Corollary 1, if ${g}_{n}={g}_{0}$ is known, this TMLE of ${\mathrm{\psi }}_{0}$ is asymptotically linear with influence curve $IC\left({P}_{0}\right)={D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)$. If ${g}_{n}$ is an MLE according to a model with tangent space ${T}_{g}\left({P}_{0}\right)$, then the TMLE is asymptotically linear with influence curve $IC\left({P}_{0}\right)-\mathrm{\Pi }\left(IC\left({P}_{0}\right)|{T}_{g}\left({P}_{0}\right)\right),$

so that one could use $IC\left({P}_{0}\right)$ as a conservative influence curve. Let $I{C}_{n}$ be an estimator of this influence curve $IC\left({P}_{0}\right)$ obtained by plugging in the available estimates of its unknown components. The asymptotic variance of the TMLE ${\mathrm{\psi }}_{n}^{\ast }$ of ${\mathrm{\psi }}_{0}$ can now be (conservatively) estimated with ${\mathrm{\sigma }}_{n}^{2}=\frac{1}{n}\sum _{i=1}^{n}I{C}_{n}^{2}\left({O}_{i}\right).$

An asymptotic 95% confidence interval for ${\mathrm{\psi }}_{0}$ is given by ${\mathrm{\psi }}_{n}^{\ast }±1.96{\mathrm{\sigma }}_{n}/\sqrt{n}$.

## 6 Statistical inference for mean outcome under data adaptively determined dynamic treatment

Let $\stackrel{^}{d}:M\to D$ be an estimator that maps an empirical distribution into an individualized treatment rule. See our companion paper for examples of possible estimators $\stackrel{ˆ}{d}$. Let ${d}_{n}=\stackrel{ˆ}{d}\left({P}_{n}\right)$ be the estimated rule. Up until now we have been concerned with statistical inference for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$, where ${d}_{0}$ is the unknown V-optimal rule while ${d}_{n}$ is a best estimator of this rule. As a consequence, statistical inference for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ based on the TMLE relied on consistency of ${d}_{n}$ to ${d}_{0}$, but also relied on the rate of convergence at which ${d}_{n}$ converges to ${d}_{0}$, that is, ${R}_{2}\left({Q}_{n},{Q}_{0}\right)={o}_{{P}_{0}}\left(1/\sqrt{n}\right)$. In this section we present statistical inference for the data adaptive target parameter ${\mathrm{\psi }}_{0n}={\mathrm{\Psi }}_{{d}_{n}}\left({P}_{0}\right)={|}_{{E}_{{P}_{0}}{Y}_{d}}.$

That is, we construct an estimator ${\stackrel{ˆ}{\mathrm{\Psi }}}_{\stackrel{ˆ}{d}\left({P}_{n}\right)}\left({P}_{n}\right)$ of ${\mathrm{\Psi }}_{\stackrel{ˆ}{d}\left({P}_{n}\right)}\left({P}_{0}\right)$ and a confidence interval so that $\underset{n\to \mathrm{\infty }}{lim\phantom{\rule{thinmathspace}{0ex}}}\mathrm{P}{\mathrm{r}}_{{P}_{0}}\left({\mathrm{\Psi }}_{\stackrel{ˆ}{d}\left({P}_{n}\right)}\left({P}_{0}\right)\in {\stackrel{ˆ}{\mathrm{\Psi }}}_{\stackrel{ˆ}{d}\left({P}_{n}\right)}\left({P}_{n}\right)±1.96\stackrel{ˆ}{\mathrm{\sigma }}\left({P}_{n}\right)/\sqrt{n}\right)=0.95,$

where $\stackrel{ˆ}{\mathrm{\sigma }}\left({P}_{n}\right)$ is a consistent estimator of the standard error of ${\stackrel{ˆ}{\mathrm{\Psi }}}_{\stackrel{ˆ}{d}\left({P}_{n}\right)}\left({P}_{n}\right)$. Note that in this definition of the confidence interval the target parameter is itself also a random variable through the data ${P}_{n}$.

We do not assume that (5) holds in this section, but we do implicitly make the weaker assumption that ${d}_{n}\to {d}_{1}$ for some ${d}_{1}\in \mathcal{D}$ in assumption (12) of Theorem 5. Statistical inference will be based on the same TMLE of ${\mathrm{\Psi }}_{d}\left({P}_{0}\right)$ at $d={d}_{n}$, and our variance estimator will also be the same, but since the target is not ${\mathrm{\Psi }}_{{d}_{0}}\left({P}_{0}\right)$ but ${\mathrm{\Psi }}_{{d}_{n}}\left({P}_{0}\right)$, there will be no need for ${d}_{n}$ to even be consistent for ${d}_{0}$, let alone converge at a particular rate. As a consequence, this approach is particularly appropriate in cases where V is high dimensional so that it is not reasonable to expect that ${d}_{n}$ converges to ${d}_{0}$ at the required rate. Another motivation for this data adaptive target parameter is that, even when statistical inference for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ is feasible, one might be interested in statistical inference for the mean outcome under the concretely available rule ${d}_{n}$ instead of under the unknown rule ${d}_{0}$.

As shown in the proof of Theorem 3, ${P}_{0}{D}^{\ast }\left({d}_{n},{Q}_{n}^{\ast },{g}_{n}\right)={\mathrm{\psi }}_{0n}-{\mathrm{\psi }}_{n}^{\ast }+{R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast },{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)$. Further, ${P}_{n}{D}^{\ast }\left({d}_{n},{Q}_{n}^{{d}_{n}\ast },{g}_{n}\right)=0$, which yields ${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0n}=\left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{n},{Q}_{n}^{{d}_{n}\ast },{g}_{n}\right)+{R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast },{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right).$

This relation is key to the proof of the following theorem, which is analogous to Theorem 4. Note crucially that the theorem does not have any conditions on the remainder term ${R}_{2}$, nor does it require that ${d}_{n}$ converge to the optimal rule ${d}_{0}$.

Theorem 5. Assume $Y\in \left[0,1\right]$. Let $\stackrel{^}{d}\left({P}_{n}\right)\in D$ with probability tending to 1, and assume the strong positivity assumption. Let ${\mathrm{\psi }}_{0n}={\mathrm{\Psi }}_{{d}_{n}}\left({P}_{0}\right)={|}_{{E}_{{P}_{0}}{Y}_{d}}$ be the data adaptive target parameter of interest. Let ${R}_{1d}$ be as defined in Theorem 3.

Assume ${D}_{n}^{\ast }\equiv {D}^{\ast }\left({d}_{n},{Q}_{n}^{\ast },{g}_{n}\right)$ falls in a ${P}_{0}$ -Donsker class with probability tending to 1,

${P}_{0}{\left\{{D}_{n}^{\ast }-{D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},{g}_{0}\right)\right\}}^{2}={o}_{{P}_{0}}\left(1\right)$(12)

for some ${d}_{1}\in \mathcal{D}$ and ${Q}^{{d}_{1}}$. Then,

${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0n}=\left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},{g}_{0}\right)+{R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast },{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)+{o}_{{P}_{0}}\left({n}^{-1/2}\right).$

If ${g}_{n}={g}_{0}$ (i.e., RCT), then ${R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast },{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)=0$, so that ${\mathrm{\psi }}_{n}^{\ast }$ is asymptotically linear with influence curve ${D}^{\ast }\left({d}_{1},Q,{g}_{0}\right)$.

The proof of the above theorem is nearly identical to the proof of Theorem 4 so is omitted. For general ${g}_{n}$, ${R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast },{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)={o}_{{P}_{0}}\left({n}^{-1/2}\right)$ under an analogous second-order term condition to the one assumed in Corollary 1. As in Corollary 2, the asymptotic efficiency may improve (and will not worsen) when a known intervention mechanism is fit using a correctly specified parametric model. See Theorem 11 in our online technical report for details [47].

## 7 Statistical inference for the average of sample-split specific mean counterfactual outcomes under data adaptively determined dynamic treatments

Again let $\stackrel{ˆ}{d}:\mathcal{M}\to \mathcal{D}$ be an estimator that maps an empirical distribution into an individualized treatment rule. Let ${B}_{n}\in \left\{0,1{\right\}}^{n}$ denote a random vector for a cross-validation split, and for a split ${B}_{n}$, let ${P}_{n,{B}_{n}}^{0}$ be the empirical distribution of the training sample $\left\{i:{B}_{n}\left(i\right)=0\right\}$ and ${P}_{n,{B}_{n}}^{1}$ is the empirical distribution of the validation sample $\left\{i:{B}_{n}\left(i\right)=1\right\}$. Consider a J-fold cross-validation scheme. In J-fold cross-validation, the data is split into J mutually exclusive and exhaustive sets of size approximately $n/J$ uniformly at random. Each set is then used as the validation set once, with the union of all other sets serving as the training set. With probability $1/J$, ${B}_{n}$ has value 1 in all indices in validation set $j\in \left\{1,...,J\right\}$ and 0 for all indices not corresponding to training set j.

In this section, we present a method that provides an estimator and statistical inference for the data adaptive target parameter ${\stackrel{˜}{\mathrm{\psi }}}_{0n}={E}_{{B}_{n}}{\mathrm{\Psi }}_{\stackrel{ˆ}{d}\left({P}_{n,{B}_{n}}^{0}\right)}\left({P}_{0}\right).$Note that ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$ is different from the data adaptive target parameter ${\mathrm{\psi }}_{0n}$ presented in the previous section. In particular, this target parameter is defined as the average of data adaptive parameters, where the data adaptive parameters are learned from the training samples of size approximately $n/J$. In the previous section, the data adaptive target parameter was defined as the mean outcome under the rule ${d}_{n}$ which was estimated on the entire data set. Again the target parameter is a random quantity that relies on the sample of size n.

One applies the estimator $\stackrel{ˆ}{d}$ to each of the J training samples, giving a target parameter value ${\mathrm{\Psi }}_{\stackrel{ˆ}{d}\left({P}_{n,{B}_{n}}^{0}\right)}\left({P}_{0}\right)$, and our target parameter ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$ is defined as the average across these J target parameters. Below we present a CV-TMLE ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }$ of this data adaptive target parameter ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$. As in the previous section, we will be able to establish statistical inference for our estimate ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }$ without requiring that the estimated rules converge to ${d}_{0}$, nor any rate condition on the estimated rules. Unlike the asymptotic linearity results in all previous sections, the results in this section do not rely on an empirical process condition (i.e., Donsker class condition). That means we obtain valid asymptotic statistical inference under essentially no conditions in a sequential RCT, even when ${d}_{n}$ is a highly data adaptive estimator of a V-optimal rule for a possibly high dimensional V. Under a consistency and rate condition (but no empirical process condition) on ${d}_{n}$, we also get inference for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$.

The next subsection defines the general CV-TMLE for data adaptive target parameters. We subsequently present an asymptotic linearity theorem allowing us to construct asymptotic 95% confidence intervals.

## 7.1 General description of CV-TMLE

Here we give a general overview of the CV-TMLE procedure. In “CV-TMLE of the mean outcome under data adaptive V-optimal rule” in Appendix B we present a particular CV-TMLE which satisfies all of the properties described in this section. Denote the realizations of ${B}_{n}$ with $j=1,\dots ,J$, and let ${d}_{nj}=\stackrel{ˆ}{d}\left({P}_{n,j}^{0}\right)$ for some estimator of the optimal rule $\stackrel{ˆ}{d}$. Let $\left(a\left(0\right),\stackrel{ˉ}{l}\left(1\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{nj}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(a\left(0\right),v\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}\left(1\right)\right]$represent an initial estimate of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ based on the training sample j. Similarly, let $l\left(0\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{nj}\left[{Y}_{{d}_{nj}}|L\left(0\right)=l\left(0\right)\right]$ represent an initial estimate of ${E}_{{P}_{0}}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]$ based on the training sample j. Finally, let ${Q}_{L\left(0\right),nj}$ represent the empirical distribution of $L\left(0\right)$ in validation sample j. We then fluctuate these three regression functions using the following submodels: $\left\{{E}_{nj}^{\left({\mathrm{\epsilon }}_{2}\right)}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(a\left(0\right),v\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}\left(1\right)\right]:{\mathrm{\epsilon }}_{2}\in \mathbb{R}\right\}$ $\left\{{E}_{nj}^{\left({\epsilon }_{1}\right)}\left[{Y}_{{d}_{nj}}|L\left(0\right)=l\left(0\right)\right]:{\epsilon }_{1}\in ℝ\right\}$ $\left\{{Q}_{L\left(0\right),nj}^{\left({\epsilon }_{0}\right)}:{\epsilon }_{0}\in ℝ\right\},$where these submodels rely on an estimate ${g}_{nj}$ of ${g}_{0}$ based on training sample j and are such that: ${E}_{nj}^{\left(0\right)}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(a\left(0\right),v\right),\stackrel{ˉ}{L}\left(1\right)\right]={E}_{nj}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(a\left(0\right),v\right),\stackrel{ˉ}{L}\left(1\right)\right]$ ${E}_{nj}^{\left(0\right)}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]={E}_{nj}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]$ ${Q}_{L\left(0\right),nj}^{\left(0\right)}={Q}_{L\left(0\right),nj}.$Let ${Q}_{nj}^{{d}_{nj}}\left(\mathrm{\epsilon }\right)$ represent the parameter mapping that gives the three regression functions above fluctuated by $\mathrm{\epsilon }\equiv \left({\mathrm{\epsilon }}_{0},{\mathrm{\epsilon }}_{1},{\mathrm{\epsilon }}_{2}\right)$. For a fixed $\mathrm{\epsilon }$, ${Q}_{nj}^{{d}_{nj}}\left(\mathrm{\epsilon }\right)$ only relies on ${P}_{nj}^{1}$ through the empirical distribution of $L\left(0\right)$ in validation sample j. Let $\mathrm{\varphi }$ be a valid loss function for ${Q}_{0}^{d}$ so that ${Q}_{0}^{d}=arg{min}_{{Q}^{d}}{P}_{0}\mathrm{\varphi }\left({Q}^{d}\right)$, and let $\mathrm{\varphi }$ and the submodels above satisfy ${D}^{*}\left(d,{Q}^{d},g\right)\in 〈{\frac{d}{d\epsilon }\varphi \left({Q}^{d}\left(\epsilon \right)\right)|}_{\epsilon =0}〉,$where $〈f〉=\left\{{\sum }_{j}{\beta }_{j}{f}_{j}:\beta \right\}$ denotes the linear space spanned by the components of f. We choose ${\mathrm{\epsilon }}_{n}$ to minimize ${P}_{n}^{1}\mathrm{\varphi }\left({Q}_{nj}^{{d}_{nj}}\left(\mathrm{\epsilon }\right)\right)$ over $\mathrm{\epsilon }\in {\mathbb{R}}^{3}$. We then define the targeted estimate ${Q}_{nj}^{{d}_{nj}\ast }\equiv {Q}_{nj}^{{d}_{nj}}\left({\mathrm{\epsilon }}_{n}\right)$ of ${Q}_{0}^{{d}_{nj}}$. We note that ${Q}_{nj}^{{d}_{nj}\ast }$ maintains the rate of convergence of ${Q}_{nj}$ under mild conditions that are standard to M-estimator analysis. The key property that we need from the ${\mathrm{\epsilon }}_{n}$ and the corresponding update ${Q}_{nj}^{{d}_{nj}\ast }$ is that it (approximately) solves the cross-validated empirical mean of the efficient influence curve: ${E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}{D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)={o}_{{P}_{0}}\left(1/\sqrt{n}\right).$(13)The CV-TMLE implementation presented in the appendix satisfies this equation with ${o}_{{P}_{0}}\left(1/\sqrt{n}\right)$ replaced by 0. The proposed estimator of ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$ is given by ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }\equiv {E}_{{B}_{n}}{\mathrm{\Psi }}_{{d}_{nj}}\left({Q}_{nj}^{{d}_{nj}\ast }\right).$

In the current literature we have referred to this estimator as the CV-TMLE [5356]. We give a concrete CV-TMLE algorithm for ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }$ in “CV-TMLE of the mean outcome under data adaptive V-optimal rule” in Appendix B, but note that other CV-TMLE algorithms can be derived using the approach in this section for different choices of loss function $\mathrm{\varphi }$ and submodels.

## 7.2 Statistical inference based on the CV-TMLE

We now proceed with the analysis of this CV-TMLE ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }$ of ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$. We first give a representation theorem for the CV-TMLE that is analogous to Theorem 5.

Theorem 6. Let ${g}_{nj}$ and ${d}_{nj}$ represent estimates of ${g}_{0}$ and ${d}_{0}$ based on training sample j. Let ${Q}_{nj}^{{d}_{nj}\ast }$ represent a targeted estimate of ${Q}_{0}^{{d}_{nj}}$ as presented in Section 7.1 so that ${Q}_{nj}^{{d}_{nj}\ast }$ satisfies (13). Let ${R}_{1d}$ be as in Theorem 3. Further suppose that the supremum norm of ${max}_{j}{D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)$ is bounded by some $M<\mathrm{\infty }$ with probability tending to 1, and that

$\underset{j\in \left\{1,\dots ,J\right\}}{max}{P}_{0}{\left\{{D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)-{D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},g\right)\right\}}^{2}\to 0\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}in\phantom{\rule{thickmathspace}{0ex}}probability\phantom{\rule{1pt}{0ex}}$

for some ${d}_{1}\in \mathcal{D}$ and possibly misspecified ${Q}^{{d}_{1}}$ and g. then: $\begin{array}{rl}{\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }-{\stackrel{˜}{\mathrm{\psi }}}_{0n}& =\left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},{g}^{{d}_{1}}\right)\\ & \phantom{\rule{1em}{0ex}}+\frac{1}{J}\sum _{j=1}^{J}{R}_{1{d}_{nj}}\left({Q}_{nj}^{{d}_{nj}\ast },{Q}_{0}^{{d}_{nj}},{g}_{nj},{g}_{0}\right)+{o}_{{P}_{0}}\left({n}^{-1/2}\right).\end{array}$

Note that ${d}_{1}$ in the above theorem need not be the same as the optimal rule ${d}_{0}$, though later we will discuss the desirable special case where ${d}_{1}={d}_{0}$. The above theorem also does not require that ${g}_{0}$ is known, or even that the limit of our intervention mechanisms g is equal to ${g}_{0}$. Nonetheless, we get the following asymptotic linearity result when $g={g}_{0}$ and ${g}_{nj}$ satisfies an asymptotic linearity condition on a smooth functional of ${g}_{nj}$.

Corollary 3. Suppose the conditions from Theorem 6 hold with $g={g}_{0}$. Further suppose that:

$\frac{1}{J}\sum _{j=1}^{J}\left({R}_{1{d}_{nj}}\left({Q}_{nj}^{{d}_{nj}\ast },{Q}_{0}^{{d}_{nj}},{g}_{nj},{g}_{0}\right)-{R}_{1{d}_{nj}}\left({Q}^{{d}_{nj}},{Q}_{0}^{{d}_{nj}},{g}_{nj},{g}_{0}\right)\right)={o}_{{P}_{0}}\left({n}^{-1/2}\right),$

for some ${Q}^{{d}_{nj}}$ and that:

$\frac{1}{J}\sum _{j=1}^{J}{R}_{1{d}_{nj}}\left({Q}^{{d}_{nj}\ast },{Q}_{0}^{{d}_{nj}},{g}_{nj},{g}_{0}\right)=\left({P}_{n}-{P}_{0}\right){D}_{g}\left({P}_{0}\right)+{o}_{{P}_{0}}\left({n}^{-1/2}\right).$(14)

We can conclude that:

${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }-{\stackrel{˜}{\mathrm{\psi }}}_{0n}=\left({P}_{n}-{P}_{0}\right)\left({D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},{g}_{0}\right)+{D}_{g}\left({P}_{0}\right)\right)+{o}_{{P}_{0}}\left({n}^{-1/2}\right).$

The proof of the above result is just a rearrangement of terms so is omitted. Consider our setting. Suppose ${g}_{0}$ is known so we can have that ${g}_{nj}={g}_{0}$ for all j. Consider the estimator ${\mathrm{\sigma }}_{n}^{2}=\frac{1}{J}\sum _{j=1}^{J}{P}_{n,j}^{1}{\left\{{D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)\right\}}^{2}$

of the asymptotic variance ${\mathrm{\sigma }}_{0}^{2}={P}_{0}\left\{{D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},{g}_{0}\right){\right\}}^{2}$ of the CV-TMLE ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }$. An asymptotic 95% confidence interval for ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$ is given by ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }±1.95{\mathrm{\sigma }}_{n}/\sqrt{n}$. This same variance estimator and confidence interval can be used for the case that ${g}_{0}$ is not known and each ${g}_{nj}$ is an MLE of ${g}_{0}$ according to some model. In that case, it is an asymptotically conservative confidence interval (analogous to eq. (11) applied to Corollary 3).

Now consider the case where ${d}_{1}$ from the above theorem is equal to the optimal rule ${d}_{0}$ and condition (5) holds. For simplicity, also assume that ${g}_{0}$ is known and ${g}_{nj}={g}_{0}$. Then ${R}_{1{d}_{nj}}$ is equal to 0 for all j, so Theorem 6 shows that the CV-TMLE for ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$ is asymptotically linear with influence curve ${D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},{g}_{0}\right)={D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)$. If ${\stackrel{˜}{\mathrm{\psi }}}_{0n}-{\mathrm{\psi }}_{0}=\frac{1}{J}\sum _{j=1}^{J}{R}_{2}\left({Q}_{nj},{Q}_{0}\right)$is second order, that is, ${o}_{{P}_{0}}\left({n}^{-1/2}\right)$, where ${Q}_{nj}$ is analogous to ${Q}_{n}$ but only estimated on the training sample j, then the CV-TMLE is consistent and asymptotically normal estimator of the mean outcome under the optimal rule. If ${Q}^{{d}_{0}}={Q}_{0}^{{d}_{0}}$, then the CV-TMLE is also asymptotically efficient among all regular asymptotically linear estimators. One can apply bounds like those in Lemma 1 for each of the J terms above to understand the behavior of ${\stackrel{˜}{\mathrm{\psi }}}_{0n}-{\mathrm{\psi }}_{0}$. Note crucially that this result does not rely on the restrictive empirical process conditions used in the previous sections, although it relies on a consistency and rate condition for asymptotic linearity with respect to the non-data adaptive parameter ${E}_{{P}_{0}}{Y}_{{d}_{0}}$.

## 8 Simulation methods

We start by presenting two single time point simulations. In earlier technical reports we directly describe the single time point problem [47, 48]. Here, we instead note that a single time point optimal treatment is a special case of a two time point treatment when only the second treatment is of interest. In particular, we can see this by taking $L\left(0\right)=V\left(0\right)=\mathrm{\varnothing }$, estimating ${\stackrel{ˉ}{Q}}_{2,0}$ without any dependence on $a\left(0\right)$, and correctly estimating ${\stackrel{ˉ}{Q}}_{1,0}$ with the constant function zero. We note that, in this one time point formulation, we do not need (5) to hold for ${\stackrel{ˉ}{Q}}_{10}$, so it may be more natural to view the single time point problem directly and use the single time point pathwise differentiability result in Theorem 2 of van der Laan and Luedtke [48]. We can then let $I\left(A\left(0\right)={d}_{n,A\left(0\right)}\left(V\left(0\right)\right)\right)=1$ for all $A\left(0\right),V\left(0\right)$ wherever the indicator appears in our calculations. Because the first time point is not of interest, we only describe the second time point treatment mechanism for this simulation. We refer the interested reader to the earlier technical report for a thorough discussion of the single time point case. We then present a two time point data generating distribution to show the effectiveness of our proposed method in the longitudinal setting.

## 8.1.1 Single time point

We simulate 1,000 data sets of 1,000 observations from an RCT without missingness. We have that: ${L}_{1}\left(1\right),{L}_{2}\left(1\right),{L}_{3}\left(1\right),{L}_{4}\left(1\right)|A\left(0\right)\stackrel{iid}{\sim }N\left(0,1\right)$ ${A}_{1}\left(1\right)|A\left(0\right)\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(1/2\right)$ ${A}_{2}\left(1\right)|{A}_{1}\left(1\right),A\left(0\right)\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(1\right)$ $\begin{array}{rl}& \mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\text{\hspace{0.17em}}{E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right),\stackrel{ˉ}{L}\left(1\right),H=0\right]\\ & \phantom{\rule{2em}{0ex}}=1-{L}_{1}\left(1{\right)}^{2}+3{L}_{2}\left(1\right)+{A}_{1}\left(1\right)\left(5{L}_{3}{\left(1\right)}^{2}-4.45\right)\end{array}$ $\begin{array}{rl}& \mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\text{\hspace{0.17em}}{E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right),\stackrel{ˉ}{L}\left(1\right),H=1\right]\\ & \phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}=-0.5-{L}_{3}\left(1\right)+2{L}_{1}\left(1\right){L}_{2}\left(1\right)+{A}_{1}\left(1\right)\left(3|{L}_{2}\left(1\right)|-1.5\right)\end{array}$

where Y is a Bernoulli random variable and H is an unobserved $\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(1/2\right)$ variable independent of $\stackrel{ˉ}{A}\left(1\right),\stackrel{ˉ}{L}\left(1\right)$. The above distribution was selected so that the mean outcomes under static treatments (treating everyone or no one at the second time point) have approximately the same mean outcome of 0.464.

We consider two choices for $V\left(1\right)$. For the first we consider $V\left(1\right)={L}_{3}\left(1\right)$, and for the second we consider $V\left(1\right)$ to be the entire covariate history $\stackrel{ˉ}{L}\left(1\right)$. We have shown via Monte Carlo simulation that the optimal rule has mean outcome ${E}_{{P}_{0}}{Y}_{{d}_{0}}\approx 0.536$ when $V\left(1\right)={L}_{3}\left(1\right)$ and the optimal rule has mean outcome ${E}_{{P}_{0}}{Y}_{{d}_{0}}\approx 0.563$ when $V\left(1\right)=\left({L}_{1}\left(1\right),{L}_{2}\left(1\right),{L}_{3}\left(1\right),{L}_{4}\left(1\right)\right)$. One can verify that the blip function at the second time point is nonzero with probability 1 for both choices of $V\left(1\right)$.

## 8.1.2 Two time point

We again simulate 1,000 data sets of 1,000 observations from an RCT without missingness. The observed variables have the following distribution: ${L}_{1}\left(0\right),{L}_{2}\left(0\right)\phantom{\rule{thinmathspace}{0ex}}\stackrel{iid}{\sim }\phantom{\rule{thinmathspace}{0ex}}\mathrm{U}\mathrm{n}\mathrm{i}\mathrm{f}\left(-1,1\right)$ ${A}_{1}\left(0\right)|L\left(0\right)\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(1/2\right)$ ${A}_{2}\left(0\right)|{A}_{1}\left(0\right),L\left(0\right)\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(1\right)$ ${U}_{1},{U}_{2}|A\left(0\right),L\left(0\right)\phantom{\rule{thinmathspace}{0ex}}\stackrel{iid}{\sim }\phantom{\rule{thinmathspace}{0ex}}\mathrm{U}\mathrm{n}\mathrm{i}\mathrm{f}\left(-1,1\right)$ ${L}_{1}\left(1\right)|A\left(0\right),L\left(0\right),{U}_{1},{U}_{2}\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}{U}_{1}\left(1.25{A}_{1}\left(0\right)+0.25\right)$ ${L}_{2}\left(1\right)|A\left(0\right),L\left(0\right),{L}_{1}\left(1\right),{U}_{1},{U}_{2}\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}{U}_{2}\left(1.25{A}_{1}\left(0\right)+0.25\right)$ ${A}_{1}\left(1\right)|A\left(0\right),\stackrel{ˉ}{L}\left(1\right)\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(1/2\right)$ ${A}_{2}\left(1\right)|A\left(0\right),{A}_{1}\left(1\right),\stackrel{ˉ}{L}\left(1\right)\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(1\right)$ $Y|\stackrel{ˉ}{A}\left(1\right),\stackrel{ˉ}{L}\left(1\right)\phantom{\rule{thinmathspace}{0ex}}\sim \phantom{\rule{thinmathspace}{0ex}}\mathrm{B}\mathrm{e}\mathrm{r}\mathrm{n}\left(0.4+0.069\phantom{\rule{1pt}{0ex}}b\left(\stackrel{ˉ}{A}\left(1\right),\stackrel{ˉ}{L}\left(1\right)\right)\right),$

where $\begin{array}{rl}b\left(\stackrel{ˉ}{A}\left(1\right),\stackrel{ˉ}{L}\left(1\right)\right)& \equiv \phantom{\rule{thickmathspace}{0ex}}0.5{A}_{1}\left(0\right)\left(-0.8-3\left(\mathrm{s}\mathrm{g}\mathrm{n}\left({L}_{1}\left(0\right)\right)+{L}_{1}\left(0\right)\right)-{L}_{2}{\left(0\right)}^{2}\right)\\ & \phantom{\rule{1em}{0ex}}+{A}_{1}\left(1\right)\left(-0.35+{\left({L}_{1}\left(1\right)-0.5\right)}^{2}\right)+0.08{A}_{1}\left(0\right){A}_{1}\left(1\right).\end{array}$Note that ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right),\stackrel{ˉ}{L}\left(1\right)\right]$ is contained in the unit interval by the bounds on $\stackrel{ˉ}{A}\left(1\right)$ and $\stackrel{ˉ}{L}\left(1\right)$ so that Y is indeed a valid Bernoulli random variable. We will let $V\left(0\right)=L\left(0\right)$ and $V\left(1\right)=\left(A\left(0\right),\stackrel{ˉ}{L}\left(1\right)\right)$. One can verify that (5) is satisfied for this choice of V.

Static treatments yield mean outcomes ${E}_{{P}_{0}}{Y}_{\left(0,1\right),\left(0,1\right)}=0.400$, ${E}_{{P}_{0}}{Y}_{\left(0,1\right),\left(1,1\right)}\approx 0.395$, ${E}_{{P}_{0}}{Y}_{\left(1,1\right),\left(0,1\right)}\approx 0.361$, and ${E}_{{P}_{0}}{Y}_{\left(1,1\right),\left(1,1\right)}\approx 0.411$. The true optimal treatment has mean outcome ${E}_{{P}_{0}}{Y}_{{d}_{0}}\approx 0.485$.

## 8.2 Optimal rule estimation methods

For now suppose we have estimators of the optimal rule with reasonable convergence properties, by which we mean that the true mean outcome under the fitted rule is close to the mean outcome under the optimal rule. In our companion paper in this volume we describe these estimators and show precisely how close these estimators come to achieving the optimal mean outcome. Here we note that our estimation algorithms correspond to using the full candidate library of weighted classification and blip function-based estimators proposed in table 2 of our companion paper, with the weighted log loss function used to determine the convex combination of candidates. We provide oracle inequalities for this estimator in our companion paper, and argue that it represents a powerful approach to data adaptively estimate the optimal rule without over- or underfitting the data. For a sample size n, we denote the rule estimated on the whole sample by ${d}_{n}$, and the rule estimated on training sample j by ${d}_{nj}$.

## 8.3 Inference procedures

We use four procedures to estimate the mean outcome under the fitted rule. All inference procedures rely on the intervention mechanism ${g}_{0}$. We always estimate the intervention mechanism with the true mechanism ${g}_{0}$, as one may do in an RCT without missingness. We do not consider efficiency gains resulting from estimating the known treatment mechanism here.

The first method uses the TMLE described in “TMLE of the mean outcome under a given rule” in Appendix B. The second method uses the analogous estimating equation approach that uses the double robust inverse probability of censoring weighted (DR-IPCW) estimating equation implied by ${D}^{\ast }\left({d}_{n},{Q}_{n}^{{d}_{n}},{g}_{0}\right)$, where ${Q}_{n}^{{d}_{n}}$ represents the unfluctuated initial estimates of ${Q}_{0}^{{d}_{n}}$. See van der Laan and Robins [34] for a general outline of such an estimating equation approach. This approach is valid whenever the TMLE is valid. We also use the CV-TMLE described in “CV-TMLE of the mean outcome under data adaptive V-optimal rule” in Appendix B, where we use a 10-fold cross-validation scheme. Finally, we use the CV-DR-IPCW cross-validated estimating equation implied by ${\sum }_{j}{P}_{n,j}^{1}{D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}},{g}_{0}\right)$, where ${Q}_{nj}^{{d}_{nj}}$ represents the unfluctuated initial estimates of ${Q}_{0}^{{d}_{nj}}$. This approach is valid whenever the CV-TMLE is valid.

All inference procedures also rely on an estimate of ${Q}_{0}^{d}$ for some estimated d. For the two time point case, we use the empirical distribution of $L\left(0\right)$ to estimate the marginal distribution of $L\left(0\right)$. We compare plugging in both of the true values of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ and ${E}_{{P}_{0}}\left[{Y}_{d}|L\left(0\right),A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right]$ as initial estimates with plugging in the incorrectly specified constant function $1/2$ as initial estimates.

For the single time point case, we compare plugging in the true value of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ with the incorrectly specified constant function $1/2$. We always estimate ${E}_{{P}_{0}}\left[{Y}_{d}|L\left(0\right),A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right]$ by averaging $\left(A\left(0\right),\stackrel{ˉ}{L}\left(1\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$over the empirical distribution of $L\left(1\right)$ from the entire sample for non-cross-validated methods, and from the training sample for cross-validated methods. The empirical distribution of $L\left(0\right)$ will not play a role for the single time point case because $L\left(0\right)=\mathrm{\varnothing }$.

The procedures used to estimate the optimal rule rely on similar means, and we supply these estimation procedures with the incorrect value $1/2$ for these conditional means whenever we supply the inference procedures with the incorrect values of the corresponding conditional means, and with the correct values of the conditional means whenever we supply the inference procedures with the corresponding correct values.

The simulation was implemented in R [57]. The code used to run the simulations is available upon request. We are currently looking to implement the methods in this paper and the companion paper in an R package.

## 8.4 Evaluating performance

We use the coverage of asymptotic 95% confidence intervals to evaluate the performance of the various methods. As we establish in the earlier parts of this paper, each inference approach yields two interesting target parameters with respect to which we can compute coverage. All approaches give asymptotically valid inference for the mean outcome under the optimal rule under conditions, and thus the coverage with respect to this parameter is assessed across all methods.

The TMLE and DR-IPCW estimating equation-based approaches also estimate the data adaptive target parameter ${\mathrm{\psi }}_{0n}$ as presented in Section 6. Given a fitted rule ${d}_{n}$, we approximate the expected value in this parameter definition using ${10}^{6}$ Monte Carlo simulations for the single time point case and $5×{10}^{5}$ Monte Carlo simulations for the two time point case. We then assess confidence interval coverage with respect to this approximation.

The CV-TMLE and cross-validated DR-IPCW estimating equation approaches estimate the data adaptive target parameter ${\stackrel{˜}{\mathrm{\psi }}}_{0n}$ as presented in Section 7. Given the ten rules estimated on each of the training sets, the expectation over the sample split random variable ${B}_{n}$ becomes an average over ten target parameters, one for each estimated rule. Again we estimate the expected value of ${P}_{0}$ using ${10}^{6}$ Monte Carlo simulations for each of the ten target parameters in the single time point case, and $5×{10}^{5}$ Monte Carlo simulations in the two time point case.

## 9 Simulation results

Figure 1 shows that the (CV-)TMLE is more efficient than the (CV-)DR-IPCW estimating equation methods in our single time point simulation, except for the cross-validated methods when $V={L}_{1}\left(1\right),\dots ,{L}_{4}\left(1\right)$ and the regressions are misspecified. Note that the MSEs relative to ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ are the typical ${E}_{{P}_{0}}\left({\mathrm{\psi }}_{n}-{\mathrm{\psi }}_{0}{\right)}^{2}$ for an estimate ${\mathrm{\psi }}_{n}$, while the MSEs relative to the data adaptive parameter are the slightly less typical ${E}_{{P}_{0}}\left({\mathrm{\psi }}_{n}-{\mathrm{\psi }}_{0n}{\right)}^{2}$ for the TMLE and DR-IPCW, and ${E}_{{P}_{0}}\left({\mathrm{\psi }}_{n}-{\stackrel{˜}{\mathrm{\psi }}}_{0n}{\right)}^{2}$ for the cross-validated methods. That is, the target parameters vary for each of the 1,000 data sets considered. We also confirmed that, as is typical in missing data problems, the methods in which the conditional means were correctly specified were more efficient than the methods in which the conditional means are incorrectly specified. Figure 2 shows that the (CV-)TMLE in general has better coverage than the (CV-)DR-IPCW estimating equation approaches in our single time point simulation, with the only exception being the CV-TMLE for ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ when the regressions are misspecified and $V={L}_{1}\left(1\right),\dots ,{L}_{4}\left(1\right)$.

Figure 1

Relative efficiency of TMLE and DR-IPCW methods compared to both ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ and the data adaptive parameter ${E}_{{P}_{0}}\left({\mathrm{\psi }}_{n}-{\mathrm{\psi }}_{0n}{\right)}^{2}$ for the TMLE and DR-IPCW, and ${E}_{{P}_{0}}\left({\mathrm{\psi }}_{n}-{\stackrel{˜}{\mathrm{\psi }}}_{0n}{\right)}^{2}$ for the cross-validated methods. Results are provided both for the cases where the estimate ${E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right),W\right]$ of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right),W\right]$ is correctly specified and the case where this estimate is incorrectly specified with the constant function 1/2. Error bars indicate 95% confidence intervals to account for uncertainty from the finite number of Monte Carlo draws in our simulation. (a) V=L1(1), (b) V=L1(1), …, L4(1)

Figure 2

Coverage of 95% confidence intervals from the TMLE and DR-IPCW methods with respect to both ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ and the data adaptive parameter ${\mathrm{\psi }}_{0n}$ for the TMLE and DR-IPCW and $\stackrel{˜}{{\mathrm{\psi }}_{0n}}$ for the cross-validated methods. Results are provided both for the cases where the estimate ${E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right),W\right]$ of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right),W\right]$ is correctly specified and the case where this estimate is incorrectly specified with the constant function 1/2. The (CV-)TMLE outperforms the (CV-)DR-IPCW estimating equation approach for almost all settings. Error bars indicate 95% confidence intervals to account for uncertainty from the finite number of Monte Carlo draws in our simulation. (a) V=L1(1), (b) V=L1(1), …, L4(1)

Figure 3a shows that the (CV-)TMLE is always more efficient than the (CV-)DR-IPCW estimating equation methods for our two time point simulation. Figure 3b shows that this increased efficiency does not come at the expense of coverage: the (CV-)TMLE always has better coverage than the (CV-)DR-IPCW estimators in our two time point simulation. In general, we see that the cross-validated methods always achieve approximately 95% coverage for the data adaptive parameter. This is to be expected because the cross-validated methods only learn the optimal rule on validation sets, and thus avoid finite sample bias when the conditional means of the outcome are averaged over the validation samples.

Figure 3

(a) Relative efficiency of TMLE and DR-IPCW methods compared to both ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ and the data adaptive parameter ${E}_{{P}_{0}}\left({\mathrm{\psi }}_{n}-{\mathrm{\psi }}_{0n}{\right)}^{2}$ for the TMLE and DR-IPCW, and ${E}_{{P}_{0}}\left({\mathrm{\psi }}_{n}-{\stackrel{˜}{\mathrm{\psi }}}_{0n}{\right)}^{2}$ for the cross-validated methods. (b) Coverage of 95% confidence intervals from the TMLE and DR-IPCW methods with respect to both ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ and the data adaptive parameter ${\mathrm{\psi }}_{0n}$ for the TMLE and DR-IPCW and $\stackrel{˜}{{\mathrm{\psi }}_{0n}}$ for the cross-validated methods. Both (a) and (b) give results both for the cases where the estimates of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{n}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ and ${E}_{{P}_{0}}\left[{Y}_{{d}_{n}}|L\left(0\right)\right]$ are correctly specified and the case where these estimates are incorrectly specified with the constant function 1/2. Error bars indicate 95% confidence intervals to account for uncertainty from the finite number of Monte Carlo draws in our simulation

It may at first be surprising that the TMLE outperforms the DR-IPCW estimating equation method in a randomized clinical trial, especially given that the CV-TMLE and CV-DR-IPCW achieve similar coverage. In Appendix C we give intuition as to why this may be the case in a single time point randomized clinical trial. In short, this difference in coverage appears to occur because our proposed TMLE only fluctuates the conditional means for individuals who received the fitted treatment, thereby reducing finite sample bias that may result from estimating the optimal rule on the same sample that is used to estimate the mean outcome under this fitted rule.

We also looked at the average confidence interval width across Monte Carlo simulations for each method and simulation setting. For a given simulation setting, all four estimation methods gave approximately the same ($±0.002$) average confidence interval width: 0.08 for both single time point simulations, 0.12 for the multiple time point simulation. These average widths show that we can get informatively small confidence intervals from our relatively small sample size of 1,000 individuals. Unlike Figures 1 and 3a, these values should not be used to gauge the efficiency of the proposed estimators since they do not take the true parameter value into account.

## 10 Discussion

This article investigated semiparametric statistical inference for the mean outcome under the V-optimal rule and statistical inference for the data adaptive target parameter defined as the mean outcome under a data adaptively determined V-optimal rule (treating the latter as given).

We proved a surprising and useful result stating that the mean outcome under the V-optimal rule is represented by a statistical parameter whose pathwise derivative is identical to what it would have been if the unknown rule had been treated as known, under the condition that the data is generated by a non-exceptional law [52]. As a consequence, the efficient influence curve is immediately known, and any of the efficient estimators for the mean outcome under a given rule can be applied at the estimated rule. In particular, we demonstrate a TMLE, and present asymptotic linearity results. However, the dependence of the statistical target parameter on the unknown rule affects the second-order terms of the TMLE, and, as a consequence, the asymptotic linearity of the TMLE requires that a second-order difference between the estimated rule and the V-optimal rule converges to zero at a rate faster than $1/\sqrt{n}$. We show that this can be expected to hold for rules that are only a function of one continuous score (such as a biomarker), but when V is higher dimensional, only strong smoothness assumptions will guarantee this, so that, even in an RCT, we cannot be guaranteed valid statistical inference for such V-optimal rules.

Therefore, we proceeded to pursue statistical inference for so-called data adaptive target parameters. Specifically, we presented statistical inference for the mean outcome under the dynamic treatment regime we fitted based on the data. We showed that statistical inference for this data adaptive target parameter does not rely on the convergence rate of our estimated rule to the optimal rule, and in fact only requires that the data adaptively fitted rule converges to some (possibly suboptimal) fixed rule. However, even in a sequential RCT, the asymptotic linearity theorem still relies on an empirical process condition that limits the data adaptivity of the estimator of the rule. So, even though the assumptions are much weaker, they can still cause problems in finite samples when V is high dimensional, and possibly even asymptotically.

Therefore, we proceeded with the average of sample split specific target parameters, as in general proposed by van der Laan et al. [46], where we show that statistical inference can now avoid the empirical process condition. Specifically, our data adaptive target parameter is now defined as an average across J sample splits in training and validation sample of the mean outcome under the dynamic treatment fitted on the training sample. We presented CV-TMLE of this data adaptive target parameter, and we established an asymptotic linearity theorem that does not require that the estimated rule is consistent for the optimal rule, let alone at a particular rate. The CV-TMLE also does not require the empirical process condition. As a consequence, in a sequential RCT, this method provides valid asymptotic statistical inference without any conditions, beyond the requirement that the estimated rule converges to some (possibly suboptimal) fixed rule.

We supported our theoretical findings with simulations, both in the single and two time point settings. Our simulations supported our claim that it is easier to have good coverage of the proposed data adaptive target parameters than the mean outcome under the optimal rule, though the results for this harder mean outcome under the optimal rule parameter were also promising. In future work we hope to apply these methods to actual data sets of interest, generated by observational controlled trial as well as RCTs.

It might also be of interest to propose working models for the mean outcome ${E}_{{P}_{0}}\left[{Y}_{{d}_{0}}|S\right]$ under the optimal rule, conditional on some baseline covariates $S\subset W$. This is now a function of S, but we would define the target parameter of interest as a projection of this true underlying function on the working model. It would now be of interest to develop TMLE for this finite dimensional pathwise differentiable parameter, and we presume that similar results as we found here might appear. Such parameters provide information about how the mean outcome under the optimal rule are affected by certain baseline characteristics.

Drawing inferences concerning optimal treatment strategies is an important topic that will hopefully help guide future health policy decisions. We believe that working with a large semiparametric model is desirable because it helps to ensure that the projected health benefits from implementing an estimated treatment strategy are not due to bias from a misspecified model. The TMLEs presented in this article have many desirable statistical properties and represent one way to get estimates and make inference in this large model. We look forward to future advances in statistical inference for parameters that involve optimal dynamic treatment regimes.

## Acknowledgements

This research was supported by an NIH grant R01 AI074345-06. AL was supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) program. The authors would like to thank the anonymous reviewers and Erica Moodie for their invaluable comments and suggestions to improve the quality of the paper. The authors would also like to thank Oleg Sofrygin for valuable discussions.

## Proofs

Proof of Theorem 1. Let ${V}_{d}=\left(V\left(0\right),{V}_{d}\left(1\right)\right)$. For a rule in $\mathcal{D}$, we have $\begin{array}{l}{E}_{{P}_{d}}{Y}_{d}={E}_{{P}_{d}}{E}_{{P}_{d}}\left({Y}_{d}|{V}_{d}\right)\hfill \\ ={E}_{{V}_{d}}\left(E\left({Y}_{a\left(0\right),a\left(1\right)}|{V}_{a\left(0\right)}\right)I\left(a\left(1\right)={d}_{A\left(1\right)}\left(a\left(0\right),{V}_{a\left(0\right)}\left(1\right)\right)\right)I\left(a\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right).\hfill \end{array}$For each value of $a\left(0\right)$, ${V}_{a\left(0\right)}=\left(V\left(0\right),{V}_{a\left(0\right)}\left(1\right)\right)$ and ${d}_{A\left(0\right)}\left(V\left(0\right)\right)$, the inner conditional expectation is maximized over ${d}_{A\left(1\right)}\left(a\left(0\right),{V}_{a\left(0\right)}\left(1\right)\right)$ by ${d}_{0,A\left(1\right)}$ as presented in the theorem, where we used that $V\left(1\right)$ includes $V\left(0\right)$. This proves that ${d}_{0,A\left(1\right)}$ is indeed the optimal rule for assignment of $A\left(1\right)$. Suppose now that $V\left(1\right)$ does not include $V\left(0\right)$, but the stated assumption holds. Then the optimal rule ${d}_{0,A\left(1\right)}$ that is restricted to be a function of $\left(V\left(0\right),V\left(1\right),A\left(0\right)\right)$ is given by $I\left({\stackrel{ˉ}{Q}}_{20}\left(A\left(0\right),V\left(0\right),V\left(1\right)\right)>0\right)$, where $\begin{array}{l}{\overline{Q}}_{20}\left(a\left(0\right),v\left(0\right),v\left(1\right)\right)=\hfill \\ {E}_{{P}_{0}}\left({Y}_{a\left(0\right),A\left(1\right)=\left(1,1\right)}-{Y}_{a\left(0\right),A\left(1\right)=\left(0,1\right)}|{V}_{a\left(0\right)}\left(1\right)=v\left(1\right),V\left(0\right)=v\left(0\right)\right).\hfill \end{array}$However, by assumption, the latter function only depends on $\left(a\left(0\right),v\left(0\right),v\left(1\right)\right)$ through $\left(a\left(0\right),v\left(1\right)\right)$, and equals ${\stackrel{ˉ}{Q}}_{20}\left(a\left(0\right),v\left(1\right)\right)$. Thus, we now still have that ${d}_{0,A\left(1\right)}\left(V\right)=\left(I\left({\stackrel{ˉ}{Q}}_{20}\left(A\left(0\right),V\left(1\right)\right)>0\right),1\right)$, and, in fact, it is now also an optimal rule among the larger class of rules that are allowed to use $V\left(0\right)$ as well.

Given we found ${d}_{0,A\left(1\right)}$, it remains to determine the rule ${d}_{0,A\left(0\right)}$ that maximizes $\begin{array}{c}{E}_{{V}_{d}}\left({E}_{P}\left({Y}_{a\left(0\right),{d}_{0,A\left(1\right)}}|{V}_{a\left(0\right)}\right)I\left(a\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right)\right)\\ ={E}_{{P}_{0}}E\left({Y}_{a\left(0\right),{d}_{0,A\left(1\right)}}|V\left(0\right)\right)I\left(a\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right),\end{array}$where we used the iterative conditional expectation rule, taking the conditional expectation of ${V}_{a\left(0\right)}$, given $V\left(0\right)$. This last expression is maximized over ${d}_{A\left(0\right)}$ by ${d}_{0,A\left(0\right)}$ as presented in the theorem. This completes the proof. □

The following lemma will be useful for proving Theorem 2.

Lemma 1. Recall the definitions of ${\stackrel{ˉ}{Q}}_{20}$ and ${\stackrel{ˉ}{Q}}_{10}$ in Theorem 1. We can represent $\mathrm{\Psi }\left({P}_{0}\right)={E}_{{P}_{0}}{Y}_{{d}_{0}}$ as follows:

$\begin{array}{c}\mathrm{\Psi }\left({P}_{0}\right)={E}_{{P}_{0}}{Y}_{\left(0,1\right),\left(0,1\right)}+{E}_{{P}_{0}}\left[{d}_{0,A\left(1\right)}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(1\right)\right){\stackrel{ˉ}{Q}}_{20}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(1\right)\right)\right]\\ \phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{E}_{{P}_{0}}{d}_{0,A\left(0\right)}\left(V\left(0\right)\right){\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right).\end{array}$

where ${V}_{\left(0,1\right)}\left(1\right)$ is drawn under the G-computation distribution for which treatment $\left(0,1\right)$ is given at the first time point.

Proof of Lemma A.1. For a point treatment data structure $O=\left(L\left(0\right),A\left(0\right),Y\right)$ and binary treatment $A\left(0\right)$, we have for a rule $V\to d\left(V\right)$, ${E}_{{P}_{0}}{Y}_{d}={E}_{{P}_{0}}{Y}_{0}+{E}_{{P}_{0}}d\left(V\right){\stackrel{ˉ}{Q}}_{0}\left(V\right)$ with ${\stackrel{ˉ}{Q}}_{0}\left(V\right)={E}_{{P}_{0}}\left[{Y}_{1}-{Y}_{0}|V\right]$. This identity is applied twice in the following derivation:

$\begin{array}{rl}\mathrm{\Psi }\left({P}_{0}\right)& ={E}_{{P}_{0}}{Y}_{\left(0,1\right),{d}_{0,A\left(1\right)}}+{E}_{{P}_{0}}{d}_{0,A\left(0\right)}\left(V\left(0\right)\right){\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right)\\ & ={E}_{{P}_{0}}{E}_{{P}_{0}}\left[{Y}_{\left(0,1\right),{d}_{0,A\left(1\right)}}|{V}_{\left(0,1\right)}\left(1\right)\right]+{E}_{{P}_{0}}{d}_{0,A\left(0\right)}\left(V\left(0\right)\right){\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right)\\ & ={E}_{{P}_{0}}{E}_{{P}_{0}}\left[{Y}_{\left(0,1\right),\left(0,1\right)}|{V}_{\left(0,1\right)}\left(1\right)\right]\\ & \phantom{\rule{1em}{0ex}}+{E}_{{P}_{0}}I\left({\stackrel{ˉ}{Q}}_{20}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(1\right)\right)>0\right){\stackrel{ˉ}{Q}}_{20}\left(0,{V}_{\left(0,1\right)}\left(1\right)\right)\\ & \phantom{\rule{1em}{0ex}}+{E}_{{P}_{0}}{d}_{0,A\left(0\right)}\left(V\left(0\right)\right){\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right)\\ & ={E}_{{P}_{0}}{E}_{{P}_{0}}\left[{Y}_{\left(0,1\right),\left(0,1\right)}|{V}_{\left(0,1\right)}\left(1\right)\right]+{E}_{{P}_{0}}{d}_{0,A\left(1\right)}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(1\right)\right){\stackrel{ˉ}{Q}}_{20}\left(0,{V}_{\left(0,1\right)}\left(1\right)\right)\\ & \phantom{\rule{1em}{0ex}}+{E}_{{P}_{0}}{d}_{0,A\left(0\right)}\left(V\left(0\right)\right){\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right)\\ & ={E}_{{P}_{0}}{Y}_{\left(0,1\right),\left(0,1\right)}+{E}_{{P}_{0}}{d}_{0,A\left(1\right)}\left(\left(0,1\right),{V}_{\left(0,1\right)}\left(1\right)\right){\stackrel{ˉ}{Q}}_{20}\left(0,{V}_{\left(0,1\right)}\left(1\right)\right)\\ & \phantom{\rule{1em}{0ex}}+{E}_{{P}_{0}}{d}_{0,A\left(0\right)}\left(V\left(0\right)\right){\stackrel{ˉ}{Q}}_{10}\left(V\left(0\right)\right).\end{array}$

Proof of Theorem 3. By the definition of ${R}_{1d}$ we have

$\begin{array}{c}{P}_{0}{D}^{\ast }\left(Q,g\right)={P}_{0}{D}^{\ast }\left({d}_{Q},Q,g\right)={\mathrm{\Psi }}_{{d}_{Q}}\left({Q}_{0}^{{d}_{Q}}\right)-{\mathrm{\Psi }}_{{d}_{Q}}\left({Q}^{{d}_{Q}}\right)+{R}_{1{d}_{Q}}\left({Q}^{{d}_{Q}},{Q}_{0}^{{d}_{Q}},g,{g}_{0}\right)\\ \phantom{\rule{1em}{0ex}}={\mathrm{\Psi }}_{{d}_{0}}\left({Q}_{0}^{{d}_{0}}\right)-{\mathrm{\Psi }}_{{d}_{Q}}\left({Q}^{{d}_{Q}}\right)+\left\{{\mathrm{\Psi }}_{{d}_{Q}}\left({Q}_{0}^{{d}_{Q}}\right)-{\mathrm{\Psi }}_{{d}_{0}}\left({Q}_{0}^{{d}_{0}}\right)\right\}+{R}_{1{d}_{Q}}\left({Q}^{{d}_{Q}},{Q}_{0}^{{d}_{Q}},g,{g}_{0}\right)\\ \phantom{\rule{1em}{0ex}}=\mathrm{\Psi }\left({Q}_{0}\right)-\mathrm{\Psi }\left(Q\right)+{R}_{2}\left(Q,{Q}_{0}\right)+{R}_{1{d}_{Q}}\left({Q}^{{d}_{Q}},{Q}_{0}^{{d}_{Q}},g,{g}_{0}\right).\end{array}$

Proof of Lemma 1. Below we omit the dependence of ${d}_{Q,A\left(0\right)}$, ${d}_{0,A\left(0\right)}$, ${\stackrel{ˉ}{Q}}_{1}$, and ${\stackrel{ˉ}{Q}}_{10}$ on $V\left(0\right)$: $\begin{array}{rl}{R}_{2A\left(0\right)}& ={E}_{{P}_{0}}\left[\left({d}_{Q,A\left(0\right)}-{d}_{0,A\left(0\right)}\right){\stackrel{ˉ}{Q}}_{10}\right]\\ & \le {E}_{{P}_{0}}\left|\left({d}_{Q,A\left(0\right)}-{d}_{0,A\left(0\right)}\right){\stackrel{ˉ}{Q}}_{10}\right|\\ & ={E}_{{P}_{0}}|\left({d}_{Q,A\left(0\right)}-{d}_{0,A\left(0\right)}\right){\stackrel{ˉ}{Q}}_{10}I\left(|{\stackrel{ˉ}{Q}}_{10}|\ge |{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}|\right)|\\ & \phantom{\rule{1em}{0ex}}+{E}_{{P}_{0}}|\left({d}_{Q,A\left(0\right)}-{d}_{0,A\left(0\right)}\right){\stackrel{ˉ}{Q}}_{10}I\left(0<|{\stackrel{ˉ}{Q}}_{10}|<|{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}|\right)|.\end{array}$

The first term in the final equality is always 0 because ${d}_{Q,A\left(0\right)}={d}_{0,A\left(0\right)}$ whenever the indicator is 1. In the second term, ${d}_{Q,A\left(0\right)}\mathit{/}={d}_{0,A\left(0\right)}$ whenever the indicator is 1, so: $\begin{array}{rl}{R}_{2A\left(0\right)}& \le {E}_{{P}_{0}}\left[|{\stackrel{ˉ}{Q}}_{10}|I\left(0<|{\stackrel{ˉ}{Q}}_{10}|<|{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}|\right)\right]\\ & \le {E}_{{P}_{0}}\left[|{\stackrel{ˉ}{Q}}_{10}|I\left(0<|{\stackrel{ˉ}{Q}}_{10}{|}^{\frac{p\left({\mathrm{\beta }}_{1}+1\right)}{p+{\mathrm{\beta }}_{1}}}<|{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}{|}^{\frac{p\left({\mathrm{\beta }}_{1}+1\right)}{p+{\mathrm{\beta }}_{1}}}\right)I\left(|{\stackrel{ˉ}{Q}}_{10}|>0\right)\right]\\ & \le {E}_{{P}_{0}}\left[|{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}{|}^{\frac{p\left({\mathrm{\beta }}_{1}+1\right)}{p+{\mathrm{\beta }}_{1}}}|{\stackrel{ˉ}{Q}}_{10}{|}^{-\frac{{\mathrm{\beta }}_{1}\left(p-1\right)}{p+{\mathrm{\beta }}_{1}}}I\left(|{\stackrel{ˉ}{Q}}_{10}|>0\right)\right]\\ & \le {∥{\stackrel{ˉ}{Q}}_{1}-{\stackrel{ˉ}{Q}}_{10}∥}_{p,{P}_{0}}^{\frac{p\left({\mathrm{\beta }}_{1}+1\right)}{p+{\mathrm{\beta }}_{1}}}{∥{\stackrel{ˉ}{Q}}_{10}^{-1}I\left(|{\stackrel{ˉ}{Q}}_{10}|>0\right)∥}_{{\mathrm{\beta }}_{1},{P}_{0}}^{\frac{{\mathrm{\beta }}_{1}\left(p-1\right)}{p+{\mathrm{\beta }}_{1}}}\end{array}$(15)where the final inequality holds by Hölder’s inequality. The above also holds when the limit is taken as $p\to \mathrm{\infty }$, yielding the essential supremum result. The result for ${R}_{2A\left(1\right)}$ follows by the same argument.□

Proof of Theorem 4. By Theorem 3, we have

${P}_{0}{D}^{\ast }\left({d}_{n},{Q}_{n}^{{d}_{n}\ast },{g}_{n}\right)={\mathrm{\psi }}_{0}-{\mathrm{\Psi }}_{{d}_{n}}\left({Q}_{n}^{{d}_{n}\ast }\right)+{R}_{n},$where ${R}_{n}={R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}},{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)+{R}_{2}\left({Q}_{n},{Q}_{0}\right)$. Combining this with the fact that ${D}_{n}^{\ast }\equiv {D}^{\ast }\left({d}_{n},{Q}_{n}^{{d}_{n}\ast },{g}_{n}\right)$ has empirical mean 0 yields ${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right){D}_{n}^{\ast }+{R}_{n}=\left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)+\left({P}_{n}-{P}_{0}\right)\left({D}_{n}^{\ast }-{D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)\right)+{R}_{n}$The Donsker condition and the mean square consistency of ${D}_{n}^{\ast }$ to ${D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)$ give $\left({P}_{n}-{P}_{0}\right)\left({D}_{n}^{\ast }-{D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)\right)={o}_{{P}_{0}}\left({n}^{-1/2}\right),$see, for example, van der Vaart and Wellner [58]. By assumption, ${R}_{2}\left({Q}_{n},{Q}_{0}\right)={o}_{{P}_{0}}\left({n}^{-1/2}\right)$. Thus: ${\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{0},{Q}^{{d}_{0}},{g}_{0}\right)+{R}_{1{d}_{n}}\left({Q}_{n}^{{d}_{n}},{Q}_{0}^{{d}_{n}},{g}_{n},{g}_{0}\right)+{o}_{{P}_{0}}\left({n}^{-1/2}\right)$as desired. □

Proof of Theorem 6. For all $j=1,\dots ,J$, we have that:

$\begin{array}{l}{\Psi }_{{d}_{nj}}\left({Q}_{nj}^{{d}_{nj}*}\right)-{\Psi }_{{d}_{nj}}\left({Q}_{0}^{{d}_{nj}*}\right)=-{P}_{0}{D}^{*}\left({d}_{nj},{Q}_{nj}^{{d}_{nj}*},{g}_{nj}\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+{R}_{1{d}_{nj}}\left({Q}_{nj}^{{d}_{nj}*},{Q}_{0}^{{d}_{nj}*},{g}_{nj},{g}_{0}\right)\end{array}$

Summing over j and using (13) gives: ${\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }-{\stackrel{˜}{\mathrm{\psi }}}_{0n}=\frac{1}{J}\sum _{j=1}^{J}\left(\left({P}_{n,j}^{1}-{P}_{0}\right){D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)+{R}_{1{d}_{nj}}\left({Q}_{nj}^{{d}_{nj}\ast },{Q}_{0}^{{d}_{nj}\ast },{g}_{nj},{g}_{0}\right)\right).$

We also have that: $\frac{1}{J}\sum _{j=1}^{J}\left({P}_{n,j}^{1}-{P}_{0}\right)\left({D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)-{D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},g\right)\right)={o}_{{P}_{0}}\left({n}^{-1/2}\right).$

The above follows from the first by applying the law of total expectation conditional on the training sample, and then noting that each ${\stackrel{ˆ}{Q}}^{\ast }\left({P}_{n,{B}_{n}}^{0},{\mathrm{\epsilon }}_{n}\right)$ only relies on ${P}_{n,{B}_{n}}^{0}$ through the finite dimensional parameter ${\mathrm{\epsilon }}_{n}$. Because GLM-based parametric classes easily satisfy an entropy integral condition [58], the consistency assumption on ${D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)$ shows that the above is second order. We refer the reader to Zheng and van der Laan [55] for a detailed proof of the above result for general cross-validation schemes, including J-fold cross-validation.

It follows that: $\begin{array}{rl}{\stackrel{˜}{\mathrm{\psi }}}_{n}^{\ast }-{\stackrel{˜}{\mathrm{\psi }}}_{0n}=& \left({P}_{n}-{P}_{0}\right){D}^{\ast }\left({d}_{1},{Q}^{{d}_{1}},g\right)\\ & +\frac{1}{J}\sum _{j=1}^{J}{R}_{1{d}_{nj}}\left({Q}_{nj}^{{d}_{nj}\ast },{Q}_{0}^{{d}_{nj}\ast },{g}_{nj},{g}_{0}\right)+{o}_{{P}_{0}}\left({n}^{-1/2}\right).\end{array}$

## TMLE of the mean outcome under a given rule

This TMLE for a fixed dynamic treatment rule has been presented in the literature, but for the sake of being self-contained it will be shortly described here. The TMLE yields a substitution estimator that empirically solves the estimating equations corresponding to the efficient influence curve, analogous to Theorem 2 for general d. By substitution estimator, we mean that the TMLE can be written as the mapping $\mathrm{\Psi }$ applied to a particular Q.

Assume without loss of generality that $Y\in \left[0,1\right]$. In this section we use lower case letters to emphasize when quantities are the values taken on by random variables rather than the random variables themselves, for example, our sample is given by $\left({o}_{1},\dots ,{o}_{n}\right)$, where ${o}_{i}=\left(\stackrel{ˉ}{l}\left(1{\right)}_{i},\stackrel{ˉ}{a}\left(1{\right)}_{i},{y}_{i}\right)$. The indicator for not being right censored at time j for individual i is given by ${a}_{2}\left(j{\right)}_{i}$.

Regress $\left({y}_{i}:{a}_{2}\left(0{\right)}_{i}={a}_{2}\left(1{\right)}_{i}=1\right)$ on $\left(\stackrel{ˉ}{a}\left(1{\right)}_{i},\stackrel{ˉ}{l}\left(1{\right)}_{i}:{a}_{2}\left(0{\right)}_{i}={a}_{2}\left(1{\right)}_{i}=1\right)$ to get an estimate $\left({a}_{1}\left(0\right),{a}_{1}\left(1\right),\stackrel{ˉ}{l}\left(1\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right)=\left(\left({a}_{1}\left(0\right),1\right),\left({a}_{1}\left(1\right),1\right)\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}\left(1\right)\right].$(16)Note that we have only used individuals who are not right censored at time 1 to obtain this fit. The above regression can be fitted using a data adaptive technique such as super-learning [59]. To estimate ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(a\left(0\right),v\right),\stackrel{ˉ}{l}\left(1\right)\right]$, use $\left(a\left(0\right),\stackrel{ˉ}{l}\left(1\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(a\left(0\right),v\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}\left(1\right)\right],$where we remind the reader that we are treating the rule $d={d}_{n}$ as a known function and that v is a function of $\stackrel{ˉ}{l}\left(1\right)$ that sets the indicators for not being censored to 1. Consider the fluctuation submodel $\begin{array}{l}\text{logit}{E}_{n}^{\left({\epsilon }_{2}\right)}\left[Y|\overline{A}\left(1\right)=d\left(A\left(0\right),V\right),\overline{L}\left(1\right)\right]\text{\hspace{0.17em}}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=\text{logit}{E}_{n}\left[Y|\overline{A}\left(1\right)=d\left(A\left(0\right),V\right),\overline{L}\left(1\right)\right]+{\epsilon }_{2}{H}_{2}\left({g}_{n}\right)\left(O\right),\end{array}$where ${H}_{2}\left({g}_{n}\right)\left(O\right)=\frac{I\left(\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right)\right)}{{\prod }_{j=0}^{1}{g}_{n,A\left(j\right)}\left(O\right)}.$Let ${\mathrm{\epsilon }}_{2n}$ be the estimate for ${\mathrm{\epsilon }}_{2}$ obtained by running a univariate logistic regression of $\left({y}_{i}:i=1,\dots ,n\right)$ on $\left({H}_{2}\left({g}_{n}\right)\left({o}_{i}\right):i=1,\dots n\right)$ using $\left(\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(a{\left(0\right)}_{i},{v}_{i}\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}{\left(1\right)}_{i}\right]:i=1,\dots ,n\right)$as offset. This defines a targeted estimate ${E}_{n}^{\ast }\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]\equiv {E}_{n}^{\left({\mathrm{\epsilon }}_{2n}\right)}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$(17)of the regression function, where we remind the reader that the targeted estimate is chosen to ensure that the empirical mean of the component ${D}_{2}^{\ast }$ is 0 when we plug in the estimate of the intervention mechanism and the targeted estimate of the regression function for the unknown true quantities.

We now develop a targeted estimator of the second regression function in ${D}_{1}^{\ast }$ to ensure that the substitution estimator of ${D}_{1}^{\ast }$ will have empirical mean 0. Regress $\left({E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(a{\left(0\right)}_{i},{v}_{i}\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}{\left(1\right)}_{i}\right]:{a}_{2}{\left(0\right)}_{i}=1\right)$on $\left(l\left(0{\right)}_{i},a\left(0{\right)}_{i}:{a}_{2}\left(0{\right)}_{i}=1\right)$ to get the regression function $\left({a}_{1}\left(0\right),l\left(0\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{n}\left[|{E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]A\left(0\right)=\left({a}_{1}\left(0\right),1\right),L\left(0\right)=l\left(0\right)\right].$(18)One can estimate this quantity using the super-learner algorithm among all individuals who are not right censored at time 0. For honest cross-validation in the super-learner algorithm, the nuisance parameter ${E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ should be fit on the training samples in the super-learner algorithm. We refer the reader to Appendix B of van der Laan and Gruber [41] for a detailed explanation of this procedure. The same strategy holds for estimating the nuisance parameter ${g}_{0}$ when necessary (e.g., in an observational study).

For an estimate of ${E}_{{P}_{0}}\left[{Y}_{d}|L\left(0\right)\right]$, one can use the regression function above, but with $a\left(0\right)$ fixed to ${d}_{A\left(0\right)}\left(v\left(0\right)\right)$, which is itself a function of $l\left(0\right)$. We will denote this function by $l\left(0\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{n}\left[{Y}_{d}|L\left(0\right)=l\left(0\right)\right]$. We now wish to fluctuate this initial estimator so that the plug-in estimator of ${D}_{1}^{\ast }\left({P}_{0}\right)$ has empirical mean 0. In particular, we use the submodel $\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}{E}_{n}^{\left({\mathrm{\epsilon }}_{1}\right)}\left[{Y}_{d}|L\left(0\right)\right]=\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}{E}_{n}\left[{Y}_{d}|L\left(0\right)\right]+{\mathrm{\epsilon }}_{1}{H}_{1}\left({g}_{n}\right),$where ${H}_{1}\left({g}_{n}\right)=\frac{I\left(A\left(0\right)={d}_{A\left(0\right)}\left(V\left(0\right)\right)\right)}{{g}_{n,A\left(0\right)}\left(O\right)}.$Let ${\mathrm{\epsilon }}_{1n}$ be the estimate for ${\mathrm{\epsilon }}_{1}$ obtained by running a univariate logistic regression of $\left({E}_{n}^{\ast }\left[Y|\stackrel{ˉ}{A}\left(1\right)=d\left(a{\left(0\right)}_{i},{v}_{i}\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}{\left(1\right)}_{i}\right]:i=1,\dots ,n\right)$on $\left({H}_{1}\left({g}_{n}\right)\left({o}_{i}\right):i=1,\dots ,n\right)$ using $\left(\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{E}_{n}\left[{Y}_{d}|L\left(0\right)=l\left(0{\right)}_{i}\right]:i=1,\dots ,n\right)$ as offset. A targeted estimate of ${E}_{{P}_{0}}\left[{Y}_{d}|L\left(0\right)\right]$ is given by ${E}_{n}^{\ast }\left[{Y}_{d}|L\left(0\right)\right]\equiv {E}_{n}^{\left({\mathrm{\epsilon }}_{1n}\right)}\left[{Y}_{d}|L\left(0\right)\right]$(19)Plugging the targeted regressions and ${g}_{n}$ into the expression for ${D}_{1}^{\ast }$ shows that this estimate of ${D}_{1}^{\ast }$ has empirical mean 0.

Let ${Q}_{L\left(0\right),n}$ be the empirical distribution of $L\left(0\right)$, and let ${Q}_{n}^{d\ast }$ be the parameter mapping representing the collection containing ${Q}_{L\left(0\right),n}$ and the targeted regression functions in (17) and (19). This concludes the presentation of the components of the TMLE of ${E}_{{P}_{0}}{Y}_{d}$. The discussion of properties of this estimator is continued in the main text.

## CV-TMLE of the mean outcome under data adaptive V-optimal rule

Let $\stackrel{ˆ}{d}:\mathcal{M}\to \mathcal{D}$ be an estimator of the V-optimal rule ${d}_{0}$. Firstly, without loss of generality we can assume that $Y\in \left[0,1\right]$. Denote the realizations of ${B}_{n}$ with $j=1,\dots ,J$, and let ${d}_{nj}\equiv \stackrel{ˆ}{d}\left({P}_{n,j}^{0}\right)$ denote the estimated rule on training sample j. Let $\left(a\left(0\right),\stackrel{ˉ}{l}\left(1\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{nj}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(a\left(0\right),v\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}\left(1\right)\right]$(20)represent an initial estimate of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ based on the training sample j, obtained analogously to the estimator in (16). Similarly, let ${g}_{nj}$ represent the estimated intervention mechanism based on this training sample ${P}_{n,j}^{0}$, $j=1,\dots ,J$. Consider the fluctuation submodel $\begin{array}{rl}& \phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\text{\hspace{0.17em}}\phantom{\rule{1pt}{0ex}}{E}_{nj}^{\left({\mathrm{\epsilon }}_{2}\right)}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]\\ & \phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}=\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\text{\hspace{0.17em}}\phantom{\rule{1pt}{0ex}}{E}_{nj}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]+{\mathrm{\epsilon }}_{2}{H}_{2}\left({g}_{nj}\right)\left(O\right)\end{array}$where ${H}_{2}\left({g}_{nj}\right)\left(O\right)=\frac{I\left(\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\left(1\right)\right)\right)}{{\prod }_{l=0}^{1}{g}_{nj,A\left(l\right)}\left(O\right)}.$Note that the fluctuation ${\mathrm{\epsilon }}_{2}$ does not rely on j. Let ${\mathrm{\epsilon }}_{2n}=\mathrm{a}\mathrm{r}\mathrm{g}\underset{{\mathrm{\epsilon }}_{2}}{\mathrm{m}\mathrm{i}\mathrm{n}\phantom{\rule{thinmathspace}{0ex}}}\frac{1}{J}\sum _{j=1}^{J}{P}_{n,j}^{1}\stackrel{˜}{\mathrm{\varphi }}\left({E}_{nj}^{\left({\mathrm{\epsilon }}_{2}\right)}\right),$where ${E}_{nj}^{\left({\mathrm{\epsilon }}_{2}\right)}$ represents the fluctuated function in (20) and $-\stackrel{˜}{\mathrm{\varphi }}\left(f\right)\left(o\right)=y\text{\hspace{0.17em}}logf\left(o\right)+\left(1-y\right)\text{\hspace{0.17em}}log\left(1-f\left(o\right)\right).$(21)

for all $f:\mathcal{O}\to \left(0,1\right)$. For each $i=1,\dots ,n$, let $j\left(i\right)\in \left\{1,\dots ,J\right\}$ represent the value of ${B}_{n}$ for which element i is in the validation set. The fluctuation ${\mathrm{\epsilon }}_{2n}$ can be obtained by fitting a univariate logistic regression of $\left({y}_{i}:i=1,\dots ,n\right)$ on $\left({H}_{2}\left({g}_{nj\left(i\right)}\right)\left({o}_{i}\right):i=1,\dots ,n\right)$ using $\left(\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{E}_{nj\left(i\right)}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(a{\left(0\right)}_{i},{v}_{i}\right),\stackrel{ˉ}{L}\left(1\right)=\stackrel{ˉ}{l}{\left(1\right)}_{i}\right]:i=1,\dots ,n\right)$as offset. Thus each observation i is paired with nuisance parameters that are fit on the training sample which does not contain observation i. This defines a targeted estimate ${E}_{nj}^{\ast }\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]\equiv {E}_{nj}^{\left({\mathrm{\epsilon }}_{2n}\right)}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$(22)of ${E}_{{P}_{0}}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$. We note that this targeted estimate only depends on ${P}_{n}$ through the training sample ${P}_{n,j}^{0}$ and the one-dimensional ${\mathrm{\epsilon }}_{2n}$.

We now aim to get a targeted estimate of ${E}_{{P}_{0}}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]$. We can obtain an estimate $\left({a}_{1}\left(0\right),l\left(0\right)\right)\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{nj}\left[|{E}_{nj}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]A\left(0\right)=\left({a}_{1}\left(0\right),1\right),L\left(0\right)=l\left(0\right)\right]$(23)in the same manner as we estimated the quantity in (18), with the caveat that we replace ${E}_{n}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ by ${E}_{nj}\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj}\left(A\left(0\right),V\right),\stackrel{ˉ}{L}\left(1\right)\right]$ and only fit the regression on samples that are not right censored at time 0 and are in training set j. For an estimate ${E}_{nj}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]$ of ${E}_{{P}_{0}}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]$, we can use the regression function above but with $a\left(0\right)$ fixed to ${d}_{nj,A\left(0\right)}\left(v\left(0\right)\right)$.

Consider the fluctuation submodel $\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{E}_{nj}^{\left({\mathrm{\epsilon }}_{1}\right)}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]=\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{E}_{nj}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]+\mathrm{\epsilon }{H}_{1}\left({g}_{nj}\right)\left(O\right),$where ${H}_{1}\left({g}_{nj}\right)\left(O\right)=\frac{I\left(A\left(0\right)={d}_{nj,A\left(0\right)}\left(V\left(0\right)\right)\right)}{{g}_{nj,A\left(0\right)}\left(O\right)}.$Again the fluctuation ${\mathrm{\epsilon }}_{1}$ does not rely on j. Let ${\mathrm{\epsilon }}_{1n}=arg\underset{{\mathrm{\epsilon }}_{1}}{min\phantom{\rule{thinmathspace}{0ex}}}\frac{1}{J}\sum _{j=1}^{J}{P}_{n,j}^{1}\stackrel{˜}{\mathrm{\varphi }}\left({E}_{nj}^{\left({\mathrm{\epsilon }}_{1}\right)}\right),$where $\stackrel{˜}{\mathrm{\varphi }}$ is defined in (21). For each $i=1,\dots ,n$, again let $j\left(i\right)\in \left\{1,\dots ,J\right\}$ represent the value of ${B}_{n}$ for which element i is in the validation set. The fluctuation ${\mathrm{\epsilon }}_{1n}$ can be obtained by fitting a univariate logistic regression of $\left({E}_{nj\left(i\right)}^{\ast }\left[Y|\stackrel{ˉ}{A}\left(1\right)={d}_{nj\left(i\right)}\left(a{\left(0\right)}_{i},{v}_{i}\right),\stackrel{ˉ}{l}{\left(1\right)}_{i}\right]:i=1,\dots ,n\right)$on $\left({H}_{1}\left({g}_{nj\left(i\right)}\right)\left({o}_{i}\right):i=1,\dots ,n\right)$ using $\left(\phantom{\rule{1pt}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{E}_{nj\left(i\right)}\left[{Y}_{{d}_{nj\left(i\right)}}|L\left(0\right)=l{\left(0\right)}_{i}\right]:i=1,\dots ,n\right)$as offset. This defines a targeted estimate ${E}_{nj}^{\ast }\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]\equiv {E}_{nj}^{\left({\mathrm{\epsilon }}_{1n}\right)}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]$(24)of ${E}_{{P}_{0}}\left[{Y}_{{d}_{nj}}|L\left(0\right)\right]$. We note that this targeted estimate only depends on ${P}_{n}$ through the training sample ${P}_{n,j}^{0}$ and the one-dimensional ${\mathrm{\epsilon }}_{1n}$.

Let ${Q}_{L\left(0\right),nj}$ be the empirical distribution of $L\left(0{\right)}_{i}$ for the validation sample ${P}_{n,j}^{1}$. For all $j=1,\dots ,J$, let ${Q}_{nj}^{{d}_{nj}\ast }$ be the parameter mapping representing the collection containing ${Q}_{L\left(0\right),nj}$ and the targeted regressions in (22) and (24). This defines an estimator ${\mathrm{\psi }}_{nj}^{\ast }={P}_{n,j}^{1}{\stackrel{ˉ}{Q}}_{1nj}^{\ast }$ of ${\mathrm{\psi }}_{{d}_{nj}0}={\mathrm{\Psi }}_{{d}_{nj}}\left({P}_{0}\right)$ for each $j=1,\dots ,J$. CV-TMLE is now defined as ${\mathrm{\psi }}_{n}^{\ast }=\frac{1}{J}{\sum }_{j=1}^{J}{\mathrm{\psi }}_{nj}^{\ast }$. This CV-TMLE solves the cross-validated efficient influence curve equation: $\frac{1}{J}\sum _{j=1}^{J}{P}_{n,j}^{1}{D}^{\ast }\left({d}_{nj},{Q}_{nj}^{{d}_{nj}\ast },{g}_{nj}\right)=0.$Further, each ${Q}_{nj}^{{d}_{nj}\ast }$ only relies on ${P}_{n,j}^{1}$ through the univariate parameters ${\mathrm{\epsilon }}_{1n}$ and ${\mathrm{\epsilon }}_{2n}$. This will allow us to use the entropy integral arguments presented in Zheng and van der Laan [55] which show that no restrictive empirical process conditions are needed on the initial estimates in (20) and (23).

The only modification relative to the original CV-TMLE presented in Zheng and van der Laan [55] is that in the above description we change our target on each training sample into the training sample-specific target parameter implied by the fit $\stackrel{ˆ}{d}\left({P}_{n,{B}_{n}}^{0}\right)$ on the training sample, while in the original CV-TMLE formulation, the target would still be ${\mathrm{\Psi }}_{{d}_{0}}\left({P}_{0}\right)$. With this minor twist, the (same) CV-TMLE is now used to target the average of training sample-specific target parameters averaged across the J training samples. This utilization of CV-TMLE was already used to estimate the average (across training samples) of the true risk of an estimator based on a training sample in van der Laan and Petersen [53] and Díaz and van der Laan [54], so that this just represents a generalization of that application of CV-TMLE to estimate general data adaptive target parameters as proposed in van der Laan et al. [46].

## Appendix C: Why the TMLE may have better coverage than the estimating equation approach in a randomized clinical trial

We wrote this section after performing our simulations because we wanted to understand why the TMLE is outperforming the DR-IPCW estimating equation approach by such a wide margin. The two approaches do not typically give such disparate estimates in a randomized clinical trial, so it is natural to ask why this is happening in our simulations. Part of this section is conjecture (which is in line with our simulations), but we offer some justification to support this conjecture.

We now offer a heuristic explanation of why the TMLE may have better coverage than the DR-IPCW estimating equation approach when estimating the data adaptive parameter ${\mathrm{\psi }}_{0n}$. Suppose we have a single time point data structure $O=\left(W,A,Y\right)$ drawn according to the distribution ${P}_{0}$ in a randomized clinical trial without missingness. Here we use notation which directly describes the single time point data structure rather than forcing this problem into the longitudinal context as in Section 8.1.1. Let ${d}_{0}=arg{max}_{d}{E}_{{P}_{0}}{E}_{{P}_{0}}\left[Y|A=d\left(V\right),W\right]$ for some V that is a function of W. Suppose we observe ${o}_{1},\dots ,{o}_{n}$ and let ${d}_{n}$ be an estimate of ${d}_{0}$, which is obtained using the methods in our accompanying technical report [47]. For any fixed rule d, the efficient influence curve at some $P\in \mathcal{M}$ is given by $\begin{array}{rl}& {E}_{P}\left[\frac{I\left(A=d\left(V\right)\right)}{g\left(A|W\right)}\left(Y-{E}_{P}\left[Y|A=d\left(V\right),W\right]\right)\right]\\ & \phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}+{E}_{P}\left[Y|A=d\left(V\right),W\right]-{E}_{P}{E}_{P}\left[Y|A=d\left(V\right),W\right],\end{array}$where g is the intervention mechanism under P. Again we have that ${E}_{{P}_{0}}{Y}_{{d}_{0}}$ has the same influence curve as above with $d={d}_{0}$ (see our online technical report). Suppose that ${g}_{0}=1/2$ is known and we have estimated ${E}_{{P}_{0}}\left[Y|A=d\left(V\right),W\right]$ perfectly, though we continue to work in the model where ${E}_{{P}_{0}}\left[Y|A=d\left(V\right),W\right]$ is treated as unknown so that simply averaging over this quantity is not appropriate if we want inference or robustness.

For any fixed rule $V\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}d\left(V\right)$, it is easy to show that ${E}_{{P}_{0}}\left[\frac{I\left(A=d\left(V\right)\right)}{{g}_{0}\left(A|W\right)}\left(Y-{E}_{{P}_{0}}\left[Y|A=d\left(V\right),W\right]\right)\right]=0,$where ${g}_{0}\left(a|w\right)$ represents the probability under ${P}_{0}$ that $A=a$ given $W=w$. Similarly, we expect that ${\mathrm{\beta }}_{d}\left({P}_{n}\right)\equiv \frac{1}{n}\sum _{i=1}^{n}\frac{I\left({a}_{i}=d\left({v}_{i}\right)\right)}{{g}_{0}\left({a}_{i}|{w}_{i}\right)}\left({y}_{i}-{E}_{{P}_{0}}\left[Y|A=d\left({v}_{i}\right),W={w}_{i}\right]\right)\approx 0.$Further, ${E}_{{P}_{0}}{\mathrm{\beta }}_{d}\left({P}_{n}\right)=0$ for fixed d, where the expectation is over the observed sample ${P}_{n}$ but not the fixed rule d. In the first part of this paper we argued that one can learn an estimated rule ${d}_{n}$ on the entire data set, and then treat this rule ${d}_{n}$ as known when estimating ${E}_{{P}_{0}}{Y}_{{d}_{n}}$. This is asymptotically valid under the conditions given in this paper, but even if these conditions hold we may expect some finite sample bias. In our simulation this finite sample bias is manifested as ${E}_{{P}_{0}}\left[\frac{1}{n}\sum _{i=1}^{n}\frac{I\left({a}_{i}={d}_{n}\left({v}_{i}\right)\right)}{{g}_{0}\left({a}_{i}|{w}_{i}\right)}\left({y}_{i}-{E}_{{P}_{0}}\left[Y|A={d}_{n}\left({v}_{i}\right),W={w}_{i}\right]\right)\right]>0,$where the expectation is over the observed sample ${P}_{n}$ and the estimated rule ${d}_{n}$. For a single time point simulation with $V={L}_{3}\left(1\right)$, this sample average is approximately 0.013 on average across 1,000 simulations. When $V={L}_{1}\left(1\right),\dots ,{L}_{1}\left(4\right)$, this sample average is approximately 0.040 on average across 1,000 simulations. Because this was a follow-up analysis, we ran these simulations on different Monte Carlo draws than those used for our results in the main text. We conjecture that the above phenomenon is not specific to our simulation settings and will occur in more general settings. Our companion paper in this issue explores the estimation of ${d}_{0}$, and a careful look at the mean performance-based loss function presented in that paper will show that indeed one way to make the empirical risk smaller is to choose ${d}_{n}$ so that ${\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)>0$. Nonetheless, selecting ${d}_{n}$ by a cross-validation selector as we propose in our companion paper should help mitigate this issue since ${\mathrm{\beta }}_{{d}_{n}}$ for ${d}_{n}$ trained on a training sample should have empirical mean close to 0 in the validation sample.

The DR-IPCW estimating equation gives the estimator: ${\stackrel{ˆ}{\mathrm{\Psi }}}_{EE}^{{d}_{n}}\left({P}_{n}\right)\equiv {\mathrm{\psi }}_{n,EE}\equiv {\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)+\frac{1}{n}\sum _{i=1}^{n}{E}_{{P}_{0}}\left[Y|A={d}_{n}\left({V}_{i}\right),W={W}_{i}\right].$This estimator has bias ${E}_{{P}_{0}}{\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)$, where the expectation is over the random sample ${P}_{n}$ and the estimated rule ${d}_{n}$.

Consider the simple linear TMLE which fluctuates $w\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{{P}_{0}}\left[Y|A={d}_{n}\left(v\right),W=w\right]$ using the submodel: ${E}_{{P}_{0}}^{\left(\mathrm{\epsilon }\right)}\left[Y|A={d}_{n}\left(V\right),W\right]={E}_{{P}_{0}}\left[Y|A={d}_{n}\left(V\right),W\right]+\mathrm{\epsilon }\frac{I\left(A={d}_{n}\left(V\right)\right)}{{g}_{0}\left(A|W\right)}$where we recall that $w\phantom{\rule{thinmathspace}{0ex}}↦\phantom{\rule{thinmathspace}{0ex}}{E}_{{P}_{0}}\left[Y|A={d}_{n}\left(v\right),W=v\right]$ is being treated as unknown. A valid TMLE is given by choosing ${\mathrm{\epsilon }}_{n}$ to minimize the mean-squared error between Y and ${E}_{{P}_{0}}^{\left(\mathrm{\epsilon }\right)}\left[Y|A={d}_{n}\left(V\right),W\right]$. When Y is bounded, the logistic fluctuations that we have presented in this paper are preferable to the linear fluctuation because they respect our model constraints. We consider the linear fluctuation here for simplicity. The minimizer ${\mathrm{\epsilon }}_{n}$ is given by ${\mathrm{\epsilon }}_{n}=\frac{\frac{1}{n}\sum _{i}\frac{I\left({a}_{i}={d}_{n}\left({v}_{i}\right)\right)}{{g}_{0}\left({a}_{i}|{w}_{i}\right)}\left({y}_{i}-{E}_{{P}_{0}}\left[Y|A={d}_{n}\left({v}_{i}\right),W={w}_{i}\right]\right)}{\frac{1}{n}\sum _{i}\frac{I\left({a}_{i}={d}_{n}\left({v}_{i}\right)\right)}{{g}_{0}{\left({a}_{i}|{w}_{i}\right)}^{2}}}$ $=\frac{1}{2}\frac{{\beta }_{{d}_{n}}\left({P}_{n}\right)}{\frac{1}{n}\sum _{i}\frac{I\left({a}_{i}={d}_{n}\left({v}_{i}\right)\right)}{{g}_{0}\left({a}_{i}|{w}_{i}\right)}},$if $\frac{1}{n}{\sum }_{i}\frac{I\left({a}_{i}={d}_{n}\left({v}_{i}\right)\right)}{{g}_{0}\left({a}_{i}|{w}_{i}\right)}>0$ and we take ${\mathrm{\epsilon }}_{n}=0$ if $\frac{1}{n}{\sum }_{i}\frac{I\left({a}_{i}={d}_{n}\left({v}_{i}\right)\right)}{{g}_{0}\left({a}_{i}|{w}_{i}\right)}=0$. The denominator above is the same as the denominator in a modified Horvitz-Thompson estimator [60] and, more importantly, appears in one of the terms in the TMLE, which is given by ${\mathrm{\psi }}_{n,\mathrm{T}\mathrm{M}\mathrm{L}\mathrm{E}}^{\ast }\equiv \frac{1}{n}\sum _{i=1}^{n}{E}_{{P}_{0}}^{\left({\mathrm{\epsilon }}_{n}\right)}\left[Y|A={d}_{n}\left(V\right),W\right]$ $=\frac{1}{n}\sum _{i=1}^{n}{E}_{{P}_{0}}\left[Y|A={d}_{n}\left(V\right),W\right]+\frac{{\mathrm{\epsilon }}_{n}}{n}\sum _{i=1}^{n}\frac{I\left(A={d}_{n}\left(V\right)\right)}{{g}_{0}\left(A|W\right)}$ $=\frac{1}{n}\sum _{i=1}^{n}{E}_{{P}_{0}}\left[Y|A={d}_{n}\left(V\right),W\right]+\frac{{\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)}{2}.$This linear fluctuation TMLE has bias ${E}_{{P}_{0}}\left[\frac{{\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)}{2}\right]$, which is half the bias of ${\stackrel{ˆ}{\mathrm{\Psi }}}_{EE}^{{d}_{n}}\left({P}_{n}\right)$.

The arguments presented in this section are mainly interesting if ${E}_{{P}_{0}}\left[{\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)\right]\mathit{/}=0$. We have conjectured that ${E}_{{P}_{0}}\left[{\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)\right]>0$ for many data generating distributions ${P}_{0}$ and estimators of the optimal rule, though we have not analytically justified this claim. If the conditions of Theorem 5 hold, then this bias will only occur in finite samples. For simplicity we analyzed a different TMLE than the ones presented in this paper. First, we analyzed a TMLE for the single time point problem. We show in our online technical report that the single and multiple time point problems are closely related, so we expect that these results carry over to the two time point case. We have also analyzed a linear rather than logistic fluctuation in this section. We did this simply so we could get a straightforward expression for the bias of the TMLE without having to worry about linearizing the fluctuation submodel in a neighborhood of 0. Similar results should hold for the logistic fluctuations. We also assumed that ${E}_{{P}_{0}}\left[Y|A={d}_{n}\left(V\right),W\right]$ was estimated perfectly, which of course is not true in practice. Nonetheless, this assumption makes our results clearer because then we do not have to worry about a resulting empirical process term.

The term ${\mathrm{\beta }}_{{d}_{n}}\left({P}_{n}\right)$ only causes problems because ${d}_{n}$ is learned from the same data over which the estimators of ${E}_{{P}_{0}}{Y}_{{d}_{n}}$ are run. The cross-validated approaches that we have presented in this paper do not suffer from this conjectured bias because we can condition on the training sample and then treat ${d}_{n}$ as known. For fixed d, ${E}_{{P}_{0}}\left[{\mathrm{\beta }}_{d}\left({P}_{n}\right)\right]=0$ and thus ${\mathrm{\beta }}_{d}\left({P}_{n}\right)$ will not cause problems.

## References

• 1.

Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effect. Math Mod 1986;7:1393–512.

• 2.

Robins JM. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers. In Proceedings of the Biopharmaceutical Section. American Statistical Association, 1993.Google Scholar

• 3.

Robins JM. Causal inference from complex longitudinal data. In Berkane M, editor. Latent variable modeling and applications to causality. New York: Springer, 1997:69–117.Google Scholar

• 4.

Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran ME, Berry D, editors. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997). New York: Springer, 2000:95–133. Google Scholar

• 5.

Holland PW. Statistics and causal inference. J Am Stat Assoc 1986;810:945–60.

• 6.

Neyman J. Sur les applications de la théorie des probabilites aux experiences agaricales: essay des principle (1923). Excerpts reprinted (1990) in English (D. Dabrowska and T. Speed), trans. Stat Sci 1990;5:463–72.Google Scholar

• 7.

Pearl J. Causality: models, reasoning and inference, 2nd ed. New York: Cambridge University Press, 2009. Google Scholar

• 8.

Robins JM. Addendum to: “A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect”. Comput Math Appl 1987;140:923–45. ISSN 0097-4943

• 9.

Robins JM. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. J Chron Dis (40, Supplement) 1987;2:139s–161s.

• 10.

Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974;66:688–701.

• 11.

Rubin DB. Matched sampling for causal effects. Cambridge, MA: Cambridge University Press, 2006. Google Scholar

• 12.

Lavori P, Dawson R. A design for testing clinical strategies: biased adaptive within-subject randomization. J R Stat Soc Ser A 2000;163:29–38.

• 13.

Lavori P, Dawson R. Adaptive treatment strategies in chronic disease. Annu Rev Med 2008;59:443–53.

• 14.

Murphy S. An experimental design for the development of adaptive treatment strategies. Stat Med 2005;24:1455–81.

• 15.

Rosthøj S, Fullwood C, Henderson R, Stewart S. Estimation of optimal dynamic anticoagulation regimes from observational data: a regret-based approach. Stat Med 2006;88:4197–215.

• 16.

Thall P, Millikan R, Sung H-G. Evaluating multiple treatment courses in clinical trials. Stat Med 2000;19:10111028. Google Scholar

• 17.

Thall P, Sung H, Estey E. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. J Am Stat Assoc 2002;39:29–39.

• 18.

Wagner E, Austin B, Davis C, Hindmarsh M, Schaefer J, Bonomi A. Improving chronic illness care: translating evidence into action. Health Aff 2001;20:64–78.

• 19.

Petersen ML, Deeks SG, Martin JN, van der Laan MJ. History-adjusted marginal structural models to estimate time-varying effect modification. Am J Epidemiol 2007;166:985–93.

• 20.

van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat 2007;3:Article 3. Google Scholar

• 21.

Robins J, Orallana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Stat Med 2008;27:4678–721.

• 22.

Lavori P, Dawson R. Dynamic treatment regimes: practical design considerations. Clin Trials 2004;1:9–20.

• 23.

Chakraborty B, Murphy SA, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res 2010;19:317–43.

• 24.

Kasari C. Developmental and augmented intervention for facilitating expressive language. ClinicalTrials.gov database, updated Apr. 26, 2012, Natl. Inst:0 accessed July 24, 2013, 2009. Google Scholar

• 25.

Lei H, Nahum-Shani I, Lynch K, Oslin D, Murphy S. A SMART design for building individualized treatment sequences. Annu Rev Clin Psychol 2011;8:21–48. Google Scholar

• 26.

Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, et al. Experimental design and primary data analysis methods for comparing adaptive interventions. Psychol Methods 2012;17:457–77.

• 27.

Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, et al. Q-learning: a data analysis method for constructing adaptive interventions. Psychol Methods 2012;17:478–94.

• 28.

Jones H. Reinforcement-based treatment for pregnant drug abusers. ClinicalTrials.gov data base, updated October 19, 2012, Natl. Inst:0 accessed July24, 2013, 2010. Google Scholar

• 29.

Chakraborty B, Murphy SA. Dynamic treatment regimens. Annu Rev Stat Appl 2013;1:1–18. Google Scholar

• 30.

Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B 2003;65:331–55.

• 31.

Robins JM. Discussion of “optimal dynamic treatment regimes” by Susan A. Murphy. J R Stat Soc Ser B 2003;65:355–66. Google Scholar

• 32.

Robins JM. Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium on Biostatistics 2004:189–326. Google Scholar

• 33.

Sutton R, Sung H. Reinforcement learning: an introduction. Cambridge, MA: MIT Press, 1998. Google Scholar

• 34.

van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer, 2003. Google Scholar

• 35.

Orellana L, Rotnitzky A, Robins JM. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. Int J Biostat 2010;6:Article 8. Google Scholar

• 36.

Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61:962–72.

• 37.

van der Laan MJ. The construction and analysis of adaptive group sequential designs. Technical Report 232, Division of Biostatistics, University of California, Berkeley, CA, 2008. Google Scholar

• 38.

van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2012.Google Scholar

• 39.

van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat 2006;2:Article 11. Google Scholar

• 40.

Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted minimum loss based estimation of marginal structural working models. J Causal Inference 2013;submitted. Google Scholar

• 41.

van der Laan MJ, Gruber S. Targeted minimum loss based estimation of causal effects of multiple time point interventions. Int J Biostat 2012;8:Article 9. Google Scholar

• 42.

Cotton C, Heagerty P. A data augmentation method for estimating the causal effect of adherence to treatment regimens targeting control of an intermediate measure. Stat Biosci 2011;3:28–44.

• 43.

Hernan MA, Lanoy E, Costagliola D, Robins JM. Comparison of dynamic treatment regimes via inverse probability weighting. Basic Clin Pharmacol 2006;98:237–42.

• 44.

Shortreed S, Moodie E. Estimating the optimal dynamic antipsychotic treatment regime: evidence from the sequential-multiple assignment randomized CATIE Schizophrenia Study. J R Stat Soc C 2012;61:577–99.

• 45.

Robins JM, Li L, Tchetgen E, van der Vaart AW. Higher order influence functions and minimax estimation of non-linear functionals. In Probability and statistics: essays in honor of David A. Freedman. Beachwood, OH: Institute of Mathematical Statistics, 2008:335–421. doi:10.1214/193940307000000527. Available at: http://projecteuclid.org/euclid.imsc/1207580092 Crossref

• 46.

van der Laan MJ, Hubbard AE, Kherad S. Statistical inference for data adaptive target parameters. Technical Report 314, Division of Biostatistics, University of California, Berkeley, CA, 2013. Google Scholar

• 47.

van der Laan MJ. Targeted learning of an optimal dynamic treatment and statistical inference for its mean outcome. Technical Report 317, UC Berkeley, CA, 2013. Google Scholar

• 48.

van der Laan MJ, Luedtke AR. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Technical Report 329, UC Berkeley, CA, 2014. Google Scholar

• 49.

Robins JM, Rotnitzky A, Scharfstein DO. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran ME, Berry D, editors. Statistical models in epidemiology, the environment and clinical trials. IMA Volumes in Mathematics and Its Applications. Springer, 1999. Google Scholar

• 50.

Bickel PJ, Klaassen CA, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. New York: Springer, 1997. Google Scholar

• 51.

van der Vaart AW. Asymptotic statistics. New York: Cambridge University Press, 1998. Google Scholar

• 52.

Robins J, Rotnitzky A. Discussion of “dynamic treatment regimes: technical challenges and applications. Electron J Stat 2014;8:1273–89. . URL http://dx.doi.org/10.1214/14-EJS908 Crossref

• 53.

Díaz I, van der Laan MJ. Targeted data adaptive estimation of the causal dose response curve. Technical Report 306, Division of Biostatistics, University of California, Berkeley, CA, submitted to JCI, 2013. Google Scholar

• 54.

van der Laan MJ, Petersen ML. Targeted learning. In Zhang C, Ma Y, editors. Ensemble machine learning. New York: Springer, 2012. Google Scholar

• 55.

Zheng W, van der Laan MJ. Asymptotic theory for cross-validated targeted maximum likelihood estimation. Technical Report 273, Division of Biostatistics, University of California, Berkeley, CA, 2010. Google Scholar

• 56.

Zheng W, van der Laan MJ. Cross-validated targeted minimum loss based estimation. In van der Laan MJ, Rose S, editors. Targeted learning: causal inference for observational and experimental studies. New York: Springer, 2012. Google Scholar

• 57.

R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. Available at: http://www.R-project.org/

• 58.

van der Vaart AW, Wellner JA. Weak convergence and empirical processes. New York: Springer, 1996. Google Scholar

• 59.

van der Laan MJ, Polley E, Hubbard A. Super learner. Stat Appl Genet Mol Biol 2007;6:Article 25. Google Scholar

• 60.

Hernán MA, Robins JM. Estimating causal effects from epidemiological data. J Epidemiol Community Health 2006;60:578–86.

## About the article

Published Online: 2014-10-14

Published in Print: 2015-03-01

Funding: National Institute of Allergy and Infectious Diseases, (Grant / Award Number: ‘R01 AI074345-06’)

Citation Information: Journal of Causal Inference, Volume 3, Issue 1, Pages 61–95, ISSN (Online) 2193-3685, ISSN (Print) 2193-3677,

Export Citation

©2015 by De Gruyter.