Jump to ContentJump to Main Navigation
Show Summary Details
More options …

# Journal of Causal Inference

Ed. by Imai, Kosuke / Pearl, Judea / Petersen, Maya Liv / Sekhon, Jasjeet / van der Laan, Mark J.

2 Issues per year

Online
ISSN
2193-3685
See all formats and pricing
More options …
Volume 2, Issue 2

# Targeted Maximum Likelihood Estimation for Dynamic and Static Longitudinal Marginal Structural Working Models

Maya Petersen
/ Joshua Schwab
/ Susan Gruber
/ Nello Blaser
/ Michael Schomaker
/ Mark van der Laan
Published Online: 2014-06-18 | DOI: https://doi.org/10.1515/jci-2013-0007

## Abstract

This paper describes a targeted maximum likelihood estimator (TMLE) for the parameters of longitudinal static and dynamic marginal structural models. We consider a longitudinal data structure consisting of baseline covariates, time-dependent intervention nodes, intermediate time-dependent covariates, and a possibly time-dependent outcome. The intervention nodes at each time point can include a binary treatment as well as a right-censoring indicator. Given a class of dynamic or static interventions, a marginal structural model is used to model the mean of the intervention-specific counterfactual outcome as a function of the intervention, time point, and possibly a subset of baseline covariates. Because the true shape of this function is rarely known, the marginal structural model is used as a working model. The causal quantity of interest is defined as the projection of the true function onto this working model. Iterated conditional expectation double robust estimators for marginal structural model parameters were previously proposed by Robins (2000, 2002) and Bang and Robins (2005). Here we build on this work and present a pooled TMLE for the parameters of marginal structural working models. We compare this pooled estimator to a stratified TMLE (Schnitzer et al. 2014) that is based on estimating the intervention-specific mean separately for each intervention of interest. The performance of the pooled TMLE is compared to the performance of the stratified TMLE and the performance of inverse probability weighted (IPW) estimators using simulations. Concepts are illustrated using an example in which the aim is to estimate the causal effect of delayed switch following immunological failure of first line antiretroviral therapy among HIV-infected patients. Data from the International Epidemiological Databases to Evaluate AIDS, Southern Africa are analyzed to investigate this question using both TML and IPW estimators. Our results demonstrate practical advantages of the pooled TMLE over an IPW estimator for working marginal structural models for survival, as well as cases in which the pooled TMLE is superior to its stratified counterpart.

This article offers supplementary material which is provided at the end of the article.

## 1 Introduction

Many studies aim to learn about the causal effects of longitudinal exposures or interventions using data in which these exposures are not randomly assigned. Specifically, consider a study in which baseline covariates, time-varying exposures or treatments, time-varying covariates, and an outcome of interest, such as death, are observed on a sample of subjects followed over time. The exposures of interest can both depend on past covariates and affect future covariates, as well as the outcome. Censoring may also occur, possibly in response to past treatment and covariates. Such data structures are ubiquitous in observational cohort studies. For example, a sample of HIV-infected patients might be followed longitudinally in clinic and data collected on antiretroviral prescriptions, determinants of prescription decisions including CD4+ T cell counts and plasma HIV RNA levels (viral loads), and vital status. Such data structures also occur in randomized trials when the exposure of interest is (non-random) compliance with a randomized exposure or includes non-randomized mediators of an exposure’s effect.

The causal effects of longitudinal exposures (as well as the effects of single time point exposures when the outcome is subject to censoring) can be formally defined by contrasting the distribution of a counterfactual outcome under different “interventions” to set the values of the exposure and censoring variables. For example, the counterfactual survival curve of HIV-infected subjects following immunological failure of antiretroviral therapy might be contrasted under a hypothetical intervention in which all subjects were switched immediately to a new antiretroviral regimen versus an intervention in which all subjects remained on their failing therapy [1]. In the presence of censoring due to losses to follow up, these counterfactuals of interest might be defined under a further intervention to prevent censoring. Interventions such as these, under which all subjects in a population are deterministically assigned the same vector of exposure and censoring decisions (for example, do not switch and remain under follow up) are referred to as “static regimes.”

More generally, counterfactuals can be defined under interventions that assign a treatment or exposure level to each subject at each time point based on that subject’s observed past. For example, counterfactual survival might be compared under interventions to switch all patients to second line antiretroviral therapy the first time their CD4+T cell count crosses a certain threshold, for some specified set of thresholds [2]. Such subject-responsive treatment strategies have been referred to as individualized treatment rules, adaptive treatment strategies, or “dynamic regimes” (see, for example, Robins [3]; Murphy et al. [4]; Hernan et al. [5]). Additional examples include strategies for deciding when to start antiretroviral therapy [6, 7] and strategies for modifying dose or drug choice based on prior response and adverse effects. Investigation of the effects of such dynamic regimes makes it possible to learn effective strategies for assigning an intervention based on a subject’s past and is thus relevant to any discipline that seeks to learn how best to use past information to make decisions that will optimize future outcomes.

The static and dynamic regimes described above are longitudinal – they involve interventions to set the value of multiple treatment and censoring variables over time. For example, counterfactual survival under no switch to second line therapy corresponds to a subject’s survival under an intervention to prevent a patient from switching at each time point from immunologic failure until death or the end of the study. A time-dependent causal dose–response curve, which plots the mean of the intervention-specific counterfactual outcome at time t as a function of the interventions through time t, can be used to summarize the effects of these longitudinal interventions. For example, a plot of the counterfactual survival probability as a function of time since immunologic failure, for a range of alternative CD4+ T cell count thresholds used to initiate a switch captures the effect of alternative switching strategies on survival.

Formal causal frameworks provide a tool to establish the conditions under which such causal dose–response curves can be identified from the observed data. Longitudinal static and dynamic regimes are often subject to time-dependent confounding – time-varying variables may confound the effect of future treatments while being affected by past treatment [8]. Traditional approaches to the identification of point treatment effects, which are based on selection of a single set of covariates for regression or stratification-based adjustment, break down when such time-dependent confounding is present. However, the mean counterfactual outcome under a longitudinal static or dynamic regime may still be identified under the appropriate sequential randomization and positivity assumptions (reviewed in Robins and Hernan [9]).

Under these assumptions, causal dose–response curves can be estimated by generating separate estimates of the mean counterfactual outcome for each time point and intervention (or regime) of interest. For example, one could generate separate estimates of the counterfactual survival curve for each CD4-based threshold for switching to second line therapy. In this manner, one obtains fits of the time-dependent causal dose–response curve for each of a range of possible thresholds, which together summarize how the mean counterfactual outcome at time t depends on the choice of threshold.

A number of estimators can be used to estimate intervention-specific mean counterfactual outcomes. These include inverse probability weighted (IPW) estimators (for example, [3, 5, 10]), “G-computation” estimators (typically based on parametric maximum likelihood estimation of the non-intervention components of the data generating process) (for example, [7, 11, 12]), augmented-IPW estimators (for example, [1316, 31]), and targeted maximum likelihood (or minimum loss) estimators (TMLEs) (for example, [17, 18]). In particular, van der Laan and Gruber [19] combine the targeted maximum likelihood framework [20, 21] with important insights and the iterated conditional expectation estimators established in Robins [3, 29] and Bang and Robins [22].

Both the theoretical validity and the practical utility of these estimators rely, however, on reasonable support for each of the interventions of interest, both in the true data generating distribution and in the sample available for analysis. For example, in order to estimate how survival is affected by the threshold CD4 count used to initiate an antiretroviral treatment switch, a reasonable number of subjects must in fact switch at the time indicated by each threshold of interest. Without such support, estimators of the intervention-specific outcome will be ill-defined or extremely variable. Although one might respond to this challenge by creating coarsened versions of the desired regimes, so that sufficient subjects follow each coarsened version, such a method introduces bias and leaves open the question of how to choose an optimal degree of coarsening.

Since adequate support for every intervention of interest is often not available, Robins [23] introduced marginal structural models (MSMs) that pose parametric or small semiparametric models for the counterfactual conditional mean outcome as a function of the choice of intervention and time. For example, static MSMs have been used to summarize how the counterfactual hazard of death varies as a function of when antiretroviral therapy is initiated [24] and when an antiretroviral regimen is switched [25]. The extrapolation assumptions implicitly defined by non-saturated MSMs make it possible to estimate the coefficients of the model, and thereby the causal dose–response curve, even when few or no subjects follow some interventions of interest.

While MSMs were originally developed for static interventions [8, 10, 23, 24] they naturally generalize to classes of dynamic (or even more generally, stochastic) interventions as shown in van der Laan and Petersen [2] and Robins et al. [26]. Dynamic MSMs have been used, for example, to investigate how counterfactual hazard of death varies as a function of CD4+ T cell count threshold used to initiate antiretroviral therapy [6] or to switch antiretroviral therapy regimens [2]. Because the true shape of the causal dose–response curve is typically unknown, we have suggested that MSMs be used as working models. The target causal coefficients can then be defined by projecting the true causal dose–response curve onto this working model [20, 27].

The coefficients of both static and dynamic MSMs are frequently estimated using IPW estimators [2, 8, 10, 26]. These estimators have a number of attractive qualities: they can be intuitively understood, they are easy to implement, and they provide an influence curve-based approach to standard error estimation. However, IPW estimators also have substantial shortcomings. In particular, they are biased if the treatment mechanism used to construct the weights is estimated poorly (for example, using a misspecified parametric model). Further, IPW estimators are unstable in settings of strong confounding (near or partial positivity violations) and the resulting bias in both point and standard error estimates can result in poor inference (for a review of this issue see Petersen et al. [28]). Dynamic MSMs can exacerbate this problem, as the options for effective weight stabilization are limited [6, 26].

Asymptotically efficient and double robust augmented-IPW estimators of the estimand corresponding to longitudinal static MSM parameters were developed by Robins and Rotnitzky [14], Robins [13], Robins et al. [16]. These estimators are defined as a solution of an estimating equation, and as a result may be unstable due to failure to respect the global constraints implied by the model and the parameter. Robins [13, 29] and Bang and Robins [22] introduced an alternative double robust estimating equation-based estimator of longitudinal MSM parameters based on the key insight that both the statistical target parameter and the corresponding augmented-IPW estimating function (efficient influence curve) for MSMs on the intervention-specific mean can be represented as a series of iterated conditional expectations. In addition, they proposed a targeted sequential regression method to estimate the nuisance parameters of the augmented-IPW estimating equation. This innovative idea allowed construction of a double robust estimator that relies only on estimation of minimal nuisance parameters beyond the treatment mechanism.

In this paper, we describe a double robust substitution estimator of the parameters of a longitudinal marginal structural working model. The estimator presented incorporates the key insights and prior estimator of Robins [13, 29] and Bang and Robins [22] into the TMLE framework. Specifically, we expand on this prior work in several ways. We propose a TMLE for marginal structural working models for longitudinal dynamic regimes, possibly conditional on pre-treatment covariates. The TMLE described is defined as a substitution estimator rather than as solution to an estimating equation and incorporates data-adaptive/machine learning methods in generating initial fits of the sequential regressions. Finally, we further generalize the TMLE to apply to a larger class of parameters defined as arbitrary functions of intervention-specific means across a user-supplied class of interventions.

TMLE for the parameters of a MSM for “point treatment” problems, in which adjustment for a single set of covariates known not to be affected by the intervention of interest is sufficient to control for confounding, including history-adjusted MSMs, have been previously described [30, 31]. However, the parameter of a longitudinal MSM on the intervention-specific mean under sequential interventions subject to time-dependent confounding is identified as a distinct, and substantially more complex, estimand than the estimand corresponding to a point treatment MSM, and thus requires distinct estimators. An alternative TMLE for longitudinal static MSMs, which we refer to as a stratified TMLE, was described by Schnitzer et al. [32]. The stratified TMLE uses the longitudinal TMLE of van der Laan and Gruber [19] for the intervention-specific mean to estimate each of a set of static treatments and combines these estimates into a fit of the coefficients of a static longitudinal MSM on both survival and hazard functions. The stratified TMLE [32] resulted in substantially lower standard error estimates than an IPW estimator in an applied data analysis and naturally generalizes to dynamic MSMs. However, it remains vulnerable when there is insufficient support for some interventions of interest. In contrast, the TMLE we describe here pools over the set of dynamic or static interventions of interest as well as optionally over time when updating initial fits of the likelihood. It thus substantially relaxes the degree of data support required to remain an efficient double robust substitution estimator.

In summary, a large class of causal questions can be formally defined using static and dynamic longitudinal MSMs, and the parameters of these models can be identified from non-randomized data under well-studied assumptions. This article describes a TMLE that builds on the work of Robins [13, 29] and Bang and Robins [22] in order to directly target the coefficients of a marginal structural (working) model for a user-supplied class of longitudinal static or dynamic interventions. The theoretical properties of the pooled TMLE are presented, its implementation is reviewed, and its practical performance is compared to alternatives using both simulated and real data. R code [33] implementing the estimator and evaluating it in simulations is provided in online supplementary materials and as an open source R library ltmle [34].

## 1.1 Organization of paper

In Section 2, we define the observed data and a statistical model for its distribution. We then specify a non-parametric structural equation model for the process assumed to generate the observed data. We define counterfactual outcomes over time based on static or dynamic interventions on multiple treatment and censoring nodes in this system of structural equations. Our target causal quantity is defined using a marginal structural working model on the mean of these intervention-specific counterfactual outcomes at time t. The general case we present includes marginal structural working models on both the counterfactual survival and the hazard. We briefly review the assumptions under which this causal quantity is identified as a parameter of the observed data distribution. The statistical estimation problem is thus defined in terms of the statistical model and statistical target parameter.

Section 3 presents the TMLE defined by (a) representation of the statistical target parameter in terms of an iteratively defined set of conditional mean outcomes, (b) an initial estimator for the intervention mechanism and for these conditional means, (c) a submodel through this initial estimator and a loss function chosen so that the generalized score of the submodel with respect to this loss spans the efficient influence curve, (d) a corresponding updating algorithm that updates the initial estimator and iterates the updating till convergence, and (e) final evaluation of the TMLE as a plug-in estimator. We also present corresponding influence curve-based confidence intervals for our target parameter.

Section 4 illustrates the results presented in Section 3 using a simple three time point example and focusing on a marginal structural working model for counterfactual survival probability over time. This example is used to clarify understanding of notation and to provide a step-by-step overview of implementation of the pooled TMLE.

Section 5 compares the pooled TMLE described in this paper with alternative estimators for the parameters of longitudinal dynamic MSMs for survival. We provide a brief overview of the stratified TMLE [32], discuss scenarios in which each estimator may be expected to offer superior performance, and illustrate the breakdown of the stratified TMLE in a finite sample setting in which some interventions of interest have no support. As IPW estimators are currently the most common approach used to fit longitudinal dynamic MSMs, we also discuss two IPW estimators for these parameters.

Section 6 presents a simulation study in which the pooled TMLE is implemented for a marginal structural working model for survival at time t. Its performance is compared to IPW estimators and to the stratified TMLE for a simple data generating process and in a simulation designed to be similar to the data analysis presented in the following section, which includes time-dependent confounding and right censoring.

Section 7 presents the results of a data analysis investigating the effect of switching to second line therapy following immunologic failure of first line therapy using data from HIV-infected patients in the International Epidemiological Databases to Evaluate AIDS (IeDEA), Southern Africa. Throughout the paper, we illustrate notation and concepts using a simplified data structure based on this example.

Appendices contain a derivation of the efficient influence curve, further simulation details, an alternative TMLE, and reference table for notation. In online supplementary files, we present R code that implements the pooled TMLE, the stratified TMLE, and two IPW estimators for a marginal structural working model of survival. A corresponding publicly available R-package, ltmle, was released in May 2013 (http://cran.r-project.org/web/packages/ltmle/).

## 2 Definition of statistical estimation problem

Consider a longitudinal study in which the observed data structure O on a randomly sampled subject is coded as $O=\left(L\left(0\right),A\left(0\right),\dots ,L\left(K\right),A\left(K\right),L\left(K+1\right)\right),$where $L\left(0\right)$ are baseline covariates, $A\left(t\right)$ denotes an intervention node at time t, and $L\left(t\right)$ denotes covariates measured between intervention nodes $A\left(t-1\right)$ and $A\left(t\right)$. Assume that there is an outcome process $Y\left(t\right)\subseteq L\left(t\right)$ for $t=1,\dots ,K+1$, where $L\left(K+1\right)=Y\left(K+1\right)$ is the final outcome measured after the final treatment $A\left(K\right)$. The intervention node $A\left(t\right)=\left({A}_{1}\left(t\right),{A}_{2}\left(t\right)\right)$ has a treatment node ${A}_{1}\left(t\right)$ and a censoring indicator ${A}_{2}\left(t\right)$, where ${A}_{2}\left(t\right)=1$ indicates that the subject is right censored by time t. We observe n independent and identically distributed (i.i.d.) copies ${O}_{1},\dots ,{O}_{n}$ of O, and we will denote the probability distribution of O with ${P}_{O,0}$, or more simply, as ${P}_{0}$. Throughout, we use subscript 0 to denote the true distribution.

Running example. Here and in subsequent sections, we illustrate notation using an example in which n i.i.d. HIV-infected subjects with immunological failure on first line therapy are sampled from some target population. Here $t=0$ denotes time of immunological failure. $L\left(t\right)$ denotes time-varying covariates, and includes CD4+ T cell count at time t and $Y\left(t\right)$, an indicator of death by time t. In addition to baseline values of these time-varying covariates, $L\left(0\right)$ includes non-time-varying covariates such as sex. The intervention nodes of interest are $A\left(t\right),t=0,\dots K$, where $A\left(t\right)$ is defined as an indicator of switch to second line therapy by time t; in our simplified example, we assume no right censoring. For notational convenience, after a subject dies all variables for that subject are defined as equal to their last observed value.

## 2.1 Statistical model

We use the notation $\stackrel{ˉ}{L}\left(k\right)=\left(L\left(0\right),\dots ,L\left(k\right)\right)$ to denote the history of time-dependent variable L from $t=0,\dots ,k$. Define the “parents” of a variable $L\left(k\right)$, denoted $Pa\left(L\left(k\right)\right)$, as those variables that precede $L\left(k\right)$ (i.e., $Pa\left(L\left(k\right)\right)=\stackrel{ˉ}{L}\left(k-1\right),\stackrel{ˉ}{A}\left(k-1\right)$). Similarly, $\stackrel{ˉ}{A}\left(k\right)=\left(A\left(0\right),\dots ,A\left(k\right)\right)$ is used to denote the history of the intervention process and $Pa\left(A\left(k\right)\right)$ to denote a specified subset of the variables that precede $A\left(k\right)$ such that the distribution of $A\left(k\right)$ given the whole past is equal to the distribution of $A\left(k\right)$ given its parents ($Pa\left(A\left(k\right)\right)\subseteq \stackrel{ˉ}{L}\left(k\right),\stackrel{ˉ}{A}\left(k-1\right)$). Under our causal model, which we introduce below, these parent sets $Pa\left(L\left(k\right)\right)$ and $Pa\left(A\left(k\right)\right)$ correspond to the set of variables that may affect the values taken by $L\left(k\right)$ and $A\left(k\right)$, respectively.

We use ${Q}_{L\left(k\right),0}$ to denote the conditional distribution of $L\left(k\right)$, given $Pa\left(L\left(k\right)\right)$, and, ${g}_{A\left(k\right),0}$ to denote the conditional distribution of $A\left(k\right)$, given $Pa\left(A\left(k\right)\right)$. We also use the notation: ${g}_{0:k}\equiv {\prod }_{j=0}^{k}{g}_{A\left(j\right),0}$, ${g}_{0}\equiv {g}_{0:K}$ and define ${Q}_{0:k}\equiv {\prod }_{j=0}^{k}{Q}_{L\left(j\right),0}$ and ${Q}_{0}\equiv {Q}_{0:K+1}$. In our example, ${Q}_{L\left(k\right),0}$ denotes the joint conditional distribution of CD4 count and death at time k, given the observed past (including past CD4 count and switching history), and ${g}_{A\left(k\right),0}$ denotes the conditional probability of having switched to second line by time k given the observed past (deterministically equal to one for those time points at which a subject has already switched).

The probability distribution ${P}_{0}$ of O can be factorized according to the time-ordering as $\begin{array}{l}{P}_{0}\left(O\right)=\prod _{k=0}^{K+1}{P}_{0}\left(L\left(k\right)|Pa\left(L\left(k\right)\right)\right)\prod _{k=0}^{K}{P}_{0}\left(A\left(k\right)|Pa\left(A\left(k\right)\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=\prod _{k=0}^{K+1}{Q}_{L\left(k\right),0}\left(O\right)\prod _{k=0}^{K}{g}_{A\left(k\right),0}\left(O\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}={Q}_{0}\left(O\right){g}_{0}\left(O\right).\end{array}$ We consider a statistical model $\mathcal{M}$ for ${P}_{0}$ that possibly assumes knowledge on the intervention mechanism ${g}_{0}$. For example, the treatment of interest, such as switch time, may be known to be randomized, or to be assigned based on only a subset of the observed past. If $\mathcal{Q}$ is the set of all values for ${Q}_{0}$ and $\mathcal{G}$ the set of possible values of ${g}_{0}$, then this statistical model can be represented as $\mathcal{M}=\left\{P=Q\phantom{\rule{thinmathspace}{0ex}}g\phantom{\rule{negativethinmathspace}{0ex}}:Q\in \mathcal{Q},g\in \mathcal{G}\right\}$. In this statistical model, $\mathcal{Q}$ puts no restrictions on the conditional distributions ${Q}_{L\left(k\right),0}$, $k=0,\dots ,K+1$.

## 2.2 Causal model and counterfactuals of interest

By specifying a structural causal model [35, 36] or equivalently, a system of non-parametric structural equations, it is assumed that each component of the observed longitudinal data structure (e.g. $A\left(k\right)$ or $L\left(k\right)$) is a function of a set of observed parent variables and an unmeasured exogenous error term. Specifically, consider the non-parametric structural equation model (NPSEM) defined by $L\left(k\right)={f}_{L\left(k\right)}\left(Pa\left(L\left(k\right)\right),{U}_{L\left(k\right)}\right),\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}k=0,\dots ,K+1\phantom{\rule{1pt}{0ex}},$and $A\left(k\right)={f}_{A\left(k\right)}\left(Pa\left(A\left(k\right)\right),{U}_{A\left(k\right)}\right)\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}k=0,\dots ,K\phantom{\rule{1pt}{0ex}},$in terms of a set of deterministic functions $\left({f}_{L\left(k\right)}:k=0,\dots ,K+1\right),\left({f}_{A\left(k\right)}:k=0,\dots ,K\right)$, and a vector of unmeasured random errors or background factors $U=\left({U}_{L\left(0\right)},\dots ,$${U}_{L\left(K+1\right)},{U}_{A\left(0\right)},\dots ,$${U}_{A\left(K\right)}\right)$ [35, 36].

To continue our HIV example, we might specify a causal model in which both time-varying CD4 count, death, and the decision to switch potentially depended on a subject’s entire observed past, as well as unmeasured factors. Alternatively, if we knew that switching decisions were made only in response to a subject’s most recent CD4 count and baseline covariates, the parent set of $A\left(k\right)$ could be restricted to exclude earlier CD4 count values.

This causal model represents a model ${\mathcal{M}}^{F}$ for the distribution of $\left(O,U\right)$ and provides a parameterization of the distribution of the observed data structure O in terms of the distribution of the random variables $\left(O,U\right)$ modeled by the system of structural equations. Let ${P}_{O,U,0}$ denote the latter distribution. The causal model ${\mathcal{M}}^{F}$ encodes knowledge about the process, including both measured and unmeasured variables, that generated the observed data. It also implies a model for the distribution of counterfactual random variables under specific interventions on (or changes to) the observed data generating process. Specifically, a post-intervention (or counterfactual) distribution is defined as the distribution that O would have had under a specified intervention to set the value of the intervention nodes $\stackrel{ˉ}{A}=\left(A\left(0\right),\dots ,A\left(K\right)\right)$.

The intervention of interest might be static, with ${f}_{A\left(k\right)},k=0,\dots ,K$ replaced by some constant for all subjects. For example, an intervention to set $A\left(k\right)=0$ for $k=0,\dots ,K$ corresponds to a static intervention to delay switching indefinitely for all subjects. Alternatively, the intervention might be dynamic, with ${f}_{A\left(k\right)},k=0,\dots ,K$ replaced by some specified function ${d}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)$ of a subject’s observed covariates. For example, an intervention could set $A\left(k\right)$ to 1 the first time a subject’s CD4 count drops below some threshold. As static regimes are a special case of dynamic regimes, in the following sections we define the statistical estimation problem and develop our estimator for the more general dynamic case. Throughout, we use “rule” to refer to a specific intervention, static or dynamic, that sets the values of $\stackrel{ˉ}{A}$.

Given a rule d, the counterfactual random variable ${L}_{d}=\left(L\left(0\right),{L}_{d}\left(1\right),\dots ,{L}_{d}\left(K+1\right)\right)$ is defined by deterministically setting all the $A\left(k\right)$ nodes equal to ${d}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)$ in the system of structural equations. The probability distribution of this counterfactual ${L}_{d}$ is called the post-intervention or counterfactual distribution of L and is denoted with ${P}_{d,0}$. Causal effects are defined as parameters of a collection of post-intervention distributions under a specified set of rules. For example, we might compare mean counterfactual survival over time under a range of possible switch times.

## 2.3 Marginal structural working model

Our causal quantity of interest is defined using a marginal structural working model to summarize how the mean counterfactual outcome at time t varies as a function of the intervention rule d, time point t, and possibly some baseline covariate V that is a function of the collection of all baseline covariates $L\left(0\right)$. Specifically, given a class of dynamic treatment rules $\mathcal{D}$, we can define a true time-dependent causal dose–response curve $\left({E}_{{P}_{d,0}}\left({Y}_{d}\left(t\right)|V\right):d\in \mathcal{D},t\in \mathrm{\tau }\right)$ for some subset $\mathrm{\tau }\subseteq \left\{1,\dots ,K+1\right\}$. Note that choice of V (as well as choice of $\mathrm{\tau }$ and $\mathcal{D}$) depends on the scientific question of interest. In many cases V will be defined as the empty set. In other cases, it may be of interest to estimate how the causal dose–response curve varies depending on the value of some subset of baseline variables.

We specify a working model $\mathrm{\Theta }\equiv \left\{{m}_{\mathrm{\beta }}:\mathrm{\beta }\right\}$ for this true time-dependent causal dose–response curve. Our causal quantity of interest is then defined as a projection of the true causal dose–response curve onto this working model, which yields a definition ${m}_{{\mathrm{\beta }}_{0}}$ representing this projection. For example, if $Y\left(t\right)\in \left[0,1\right]$, we may use a logistic working model $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{m}_{\mathrm{\beta }}\left(d,t,V\right)={\sum }_{j=1}^{J}{\mathrm{\beta }}_{j}{\mathrm{\varphi }}_{j}\left(d,t,V\right)$ for a set of basis functions, and define our causal quantity of interest as $\begin{array}{l}{\Psi }^{F}\left({P}_{O,U,0}\right)={\text{argmin}}_{\beta }-{E}_{0}{\sum }_{t\in \tau }{\sum }_{d\in D}h\left(d,t,V\right)\left\{{Y}_{d}\left(t\right)\mathrm{log}{m}_{\beta }\left(d,t,V\right)\right\},\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\left(1-{Y}_{d}\left(t\right)\right)\mathrm{log}\left(1-{m}_{\beta }\left(d,t,V\right)\right)\right\},\end{array}$where $h\left(d,t,v\right)$ is a user-specified weight function. We discuss choice of $h\left(d,t,v\right)$ further below.

Such a ${\mathrm{\Psi }}_{0}^{F}={\mathrm{\beta }}_{0}$ solves the equation $0={E}_{0}{\sum }_{t\in \tau }{\sum }_{d\in \mathcal{D}}h\left(d,t,V\right)\frac{\frac{d}{d{\beta }_{0}}{m}_{{\beta }_{0}}\left(d,t,V\right)}{{m}_{{\beta }_{0}}\left(1-{m}_{{\beta }_{0}}\right)}\left({E}_{0}\left({Y}_{d}\left(t\right)\mid V\right)-{m}_{{\beta }_{0}}\left(d,t,V\right)\right).$This equation can be replaced by $0={E}_{0}{\sum }_{t\in \mathrm{\tau }}{\sum }_{d\in \mathcal{D}}h\left(d,t,V\right)\frac{\frac{d}{d{\mathrm{\beta }}_{0}}{m}_{{\mathrm{\beta }}_{0}}\left(d,t,V\right)}{{m}_{{\mathrm{\beta }}_{0}}\left(1-{m}_{{\mathrm{\beta }}_{0}}\right)}\left({E}_{0}\left({Y}_{d}\left(t\right)|L\left(0\right)\right)-{m}_{{\mathrm{\beta }}_{0}}\left(d,t,V\right)\right),$which corresponds with $\begin{array}{l}{\Psi }^{F}\left({P}_{O,U,0}\right)=\text{\hspace{0.17em}}{\text{argmin}}_{\beta }-{E}_{0}{\sum }_{t\in \tau }{\sum }_{t\in D}h\left(d,t,V\right)\left\{{E}_{0}\left({Y}_{d}\left(t\right)|L\left(0\right)\right)\mathrm{log}{m}_{\beta }\left(d,t,V\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}\left(1-{E}_{0}\left({Y}_{d}\left(t\right)|L\left(0\right)\right)\right)\mathrm{log}\left(1-{m}_{\beta }\left(d,t,V\right)\right)\right\}.\end{array}$In this case we have that $\frac{\frac{d}{d{\mathrm{\beta }}_{0}}{m}_{{\mathrm{\beta }}_{0}}\left(d,t,V\right)}{{m}_{{\mathrm{\beta }}_{0}}\left(1-{m}_{{\mathrm{\beta }}_{0}}\right)}=\left({\mathrm{\varphi }}_{j}\left(d,t,V\right):j=1,\dots ,J\right)$.

To be completely general, we will define our causal quantity of interest as a function f of $E\left({Y}_{d}\left(t\right)|L\left(0\right)\right)$ across $d\in \mathcal{D},t\in \mathrm{\tau }$ and the distribution ${Q}_{L\left(0\right)}$ of $L\left(0\right)$. Thus we define ${\mathrm{\Psi }}^{F}\left({P}_{O,U,0}\right)=f\left(\left({E}_{0}\left({Y}_{d}\left(t\right)|L\left(0\right)\right):d\in \mathcal{D},t\in \mathrm{\tau }\right),{Q}_{L\left(0\right),0}\right).$In addition to including the above example, this general formulation allows us to include marginal structural working models on continuous outcomes and on the intervention-specific hazard.

## 2.3.1 Choice of a weight function $h\left(d,t,V\right)$

Unless one is willing to assume that the MSM ${m}_{\mathrm{\beta }}$ is correctly specified, choice of the weight function changes the target quantity being estimated. Choice of the weight function should thus be guided by the motivating scientific question. For example, the simple weight function $h\left(d,t,V\right)=1$ gives equal weight to all time points and switch times. Alternatively, choice of a weight function equal to the marginal probability of following rule d through time t within strata of V gives greater weight to those rule, time, and baseline strata combinations with more support in the data, and zero weight to values $\left(d,t,v\right)$ without support. As discussed further below, choice of a weight function can thus also affect both identifiability of the target parameter and the asymptotic and finite sample properties of IPW and TML estimators.

## 2.3.2 Running example

Continuing our HIV example, recall that static regimes are a special case of dynamic regimes and define the set of treatment rules of interest $\mathcal{D}$ as the set of possible switch times (switch at time 0, switch at time 1,$\dots$, never switch). We might focus on the marginal counterfactual survival curves under a range of switch times (with V defined as the empty set). Alternatively, we might investigate how survival under a specific switch time differs among subjects that have a CD4+ T cell count $\phantom{\rule{negativethinmathspace}{0ex}}<50$ versus $\phantom{\rule{negativethinmathspace}{0ex}}\ge 50$ cells/$\mathrm{\mu }\mathrm{l}$ at time of failure ($V=I\left(CD4\left(0\right)<50\right)$ where $CD4\left(0\right)\subset L\left(0\right)$). For simplicity, for the remainder of the paper we use a running example in which we avoid conditioning on baseline covariates (i.e. $V=\left\{\right\}$).

The true time-dependent causal dose–response curve $\left({E}_{0}\left({Y}_{d}\left(t\right)\right):d\in \mathcal{D},t\in 1,\dots ,K+1\right)$ corresponds to the set of counterfactual survival curves (through time $K+1$) that would have been observed for the population as a whole under each possible switch time. In this example, each rule d implies a single vector $\stackrel{ˉ}{a}$; we use $d\left(t\right)$ to refer to the value $a\left(t\right)$ implied by rule d and ${s}_{d}$ to refer to the switch time assigned by rule d. One might then specify the following marginal structural working model to summarize how the counterfactual probability of death by time t varies as a function of t and assigned switch time: $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}{m}_{\mathrm{\beta }}\left(d,t\right)={\mathrm{\beta }}_{0}+{\mathrm{\beta }}_{1}t+{\mathrm{\beta }}_{2}\left(d\left(t-1\right)\left(t-{s}_{d}\right)\right),$where $d\left(t-1\right)\left(t-{s}_{d}\right)$ is time since switch for subjects who have switched by time $t-1$, and otherwise 0. For simplicity, we choose $h\left(d,t\right)=1$ and define the target causal quantity of interest as the projection of $\left({E}_{0}\left({Y}_{d}\left(t\right)\right):d\in \mathcal{D},t\in 1,\dots ,K+1\right)$ onto ${m}_{\mathrm{\beta }}\left(d,t\right)$ according to $\begin{array}{l}{\Psi }^{F}\left({P}_{O,U,0}\right)=\text{ }{\text{argmin}}_{\beta }-{E}_{0}{\sum }_{t\in \tau }{\sum }_{d\in D}\left\{{Y}_{d}\left(t\right)\mathrm{log}{m}_{\beta }\left(d,t\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\left(1-{Y}_{d}\left(t\right)\right)\mathrm{log}\left(1-{m}_{\beta }\left(d,t\right)\right)\right\}.\end{array}$(1)

## 2.4 Identifiability and definition of statistical target parameter

We assume the sequential randomization assumption [11] $A\left(k\right)\coprod {L}_{d}|Pa\left(A\left(k\right)\right),k=0,\dots ,K$(2)(noting that weaker identifiability assumptions are also possible; see, for example, Robins and Hernan [9]). In our HIV example, the plausibility of this assumption would be strengthened by measuring all determinants of the decision to switch to second line therapy that also affect mortality via pathways other than switch time.

We further assume positivity, informally an assumption of support for each rule of interest across covariate histories compatible with that rule of interest. Specifically, for each $d,t,V$ for which $h\left(d,t,V\right)\ne 0$, we assume ${P}_{0}\left(A\left(k\right)={d}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)|\stackrel{ˉ}{L}\left(k\right),\stackrel{ˉ}{A}\left(k-1\right)=\stackrel{ˉ}{d}\left(\stackrel{ˉ}{L}\left(k-1\right)\right)>0\phantom{\rule{1pt}{0ex}},k=0,\dots ,K\phantom{\rule{thickmathspace}{0ex}}-\mathrm{a}\mathrm{l}\mathrm{m}\mathrm{o}\mathrm{s}\mathrm{t}\phantom{\rule{thickmathspace}{0ex}}\mathrm{e}\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{y}\mathrm{w}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{e}.\phantom{\rule{1pt}{0ex}}$(3)In our HIV example, in which $h\left(d,t\right)\phantom{\rule{negativethinmathspace}{0ex}}=1$, a subject who has not already switched should have some positive probability of both switching and not switching regardless of his covariate history. Under these assumptions, the counterfactual probability distribution of ${L}_{d}$ is identified from the true observed data distribution ${P}_{0}$ and given by the G-computation formula ${P}_{0}^{d}$ [11]: ${P}_{0}^{d}\left(l\right)=\prod _{k=0}^{K+1}{Q}_{L\left(k\right),0}^{d}\left(\stackrel{ˉ}{l}\left(k\right)\right),$(4)where ${Q}_{L\left(k\right),0}^{d}\left(\stackrel{ˉ}{l}\left(k\right)\right)={Q}_{L\left(k\right),0}\left(l\left(k\right)|\stackrel{ˉ}{l}\left(k-1\right),\stackrel{ˉ}{A}\left(k-1\right)=\stackrel{ˉ}{d}\left(\stackrel{ˉ}{L}\left(k-1\right)\right)$. Thus this G-computation formula ${P}_{0}^{d}$ is defined by the product over all $L\left(k\right)$-nodes of the conditional distribution of the $L\left(k\right)$-node, given its parents, and given $\stackrel{ˉ}{A}\left(k-1\right)=\stackrel{ˉ}{d}\left(\stackrel{ˉ}{L}\left(k-1\right)\right)$. If identifiability assumptions (2) and (3) hold for each rule $d\in \mathcal{D}$, then the time-dependent causal dose–response curve $\left({E}_{0}\left({Y}_{d}\left(t\right)|V\right):d\in \mathcal{D},t\in \mathrm{\tau }\right)$ is also identified from ${P}_{0}$ through the collection of G-computation formulas $\left({P}_{0}^{d}:d\in \mathcal{D}\right)$. For the remainder of the paper, we choose $\mathrm{\tau }=1,\dots ,K+1$ and at times suppress the index set $\mathrm{\tau }$.

Let ${L}^{d}=\left(L\left(0\right),{L}^{d}\left(1\right),\dots ,{L}^{d}\left(K+1\right)\right)$ denote a random variable with probability distribution ${P}_{0}^{d}$, which includes as a component the process ${Y}^{d}=\left({Y}^{d}\left(0\right),{Y}^{d}\left(1\right),\dots ,{Y}^{d}\left(K+1\right)\right)$. The above-defined causal quantities can now be defined as a parameter of ${P}_{0}$. For example, if $Y\left(t\right)\in \left[0,1\right]$ and the causal parameter of interest is a vector of coefficients in a logistic MSM, then we have $\begin{array}{l}{\Psi }^{F}\left({P}_{O,U,0}\right)=\text{ }{\text{argmin}}_{\beta }-{E}_{0}{\sum }_{t}{\sum }_{d\in D}h\left(d,t,V\right)\left\{{E}_{0}\left({Y}^{d}\left(t\right)|L\left(0\right)\right)\mathrm{log}{m}_{\beta }\left(d,t,V\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\left(1-{E}_{0}\left({Y}^{d}\left(t\right)|L\left(0\right)\right)\right)\mathrm{log}\left(1-{m}_{\beta }\left(d,t,V\right)\right)\right\}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}-10\equiv \Psi \left({P}_{0}\right).\end{array}$The estimand ${\mathrm{\psi }}_{0}={\mathrm{\beta }}_{0}$ solves the equation $0={E}_{0}{\sum }_{t}{\sum }_{d\in \mathcal{D}}h\left(d,t,V\right)\frac{\frac{d}{d{\mathrm{\beta }}_{0}}{m}_{{\mathrm{\beta }}_{0}}\left(d,t,V\right)}{{m}_{{\mathrm{\beta }}_{0}}\left(1-{m}_{{\mathrm{\beta }}_{0}}\right)}\left({E}_{0}\left({Y}^{d}\left(t\right)|L\left(0\right)\right)-{m}_{{\mathrm{\beta }}_{0}}\left(d,t,V\right)\right).$The causal identifiability assumptions put no restrictions on the probability distribution ${P}_{0}$ so that our statistical model is unchanged, with the exception that we now also assume positivity (3). The statistical target parameter is now defined as a mapping $\mathrm{\Psi }:\mathcal{M}\to \mathrm{I}\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{R}{\phantom{\rule{1pt}{0ex}}}^{J}$ that maps a probability distribution $P\in \mathcal{M}$ of O into a vector of parameter values $\mathrm{\Psi }\left(P\right)$.

The statistical estimation problem is now defined: We observe n i.i.d. copies ${O}_{1},\dots ,{O}_{n}$ of $O\sim {P}_{0}\in \mathcal{M}$ and we want to estimate $\mathrm{\Psi }\left({P}_{0}\right)$ for a defined target parameter mapping $\mathrm{\Psi }:\mathcal{M}\to \mathrm{I}\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{R}{\phantom{\rule{1pt}{0ex}}}^{J}$. For this estimation problem, the causal model plays no further role – even when one does not believe any of the causal assumptions, one might still argue that the statistical parameter $\mathrm{\Psi }\left({P}_{0}\right)={\mathrm{\psi }}_{0}$ represents an effect measure of interest controlling for all the measured confounders.

## 3 Pooled TMLE of working MSM for dynamic treatments and time-dependent outcome process

The TMLE algorithm starts out with defining the target parameter as a $\mathrm{\Psi }\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$ for a particular choice $\stackrel{ˉ}{Q}$ that is easier to estimate than the whole likelihood Q. It requires the derivation of the efficient influence curve ${D}^{\ast }\left(P\right)$ which can also be represented as ${D}^{\ast }\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)},g\right)$. Subsequently, it defines a loss function $\mathcal{L}\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$ for $\left({\stackrel{ˉ}{Q}}_{0},{Q}_{L\left(0\right),0}\right)$ and a submodel $\left(\left(\stackrel{ˉ}{Q}\left(\in ,g\right),{Q}_{L\left(0\right)}\left({\in }_{0}\right):\in ,{\in }_{0}\right)$ through $\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$ at $\left(\in ,{\in }_{0}\right)=0$, indexed by the intervention mechanism g, chosen so that ${\frac{d}{d\left(ϵ,{ϵ}_{0}\right)}L\left(\overline{Q}\left(ϵ,g\right),{Q}_{L\left(0\right)}\left({ϵ}_{0}\right)\right)|}_{ϵ=0}$ spans the efficient influence curve ${D}^{\ast }\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)},g\right)$. Given these choices, it remains to define the updating algorithm which simply uses the submodel through the initial estimator to determine the update by fitting $\left(\in ,{\in }_{0}\right)$ with minimum loss based estimation (MLE), and this updating step is iterated till convergence at which point the MLE of $\left(\in ,{\in }_{0}\right)$ equals 0. By the fact that an MLE solves its score equation, it then follows that the final update ${\stackrel{ˉ}{Q}}_{n}^{\ast },{Q}_{L\left(0\right),n}^{\ast }$ also solves the efficient influence curve equation ${\sum }_{i}{D}^{\ast }\left({\stackrel{ˉ}{Q}}_{n},{Q}_{L\left(0\right),n}^{\ast },{g}_{n}\right)\left({O}_{i}\right)=0$, which provides the foundation for its asymptotic linearity and efficiency. The remainder of this section presents each of these steps in detail.

An estimator of ${\mathrm{\psi }}_{0}$ is efficient among the class of regular estimators if and only if it is asymptotically linear with influence curve ${D}^{\ast }\left({Q}_{0},{g}_{0}\right)$[37]. The efficient influence curve can thus be used as an ingredient for the construction of an efficient estimator. One approach is to represent the efficient influence curve as an estimating function ${D}^{\ast }\left(Q,g,\mathrm{\psi }\right)$ and define an estimator ${\mathrm{\psi }}_{n}$ as the solution of ${P}_{n}{D}^{\ast }\left({Q}_{n},{g}_{n},\mathrm{\psi }\right)=0$, given initial estimators ${Q}_{n},{g}_{n}$. This is referred to as the estimating equation methodology for construction of locally efficient estimators [38]. Here, we instead use the efficient influence curve to define a targeted maximum likelihood (substitution) estimator $\mathrm{\Psi }\left({Q}_{n}^{\ast }\right)$ that, as a by-product of the procedure, satisfies ${P}_{n}{D}^{\ast }\left({Q}_{n}^{\ast },{g}_{n}\right)=0$ and thus also solves the efficient influence curve estimating equation. Under regularity conditions, one can now establish that, if ${D}^{\ast }\left({Q}_{n}^{\ast },{g}_{n}\right)$ consistently estimates ${D}^{\ast }\left({Q}_{0},{g}_{0}\right)$, then $\mathrm{\Psi }\left({Q}_{n}^{\ast }\right)$ is asymptotically linear with influence curve equal to the efficent influence curve, so that $\mathrm{\Psi }\left({Q}_{n}^{\ast }\right)$ is asymptotically efficient. In addition, robustness properties of the efficient influence curve are naturally inherited by the TMLE.

Robins [13, 29] and Bang and Robins [22] reformulate the statistical target parameter and corresponding efficient influence curve for longitudinal MSMs on the intervention-specific mean as a series of iterated conditional expectations. For completeness, and to generalize to dynamic marginal structural working models possibly conditional on baseline covariates, as well as to general functions of the intervention-specific mean across a user-supplied class of interventions, we present this reformulation of the statistical target parameter below. The corresponding efficient influence curve is given in Appendix B. We will use the common notation $Ph=\int h\left(O\right)dP\left(O\right)$ for the expectation of a function $h\left(O\right)$ with respect to P.

## 3.1 Reformulation of the statistical target parameter in terms of iteratively defined conditional means

For the case ${Y}_{d}\left(t\right)\in \left[0,1\right]$ in Section 2.4 we defined $\mathrm{\Psi }\left(P\right)$ as $\mathrm{\Psi }\left(Q\right)=\underset{\mathrm{\beta }}{\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{i}\mathrm{n}}-E{\sum }_{t}{\sum }_{d\in \mathcal{D}}h\left(d,t,V\right)\left\{{\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}log{m}_{\mathrm{\beta }}\left(d,t,V\right)+\left(1-{\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}\right)log\left(1-{m}_{\mathrm{\beta }}\left(d,t,V\right)\right)\right\},$(5) where ${\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}={E}_{P}\left({Y}^{d}\left(t\right)|L\left(0\right)\right)$. Thus, $\mathrm{\Psi }\left(P\right)$ only depends on P through ${\stackrel{ˉ}{Q}}_{L\left(1\right)}=\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}:d\in \mathcal{D},t\right)$ and ${Q}_{L\left(0\right)}$. Therefore, we will also refer to the statistical target parameter $\mathrm{\Psi }\left(P\right)$ as $\mathrm{\Psi }\left(Q\right)$ where we redefine $Q\equiv \left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$. For each given t, we can use the following recursive definition of ${E}_{P}\left({Y}^{d}\left(t\right)|L\left(0\right)\right)$: for $k=t,t-1,\dots ,1$ we have $\begin{array}{rl}{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}& =E\left({Y}^{d}\left(t\right)|{\stackrel{ˉ}{L}}^{d}\left(k-1\right)\right)\\ & ={E}_{L\left(k\right)}\left({\stackrel{ˉ}{Q}}_{L\left(k+1\right)}^{d,t}|\stackrel{ˉ}{L}\left(k-1\right),\stackrel{ˉ}{A}\left(k-1\right)={\stackrel{ˉ}{d}}_{k-1}\left(\stackrel{ˉ}{L}\left(k-1\right)\right)\right),\end{array}$ where we define ${\stackrel{ˉ}{Q}}_{L\left(t+1\right)}^{d,t}=Y\left(t\right)$. This defines ${\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}$ as an iteratively defined conditional mean [22].

To obtain $\mathrm{\Psi }\left(Q\right)$ we simply put ${\stackrel{ˉ}{Q}}_{L\left(1\right)}=\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}:d\in \mathcal{D},t\right)$, combined with the marginal distribution of $L\left(0\right)$ into the above representation $\mathrm{\Psi }\left(Q\right)=\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$. As mentioned in the previous section, we have that $\mathrm{\Psi }\left(Q\right)$ solves the score equations given by $\begin{array}{rl}0& =E{\sum }_{t}{\sum }_{d\in \mathcal{D}}h\left(d,t,V\right)\frac{\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t,V\right)}{{m}_{\mathrm{\beta }}\left(1-{m}_{\mathrm{\beta }}\right)}\left(E\left({Y}^{d}\left(t\right)|L\left(0\right)\right)-{m}_{\mathrm{\beta }}\left(d,t,V\right)\right)\\ & \equiv E{\sum }_{t}{\sum }_{d\in \mathcal{D}}{h}_{1}\left(d,t,V\right)\left(E\left({Y}^{d}\left(t\right)|L\left(0\right)\right)-{m}_{\mathrm{\beta }}\left(d,t,V\right)\right),\end{array}$ where we defined ${h}_{1}\left(d,t,V\right)\equiv h\left(d,t,V\right)\frac{\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t,V\right)}{{m}_{\mathrm{\beta }}\left(1-{m}_{\mathrm{\beta }}\right)}.$The TMLE for the linear working model using the squared error loss function is obtained by simply redefining ${h}_{1}\left(d,t,V\right)\equiv h\left(d,t,V\right)\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t,V\right)$.

In general, the above shows that we can represent $\mathrm{\Psi }\left(P\right)=f\left(\left(E\left({Y}^{d}\left(t\right)|L\left(0\right)\right):t,d\right),{Q}_{L\left(0\right)}\right)$ as a function $f\left(Q\right)$, where $Q=\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}=\left(E\left({Y}^{d}\left(t\right)|L\left(0\right):d,t\right),{Q}_{L\left(0\right)}\right)$ and we have an explicit representation of the derivative equation corresponding with f.

## 3.2 Estimation of intervention mechanism ${\mathbit{g}}_{\mathbit{0}}$

The log-likelihood loss function for ${g}_{0}$ is $-logg$. Specifically, we can factorize the likelihood ${g}_{0}$ as ${g}_{0}=\prod _{k=0}^{K}{g}_{1,k}\left({A}_{1}\left(k\right)|Pa\left({A}_{1}\left(k\right)\right)\right){g}_{2,k}\left({A}_{2}\left(k\right)|Pa\left({A}_{2}\left(k\right)\right)\right),$where $\left({g}_{1,k}:k\right)$ represents the treatment mechanism and $\left({g}_{2,k}:k\right)$ represents the censoring mechanism. Both mechanisms can be estimated separately with a log-likelihood based logistic regression estimator, either according to parametric models, or preferably using the state of the art in machine learning. In particular, we can use the log-likelihood based super learner based on a library of candidate machine learning algorithms, which uses cross-validation to determine the best performing weighted combination of the candidate machine learning algorithms [39]. Use of such aggressive data-adaptive algorithms is recommended in order to ensure consistency of ${g}_{n}$.

If there are certain variables in the $Pa\left(A\left(k\right)\right)$ that are known to be instrumental variables (variables that affect future Y nodes only via their effects on $A\left(k\right)$), then these variables should be excluded from our estimates of ${g}_{0}$ in the TMLE procedure. In that case our estimate of the conditional distribution of $A\left(k\right)$ is in fact not estimating the conditional distribution of $A\left(k\right)$ given its parents; however, for simplicity we do not make this explicit in our notation.

## 3.3 Loss functions and initial estimator of ${\mathbit{Q}}_{\mathbf{0}}$

We will alternate notation ${\stackrel{ˉ}{Q}}_{k}^{d,t}$ and ${\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}$. Recall that $\mathrm{\Psi }\left(Q\right)$ depends on Q through ${Q}_{L\left(0\right)}$, and $\stackrel{ˉ}{Q}\equiv \left({\stackrel{ˉ}{Q}}_{k}^{d,t}:d\in \mathcal{D},t\in \mathrm{\tau },k=1,\dots ,t\right)$. Note ${\stackrel{ˉ}{Q}}_{k}^{d,t}$ is a function of $\stackrel{ˉ}{l}\left(k-1\right)$, $t=1,\dots ,K+1$, $k=1,\dots ,t$, $d\in \mathcal{D}$. We will use the following loss function for ${\stackrel{ˉ}{Q}}_{k}^{d,t}$: $\begin{array}{l}-{L}_{d,t,k,{\overline{Q}}_{k+1}^{d,t}}\left({\overline{Q}}_{k}^{d,t}\right)=\\ I\left(\overline{A}\left(k-1\right)={\overline{d}}_{k-1}\left(\overline{L}\left(k-1\right)\right)\right)\left\{{\overline{Q}}_{k+1}^{d,t}\mathrm{log}{\overline{Q}}_{k}^{d,t}+\left(1-{\overline{Q}}_{k+1}^{d,t}\right)\mathrm{log}\left(1-{\overline{Q}}_{k}^{d,t}\right)\right\}.\end{array}$This is an application of the log-likelihood loss function for the conditional mean of ${\stackrel{ˉ}{Q}}_{k+1}^{d,t}$ given past covariates and given that past treatment has been assigned according to rule d. For example, fitting a parametric logistic regression model of ${\stackrel{ˉ}{Q}}_{k+1}^{d,t}$ on past covariates among subjects with $\stackrel{ˉ}{A}\left(k-1\right)={\stackrel{ˉ}{d}}_{k-1}\left(\stackrel{ˉ}{L}\left(k-1\right)\right)$ would minimize the empirical mean of this loss function over the unknown parameters of the logistic regression model. Alternatively, one could use loss-based machine learning algorithms, such as loss-based super learning, with this loss function.

In this loss function, the outcome ${\stackrel{ˉ}{Q}}_{k+1}^{d,t}$ is treated as known. In implementation of our estimator, it will be replaced by an estimate; we thus refer to ${\stackrel{ˉ}{Q}}_{k+1}^{d,t}$ as a nuisance parameter in this loss function. The collection of loss functions from $k=1,\dots ,t$ implies a sequential regression procedure where one starts at $k=t$ and sequentially fits ${\stackrel{ˉ}{Q}}_{k}^{d,t}$ for $k=t,\dots ,1$. We describe this procedure in greater detail in the next subsection, for a sum-loss function that sums the above loss function over a collection of rules $d\in \mathcal{D}$.

By summing over $d\in \mathcal{D}$, the time points t, and $k=1,\dots ,t$, we obtain the loss function ${\mathcal{L}}_{\stackrel{ˉ}{Q}}\left(\stackrel{ˉ}{Q}\right)\equiv {\sum }_{t}{\sum }_{k=1}^{t}{\sum }_{d\in \mathcal{D}}{\mathcal{L}}_{d,t,k,{\stackrel{ˉ}{Q}}_{k+1}^{d,t}}\left({\stackrel{ˉ}{Q}}_{k}^{d,t}\right)$for the whole $\stackrel{ˉ}{Q}=\left({\stackrel{ˉ}{Q}}_{k}^{d,t}:d\in \mathcal{D},k=1,\dots ,t,t=1,\dots ,K+1\right)$.

We will use the log-likelihood loss $\mathcal{L}\left({Q}_{L\left(0\right)}\right)=-log{Q}_{L\left(0\right)}$ as loss function for the distribution ${Q}_{0,L\left(0\right)}$ of $L\left(0\right)$, but this loss will play no role since we will estimate ${Q}_{0,L\left(0\right)}$ with the empirical distribution function ${Q}_{L\left(0\right),n}$. To conclude, we have presented a loss function for all components of $\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$ our target parameter depends on, and the sum-loss function ${\mathcal{L}}_{Q}\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)\equiv {\mathcal{L}}_{\stackrel{ˉ}{Q}}\left(\stackrel{ˉ}{Q}\right)-log{Q}_{L\left(0\right)}$ is a valid loss function for $\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$ as a whole.

## 3.4 Non-targeted substitution estimator

These loss functions imply a sequential regression methodology for fitting each of the required components of $\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$. These initial fits can then be used to construct a non-targeted plug-in estimator of the target parameter ${\mathrm{\psi }}_{0}$. As noted, we estimate the marginal distribution of $L\left(0\right)$ with the empirical distribution. We now describe how to obtain an estimator ${\stackrel{ˉ}{Q}}_{k,n}^{d,t}$, $d\in \mathcal{D}$, $k=1,\dots ,t$, for any given $t=1,\dots ,K+1$. We define ${\stackrel{ˉ}{Q}}_{t+1}^{d,t}=Y\left(t\right)$ for all d, and recall that ${\stackrel{ˉ}{Q}}_{t,n}^{d,t}$ is the regression of $Y\left(t\right)$ on $\stackrel{ˉ}{A}\left(t-1\right)={\stackrel{ˉ}{d}}_{t-1}\left(\stackrel{ˉ}{L}\left(t-1\right)\right)$ and $\stackrel{ˉ}{L}\left(t-1\right)$. This latter regression can be carried out conditional on ${\stackrel{ˉ}{A}}_{1}\left(t-1\right),\stackrel{ˉ}{L}\left(t-1\right)$, stratifying only on not being censored through time $t-1$ (i.e. ${\stackrel{ˉ}{A}}_{2}\left(t-1\right)=0\right)$). The resulting fit for all ${\stackrel{ˉ}{A}}_{1}\left(t-1\right)$ values can then be evaluated at $\stackrel{ˉ}{A}\left(t-1\right)={\stackrel{ˉ}{d}}_{t-1}\left(\stackrel{ˉ}{L}\left(t-1\right)\right)$. In this manner, if certain rules have little support, one can still obtain an initial estimator that smoothes across all observations.

Given the regression fit ${\stackrel{ˉ}{Q}}_{t,n}^{d,t}$, for a $d\in \mathcal{D}$, we regress ${\stackrel{ˉ}{Q}}_{t,n}^{d,t}$ onto $\stackrel{ˉ}{A}\left(t-2\right),\stackrel{ˉ}{L}\left(t-2\right)$ and evaluate it at $\stackrel{ˉ}{A}\left(t-2\right)={\stackrel{ˉ}{d}}_{t-2}\left(\stackrel{ˉ}{L}\left(t-2\right)\right)$, giving us ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t}$. This is carried out for each $d\in \mathcal{D}$, giving us ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t}$ for each $d\in \mathcal{D}$. Again, given this regression ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t}$, we regress this on $\stackrel{ˉ}{A}\left(t-3\right),\stackrel{ˉ}{L}\left(t-3\right)$, and evaluate it at $\stackrel{ˉ}{A}\left(t-3\right)={\stackrel{ˉ}{d}}_{t-3}\left(\stackrel{ˉ}{L}\left(t-3\right)\right)$, giving us ${\stackrel{ˉ}{Q}}_{t-2,n}^{d,t}$. We carry this out for each $d\in \mathcal{D}$, giving us ${\stackrel{ˉ}{Q}}_{t-2,n}^{d,t}$, for each $d\in \mathcal{D}$. This process is iterated until we obtain an estimator of ${\stackrel{ˉ}{Q}}_{1,n}^{d,t}\left(L\left(0\right)\right)$ for each $d\in \mathcal{D}$. Since this process is carried out for each $t=1,\dots ,K+1$, this results in an estimator ${\stackrel{ˉ}{Q}}_{1,n}^{d,t}$ for each $d\in \mathcal{D}$ and $t=1,\dots ,K+1$. We denote this estimator of ${\stackrel{ˉ}{Q}}_{1,0}=\left({\stackrel{ˉ}{Q}}_{1,0}^{d,t}:d,t\right)$ with ${\stackrel{ˉ}{Q}}_{1,n}=\left({\stackrel{ˉ}{Q}}_{1,n}^{d,t}:d,t\right)$. Note that a plug-in estimator $\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{1,n},{Q}_{L\left(0\right),n}\right)$ of ${\mathrm{\psi }}_{0}=\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{1,0},{Q}_{L\left(0\right)}\right)$ is now obtained by regressing ${\stackrel{ˉ}{Q}}_{1,n}^{d,t}$ onto $d,t,V$ according to the working marginal structural model using weighted logistic regression based on the pooled sample $\left({\stackrel{ˉ}{Q}}_{1,n}^{d,t}\left({L}_{i}\left(0\right)\right),{V}_{i},d,t\right)$, $d\in \mathcal{D},i=1,\dots ,n$, $t=1,\dots ,K+1$, with weight $h\left(d,t,{V}_{i}\right)$.

The pooled TMLE presented below utilizes this same sequential regression algorithm and makes use of these initial fits of ${Q}_{0}$. In order to provide a consistent initial estimator of ${Q}_{0}$ and thereby improve the efficiency of the TMLE, use of an aggressive data-adaptive algorithm such as super learning [39] when generating the initial regression fits is recommended. These initial fits are then updated to remove bias in a series of targeting steps that rely on the fit ${g}_{n}$ of ${g}_{0}$. The updating steps involve submodels whose score spans the efficient influence curve.

## 3.5 Loss function and least favorable submodel that span the efficient influence curve

Recall that we use the notation ${g}_{0:k}={\prod }_{j=0}^{k}{g}_{A\left(j\right)}$ for the cumulative product of conditional intervention distributions. Consider the submodel ${\stackrel{ˉ}{Q}}_{k}^{t}\left(\in ,g\right)\phantom{\rule{negativethinmathspace}{0ex}}=\phantom{\rule{negativethinmathspace}{0ex}}\left({\stackrel{ˉ}{Q}}_{k}^{d,t}\left(\in ,g\right)\phantom{\rule{negativethinmathspace}{0ex}}:d\in \mathcal{D}\right)$ with parameter $\in$ defined by $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}\left(\in ,g\right)=\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}+\in \frac{{h}_{1}\left(d,t,V\right)}{{g}_{0:k-1}},\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}k=t,\dots ,1.\phantom{\rule{1pt}{0ex}}$This parameter $\in$ is of same dimension as $\mathrm{\beta }$ and ${h}_{1}$. This defines a submodel ${\stackrel{ˉ}{Q}}^{t}\left(\in ,g\right)$ with parameter $\in$ through ${\stackrel{ˉ}{Q}}^{t}=\left({\stackrel{ˉ}{Q}}_{k}^{d,t}:d\in \mathcal{D},k=1,\dots ,t\right)$. Note that ${\frac{d}{dϵ}{L}_{d,t,k,{\overline{Q}}_{k+1}^{d,t}}\left({\overline{Q}}_{k}^{d,t}\left(ϵ,g\right)\right)|}_{ϵ=0}\text{ }={h}_{1}\left(d,t,V\right)\frac{I\left(\overline{A}\left(k-1\right)={\overline{d}}_{k-1}\left(\overline{L}\left(k-1\right)\right)\right)}{{g}_{0:k-1}}\left({\overline{Q}}_{k+1}^{d,t}-{\overline{Q}}_{k}^{d,t}\right)$This shows that $\begin{array}{l}\frac{d}{d\in }{{\sum }_{t=1}^{K+1}{\sum }_{d\in D}{\sum }_{k=1}^{t}{L}_{d,t,k,{\overline{Q}}_{k+1}^{d,t}}\left({\overline{Q}}_{k}^{d,t}\left(\in ,g\right)\right)|}_{\in =0}\\ ={\sum }_{t=1}^{K+1}{\sum }_{d\in D}{\sum }_{k=1}^{t}{h}_{1}\left(d,t,V\right)\frac{I\left(\overline{A}\left(k-1\right)={\overline{d}}_{k-1}\left(\overline{L}\left(k-1\right)\right)\right)}{{g}_{0:k-1}}\left({\overline{Q}}_{k+1}^{d,t}-{\overline{Q}}_{k}^{d,t}\right)\\ =c\left(Q\right)\left[{D}^{*}\left(P\right)-{D}_{L\left(0\right)}^{*}\left(Q\right)\right]\end{array}$where ${D}^{\ast }\left(P\right)$ is the efficient influence curve as presented in Corollary (1), Appendix (B), and we define $c\left(Q\right)\equiv {E}_{{Q}_{L\left(0\right)}}{\sum }_{t,d}{h}_{1}\left(d,t,V\right)\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t,V\right),$giving ${D}_{L\left(0\right)}^{\ast }\left(Q\right)=c\left(Q{\right)}^{-1}{\sum }_{t,d}{h}_{1}\left(d,t,V\right)\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}-{m}_{\mathrm{\beta }}\left(d,t,V\right)\right).$In other words, the sum-loss function ${\mathcal{L}}_{\stackrel{ˉ}{Q}}\left(\stackrel{ˉ}{Q}\right)={\sum }_{t=1}^{K+1}{\sum }_{d\in \mathcal{D}}{\sum }_{k=1}^{t}{\mathcal{L}}_{d,t,k,{\stackrel{ˉ}{Q}}_{k+1}^{d,t}}\left({\stackrel{ˉ}{Q}}_{k}^{d,t}\right)$and submodel $\stackrel{ˉ}{Q}\left(\in ,g\right)=\left({\stackrel{ˉ}{Q}}_{k}^{d,t}\left(\in ,g\right):k,d,t\right)$ through $\stackrel{ˉ}{Q}=\left({\stackrel{ˉ}{Q}}_{k}^{d,t}:k,d,t\right)$ generates the component ${D}^{\ast }\left(P\right)-{D}_{L\left(0\right)}^{\ast }\left(Q\right)$ of the efficient influence curve ${D}^{\ast }\left(P\right)$.

Consider also a submodel ${Q}_{L\left(0\right)}\left({\in }_{0}\right)$ of ${Q}_{L\left(0\right)}$ with score ${D}_{L\left(0\right)}^{\ast }\left(Q\right)$, but this submodel and loss will play no role in the TMLE algorithm since we will estimate ${Q}_{L\left(0\right)}$ with its NPMLE, the empirical distribution of ${L}_{i}\left(0\right)$, $i=1,\dots ,n$, so that the MLE of ${\in }_{0}$ will be equal to zero. This defines our submodel $\left({Q}_{L\left(0\right)}\left({\in }_{0}\right),\stackrel{ˉ}{Q}\left(\in ,g\right):{\in }_{0},\in \right)$. The sum-loss function ${\mathcal{L}}_{\stackrel{ˉ}{Q}}\left({Q}_{L\left(0\right)},\stackrel{ˉ}{Q}\right)=\mathcal{L}\left(\stackrel{ˉ}{Q}\right)-log{Q}_{L\left(0\right)}$ and this submodel satisfy the condition that the generalized score spans the efficient influence curve: ${D}^{*}\left(Q,g\right)\in 〈\frac{d}{d\left(\in ,{\in }_{0}\right)}{L}_{\overline{Q}}\left({Q}_{L\left(0\right)}\left({\in }_{0}\right),\overline{Q}\left(\in ,g\right)\right)|{}_{\left(\in ,{\in }_{0}\right)=0}〉.$(6)

## 3.6 Pooled TMLE

We now describe the TMLE algorithm based on the above choices of (1) the representation of $\mathrm{\Psi }\left(P\right)$ as $\mathrm{\Psi }\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$, (2) the loss function for $\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$, and (3) the least favorable submodels $\left(\left(\stackrel{ˉ}{Q}\left(\in ,g\right):\in \right),\left({Q}_{L\left(0\right)}\left({\in }_{0}\right):{\in }_{0}\right)\right)$ through $\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$ at $\left(\in ,{\in }_{0}\right)=0$ for fluctuating these parameters $\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$. We utilize the same sequential regression approach described in Section 3.4, but now incorporate sequential targeted updating of the initial regression fits. We assume an estimator ${g}_{n}$ of ${g}_{0}$. We first specify where in the algorithm updating occurs and then describe the updating process.

Recall that we define ${\stackrel{ˉ}{Q}}_{t+1}^{d,t}=Y\left(t\right)$ for all d and that ${\stackrel{ˉ}{Q}}_{t,n}^{d,t}$ is the regression of $Y\left(t\right)$ on $\stackrel{ˉ}{A}\left(t-1\right)={\stackrel{ˉ}{d}}_{t-1}\left(\stackrel{ˉ}{L}\left(t-1\right)\right),\stackrel{ˉ}{L}\left(t-1\right)$. For any given $t=1,\dots ,K+1$, the initial estimator ${\stackrel{ˉ}{Q}}_{t,n}^{d,t}$ is first updated to ${\stackrel{ˉ}{Q}}_{t,n}^{d,t,\ast }$ using a logistic regression fit of our least favorable submodels, as described below. For a $d\in \mathcal{D}$, we then regress the updated regression fit ${\stackrel{ˉ}{Q}}_{t,n}^{d,t,\ast }$ onto $\stackrel{ˉ}{A}\left(t-2\right),\stackrel{ˉ}{L}\left(t-2\right)$, and evaluate it at $\stackrel{ˉ}{A}\left(t-2\right)={\stackrel{ˉ}{d}}_{t-2}\left(\stackrel{ˉ}{L}\left(t-2\right)\right)$, giving us ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t}$. This is carried out for each $d\in \mathcal{D}$, giving us ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t}$ for each $d\in \mathcal{D}$. The regressions ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t}$ are then updated for each $d\in \mathcal{D}$, as described below, giving us ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t,\ast }$ for each $d\in \mathcal{D}$. For a $d\in \mathcal{D}$, we then regress the updated regression fit ${\stackrel{ˉ}{Q}}_{t-1,n}^{d,t,\ast }$ on $\stackrel{ˉ}{A}\left(t-3\right),\stackrel{ˉ}{L}\left(t-3\right)$ and evaluate it at $\stackrel{ˉ}{A}\left(t-3\right)={\stackrel{ˉ}{d}}_{t-3}\left(\stackrel{ˉ}{L}\left(t-3\right)$, giving us ${\stackrel{ˉ}{Q}}_{t-2,n}^{d,t}$. We again carry this out for each $d\in \mathcal{D}$, giving us ${\stackrel{ˉ}{Q}}_{t-2,n}^{d,t}$ for each $d\in \mathcal{D}$ and again update the resulting regressions, giving us ${\stackrel{ˉ}{Q}}_{t-2,n}^{d,t,\ast }$, for each $d\in \mathcal{D}$. This process is iterated until we obtain an updated estimator ${\stackrel{ˉ}{Q}}_{1,n}^{d,t,\ast }\left(L\left(0\right)\right)$ for each $d\in \mathcal{D}$. Since this process is carried out for each $t=1,\dots ,K+1$, this results in an estimator ${\stackrel{ˉ}{Q}}_{1,n}^{d,t,\ast }$ for each $d\in \mathcal{D}$ and $t=1,\dots ,K+1$. We denote this estimator of ${\stackrel{ˉ}{Q}}_{1,0}=\left({\stackrel{ˉ}{Q}}_{1,0}^{d,t}:d,t\right)$ with ${\stackrel{ˉ}{Q}}_{1,n}^{\ast }=\left({\stackrel{ˉ}{Q}}_{1,n}^{d,t,\ast }:d,t\right)$.

The updating steps are implemented as follows: for each $t\in \left\{1,\dots ,K+1\right\}$, and for $k=t$ to $k=1$, we compute ${\in }_{k,n}\equiv \underset{{\in }_{k}}{\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{i}\mathrm{n}}{P}_{n}{\sum }_{d\in \mathcal{D}}{\mathcal{L}}_{d,t,k,{\stackrel{ˉ}{Q}}_{k+1,n}^{d,t,\ast }}\left({\stackrel{ˉ}{Q}}_{k,n}^{d,t}\left({\in }_{k},{g}_{n}\right)\right),$and compute the corresponding update ${\stackrel{ˉ}{Q}}_{k,n}^{d,t,\ast }={\stackrel{ˉ}{Q}}_{k,n}^{d,t}\left({\in }_{k,n},{g}_{n}\right)$, for all $d\in \mathcal{D}$. Note that $\begin{array}{l}{\in }_{k,n}=\mathrm{arg}\mathrm{min}{\text{\hspace{0.17em}}}_{\in }{\sum }_{d\in D}{L}_{d,t,k,{\overline{Q}}_{k+1,n}^{d,t,*}}\left({\overline{Q}}_{k,n}^{d,t}\left(\in ,{g}_{n}\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=\mathrm{arg}{\mathrm{min}}_{\in }{\sum }_{d\in D}{\sum }_{i=1}^{n}I\left({\overline{A}}_{i}\left(k-1\right)={\overline{d}}_{k-1}\left({\overline{L}}_{i}\left(k-1\right)\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\left\{{\overline{Q}}_{k+1,n}^{d,t,*}\left({\overline{L}}_{i}\left(k\right)\right)\mathrm{log}{\overline{Q}}_{k,n}^{d,t}\left(\in ,{g}_{n}\right)\left({\overline{L}}_{i}\left(k-1\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\left(1-{\overline{Q}}_{k+1,n}^{d,t,*}\left({\overline{L}}_{i}\left(k\right)\right)\right)\mathrm{log}\left(1-{\overline{Q}}_{k,n}^{d,t}\left(\in ,{g}_{n}\right)\left({\overline{L}}_{i}\left(k-1\right)\right)\right)\right\}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}k=1,\dots ,K+1.\end{array}$ Thus ${\in }_{k,n}$ can be obtained by fitting a logistic regression of the outcome ${\stackrel{ˉ}{Q}}_{k+1,n}^{d,t,\ast }\left({\stackrel{ˉ}{L}}_{i}\left(k\right)\right)$ with offset $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\text{\hspace{0.17em}}\phantom{\rule{1pt}{0ex}}{\stackrel{ˉ}{Q}}_{k,n}^{d,t}$ on multivariate covariate ${h}_{1}\left(d,t,{V}_{i}\right)I\left({\stackrel{ˉ}{A}}_{i}\left(k-1\right)={\stackrel{ˉ}{d}}_{k-1}\left({\stackrel{ˉ}{L}}_{i}\left(k-1\right)\right)\right)/{g}_{0:k-1}\left({O}_{i}\right),$using a data set pooled across $i=1,\dots ,n,d\in \mathcal{D}$ (consisting of $n×|\mathcal{D}|$ observations).

This defines the TMLE ${\stackrel{ˉ}{Q}}_{n}^{\ast }=\left({\stackrel{ˉ}{Q}}_{k,n}^{d,t,\ast }:d\in \mathcal{D},t,k=1,\dots ,t\right)$. In particular, ${\stackrel{ˉ}{Q}}_{1,n}^{\ast }=\left({\stackrel{ˉ}{Q}}_{1,n}^{d,t,\ast }:d\in \mathcal{D},t\right)$ is the TMLE of ${\stackrel{ˉ}{Q}}_{1,0}=\left({E}_{0}\left({Y}^{d}\left(t\right)|L\left(0\right)\right):d\in \mathcal{D},t\right)$. This defines now the TMLE $\left({Q}_{L\left(0\right),n},{\stackrel{ˉ}{Q}}_{n}^{\ast }\right)$ of $\left({Q}_{L\left(0\right),0},{\stackrel{ˉ}{Q}}_{0}\right)$, where ${Q}_{L\left(0\right),n}$ is the empirical distribution of $L\left(0\right)$.

The TMLE of ${\mathrm{\psi }}_{0}$ is the plug-in estimator corresponding with ${\stackrel{ˉ}{Q}}_{1,n}^{\ast }$ and ${Q}_{L\left(0\right),n}$: ${\mathrm{\psi }}_{n}^{\ast }=\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{1,n}^{\ast },{Q}_{L\left(0\right),n}\right).$This plug-in estimator $\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{1,n}^{\ast },{Q}_{L\left(0\right),n}\right)$ of ${\mathrm{\psi }}_{0}=\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{1,0},{Q}_{L\left(0\right),0}\right)$ is obtained by regressing ${\stackrel{ˉ}{Q}}_{1,n}^{d,t,\ast }$ onto $d,t,V$ according to the marginal structural working model in the pooled sample $\left({\stackrel{ˉ}{Q}}_{1,n}^{d,t,\ast }\left({L}_{i}\left(0\right)\right),{V}_{i},d,t\right)$, $d\in \mathcal{D},i=1,\dots ,n$, $t=1,\dots ,K+1$, using weights $h\left(d,t,{V}_{i}\right)$.

An alternative pooled TMLE that only fits a single $\in$ to compute the update is described in Appendix C.

## 3.7 Statistical inference for pooled TMLE

By construction, the TMLE solves the efficient influence curve equation $0={P}_{n}{D}^{\ast }\left({\stackrel{ˉ}{Q}}_{n}^{\ast },{g}_{n},\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{n}^{\ast },{Q}_{L\left(0\right),n}\right)\right)$, thereby making it a double robust locally efficient substitution estimator under regularity conditions, and positivity (3) (van der Laan [20], theorem 8.5, appendix A.18). Here, we provide standard error estimates and thereby confidence intervals for the case that ${g}_{n}$ is a maximum likelihood estimator for ${g}_{0}$ using a correctly specified semiparametric model for ${g}_{0}$.

Specifically, if ${g}_{n}$ is a maximum likelihood estimator of ${g}_{0}$ according to a correctly specified semiparametric model for ${g}_{0}$, and ${\stackrel{ˉ}{Q}}_{n}^{\ast }$ converges to some possibly misspecified $\stackrel{ˉ}{Q}$, then under regularity conditions the TMLE ${\mathrm{\psi }}_{n}^{\ast }$ is asymptotically linear with an influence curve given by ${D}^{\ast }\left(\stackrel{ˉ}{Q},{g}_{0},{\mathrm{\psi }}_{0}\right)$ minus its projection onto the tangent space of this semiparametric model for ${g}_{0}$. As a consequence, the asymptotic variance of $\sqrt{n}\left({\mathrm{\psi }}_{n}^{\ast }-{\mathrm{\psi }}_{0}\right)$ is more spread-out or equal to the covariance matrix ${\mathrm{\Sigma }}_{0}={P}_{0}{D}^{\ast }\left(\stackrel{ˉ}{Q},{g}_{0},{\mathrm{\psi }}_{0}{\right)}^{2}$. A consistent estimator of this asymptotic variance is given by ${\mathrm{\Sigma }}_{n}={P}_{n}{\left\{{D}^{\ast }\left({\stackrel{ˉ}{Q}}_{n}^{\ast },{g}_{n},{\mathrm{\psi }}_{n}^{\ast }\right)\right\}}^{2}.$As a consequence, ${\mathrm{\psi }}_{n}^{\ast }\left(j\right)±1.96\frac{\sqrt{{\mathrm{\Sigma }}_{n}\left(j,j\right)}}{\sqrt{n}}$ is an asymptotically conservative 95% confidence interval for ${\mathrm{\psi }}_{0}\left(j\right)$, and we can also use this multivariate normal limit result, ${\mathrm{\psi }}_{n}^{\ast }\sim N\left({\mathrm{\psi }}_{0},{\mathrm{\Sigma }}_{0}/n\right)$, to construct a simultaneous confidence interval for ${\mathrm{\psi }}_{0}$ and to test null hypotheses about ${\mathrm{\psi }}_{0}$. This variance estimator treats weight function h as known. If h is estimated, then this variance estimator still provides valid statistical inference for the statistical target parameter defined by the estimated h.

In the case that ${g}_{n}$ is a data-adaptive estimator converging to ${g}_{0}$, we suggest (without proof), that this variance estimator will still provide an asymptotically conservative confidence interval under regularity conditions. However, ideally the data-adaptive estimator ${g}_{n}$ should also be targeted [40]. An approach to valid inference in the case where ${g}_{n}$ is inconsistent but ${Q}_{n}$ is consistent is also discussed in van der Laan [40]; however, it remains to be generalized to the parameters in this paper.

## 4 Implementation of the pooled TMLE

The previous section reformulated the statistical parameter in terms of iteratively defined conditional means and described a pooled TMLE for this representation. In this section, we illustrate notation and implementation of this TMLE to estimate the parameters of a marginal structural working model on counterfactual survival over time.

## 4.1 The statistical estimation problem

We continue our motivating example, in which the goal is to learn the effect of switch time on survival. For illustration, focus on the two time point case where $K=1$. Let the observed data consist of n i.i.d. copies ${O}_{1},\dots ,{O}_{n}$ of ${O}_{i}=\left({L}_{i}\left(0\right),{A}_{i}\left(0\right),{L}_{i}\left(1\right),{A}_{i}\left(1\right),{Y}_{i}\left(2\right)\right)\sim {P}_{0}$. Let $L\left(t\right)=\left(Y\left(t\right),CD4\left(t\right)\right)$, where $Y\left(t\right)$ is an indicator of death by time t, and $CD4\left(t\right)$ is CD4 count at time t. Assume all subjects are alive at baseline ($Y\left(0\right)=0$). As above, $A\left(t\right)$ is an indicator of switch to second line by time t. We assume no right censoring so that all subjects are followed until death or the end of the study (for convenience define variable values after death as equal to their last observed value). We specify a NPSEM such that each variable may be a function of all variables that precede it and an independent error, and assume the corresponding non-parametric statistical model for ${P}_{0}$.

Define the set of treatment rules of interest $\mathcal{D}$ as the set of all possible switch times $\left\{0,1,2\right\}$ (where 2 corresponds to no switch). Each rule d implies a single vector $\stackrel{ˉ}{a}=\left(a\left(0\right),a\left(1\right)\right)$; we use $d\left(t\right)$ to refer to the value $a\left(t\right)$ implied by rule d, and ${s}_{d}$ to refer to the switching time implied by rule d. We specify the following marginal structural working model for counterfactual probability of death by time t under rule d: $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{m}_{\mathrm{\beta }}\left(d,t\right)={\mathrm{\beta }}_{0}+{\mathrm{\beta }}_{1}t+{\mathrm{\beta }}_{2}\left(d\left(t-1\right)\left(t-{s}_{d}\right)\right).$(7)The target causal parameter is defined as the projection of $\left({E}_{0}\left({Y}_{d}\left(t\right)\right):d\in \mathcal{D},t\in \left\{1,2\right\}\right)$ onto ${m}_{\mathrm{\beta }}\left(d,t\right)$ according to eq. (1).

Under the sequential randomization (2) and positivity (3) assumptions, $E\left({Y}_{d}\left(t\right)|L\left(0\right)\right)=E\left({Y}^{d}\left(t\right)|L\left(0\right)\right)$. The target statistical parameter is defined as the projection of $\left(E\left({Y}^{d}\left(t\right)|L\left(0\right)\right):d\in \mathcal{D},t\in \left\{1,2\right\}\right)$, onto the marginal structural working model ${m}_{\mathrm{\beta }}$, according to eq. (5) with $h\left(d,t\right)=1$.

## 4.1.1 Reformulation of the statistical target parameter

Note that $E\left({Y}^{d}\left(t=2\right)|L\left(0\right)\right)$ for rule d (denoted ${\stackrel{ˉ}{Q}}_{1}^{d,2}$) can be expressed in terms of iteratively defined conditional means: ${E}_{L\left(1\right)}\left({E}_{Y\left(2\right)}\left(Y\left(2\right)|L\left(1\right),L\left(0\right),A\left(1\right)=d\left(1\right),A\left(0\right)=d\left(0\right)\right)|L\left(0\right),A\left(0\right)=d\left(0\right)\right),$while $E\left({Y}^{d}\left(t=1\right)|L\left(0\right)\right)$ (denoted ${\stackrel{ˉ}{Q}}_{1}^{d,1}$) equals $E\left(Y\left(1\right)|L\left(0\right),A\left(0\right)=d\left(0\right)\right)$. The statistical target parameter $\mathrm{\Psi }\left(Q\right)$ is defined by plugging $\left({\stackrel{ˉ}{Q}}_{1}^{d,1},{\stackrel{ˉ}{Q}}_{1}^{d,2}\right):d\in \mathcal{D}$, and the marginal distribution of $L\left(0\right)$ into eq. (5).

## 4.2 Estimator implementation

We begin by describing implementation of a simple plug-in estimator of $\mathrm{\Psi }\left(Q\right)$.

## 4.2.1 Non-targeted substitution estimator

• 1.

For each rule of interest $d\in \mathcal{D}$, corresponding to each possible switch time, generate a vector ${\stackrel{ˉ}{Q}}_{1,n}^{d,2}$ of length n for $t=2$:

• (a)

Fit a logistic regression of $Y\left(2\right)$ on $L\left(1\right),L\left(0\right),A\left(1\right),A\left(0\right)$ and generate a predicted value for each subject by evaluating this regression fit at $A\left(1\right)=d\left(1\right),A\left(0\right)=d\left(0\right)$. Note ${E}_{0}\left(Y\left(2\right)|Y\left(1\right)=1\right)=1$, so the regression need only be fit and evaluated among subjects who remain alive at time 1. This gives a vector ${\stackrel{ˉ}{Q}}_{2,n}^{d,2}$ of length n.

• (b)

Fit a logistic regression of the predicted values generated in the previous step on $L\left(0\right),A\left(0\right)$. Generate a new predicted value for each subject by evaluating this regression fit at $A\left(0\right)=d\left(0\right)$. This gives a vector ${\stackrel{ˉ}{Q}}_{1,n}^{d,2}$ of length n.

• 2.

For each rule of interest $d\in \mathcal{D}$ generate a vector ${\stackrel{ˉ}{Q}}_{1,n}^{d,1}$ of length n for $t=1$: Fit a logistic regression of $Y\left(1\right)$ on $L\left(0\right),A\left(0\right)$ and generate a predicted value for each subject by evaluating this regression fit at $A\left(0\right)=d\left(0\right)$.

• 3.

The previous steps generated ${\stackrel{ˉ}{Q}}_{1,n}=\left({\stackrel{ˉ}{Q}}_{1,n}^{d,t},:d\in \mathcal{D},t\in \left\{1,2\right\}\right)$. Stack these vectors to give a single vector with length equal to the number of subjects n times the number of rules $\left|\mathcal{D}\right|$ times the number of time points $\left(n×3×2\right)$. Fit a pooled logistic regression of ${\stackrel{ˉ}{Q}}_{1,n}$ on $\left(d,t\right)$ according to model ${m}_{\mathrm{\beta }}$ (eq. 7), with weights given by $h\left(d,t\right)$ (here equal to 1). This gives an estimator of the target parameter $\mathrm{\Psi }\left(Q\right)$.

We now describe how the pooled TMLE modifies this algorithm to update the initial estimator ${\stackrel{ˉ}{Q}}_{1,n}$. In the following section, we compare the pooled TMLE to this non-targeted substitution estimator and with other available estimators.

## 4.2.2 Pooled TMLE

• 1.

Estimate ${P}_{0}\left(A\left(1\right)|A\left(0\right),L\left(1\right),L\left(0\right)\right)$ and ${P}_{0}\left(A\left(0\right)|L\left(0\right)\right)$. Denote these estimators ${g}_{1,n}$ and ${g}_{0,n}$, respectively, and let ${g}_{0:1,n}={g}_{0,n}{g}_{1,n}$ denote their product. In our example, this step involves estimating the conditional probability of switching at time 0 given baseline CD4 count, and estimating the conditional probability of switching at time 1, given a subject did not switch at time 1, did not die at time 1, and CD4 count at times 0 and 1.

• 2.

Generate a vector ${\stackrel{ˉ}{Q}}_{2,n}^{d,2,\ast }$ of length $n×\left|\mathcal{D}\right|$ for $t=2$, $k=2$:

• (a)

Fit a logistic regression of $Y\left(2\right)$ on $L\left(1\right),L\left(0\right),A\left(1\right),A\left(0\right)$. Generate a predicted value for each subject and each $d\in \mathcal{D}$ by evaluating this regression fit at $A\left(1\right)=d\left(1\right),A\left(0\right)=d\left(0\right)$. Note that ${E}_{0}\left(Y\left(2\right)|Y\left(1\right)=1\right)=1$, so the regression need only be fit and evaluated among subjects who remain alive at time 1. This gives a vector of initial values ${\stackrel{ˉ}{Q}}_{2,n}^{d,2}$ of length $n×\left|\mathcal{D}\right|$.

• (b)

For each subject, $i=1,\dots ,n$, create a vector consisting of one copy of ${Y}_{i}\left(2\right)$ for each $d\in \mathcal{D}$. Stack these copies to create a single vector of length $n×\left|\mathcal{D}\right|$, denoted ${\stackrel{ˉ}{Q}}_{3,n}^{d,2,\ast }$.

• (c)

For each subject $i=1,\dots ,n$ and each $d\in \mathcal{D}$, create a new multidimensional weighted covariate: $h\left(d,t=2\right)\frac{\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t=2\right)}{{m}_{\mathrm{\beta }}\left(1-{m}_{\mathrm{\beta }}\right)}\frac{I\left({\stackrel{ˉ}{A}}_{i}=d\right)}{{g}_{0:1,n}\left({O}_{i}\right)}.$

In our example, $h\left(d,t\right)=1$, and $\frac{\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t\right)}{{m}_{\mathrm{\beta }}\left(1-{m}_{\mathrm{\beta }}\right)}$ equals 1, t, and $d\left(t-1\right)\left(t-{s}_{d}\right)$ for the derivative taken with respect to ${\mathrm{\beta }}_{0},{\mathrm{\beta }}_{1},$ and ${\mathrm{\beta }}_{2}$, respectively. The following $3×3$ matrix would thus be generated for each subject i, with rows corresponding to switch at time 0, time 1, or do not switch: $\left(\begin{array}{ccc}\frac{1×I\left({A}_{i}\left(0\right)=1,{A}_{i}\left(1\right)=1\right)}{{g}_{0:1,n}\left({O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=1,{A}_{i}\left(1\right)=1\right)}{{g}_{0:1,n}\left({O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=1,{A}_{i}\left(1\right)=1\right)}{{g}_{0:1,n}\left({O}_{i}\right)}\\ \frac{1×I\left({A}_{i}\left(0\right)=0,{A}_{i}\left(1\right)=1\right)}{{g}_{0:1,n}\left({O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=0,{A}_{i}\left(1\right)=1\right)}{{g}_{0:1,n}\left({O}_{i}\right)}& \frac{1×I\left({A}_{i}\left(0\right)=0,{A}_{i}\left(1\right)=1\right)}{{g}_{0:1,n}\left({O}_{i}\right)}\\ \frac{1×I\left({A}_{i}\left(0\right)=0,{A}_{i}\left(1\right)=0\right)}{{g}_{0:1,n}\left({O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=0,{A}_{i}\left(1\right)=0\right)}{{g}_{0:1,n}\left({O}_{i}\right)}& \frac{0×I\left({A}_{i}\left(0\right)=0,{A}_{i}\left(1\right)=0\right)}{{g}_{0:1,n}\left({O}_{i}\right)}\\ & & \end{array}\right)$

Stack these matrices to create a matrix with $n×\left|\mathcal{D}\right|$ rows and one column for each component of $\mathrm{\beta }$ (here, $3n×3$).

• (d)

Among those subjects still alive at the previous time point ($Y\left(1\right)=0$), fit a pooled logistic regression of ${\stackrel{ˉ}{Q}}_{3,n}^{d,2,\ast }$ (the $Y\left(2\right)$ vector) on the weighted covariates created in the previous step, suppressing the intercept and using as offset $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{\stackrel{ˉ}{Q}}_{2,n}^{d,2,}$, the logit of the initial predicted values for $t=2$ and $k=2$. This gives a fit for multivariate ${\in }_{2}$. Denote this fit ${\in }_{2,n}=\left({\in }_{2,n}^{{\mathrm{\beta }}_{0}},{\in }_{2,n}^{{\mathrm{\beta }}_{1}},{\in }_{2,n}^{{\mathrm{\beta }}_{2}}\right)$.

• (e)

Generate ${\stackrel{ˉ}{Q}}_{2,n}^{d,2,\ast }$ by evaluating the logistic regression fit in the previous step at each $d\in \mathcal{D}$ among those subjects for whom $Y\left(1\right)=0$. For subject i and rule d, evaluate $\text{Expit}\text{ }\left(\text{ }\text{Logit}\text{ }\left({\overline{Q}}_{2,n}^{d,2}\left({\overline{L}}_{i},d\right)\right)+\frac{{\epsilon }_{2,n}^{{\beta }_{0}}}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right){g}_{1,n}\left(d\left(1\right)|{L}_{i}\left(0\right),d\left(0\right),{L}_{i}\left(1\right)\right)}+$

$+\frac{{\in }_{2,n}^{{\mathrm{\beta }}_{1}}×2}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right){g}_{1,n}\left(d\left(1\right)|{L}_{i}\left(0\right),d\left(0\right),{L}_{i}\left(1\right)\right)}$

$+\frac{{\in }_{2,n}^{{\mathrm{\beta }}_{2}}×d\left(1\right)\left(2-{s}_{d}\right)}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right){g}_{1,n}\left(d\left(1\right)|{L}_{i}\left(0\right),d\left(0\right),{L}_{i}\left(1\right)\right)}\right).$

For subjects with $Y\left(1\right)=1$, ${\stackrel{ˉ}{Q}}_{2,n}^{d,2,\ast }={\stackrel{ˉ}{Q}}_{2,n}^{d,2}=1$. This gives an updated vector ${\stackrel{ˉ}{Q}}_{2,n}^{d,2,\ast }$ of length $n×\left|\mathcal{D}\right|$.

• 3.

Generate a vector ${\stackrel{ˉ}{Q}}_{1,n}^{d,2,\ast }$ of length $n×\left|\mathcal{D}\right|$ for $t=2$, $k=1$:

• (a)

Fit a logistic regression of ${\stackrel{ˉ}{Q}}_{2,n}^{d,2,\ast }$ (generated in the previous step) on $L\left(0\right),A\left(0\right)$. Generate a predicted value for each subject and each $d\in \mathcal{D}$ by evaluating this regression fit at $A\left(0\right)=d\left(0\right)$. This gives a vector of initial values ${\stackrel{ˉ}{Q}}_{1,n}^{d,2}$ of length $n×\left|\mathcal{D}\right|$.

• (b)

For each subject $i=1,\dots ,n$ and each $d\in \mathcal{D}$, create the multidimensional weighted covariate as above, now for $k=1$: $h\left(d,t=2\right)\frac{\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t=2\right)}{{m}_{\mathrm{\beta }}\left(1-{m}_{\mathrm{\beta }}\right)}\frac{I\left({A}_{i}\left(0\right)=d\left(0\right)\right)}{{g}_{0,n}\left({O}_{i}\right)}.$

The following $3×3$ matrix would thus be generated for each subject i, with rows corresponding to switch at time 0, time 1, or don’t switch: $\left(\begin{array}{ccc}\frac{1×I\left({A}_{i}\left(0\right)=1\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=1\right)}{{g}_{0,n}\left(,{O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=1\right)}{{g}_{0,n}\left({O}_{i}\right)}\\ \frac{1×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{1×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}\\ \frac{1×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{2×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{0×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)},\end{array}\right)$

Stack these matrices to create a matrix with $n×\left|\mathcal{D}\right|$ rows and one column for each dimension of $\mathrm{\beta }$.

• (c)

Fit a pooled logistic regression of ${\stackrel{ˉ}{Q}}_{2,n}^{d,2,\ast }$ (the updated fit generated in step 2) on these weighted covariates, suppressing the intercept and using as offset $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{\stackrel{ˉ}{Q}}_{1,n}^{d,2,}$, the logit of the initial predicted values for $t=2$ and $k=1$. This gives a fit for multivariate ${\in }_{1}$. Denote this fit ${\in }_{1,n}=\left({\in }_{1,n}^{{\mathrm{\beta }}_{0}},{\in }_{1,n}^{{\mathrm{\beta }}_{1}},{\in }_{1,n}^{{\mathrm{\beta }}_{2}}\right)$.

• (d)

Generate ${\stackrel{ˉ}{Q}}_{1,n}^{d,2,\ast }$ by evaluating the logistic regression fit in the previous step at each $d\in \mathcal{D}$. For subject i and rule d, evaluate $\text{Expit}\left(\text{Logit}\left({\overline{Q}}_{1,n}^{d,2}\left({L}_{i}\left(0\right),d\left(0\right)\right)\right)+\frac{{ϵ}_{1,n}^{{\beta }_{0}}}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right)}+\frac{{ϵ}_{1,n}^{{\beta }_{1}}×2}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right)}+\frac{{ϵ}_{1,n}^{{\beta }_{2}}×d\left(1\right)\left(2-{s}_{d}\right)}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right)}\right).$

This gives an updated vector ${\stackrel{ˉ}{Q}}_{1,n}^{d,2,\ast }$of length $n×\left|\mathcal{D}\right|$.

• 4.

Generate a vector ${\stackrel{ˉ}{Q}}_{1,n}^{d,1,\ast }$ of length $n×\left|\mathcal{D}\right|$ for $t=1$, $k=1$:

• (a)

Fit a logistic regression of $Y\left(1\right)$ on $L\left(0\right),A\left(0\right)$. Generate a predicted value for each subject and each $d\in \mathcal{D}$ by evaluating this regression fit at $A\left(0\right)=d\left(0\right)$. This gives a vector of initial values ${\stackrel{ˉ}{Q}}_{1,n}^{d,1}$ of length $n×\left|\mathcal{D}\right|$.

• (b)

For each subject, $i=1,\dots ,n$, create a vector consisting of one copy of ${Y}_{i}\left(1\right)$ for each $d\in \mathcal{D}$. Stack these copies to create a single vector of length $n×\left|\mathcal{D}\right|$, denoted ${\stackrel{ˉ}{Q}}_{2,n}^{d,1,\ast }$.

• (c)

For each subject $i=1,\dots ,n$ and each $d\in \mathcal{D}$, create a new multidimensional weighted covariate, for $t=1,k=1$: $h\left(d,t=1\right)\frac{\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t=1\right)}{{m}_{\mathrm{\beta }}\left(1-{m}_{\mathrm{\beta }}\right)}\frac{I\left({A}_{i}\left(0\right)=d\left(0\right)\right)}{{g}_{0,n}\left({O}_{i}\right)}.$

The following $3×3$ matrix would thus be generated for each subject i, with rows corresponding to switch at time 0, time 1, or don’t switch: $\left(\begin{array}{ccc}\frac{1×I\left({A}_{i}\left(0\right)=1\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{1×I\left({A}_{i}\left(0\right)=1\right)}{{g}_{0,n}\left(,{O}_{i}\right)}& \frac{1×I\left({A}_{i}\left(0\right)=1\right)}{{g}_{0,n}\left({O}_{i}\right)}\\ \frac{1×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{1×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{0×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}\\ \frac{1×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{1×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}& \frac{0×I\left({A}_{i}\left(0\right)=0\right)}{{g}_{0,n}\left({O}_{i}\right)}\end{array}\right)$

Stack these matrices to create a matrix with $n×\left|\mathcal{D}\right|$ rows and one column for each component of $\mathrm{\beta }$.

• (d)

Fit a pooled logistic regression of ${\stackrel{ˉ}{Q}}_{2,n}^{d,1,\ast }$ (the $Y\left(1\right)$ vector) on these weighted covariates, suppressing the intercept and using as offset $\phantom{\rule{1pt}{0ex}}\mathrm{L}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{1pt}{0ex}}{\stackrel{ˉ}{Q}}_{1,n}^{d,1,}$, the logit of the initial predicted values for $t=1$ and $k=1$. This gives a fit for multivariate ${\in }_{1}$. Denote this fit ${\in }_{1,n}=\left({\in }_{1,n}^{{\mathrm{\beta }}_{0}},{\in }_{1,n}^{{\mathrm{\beta }}_{1}},{\in }_{1,n}^{{\mathrm{\beta }}_{2}}\right)$.

• (e)

Generate ${\stackrel{ˉ}{Q}}_{1,n}^{d,1,\ast }$ by evaluating the logistic regression fit in the previous step at each $d\in \mathcal{D}$. For subject i and rule d, evaluate $\text{Expit}\left(\text{Logit}\left({\overline{Q}}_{1,n}^{d,1}\left({L}_{i}\left(0\right),d\left(0\right)\right)\right)+\frac{{ϵ}_{1,n}^{{\beta }_{0}}}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right)}+\frac{{ϵ}_{1,n}^{{\beta }_{1}}}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right)}+\frac{{ϵ}_{1,n}^{{\beta }_{2}}×d\left(0\right)\left(1-{s}_{d}\right)}{{g}_{0,n}\left(d\left(0\right)|{L}_{i}\left(0\right)\right)}\right).$

This gives an updated vector ${\stackrel{ˉ}{Q}}_{1,n}^{d,1,\ast }$ of length $n×\left|\mathcal{D}\right|$.

• 5.

The previous steps generated ${\stackrel{ˉ}{Q}}_{1,n}^{\ast }=\left({\stackrel{ˉ}{Q}}_{1,n}^{d,t,\ast }:d\in \mathcal{D},t=1,2\right)$. Stack these vectors to give a single vector with length equal to the number of subjects n times the number of rules $\left|\mathcal{D}\right|$ times the number of time points $\left(n×3×2\right)$. Fit a pooled logistic regression of ${\stackrel{ˉ}{Q}}_{1,n}^{\ast }$ on $\left(d,t\right)$ according model ${m}_{\mathrm{\beta }}$ (eq. 7), with weights given by $h\left(d,t\right)$ (here equal to 1). This gives the pooled TMLE of the target parameter $\mathrm{\Psi }\left(Q\right)$.

## 5 Comparison with alternative estimators

In this section we compare the TMLE described with several alternative estimators available for dynamic MSMs for survival: non-targeted substitution estimators, IPW estimators, and the stratified TMLE of Schnitzer et al. [32].

## 5.1 Non-targeted substitution estimator

The consistency of non-targeted substitution estimators of $\mathrm{\Psi }\left(Q\right)$ relies entirely on consistent estimation of the Q portions of the observed data likelihood. For estimators based on the parametric G-formula this requires correctly specifying parametric estimators for the conditional distributions of all non-intervention nodes given their parents [7, 11, 12, 41]. For the non-targeted estimator described in Section (3.4), this requires consistently estimating the literately defined conditional means $\stackrel{ˉ}{Q}\equiv \left({\stackrel{ˉ}{Q}}_{k}^{d,t}\mathit{:}\mathit{\text{\hspace{0.17em}}}d\in \mathcal{D},t\in \mathrm{\tau },k=1,\dots ,t\right)$.

Correct a priori specification of parametric models for $\stackrel{ˉ}{Q}$ in either case is rarely possible, rendering such non-targeted plug-in estimators susceptible to bias. Further, while machine learning methods, such as Super Learning, can be used to estimate Q non-parametrically, the resulting plug-in estimator has no theory supporting its asymptotic linearity, and will generally be overly biased for the target parameter $\mathrm{\beta }$ [21].

## 5.2 Inverse probability weighted estimators

The IPW estimator described in van der Laan and Petersen [2], Robins et al. [26] is commonly used to estimate the parameters of a dynamic MSM. In brief, this estimator is implemented by creating one data line for each subject i, for each t, and for each d for which $\stackrel{ˉ}{A}\left(t-1\right)=d\left(\stackrel{ˉ}{L}\left(t-1\right)\right)$. Each data line consists of ${Y}_{i}\left(t\right)$, any functions of $\left(d,t,{V}_{i}\right)$ included as covariates in the MSM, and a weight $\frac{h\left(d,t,{V}_{i}\right)I\left({\stackrel{ˉ}{A}}_{i}\left(t-1\right)=d\left({\stackrel{ˉ}{L}}_{i}\left(t-1\right)\right)\right)}{{\prod }_{j=0}^{t-1}{g}_{n}\left({A}_{i}\left(j\right)|Pa\left({A}_{i}\left(j\right)\right)\right)}$. A weighted logistic regression is then fit, pooling over time and rules d.

The parameter mapping presented here for the TMLE suggests an alternative IPW estimator for dynamic MSM – namely, implement an IPW estimator for $E\left({Y}_{d}\right)$ (possibly within strata of V if V is discrete) for $d\in \mathcal{D},t$ and project these estimates onto ${m}_{\mathrm{\beta }}$. The IPW estimator employed could be either the standard Horvitz–Thompson estimator: $\frac{1}{n}{\sum }_{i=1}^{n}\frac{I\left({\stackrel{ˉ}{A}}_{i}\left(t-1\right)=d\left({\stackrel{ˉ}{L}}_{i}\left(t-1\right)\right)\right)}{\prod _{j=0}^{t-1}{g}_{n}\left({A}_{i}\left(j\right)|Pa\left({A}_{i}\left(j\right)\right)\right)}{Y}_{i},$or its bounded counterpart: ${\sum }_{i=1}^{n}\frac{I\left({\stackrel{ˉ}{A}}_{i}\left(t-1\right)=d\left({\stackrel{ˉ}{L}}_{i}\left(t-1\right)\right)\right)}{\prod _{j=0}^{t-1}{g}_{n}\left({A}_{i}\left(j\right)|Pa\left({A}_{i}\left(j\right)\right)\right)}{Y}_{i}/{\sum }_{i=1}^{n}\frac{I\left({\stackrel{ˉ}{A}}_{i}\left(t-1\right)=d\left({\stackrel{ˉ}{L}}_{i}\left(t-1\right)\right)\right)}{\prod _{j=0}^{t-1}{g}_{n}\left({A}_{i}\left(j\right)|Pa\left({A}_{i}\left(j\right)\right)\right)}$Robins and Rotnitzky [14].

The consistency of both IPW estimators relies on having a consistent estimator ${g}_{n}$ of ${g}_{0}$; further, even if ${g}_{n}$ is estimated consistently, neither will be asymptotically efficient. Both also suffer from the general sensitivity of IPW estimators to strong confounding (data sparsity or near positivity violations). Standard IPW estimators for dynamic MSMs are typically more susceptible to instability in such settings than their counterparts for static MSMs, due to the limited ability to stabilize weights (with stabilizing function restricted to $h\left(d,t,V\right)$ versus $h\left({\stackrel{ˉ}{A}}_{i}\left(t-1\right),t,V\right)$.

## 5.3 Stratified targeted maximum likelihood estimator

Similar to the pooled TMLE, the stratified TMLE [32] also relies on reformulating the statistical target parameter in terms of iteratively defined conditional means and updating initial fits of these conditional means using covariates that are functions of an estimator ${g}_{n}$ of the intervention mechanism. The stratified and pooled differ, however, in several respects. In particular, in the pooled TMLE the update step is accomplished by fitting a single multivariate $\in$ for each time point t and non-intervention node k, pooling across all rules of interest $d\in \mathcal{D}$. In contrast, the stratified TMLE fits a separate $\in$ for each time point t, non-intervention node k, and rule of interest $d\in \mathcal{D}$. Specifically, the stratified TMLE consists of implementing the longitudinal TMLE of van der Laan and Gruber [19] for $E\left({Y}^{d}\left(t\right)\right)$ separately for each time point t and each rule of interest $d\in \mathcal{D}$, and then combining these estimates into a fit of ${m}_{\mathrm{\beta }}$.

Let ${\stackrel{ˉ}{Q}}_{n}$ denote the initial estimator of the iteratively defined conditional means that forms the basis of the pooled and the stratified TMLEs. Let ${\stackrel{ˉ}{Q}}_{n}^{\ast }$ denote the targeted update of ${\stackrel{ˉ}{Q}}_{n}$ for the two estimators (noting that the update is accomplished differently for the pooled and stratified estimators). As long as their corresponding update ${Q}_{n}^{\ast }$ solves the efficient influence curve equation ${P}_{n}{D}^{\ast }\left({Q}_{n}^{\ast },{g}_{n}\right)$, both the stratified and the pooled estimators will share the desirable asymptotic properties of a TMLE. Both estimators will be consistent if either ${\stackrel{ˉ}{Q}}_{n}$ or ${g}_{n}$ is a consistent estimator of ${Q}_{0}$ or ${g}_{0}$, respectively. Further, both the pooled and the stratified estimators will be asymptotically efficient if both the initial estimator ${\stackrel{ˉ}{Q}}_{n}$ and the estimator ${g}_{n}$ are consistent.

The pooled and stratified estimators may nonetheless differ in both their asymptotic and finite sample performance. The stratified TMLE uses a more saturated model when updating ${\stackrel{ˉ}{Q}}_{n}$ than does the pooled TMLE. Thus if the initial estimator ${\stackrel{ˉ}{Q}}_{n}$ is misspecified, the update of this initial estimator will be more extensive (the update will be further from the initial misspecified estimator) for the stratified, as compared to the pooled, TMLE, resulting asymptotically in a ${Q}_{n}^{\ast }$ that is closer to the true ${Q}_{0}$ and thus improving efficiency (recall that the efficiency bound is achieved at ${Q}_{0},{g}_{0}$). The extent to which this asymptotic property translates into meaningful finite sample gains in settings where ${\stackrel{ˉ}{Q}}_{n}$ is misspecified remains to be investigated.

On the other hand, in some cases, it is no longer clear how to implement the stratified TMLE. For example, the target parameter may be defined using a MSM ${m}_{\mathrm{\beta }}\left(d,t,V\right)$, conditional on a subset of baseline covariates V. If V is discrete with adequate finite sample support at each value, the stratified TMLE can be applied by estimating $E\left({Y}^{d}\left(t\right)|V=v\right)$ within each stratum v. When V is continuous, however, or has levels which are not represented in a given finite sample, such an approach will break down for many choices of weight function $h\left(d,t,V\right)$. Similarly, whenever there are some rules $d\in \mathcal{D}$ with no support in a given finite sample, no data will be available to fit $\in$ for some rules of interest, and the key update step will no longer be possible using the stratified TMLE. Note that such lack of support in finite samples can occur even when the assumption of positivity in the observed data distribution (3) is satisfied.

In cases where V is discrete and there is support for some but not all rules $d\in \mathcal{D}$ within each stratum of V, one option is to define a stratified quasi-TMLE using the initial fit ${\stackrel{ˉ}{Q}}_{n}^{d,t}$ for those rules, time points, and strata of V where no data are available to fit $\in$. The estimator, implemented in the simulations below, remains defined even when not all rules of interest are supported within each stratum of V in a given sample. However, in such cases, the initial estimator ${\stackrel{ˉ}{Q}}_{n}$ is only partially updated, and thus ${Q}_{n}^{\ast }$ may no longer solve the efficient influence curve equation ${P}_{n}{D}^{\ast }\left({Q}_{n}^{\ast },{g}_{n}\right)$, even if ${g}_{n}$ is a consistent estimator of ${g}_{0}$. If the initial estimator ${\stackrel{ˉ}{Q}}_{n}$ is poor (for example, if it is a misspecified parametric model), the resulting estimator of $\mathrm{\beta }$ will be biased. In contrast, the pooled TMLE retains the ability to fit $\in$ and thus update ${\stackrel{ˉ}{Q}}_{n}$ by pooling over rules d. In other words, both estimators rely on the theoretical positivity assumption on ${P}_{0}$ (3) for identifiability; however, they may respond differently to practical positivity violations in finite samples.

## 5.3.1 Numerical illustration

We use a simple simulation to illustrate a setting in which the positivity assumption holds, but many rules of interest have no support in a given finite sample. In this setting, the stratified estimator will be biased if the initial estimator ${\stackrel{ˉ}{Q}}_{n}$ is misspecified. In contrast, the pooled TMLE remains asymptotically linear if ${g}_{n}$ is a consistent estimator of ${g}_{0}$. Simulation studies comparing the performance of the pooled and stratified TMLEs as well as IPW estimators under more realistic scenarios are provided in the following section.

We implemented a simulation with observed data consisting of n i.i.d. copies of $O=\left(L\left(0\right),A=\left(A\left(0\right),\dots ,A\left(6\right)\right),Y\right)$, where $L\left(0\right)$ is a baseline covariate, $A\left(t\right)$ is a binary treatment assigned randomly at seven time points ($t=0,\dots ,6$), and Y is a binary outcome. The data for a given individual i were generated by drawing sequentially from the following distributions: $L\left(0\right)\sim N\left(0,1\right),$ $A\left(t\right)\sim Bern\left(p=0.6\right)\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{f}\mathrm{o}\mathrm{r}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t=0,\dots ,6,$ $Y|L\left(0\right),A\sim Bern\left(p=\left(\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{i}\mathrm{t}\left(L\left(0\right)-4{\left(\frac{1}{7}\sum _{k=0}^{6}A\left(t\right)\right)}^{2}\right)\right)\right).$The target parameter was defined as the projection of ${E}_{0}\left({Y}_{d}:d\in \mathcal{D}\right)$ onto marginal structural working model ${m}_{\mathrm{\beta }}\left(d\right)={\mathrm{\beta }}_{0}+{\mathrm{\beta }}_{1}{\sum }_{t=0}^{6}d\left(t\right)$ according to eq. (5), with weight function $h\left(d\right)=1$, $\mathcal{D}$ equal to the ${2}^{7}=128$ possible values of A, and $d\left(t\right)$ denoting the treatment level $a\left(t\right)$ assigned by a given rule d at time t.

Stratified and pooled TMLEs for $\mathrm{\beta }$ were implemented using estimators ${g}_{n}$ and ${\stackrel{ˉ}{Q}}_{n}$ based on intercept only logistic regressions; thus ${\stackrel{ˉ}{Q}}_{n}$ was an inconsistent estimator of ${\stackrel{ˉ}{Q}}_{0}$. Table 1 shows estimated bias, bias to standard error ratio, variance, mean squared error, and 95% confidence interval coverage (using the variance estimator described in Section 3.7) based on 500 samples of size $n=128$. Note that at this sample size many of the 128 rules of interest have no support in the data in a given sample, while the remainder have few observations available to fit $\in$ in the stratified TMLE. As predicted by its double robust property, the pooled TMLE remains without meaningful bias despite use of a poor initial estimator ${\stackrel{ˉ}{Q}}_{n}$. In contrast, the stratified TMLE has bias of approximately double the magnitude of its standard error, posing a substantial threat to valid inference. The stratified TMLE also exhibits markedly lower variance than its pooled counterpart, explained by the fact that for those rules d without support in the data, the stratified estimator uses an intercept only model to estimate ${\stackrel{ˉ}{Q}}^{d}$. Although unbiased, the pooled TMLE provides anti-conservative confidence interval coverage; we return to this point below.

Table 1

Breakdown of stratified TMLE when some rules $d\in \mathcal{D}$ have no support

This breakdown of the stratified TMLE in settings with no support for some rules of interest will not occur if the function h is chosen to give a weight of 0 to any rule (or more generally, any $\left(d,V,t\right)$ combination) without support in the sample. For example, in the illustration above we could have defined $h\left(d\right)={P}_{0}\left(A=d\right)$ and estimated it using the empirical distribution. Unless one is willing to assume that the MSM ${m}_{\mathrm{\beta }}$ is correctly specified, however, choice of h changes the target parameter being estimated [27]. Further, even with this choice of weight function, the estimators may still exhibit different finite sample performance in setting with marginal data support. We investigate this possibility further in the following section.

## 6.1 Overview

In this section, we investigate the relative performance of the pooled TMLE, stratified TMLE, and IPW estimators for the parameters of a marginal structural working model. For each candidate estimator, we report bias, variance, MSE, and 95% confidence interval coverage estimates based on influence curve variance estimators. We note that our influence curve-based estimators assume the weight function $h\left(d,t\right)$ is known; if the weight function is estimated, the influence curve should be corrected for this additional estimated component. We investigate two basic data generating processes. Simulation 1 investigates a simple process, in which the effect of the longitudinal treatment (time to switch) is confounded by baseline variables only, the outcome is observed at a single time point, and there is no censoring. Simulation 2 introduces more realistic complexity, designed to resemble the data analysis presented in the following section.

## 6.2.1 Data generating process

We implemented a simulation with observed data consisting of n i.i.d. copies of $O=\left(L\left(0\right),A=\left(A\left(0\right),\dots ,A\left(K\right),Y\right)$, where $L\left(0\right)$ is a baseline covariate, $A\left(t\right)$ is a binary treatment assigned at time point $t=0,\dots ,K$, and Y is a binary outcome. The data for a given individual were generated by drawing sequentially from the following distributions: $L\left(0\right)\sim N\left(0,1\right),$ $A\left(t\right)|L\left(0\right)\sim Bern\left(p=\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathrm{m}\mathrm{i}\mathrm{n}\left(L\left(0\right)+0.5,0.62\right),0.38\right)\right)\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{f}\mathrm{o}\mathrm{r}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t=0,\dots ,K,$ $Y|L\left(0\right),A\sim Bern\left(p=\left(\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{i}\mathrm{t}\left(L\left(0\right)-\frac{1}{K+1}{\sum }_{t=0}^{K}A\left(t\right)\right)\right).$In order to investigate the impact of decreasing support in the data (practical violations or near violations of the positivity assumption), we considered two versions of this data generating process, with $K=2$ (Simulation 1a, lower bound on ${g}_{0}$ of 0.05) and $K=6$ (Simulation 1b, lower bound on ${g}_{0}$ of 0.001).

## 6.2.2 Target parameter

The target parameter was defined as the projection of ${E}_{0}\left({Y}_{d}:d\in \mathcal{D}\right)$ onto marginal structural working model ${m}_{\mathrm{\beta }}\left(d\right)={\mathrm{\beta }}_{0}+{\mathrm{\beta }}_{1}{\sum }_{t=0}^{K}d\left(t\right)$ according to eq. (5), with $\mathcal{D}$ equal to the ${2}^{\left(K+1\right)}$ possible values of A, $d\left(t\right)$ equal to the treatment level assigned by a given rule d at time, t and weight function $h\left(d\right)={P}_{0}\left(A=d\right)$. In the case that some rules of interest had no support in a given sample, this choice of weight function (when estimated as the empirical proportion of subjects that followed rule d) ensured that the IPW estimator remained defined and that the updated fit ${Q}_{n}^{\ast }$ used by the stratified TMLE solved the efficient influence curve equation when ${g}_{n}$ was a consistent estimator of ${g}_{0}$.

## 6.2.4 Estimators

This is a static point treatment problem, and thus a number of additional estimators are available. However, we use this as a special case of longitudinal dynamic MSMs and investigate the relative performance of three estimators described in Section 5: the pooled TMLE, the stratified TMLE, and the standard IPW estimator. All estimators were implemented using two estimators of ${g}_{0}$: an estimator based on a correctly specified model for the conditional distribution of A given $L\left(0\right)$ and an estimator using an intercept only model. The estimators ${g}_{n}$ were bounded from below at 0.001. TMLEs were implemented using two estimators of ${\stackrel{ˉ}{Q}}_{0}$: an estimator based on main terms logistic regression models using the correct set of parents for a given node as independent variables (in a slight abuse, we refer to these as “correctly specified”) and an estimator using intercept only models. Performance was evaluated across 500 samples of size $n=500$; 95% confidence interval coverage was based on the variance estimator described in Section 3.7.

## 6.2.5 Results

Results for Simulation 1a are shown in Table 2. When both ${\stackrel{ˉ}{Q}}_{n}$ and ${g}_{n}$ were based on correctly specified models, all three estimators were unbiased, had similar variance, and achieved close to nominal coverage. Table 2 also demonstrates double robustness; when ${\stackrel{ˉ}{Q}}_{n}$ and ${g}_{n}$ were based on a misspecified model, both TMLEs remained without meaningful bias. In this simulation, both TMLEs continued to achieve close to nominal coverage even when ${g}_{n}$ was an inconsistent estimator of ${g}_{0}$. In contrast and as expected, the IPW estimator was substantially biased with poor coverage when ${g}_{n}$ was based on an intercept only model.

Results for Simulation 1b, with both ${\stackrel{ˉ}{Q}}_{n}$ and ${g}_{n}$ based on correctly specified models, are shown in Table 3. In this simulation, in which the lower bound for ${g}_{0}$ is 0.001, the IPW estimator was minimally biased and retained good 95% confidence interval coverage. Interestingly, the performance of the stratified TMLE suffered in this setting, with bias approximately equal to the standard error, and 95% confidence interval coverage of 83% and 80% for ${\mathrm{\beta }}_{0}$ and ${\mathrm{\beta }}_{1}$, respectively. The pooled TMLE remained unbiased and retained good confidence interval coverage.

Table 2

Simulation 1a

Table 3

Simulation 1b

## 6.3 Simulation 2: resembling data analysis

In this simulation, we used a data generating process designed to resemble the data analysis presented in the following section, in which the goal was to investigate the effect of delayed switch to a new antiretroviral regimen on mortality among HIV-infected patients who have failed first line therapy. The data generating process thus contains both baseline and time-dependent confounders of a longitudinal binary treatment (time to switch), a repeated measures binary outcome (survival over time), and informative right censoring due to two causes (database closure and loss to follow up).

## 6.3.1 Data generating process

We implemented a simulation with observed data consisting of n i.i.d. copies of $O=\left(L\left(0\right),A\left(0\right),L\left(1\right),A\left(1\right),\dots ,L\left(K\right),A\left(K\right),Y\left(K+1\right)\right),$for $K=9$. Here, $L\left(0\right)=\left(W,CD4\left(0\right)\right)$ and $L\left(t\right)=\left(Y\left(t\right),CD4\left(t\right)\right)$, where W was a non-time-varying baseline covariate ($W=\left({W}_{1},\dots ,{W}_{4}\right)$, representing baseline age, sex, and disease stage), $CD4\left(t\right)$ was a time-varying covariate representing most recent measured CD4 count at time t (square root transformed), and $Y\left(t\right)$ was an indicator of death by time t. The intervention nodes for a given time point t were $A\left(t\right)=\left({C}_{1}\left(t\right),{C}_{2}\left(t\right),{A}_{1}\left(t\right)\right)$, where ${C}_{1}\left(t\right)$ was an indicator of database closure by time t, ${C}_{2}\left(t\right)$ was an indicator of loss to follow up by time t, and ${A}_{1}\left(t\right)$ was an indicator of having switched to second line therapy by time t. In brief, the data for a given individual were generated by first drawing baseline characteristics W, then for each time point t, for as long as the subject remained alive and uncensored,

• 1.

Drawing a time updated CD4 count $CD4\left(t\right)$ given W, prior CD4 counts, and regimen status at the prior time point (${A}_{1}\left(t-1\right)$)

• 2.

Determining censoring due to database closure ${C}_{1}\left(t\right)$ using a Bernoulli trial with probability dependent on W.

• 3.

If still uncensored, determining censoring due to loss to follow up ${C}_{2}\left(t\right)$ using a Bernoulli trial with probability dependent on W, prior CD4 count and regimen status at the prior time point (${A}_{1}\left(t-1\right)$).

• 4.

If still uncensored and not yet switched, determining switching using a Bernoulli trial with probability dependent on W and prior CD4 count.

• 5.

Determining death using a Bernoulli trial with probability dependent on W, prior CD4 counts and regimen status $A\left(t\right)$

The full data generating process is provided in Appendix D. Coefficients in the data generating process were chosen to approximate the degree of censoring, treatment, and death in the analysis data set, as detailed in Appendix D, Tables 7 and 9. The data generating process results in a true intervention mechanism ${g}_{0}$ not bounded away from 0.

## 6.3.2 Target parameter

The target parameter was defined as the projection of the counterfactual survival curve for each switch time, $\left(E\left({Y}_{d}\left(t\right)\right):d\in D,t=1,\dots ,10\right)$ onto marginal structural working model $\text{Logit}\text{\hspace{0.17em}}{m}_{\beta }\left(d,t\right)={\beta }_{0}+{\beta }_{1}t+{\beta }_{2}\left(d\left(t-1\right)\left(t-{s}_{d}\right)\right)$ according to eq. (5), where we use $d\left(t\right)$ to denote the value $a\left(t\right)$ assigned by rule d, ${s}_{d}$ to denote the switch time assigned by rule d, and with $\mathcal{D}$ consisting of each possible switch time, $\left\{0,\dots ,10\right\}$ combined with an intervention to prevent censoring. We used the following weight function: $h\left(d,t\right)=\frac{{P}_{0}\left(\stackrel{ˉ}{A}\left(t-1\right)=\stackrel{ˉ}{a}\left(t-1\right)\right)}{{n}_{t}}I\left(t<{t}^{\ast }\right),$where ${n}_{t}$ was the number of unique values $\stackrel{ˉ}{a}\left(t-1\right)$ compatible with $d\in \mathcal{D}$, ${t}^{\ast }$ is the first time point at which all subjects have either died or been censored, and both ${P}_{0}\left(\stackrel{ˉ}{A}\left(t-1\right)=\stackrel{ˉ}{a}\left(t-1\right)\right)$ and ${t}^{\ast }$ were estimated with their empirical counterparts. This weight function gave 0 weight to rules without support in a given sample. It further avoided up-weighting specific values $\stackrel{ˉ}{a}\left(t-1\right)$ proportional to the number of rules (assigned switch times) that they were compatible with.

## 6.3.3 Estimators

We implemented the pooled TMLE, the stratified TMLE, the IPW estimator based on estimating $E\left({Y}^{d}\left(t\right):d\in \mathcal{D}\right)$ using the bounded Horvitz–Thompson estimator and projecting the resulting estimates onto the model ${m}_{\mathrm{\beta }}$ (referred to as “Stratified IPW”) and the standard IPW estimator for dynamic MSM (referred to as a “Standard IPW” estimator).

The intervention rules $d\in \mathcal{D}$ could be considered static, in that they assign a fixed vector of treatment decisions $\stackrel{ˉ}{a}$ irrespective of a subject’s covariate values. However, when the target parameter is defined using a MSM on survival, a static IPW estimator cannot be implemented in the standard way (fitting a weighted pooled regression of $Y\left(t\right)$ on observed treatment history $\stackrel{ˉ}{A}\left(t-1\right)$) because the full $\stackrel{ˉ}{A}\left(t-1\right)$ is not observed for subjects who die before time t. As noted by Picciotto et al. [42], one option, adopted by us here, is to instead define the interventions of interest as dynamic (switch at time ${s}_{d}$ if still alive).

All estimators were implemented using an estimator ${g}_{n}$ based on a correctly specified parametric model for ${g}_{0}$, but bounding the resulting estimates from below at 0.01 in order to ensure that the denominator in the covariate used in the updating step remained bounded away from 0. The TMLEs were implemented using estimators of ${\stackrel{ˉ}{Q}}_{0}$ based on main terms logistic regression models using the correct set of parents for a given node as independent variables (not equivalent to use of a correctly specified parametric model to estimate ${\stackrel{ˉ}{Q}}_{0}$). The performance of each estimator was evaluated across 500 samples of size $n=2,627$, corresponding to the sample size in the data analysis. 95% confidence interval coverage was based on the variance estimator described in Section 3.7; calculation of non-parametric bootstrap-based coverage was computationally prohibitive.

## 6.3.4 Results

Results for Simulation 2 are shown in Table 4. In this simulation, both the pooled and the stratified TMLEs were essentially unbiased for all coefficients, and the two TMLEs had comparable MSEs. Both TMLEs exhibited less than nominal 95% confidence interval coverage when using influence curve-based variance estimators. The anti-conservative performance of the influence curve-based variance estimator is likely due to the presence of practical positivity violations and relatively rare outcomes; the fact that the weight function was treated as known may also make a small contribution. Further work is needed to develop improved diagnostics and variance estimators in these settings.

Table 4

Simulation 2: resembling data analysis

In contrast, both IPW estimators were substantially biased for ${\mathrm{\beta }}_{2}$, which reflected the treatment effect, despite use of an estimator ${g}_{n}$ based on a correctly specified parametric model. Both IPW estimators also showed higher MSE for ${\mathrm{\beta }}_{2}$, and achieved 95% confidence interval coverage for ${\mathrm{\beta }}_{2}$ substantially below that of the TMLEs. This finding is consistent with the known susceptibility of IPW estimators to positivity violations and data sparsity, exacerbated by the limited ability when using a dynamic regime formulation to choose an effectively stabilizing weight function. Across simulations, the median minimum value of ${g}_{n}$ used by IPW prior to bounding at 0.01 was 0.000297; 1.55% of values of ${g}_{n}$ used by IPW were less than 0.01. Tables 8 and 10, provide further details on data support and number of events.

## 7 Data analysis

We analyzed data from the International Epidemiological Databases – Southern Africa in order to investigate the effect of switching to second line therapy on mortality among HIV-infected patients with immunological failure on first line antiretroviral therapy. The data set and clinical question are described in detail in Gsponer et al. [1]. In brief, data were drawn from clinical care facilities in Zambia and Malawi, in which HIV-infected patients were followed longitudinally in clinic and data were collected on baseline demographic and clinical variables (sex, age, and baseline disease stage), time-varying CD4 count, and time-varying treatment, summarized here as switch to second line therapy. Death was independently reported. The 2,627 subjects meeting WHO immunological failure criteria were included in the current analysis beginning at time of immunologic failure. Following common practice and prior analysis, time was discretized into 3-month intervals; time updated CD4 count was coded such that CD4 count for an interval preceded switching decisions in that interval. Data on a subject were censored at time of database closure or after four consecutive intervals without clinical contact.

The data structure and target parameter were identical to those described in Simulation 2, with $W=\left({W}_{1},{W}_{2},{W}_{3},{W}_{4}\right)$, ${W}_{1}$ equal to sex, ${W}_{2}$ and ${W}_{3}$ representing two levels of a three-level categorical age variable ($\phantom{\rule{negativethinmathspace}{0ex}}<30,30-39,\phantom{\rule{thickmathspace}{0ex}}\mathrm{a}\mathrm{n}\mathrm{d}\phantom{\rule{thickmathspace}{0ex}}>39$), and ${W}_{4}$ equal to disease stage. The analysis was implemented under the causal model assumed for Simulation 2; in particular assuming that monitoring times did not affect the outcome other than via effects on switching. We acknowledge that this assumption may not hold; however, relaxing it introduces a number of additional complications, as described in the Appendix. We implemented the estimators described in Simulation 2: pooled and stratified TMLEs and standard and stratified IPW estimators. Estimators of ${g}_{0}$ and ${\stackrel{ˉ}{Q}}_{0}$ were based on main term logistic regression models, analogous to Simulation 2. Given the results in Simulation 2 suggesting the poor performance of the influence curve-based variance estimator, we also estimated the variance using a non-parametric bootstrap.

Results are given in Table 5. The IPW estimates for the effect of switching on mortality (${\mathrm{\beta }}_{2}$) are close to zero and non-significant. Both TMLE point estimates suggest a 0.88 relative odds of death per 3-month earlier switch, and all except for the stratified TMLE combined with bootstrap-based variance estimation were significant at the $\mathrm{\alpha }=0.05$ level. Such a protective effect of switching is consistent with clinical knowledge. Interestingly, these results appear consistent with those of Simulation 2, which suggested that the IPW estimator was substantially positively biased, underestimating the harm of delayed switch, while both TMLEs performed well in terms of bias. In summary, our results in both the simulation and data analysis are consistent with the TMLEs controlling for measured confounders more completely than the corresponding IPW estimators.

The poor coverage observed in Simulation 2, despite absence of bias for the TMLEs, suggests that the influence curve-based variance estimators may be systematically underestimating the true variance in this analysis. While the non-parametric bootstrap offers an alternative approach, it is not expected to resolve the challenge of anti-conservative variance estimation in the setting of practical positivity violations. Intuitively, rare treatment/covariate combinations, despite being theoretically possible, may simply not occur in a given finite sample and as a result, the corresponding extreme weights implied by these combinations will not occur. Because the non-parametric bootstrap resamples from the same finite sample, it fails to address the underlying problem. Indeed, the bootstrap-based confidence intervals in the data analysis were slightly smaller than confidence intervals based on the influence curve. Thus in this realistic setting of rare outcomes and moderately strong confounding, our results caution against reliance on either approach to variance estimation, for either IPW or TML estimators, and suggest that additional work developing robust variance estimators in this setting is urgently needed.

Table 5

Data analysis

In addition to the issues raised above, limitations of the analysis include the potential for unmeasured confounding by factors such as unmeasured health status and adherence, as well as bias due to incomplete health reporting, resulting in censoring due to loss to follow up that directly depends on death [43, 44]. These results also do not contradict the previous published results of Gsponer et al. [1], which found a protective effect of switching using a static IPW estimator for a hazard MSM; such an IPW estimator might perform substantially better than the dynamic IPW estimator for a survival MSM implemented here.

## 8 Discussion

In summary, we have presented a pooled TMLE for the parameters of static and dynamic longitudinal MSMs that builds on prior work by Robins [13, 29] and Bang and Robins [22]. We evaluated the performance of this estimator using simulated data and applied it, together with alternatives, in a data analysis. Both theory and simulations suggest settings in which the pooled TMLE offers advantages over alternative estimators. Software implementing this estimator, together with competitors, is included in supplementary files and is publicly available as part of R library ltmle (http://cran.r-project.org/web/packages/ltmle/).

The pooled TMLE presented in this paper, together with corresponding open source software, provides a new tool for estimation of the parameters of static or dynamic MSMs. It has clear theoretical advantages over available alternatives. Unlike IPW and augmented-IPW estimators, it is a substitution estimator. Unlike IPW estimators, it is double robust and asymptotically efficient depending on the initial estimators of g and Q. Unlike the previously proposed stratified TMLE, it does not require support in the data for every rule of interest and remains defined in the case that the target parameter is defined using a marginal structural working model conditional on a continuous baseline covariate.

In settings where some subset of discrete intervention rules has adequate support in a given sample, an alternative approach is to compare only this subset of rules [2]. However, in many settings smoothing over a large number of poorly supported rules is appropriate. For example, for a set of rules indexed by a continuous or multiple level ordered variable, smoothing over this variable provides a way to define a causal effect of interest despite inadequate support to estimate the counterfactual outcome under any of these rules individually.

The TMLE presented in this paper was developed for a causal model in which the non-intervention variables may be a function of the entire observed past ($Pa\left(L\left(k\right)\right)=\left(\stackrel{ˉ}{A}\left(k-1\right),\stackrel{ˉ}{L}\left(k-1\right)\right)$) and the intervention variables may be a function of some subset of the observed past ($Pa\left(A\left(k\right)\right)\subseteq \left(\stackrel{ˉ}{A}\left(k-1\right),\stackrel{ˉ}{L}\left(k\right)\right)$) – in other words, for a model in which exclusion restrictions were assumed, if at all, only for the intervention variables. In some cases, a causal model that also restricts the parent set of the non-intervention variables to a subset of the observed past may be appropriate. This smaller model is included in the larger model assumed in the current paper. As a result, while the TMLE developed for the larger model will still be valid, it will no longer be efficient.

Our simulations suggest that both stratified and pooled TMLEs may outperform both the IPW estimator typically used for dynamic MSMs, as well as an alternative “stratified” IPW estimator, in some settings with sparse data/near positivity violations. However, further work is needed to confirm this preliminary observation. Although the theory in this paper was developed for the general case of dynamic MSMs, including models of the time-specific hazard or survival functions, we focused our software implementation, examples, and simulations on MSMs for survival. The practical performance of the pooled TMLE relative to alternative estimators also remains to be investigated for the case of static and dynamic MSMs on the hazard. It also remains to implement and evaluate the relative performance of the alternative pooled TMLE described in Appendix C, in which the updating step pools not only over all rules of interest but also over all time points. Finally, the relative performance of the TMLE compared to double robust efficient estimating equation-based estimators for longitudinal MSM parameters, including those of Robins [13, 29] and Bang and Robins [22], remains to be evaluated.

Importantly, our simulations and data analysis illustrate the need for improved variance estimators for both TMLE and IPW in settings with moderately strong confounding, multiple time points, and relatively rare outcomes. Improved approaches to variance estimation and valid inference, as well as appropriate diagnostics to warn applied practitioners of settings in which coverage is likely to be poor, are crucial research priorities.

## Acknowledgments

This work was supported by NIH Grant #U01AI069924 (NIAID, NICHD, and NCI) (PIs: Egger and Davies), the Doris Duke Charitable Foundation Grant #2011042 (PI: Petersen), and NIH Grant #R01AI074345-06 (NIAID) (PI: van der Laan).

## Notation

Table 6

Partial list of notation

## The efficient influence curve of the statistical target parameter

We have the following theorem for a general parameter $\mathrm{\Psi }\left(P\right)=f\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}=\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}:d,t\right),{Q}_{L\left(0\right)}\right)$.

Theorem 1 Consider a parameter $\mathrm{\Psi }:\mathcal{M}\to \mathrm{I}\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{R}{\phantom{\rule{1pt}{0ex}}}^{J}$ that can be represented as $\mathrm{\Psi }\left(P\right)=f\left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$, where ${\stackrel{ˉ}{Q}}_{L\left(1\right)}=\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}=E\left({Y}^{d}\left(t\right)|L\left(0\right)\right):d,t\right)$. Let $Q=\left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$. Assume that $f\left(\right)$ is such that $\mathrm{\Psi }$ is pathwise differentiable. Let $f{}_{d,t}^{\mathrm{\prime }}\left(Q\right)\left(w\right)=\frac{d}{d{\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}\left(w\right)}f\left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$ be the partial derivative of f with respect to ${\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}\left(w\right)=E\left({Y}^{d}\left(t\right)|L\left(0\right)=w\right)$ at Q. Let ${D}_{L\left(0\right)}^{\ast }\left(Q\right)$ be the influence curve of $f\left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right),n}\right)$ as an estimator of $f\left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$, where ${Q}_{L\left(0\right),n}$ is the empirical distribution of ${L}_{i}\left(0\right)$, $i=1,\dots ,n$.

Then, the efficient influence curve of $\mathrm{\Psi }$ at P can be represented as follows: ${D}^{\ast }\left(P\right)={D}_{L\left(0\right)}^{\ast }\left(Q\right)+{\sum }_{d,t}f{{}^{\mathrm{\prime }}}_{d,t}\left(L\left(0\right)\right){\sum }_{k=1}^{t}\frac{I\left(\stackrel{ˉ}{A}\left(k\right)={\stackrel{ˉ}{d}}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)\right)}{{g}_{0:k}}\left({\stackrel{ˉ}{Q}}_{L\left(k+1\right)}^{d,t}-{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}\right)$Proof: The efficient influence curve of the parameter $E\left({Y}^{d}\left(t\right)|L\left(0\right)=w\right)$ (assuming discrete random variable $L\left(0\right)$) is given by ${D}_{d,t,w}^{\ast }=\frac{I\left(L\left(0\right)=w\right)}{{Q}_{L\left(0\right)}\left(w\right)}{\sum }_{k=1}^{t}\frac{I\left(\stackrel{ˉ}{A}\left(k\right)={\stackrel{ˉ}{d}}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)\right)}{{g}_{0:k}}\left({\stackrel{ˉ}{Q}}_{L\left(k+1\right)}^{d,t}-{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}\right)$(appendix A3, van der Laan and Rose [20]). By the delta-method, the efficient influence curve of $f\left({\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$ is thus given by ${D}^{\ast }={D}_{L\left(0\right)}^{\ast }+{\sum }_{w,d,t}f{{}^{\mathrm{\prime }}}_{d,t}\left(w\right){D}_{d,t,w}^{\ast }$ $={D}_{L\left(0\right)}^{\ast }+{\sum }_{w,d,t}f{{}^{\mathrm{\prime }}}_{d,t}\left(w\right)\frac{I\left(L\left(0\right)=w\right)}{{Q}_{L\left(0\right)}\left(w\right)}{\sum }_{k=1}^{t}\frac{I\left(\stackrel{ˉ}{A}\left(k\right)={\stackrel{ˉ}{d}}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)\right)}{{g}_{0:k}}\left({\stackrel{ˉ}{Q}}_{L\left(k+1\right)}^{d,t}-{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}\right)$ $={D}_{L\left(0\right)}^{\ast }+{\sum }_{d,t}f{{}^{\mathrm{\prime }}}_{d,t}\left(L\left(0\right)\right){\sum }_{k=1}^{t}\frac{I\left(\stackrel{ˉ}{A}\left(k\right)={\stackrel{ˉ}{d}}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)\right)}{{g}_{0:k}}\left({\stackrel{ˉ}{Q}}_{L\left(k+1\right)}^{d,t}-{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}\right)$(appendix A3, van der Laan and Rose [20]). This completes the proof. ⃞

In order to determine the partial derivative of the function f and ${D}_{L\left(0\right)}^{\ast }$ the following is useful. Suppose that, as in our examples, $f\left(Q\right)=arg{max}_{\mathrm{\beta }}M\left(\mathrm{\beta },Q\right)$ for some function M, and suppose that $Q=\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)$. Then $\mathrm{\beta }\left(Q\right)=f\left(Q\right)$ solves the equation $0=\frac{d}{d\mathrm{\beta }}M\left(\mathrm{\beta },Q\right)\equiv U\left(\mathrm{\beta },Q\right)$. By the implicit function theorem we have that $\frac{d}{dQ}\mathrm{\beta }\left(Q\right)=-{\left\{\frac{d}{d\mathrm{\beta }}U\left(\mathrm{\beta },Q\right)\right\}}^{-1}\frac{d}{dQ}U\left(\mathrm{\beta },Q\right).$In particular, $\frac{d}{d\stackrel{ˉ}{Q}}\mathrm{\beta }\left(\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right)=-{\left\{\frac{d}{d\mathrm{\beta }}U\left(\mathrm{\beta },Q\right)\right\}}^{-1}\frac{d}{d\stackrel{ˉ}{Q}}U\left(\mathrm{\beta },Q\right),$and $\frac{d}{d{Q}_{L\left(0\right)}}\mathrm{\beta }\left(Q\right)=-{\left\{\frac{d}{d\mathrm{\beta }}U\left(\mathrm{\beta },Q\right)\right\}}^{-1}\frac{d}{d{Q}_{L\left(0\right)}}U\left(\mathrm{\beta },\stackrel{ˉ}{Q},{Q}_{L\left(0\right)}\right).$We can now apply our general Theorem 1 to the example with $Y\left(t\right)\in \left[0,1\right]$ and the logistic regression working MSM in which $\mathrm{\beta }\left(Q\right)$ solves the equation $U\left(\mathrm{\beta },{\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)$ with $U\left(\mathrm{\beta },{\stackrel{ˉ}{Q}}_{L\left(1\right)},{Q}_{L\left(0\right)}\right)\equiv {E}_{{Q}_{L\left(0\right)}}{\sum }_{t}{\sum }_{d\in \mathcal{D}}{h}_{1}\left(d,t,V\right)\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}\left(L\left(0\right)\right)-{m}_{\mathrm{\beta }}\left(d,t,V\right)\right).$The same equation applies for the linear working MSM but with the other definition of ${h}_{1}$ as mentioned above. Define $c\left(Q\right)\equiv {E}_{{Q}_{L\left(0\right)}}{\sum }_{t,d}{h}_{1}\left(d,t,V\right)\frac{d}{d\mathrm{\beta }}{m}_{\mathrm{\beta }}\left(d,t,V\right).$Note that $-\frac{d}{d\mathrm{\beta }}U\left(\mathrm{\beta },Q\right)=c\left(Q\right)$. Thus, ${D}_{L\left(0\right)}^{\ast }\left(Q\right)=c\left(Q{\right)}^{-1}{\sum }_{t,d}{h}_{1}\left(d,t,V\right)\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}-{m}_{\mathrm{\beta }}\left(d,t,V\right)\right),$and $f{{}^{\mathrm{\prime }}}_{d,t}\left(L\left(0\right)\right)=c\left(Q{\right)}^{-1}{h}_{1}\left(d,t,V\right).$This proves the following corollary [13, 22, 29].

Corollary 1 Consider the target parameter $\mathrm{\Psi }\left(Q\right)$ defined by eq. (5). This target parameter is pathwise differentiable at P with efficient influence curve given by $\begin{array}{rl}{D}^{\ast }\left(P\right)=\phantom{\rule{thickmathspace}{0ex}}& c{\left(Q\right)}^{-1}{\sum }_{t,d}{h}_{1}\left(d,t,V\right)\left({\stackrel{ˉ}{Q}}_{L\left(1\right)}^{d,t}-{m}_{\mathrm{\beta }}\left(d,t,V\right)\right)\\ & +c{\left(Q\right)}^{-1}{\sum }_{t}{\sum }_{k=1}^{t}{\sum }_{d}{h}_{1}\left(d,t,V\right)\frac{I\left(\stackrel{ˉ}{A}\left(k\right)={\stackrel{ˉ}{d}}_{k}\left(\stackrel{ˉ}{L}\left(k\right)\right)\right)}{{g}_{0:k}}\left({\stackrel{ˉ}{Q}}_{L\left(k+1\right)}^{d,t}-{\stackrel{ˉ}{Q}}_{L\left(k\right)}^{d,t}\right)\end{array}$The efficient influence curve is double robust. In other words we have that $-{P}_{0}{D}^{\ast }\left(Q,{g}_{0}\right)=\mathrm{\Psi }\left(Q\right)-{\mathrm{\psi }}_{0}$, so that, in particular, if $\mathrm{\Psi }\left(Q\right)={\mathrm{\psi }}_{0}$, then ${P}_{0}{D}^{\ast }\left(Q,{g}_{0}\right)=0$. As a consequence, our TMLE $\mathrm{\Psi }\left({Q}_{n}^{\ast }\right)$ will be a consistent estimator of ${\mathrm{\psi }}_{0}$ if either ${Q}_{n}^{\ast }$ is consistent for ${Q}_{0}$ or ${g}_{n}$ is consistent for ${g}_{0}$ [38].

## An alternative pooled TMLE that only fits a single $\in$ to compute the update

The TMLE described in the main text relies on a separate $\in$ for each $k=1,\dots ,t$ and for each $t=1,\dots ,K+1$ resulting in a collection of ${\sum }_{t=1}^{K+1}t$ estimators of $\in$ that define the TMLE. A nice feature of this TMLE is that it exists in closed form. The following alternative TMLE only relies on fitting a single $\in$, but in this case the updating needs to be iterated until convergence. First construct an initial estimator ${\stackrel{ˉ}{Q}}_{n}^{0}$ of ${\stackrel{ˉ}{Q}}_{0}=\left({\stackrel{ˉ}{Q}}_{k,0}^{d,t}:k=1,\dots ,t,d\in \mathcal{D},t=1,\dots ,K+1\right)$ as described above. Now, consider the above-presented submodel $\left({\stackrel{ˉ}{Q}}_{n}^{0}\left(\in \right)=\left({\stackrel{ˉ}{Q}}_{k,n}^{d,t}\left(\in ,g\right):\in \right)$ through ${\stackrel{ˉ}{Q}}_{n}^{0}$ at $\in =0$. Compute ${\in }_{n}^{0}=arg{min\phantom{\rule{thinmathspace}{0ex}}}_{\in }{\sum }_{t=1}^{K+1}{\sum }_{d\in \mathcal{D}}{\sum }_{k=1}^{t}{\mathcal{L}}_{d,t,k,{\stackrel{ˉ}{Q}}_{k+1,n}^{d,t,0}}\left({\stackrel{ˉ}{Q}}_{k,n}^{d,t,0}\left(\in ,{g}_{n}\right)\right),$where the nuisance parameters of the loss function are estimated with the initial estimator ${\stackrel{ˉ}{Q}}_{n}^{0}$. Note that ${\in }_{n}^{0}$ can be fit with a pooled logistic regression as stated above. This yields an update ${\stackrel{ˉ}{Q}}_{n}^{1}={\stackrel{ˉ}{Q}}_{n}^{0}\left({\in }_{n}^{0}\right)$. In general, at the mth step, given the estimator ${\stackrel{ˉ}{Q}}_{n}^{m}$, we compute ${\in }_{n}^{m}=arg{min\phantom{\rule{thinmathspace}{0ex}}}_{\in }{\sum }_{t=1}^{K+1}{\sum }_{d\in \mathcal{D}}{\sum }_{k=1}^{t}{\mathcal{L}}_{d,t,k,{\stackrel{ˉ}{Q}}_{k+1,n}^{d,t,m}}\left({\stackrel{ˉ}{Q}}_{k,n}^{d,t,m}\left(\in ,{g}_{n}\right)\right),$and the resulting update ${\stackrel{ˉ}{Q}}_{n}^{m+1}={\stackrel{ˉ}{Q}}_{n}^{m}\left({\in }_{n}^{m},{g}_{n}\right)$. This updating process is iterated until ${\in }_{n}^{m}\approx 0$. The resulting final update is denoted with ${\stackrel{ˉ}{Q}}_{n}^{\ast }$ and is the TMLE of ${\stackrel{ˉ}{Q}}_{0}$. By construction, we have that this TMLE also solves the efficient influence curve equation ${P}_{n}{D}^{\ast }\left({Q}_{n}^{\ast },{g}_{n}\right)=0$ with arbitrary precision. The TMLE of ${\mathrm{\psi }}_{0}$ is now computed with the corresponding plug-in estimator $\mathrm{\Psi }\left({\stackrel{ˉ}{Q}}_{n}^{\ast },{Q}_{L\left(0\right),n}\right)$, as above. The potential advantage of this alternative TMLE is that it is able to smooth across all time points t and k when computing the update, while the closed form TMLE presented above only smoothes over the rules $d\in \mathcal{D}$.

## Overview

Below we describe the true data generating process used in Simulation 2 in greater detail. We also provide a summary table comparing the support and number of events over time in the simulated data and in the real data set it was designed to resemble.

In our presentation of the simulation below we have altered our notation slightly from that presented in the paper to match notation in the accompanying R code. Specifically, $L\left(t\right)$ is used to refer to time varying CD4 count $CD4\left(t\right)$. The observed data generated on a given subject consisted of $O=\left(W,Y\left(0\right),L\left(0\right),{C}_{1}\left(0\right),{C}_{2}\left(0\right),{A}_{1}\left(0\right),\dots ,Y\left(9\right),L\left(9\right),{C}_{1}\left(9\right),{C}_{2}\left(9\right),{A}_{1}\left(9\right),Y\left(10\right)\right)$Further, the data generating process included a non-monotone monitoring process (denoted $M\left(t\right)$) designed to mimic when subjects come into clinic, have their CD4 counts measured, and have an opportunity to switch regimens. This adds several additional complexities. First, observed CD4 count, denoted here as $L\left(t\right)$ (and in the main text as $CD4\left(t\right)$), is only updated to reflect the true underlying CD4 count process when a patient is seen. Below, $\stackrel{ˆ}{L}\left(t\right)$ is used to denote the true underlying CD4 value. Subsequent CD4 values and death are functions of this true underlying value, while the intervention nodes are functions of the observed values only. Because both switching and monitoring are generated only in response to the observed past, however, this time-dependent non-monotone monitoring process is a multivariate instrumental variable, warranting its exclusion from the adjustment set. Further, its inclusion would both be expected to harm efficiency and introduce positivity violations (for example, subjects not seen in clinic at a given time point have zero probability of switch at that time point). The non-monotone monitoring process, while retained to mimic the data analysis, is thus omitted from the observed data and presentation in the main text in order to simplify discussion. Second, in accordance with common practice in clinical cohort data, censoring due to loss to follow up ${C}_{2}$ is defined deterministically based on not being seen in clinic for a certain number of consecutive time points. Finally, a subject can only switch treatment when seen (${A}_{1}\left(t\right)$ is only at risk of jumping when $M\left(t\right)=1$). As above, $W=\left({W}_{1},{W}_{2},{W}_{3},{W}_{4}\right)$ and $Y\left(t\right)$ denotes an indicator of death by time t.

## Data generating process

Data were generated for a given individual according to the following process, where ${\in }_{1}$ and ${\in }_{2}$ are draws from a standard normal distribution, and all binary variables were drawn from a Bernoulli distribution with the conditional probabilities given below. Data for a given subject were drawn sequentially until either $Y\left(t\right)$ jumped to one, ${C}_{1}\left(t\right)$ jumped to 1, ${C}_{2}\left(t\right)$ jumped to 1, or $Y\left(10\right)$ was generated. Tables 7 and 8 compare the number of deaths in the simulated data (median of 501 samples) and actual data among patients following a given regime. $P\left({W}_{1}=1\right)=0.3$ $P\left({W}_{2}=1|{W}_{1}=0\right)=0.5$ $P\left({W}_{3}=1\right)=0.5$ $P\left({W}_{4}=1\right)=0.3$ $P\left(Y\left(t\right)=1|W,\stackrel{ˆ}{L}\left(t-1\right),{A}_{1}\left(t-1\right)\right)=$ $\left\{\begin{array}{cc}0,& \phantom{\rule{1pt}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t=0\\ -5.8-0.1{W}_{1}-0.1{W}_{2}+0.1{W}_{3}-0.2{W}_{4}-0.7\stackrel{ˆ}{L}\left(t-1\right)-0.9{A}_{1}\left(t-1\right),& \phantom{\rule{1pt}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t>0\end{array}$ $\stackrel{ˆ}{L}\left(t\right)=$ $\left\{\begin{array}{cc}\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathrm{m}\mathrm{i}\mathrm{n}\left({\in }_{1}\left(t\right)-{W}_{4},4\right),-4\right),\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t=0& \\ \mathrm{m}\mathrm{a}\mathrm{x}\left(\mathrm{m}\mathrm{i}\mathrm{n}\left({\in }_{1}\left(t\right)+0.1{W}_{1}-0.1{W}_{2}-0.1{W}_{3}-0.5{W}_{4}+0.9\stackrel{ˆ}{L}\left(t-1\right)+{A}_{1}\left(t-1\right),4\right),-4\right),& \\ \phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t>0& \end{array}$ $P\left(M\left(t\right)=1|W,L\left(t-1\right),{A}_{1}\left(t-1\right)\right)=$ $\left\{\begin{array}{cc}1,\text{\hspace{0.17em}}& \mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t=0\\ \mathrm{e}\mathrm{x}\mathrm{p}\mathrm{i}\mathrm{t}\left(0.4+0.1{W}_{1}-0.2{W}_{2}+0.3{W}_{3}+0.1{W}_{4}-0.1L\left(t-1\right)+0.2{A}_{1}\left(t-1\right),& \phantom{\rule{1pt}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t>0\end{array}$ $L\left(t\right)=\left\{\begin{array}{cc}\stackrel{ˆ}{L}\left(t\right),& \phantom{\rule{1pt}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}M\left(t\right)=1\\ L\left(t-1\right),& \phantom{\rule{1pt}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}M\left(t\right)=0\end{array}$ $P\left({C}_{1}\left(t\right)=1|W,L\left(0\right)\right)=1-\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{i}\mathrm{t}\left(2+0.1{W}_{1}+0.2{W}_{2}+0.1{W}_{3}+0.1{W}_{4}+0.1L\left(0\right)\right)$ ${C}_{2}\left(t\right)=I\left(M\left(t-2\right)=0\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{a}\mathrm{n}\mathrm{d}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}M\left(t-1\right)=0\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{a}\mathrm{n}\mathrm{d}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}M\left(t\right)=0\right)$ $P\left({A}_{1}\left(t\right)=1|M\left(t\right),{A}_{1}\left(t-1\right),W,L\left(t\right)\right)=$ $\left\{\begin{array}{c}1,\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t>0\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{a}\mathrm{n}\mathrm{d}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}{A}_{1}\left(t-1\right)=1\\ 0,\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{i}\mathrm{f}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}t>0\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{a}\mathrm{n}\mathrm{d}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}{A}_{1}\left(t-1\right)=0\phantom{\rule{1pt}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\mathrm{a}\mathrm{n}\mathrm{d}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}M\left(t\right)=0\\ \mathrm{e}\mathrm{x}\mathrm{p}\mathrm{i}\mathrm{t}\left(-5+0.1{W}_{1}+0.1{W}_{2}+0.2{W}_{3}+0.2{W}_{4}-1.5L\left(t\right)+{\mathrm{\epsilon }}_{2}\left(t\right)\right),\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}\mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{1pt}{0ex}}\end{array}$

Tables 9 and 10 compare the number of patients in the simulated data and actual data who are uncensored and following a given regime. In the data analysis, for example, in the first 3-month interval following immunologic failure (time = 1) there were no patient deaths among the 137 uncensored patients who switched immediately (switch time = 0) and 13 deaths among the 2,285 uncensored patients who did not switch immediately (switch time = 1). In the second 3-month interval following immunologic failure (time = 2), there was one patient death among the 120 uncensored patients who switched immediately (switch time = 0), no patient deaths among the 88 uncensored patients who switched during the first 3-month interval (switch time = 1) and 8 deaths among the 1,962 uncensored patients who did not switch immediately or during the first 3-month interval (switch time = 2).

Table 7

Events in data analysis

Table 8

Events in simulated data

Table 9

Support (# uncensored and following rule) in data analysis

Table 10

Support (# uncensored and following rule) in simulated data

## References

• 1.

Gsponer T, Petersen M, Egger M, Phirid S, Maathuise M, Boulle A, et al., and O. Keiser for IeDEA Southern Africa. The causal effect of switching to second line ART in programmes without access to routine viral load monitoring. AIDS 2012;26:57–65.

• 2.

van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat 2007;3(1):Article 3.

• 3.

Robins JM. Analytic methods for estimating HIV treatment and cofactor effects. In: Ostrow DG, Kessler R, editors. Methodological issues of AIDS Mental Health Research. New York: Plenum Publishing, 1993;213–90. Google Scholar

• 4.

Murphy SA, van der Laan MJ, Robins JM. Marginal mean models for dynamic regimes. J Am Stat Assoc 2001;960:1410–23.

• 5.

Hernan MA, Lanoy E, Costagliola D, Robins JM. Comparison of dynamic treatment regimes via inverse probability weighting. Basic Clin Pharmacol 2006;98:237–42.

• 6.

Cain LE, Robins JM, Lanoy E, Logan R, Costagliola D, Hernn MA. When to start treatment? A systematic approach to the comparison of dynamic regimes using observational data. Int J Biostat 2010;6(2):Article 18.

• 7.

Schomaker M, Egger M, Ndirangu J, Phiri S, Moultrie H, Technau K, et al.. When to start antiretroviral therapy in children aged 2–5 years: A collaborative causal modelling analysis of cohort studies from Southern Africa. PLoS Med 2013;100:e1001555.

• 8.

Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;110:550–60.

• 9.

Robins JM, Hernan MA. Estimation of the causal effects of time-varying exposures. In: Fitzmaurice G, Davidian MVerbeke MG, Molenberghs G, editors. Advances in longitudinal data analysis. New York: Chapman and Hall/CRC Press, 2009:553–599. Google Scholar

• 10.

Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran ME, Berry D, editors. Statistical models in epidemiology, the environment, and clinical trials. New York: Springer, 2000:95–133. Google Scholar

• 11.

Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Math Model 1986;7:1393–512.

• 12.

Taubman SL, Robins JM, Mittleman MA, Hernan MA. Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. Int J Epidemiol 2009;380:1599–611.

• 13.

Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association on Bayesian Statistical Science, 1999, 2000:6–10. Google Scholar

• 14.

Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Nicholas P. Jewell, Klaus Dietz, Vernon T. Farewell, editors, AIDS epidemiology. Boston: Birkhäuser, 1992:297–331. Google Scholar

• 15.

Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: some questions and an answer”. Stat Sin 2001;110:920–36.Google Scholar

• 16.

Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On profile likelihood”. J Am Stat Assoc 2000;450:431–5. Google Scholar

• 17.

Rosenblum M, van der Laan MJ. Simple examples of estimating causal effects using targeted maximum likelihood estimation. UC Berkeley Division of Biostatistics Working Paper Series, Working Paper 209. Available at: http://biostats.bepress.com/jhubiostat/paper209, 2011.

• 18.

Stitelman OM, De Gruttola V, van der Laan MJ. A general implementation of TMLE for longitudinal data applied to causal inference in survival analysis. UC Berkeley Division of Biostatistics Working Paper Series, Working Paper 281. Available at: http://biostats.bepress.com/ucbbiostat/paper281, 2011.

• 19.

van der Laan MJ, Gruber S. Targeted minimum loss based estimation of causal effects of multiple time point interventions. Int J Biostat 2012;8(1):Article 8.

• 20.

van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. Berlin/Heidelberg/New York: Springer, 2011.

• 21.

van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat 2006;6(2):Article 2.

• 22

Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61:962–72.

• 23.

Robins JM. Marginal structural models. In: 1997 Proceedings of the American Statistical Association, Section on Bayesian Statistical Science, 1998:1–10. Google Scholar

• 24.

Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000;110:561–70.

• 25.

Petersen ML, van der Laan MJ, Napravnik S, Eron J, Moore R, Deeks S. Long term consequences of the delay between virologic failure of highly active antiretroviral therapy and regimen modification: a prospective cohort study. AIDS 2008;22:2097–106.

• 26.

Robins JM, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Stat Med 2008;27:4678–721.

• 27.

Neugebauer R, van der Laan M. Nonparametric causal effects based on marginal structural models. J Stat Plann Inference 2007;137:419–34.

• 28.

Petersen ML, Porter KE, Gruber S, Wang Y, van der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Stat Methods Med Res 2012;21:31–54.

• 29.

Robins JM. Commentary on “using inverse weighting and predictive inference to estimate the effects of time-varying treatments on the discrete-time hazard. Stat Med 2002;210:1663–80.

• 30.

Rosenblum M, van der Laan MJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat 2010;6(2):Article 19.

• 31.

Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion and rejoinder). J Am Stat Assoc 1999;940:1096–1120 (1121–1146).

• 32.

Schnitzer ME, Moodie EM, van der Laan MJ, Platt RW, Klein MB. Modeling the impact of hepatitis C viral clearance on end-stage liver disease in an HIV co-infected cohort with targeted maximum likelihood estimation. Biometrics 2014;70(1):144–52.

• 33.

R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. Available at: http://www.R-project.org/. ISBN 3-900051-07-0.

• 34.

Schwab J, Lendle S, Petersen M, van der Laan M. LTMLE: longitudinal targeted maximum likelihood estimation, 2013. Available at: http://cran.r-project.org/web/packages/ltmle/. R package version 0.9.3.

• 35.

Pearl J. Causal diagrams for empirical research. Biometrika 1995;82:669–710.

• 36.

Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge University Press, 2000. Google Scholar

• 37.

Bickel PJ, Klaassen CA, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Baltimore, MD: Johns Hopkins University Press, 1993. Google Scholar

• 38.

van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Berlin/Heidelberg/New York: Springer, 2003. Google Scholar

• 39.

van der Laan MJ, Polley E, Hubbard A. Super learner. Stat Appl Genet Mol Biol 2007;6:Article 25.

• 40.

van der Laan MJ. Statistical inference when using data adaptive estimators of nuisance parameters. UC Berkeley Division of Biostatistics Working Paper Series, Working Paper 302, 2012. Available at: http://www.bepress.com/ucbbiostat/paper302

• 41.

Young JG, Cain LE, Robins JM, OÕReilly EJ, Hernan MA. Comparative effectiveness of dynamic treatment regimes: an application of the parametric g-formula. Stat Biosci 2011;30:119–43.

• 42.

Picciotto S, Hernan M, Page J, Young J, Robins J. Structural nested cumulative failure time models to estimate the effects of interventions. J Am Stat Assoc 2012;1070:866–900.

• 43.

Geng EH, Glidden D, Bangsberg DR, Bwana MB, Musinguzi N, Metcalfe J, et al. Causal framework for understanding the effect of losses to follow-up on epidemiologic analyses in clinic based cohorts: the case of HIV-infected patients on antiretroviral therapy in Africa. Am J Epidemiol 2012;175:1080–7.

• 44.

Schomaker M, Gsponer T, Estill J, Fox M, Boulle A. Non-ignorable loss to follow-up: correcting mortality estimates based on additional outcome ascertainment. Stat Med 2014;330:129–42.

## About the article

Published Online: 2014-06-18

Published in Print: 2014-09-01

Citation Information: Journal of Causal Inference, Volume 2, Issue 2, Pages 147–185, ISSN (Online) 2193-3685, ISSN (Print) 2193-3677,

Export Citation

©2014 by De Gruyter.

## Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

[1]
Michael Schomaker and Christian Heumann
Statistical Papers, 2018
[2]
Carrie V. Breton, Carmen J. Marsit, Elaine Faustman, Kari Nadeau, Jaclyn M. Goodrich, Dana C. Dolinoy, Julie Herbstman, Nina Holland, Janine M. LaSalle, Rebecca Schmidt, Paul Yousefi, Frederica Perera, Bonnie R. Joubert, Joseph Wiemels, Michele Taylor, Ivana V. Yang, Rui Chen, Kinjal M. Hew, Deborah M. Hussey Freeland, Rachel Miller, and Susan K. Murphy
Environmental Health Perspectives, 2017, Volume 125, Number 4, Page 511
[3]
Jacqueline M Torres, Kara E Rudolph, Oleg Sofrygin, M Maria Glymour, and Rebeca Wong
International Journal of Epidemiology, 2018
[4]
Monika A. Izano, Daniel M. Brown, Andreas M. Neophytou, Erika Garcia, and Ellen A. Eisen
Epidemiology, 2018, Volume 29, Number 4, Page 542
[5]
Michael Schomaker and Christian Heumann
Statistics in Medicine, 2018
[6]
Noémi Kreif, Linh Tran, Richard Grieve, Bianca De Stavola, Robert C Tasker, and Maya Petersen
American Journal of Epidemiology, 2017, Volume 186, Number 12, Page 1370
[7]
Anton Pottegård, Søren Friis, Til Stürmer, Jesper Hallas, and Shahram Bahmanyar
Basic & Clinical Pharmacology & Toxicology, 2018
[8]
Laura Pazzagli, Marie Linder, Mingliang Zhang, Emese Vago, Paul Stang, David Myers, Morten Andersen, and Shahram Bahmanyar
Pharmacoepidemiology and Drug Safety, 2017
[9]
[10]
Laura B. Balzer
Epidemiology, 2017, Volume 28, Number 4, Page 562
[12]
Kristin A. Linn, Eric B. Laber, and Leonard A. Stefanski
Journal of the American Statistical Association, 2017, Volume 112, Number 518, Page 638
[13]
M A Gianfrancesco, L Balzer, K E Taylor, L Trupin, J Nititham, M F Seldin, A W Singer, L A Criswell, and L F Barcellos
Genes and Immunity, 2016, Volume 17, Number 6, Page 358
[14]
Susan Gruber
Big Data, 2015, Volume 3, Number 4, Page 211
[15]
Mireille E. Schnitzer, Judith J. Lok, and Ronald J. Bosch
Biostatistics, 2015, Page kxv028

## Comments (0)

Please log in or register to comment.
Log in