A commonly estimated causal effect parameter of survival data is the difference in marginal treatment-specific survival probabilities. That is, we wish to know the difference in the overall population survival proportion under two different treatment regimes. The estimation of treatment effects may be confounded by imbalances of certain covariates within each treatment group, and in most applications, right-censoring, which may be dependent on an observed time-dependent covariate process, poses an additional challenge to estimation. Consistent estimation of causal effects from right-censored survival data therefore requires proper adjustment for both confounding due to covariate imbalances in treatment groups and confounding due to time-dependent censoring.
With complex medical data, it is often preferred to work under a nonparametric statistical model. In light of this, many authors in the medical literature construct the Kaplan-Meier (KM)  product limit estimator stratified within each treatment group. This procedure yields a consistent estimation of the treatment-specific marginal survival probabilities if treatment assignment is independent of covariates and if right-censoring is independent of survival time. While covariate imbalance may sometimes be addressed via a randomized study design, i.e. a randomized controlled trial (RCT), this is not always possible. Further, there is concern that causal effect estimates from RCTs may not generalize to the clinical patient population in which approved drugs are routinely prescribed because trial participants tend to be healthier, and medical care provided in the RCT may differ from typical practice. The assumption of independence of censoring and survival times may not hold in practice because sicker subjects are more likely than healthy subjects to drop out of a study earlier as it may be too difficult or unethical to continue.
In the last two decades several estimators have been proposed that, under certain assumptions, achieve consistent estimation of causal effects in the presence of covariate imbalances in treatment groups or right-censoring, which is explainable by baseline and time-dependent covariate processes. They are suitable for estimation of causal effects in RCTs or from observational data. As is the case with conventional estimators, the consistency of these estimators requires that survival time and censoring time be independent given the covariate process. In addition, their consistency relies on the assumptions that (1) the distribution of treatment assignment and censoring time can be estimated consistently as a function of the observed covariate process, and (2) that the distribution of the outcome can be estimated consistently as a function of the observed covariate process. For example, the consistency of the Inverse Probability of Censoring Weighted (IPCW) estimator , which is constructed as the solution to the IPCW estimating equation, relies on (1). The consistency of Maximum Likelihood Estimation (MLE) substitution estimators relies on (2). Double robust inverse probability of censoring weighted estimators (DRIPCW) [2, 27] and Targeted Minimum Loss Estimators (TMLE) [4, 5] are doubly robust in the sense that they are consistent under (1) or (2) and achieve the efficiency bound of a nonparametric model, i.e. have the highest possible precision amongst all regular estimators, if both (1) and (2).
In this article, we implement the recent TMLE algorithm developed in van der Laan and Gruber  for estimation of intervention-specific marginal means in general longitudinal data structures. This algorithm was based on key ideas from the double robust estimating equation methodology of Bang and Robins . In brief, the algorithm computes the estimator by exploiting the iterative law, i.e. the tower property, of conditional expectation. The implementation here is specific to right-censored survival data with time-dependent covariates. The marginal mean parameter corresponds to the cumulative probability of a terminal event by a specific time point. The right-censored survival data model imposes particular constraints on the probability distribution, which allows a more computationally efficient implementation than that for the general algorithm presented in van der Laan and Gruber . We apply the TMLE to the ATRIA-1 research study to estimate the additive causal effect of sustained warfarin therapy versus no warfarin therapy on the marginal cumulative probability of stroke or death by 1 year.
2 Data, model, and parameter
2.1 Data structure, O
Consider a right-censored survival data structure in which subject covariates are assessed at a baseline time point. After this assessment, the subjects are assigned a treatment and are then followed up until they experience some terminal event of interest or are censored, where censoring events may include being lost to follow-up, study drop out, or administrative censoring at the end of the study. Time-dependent covariates are collected throughout the follow-up period. These may be predictive of the survival/death outcome or may be predictive of censoring. It is convenient to provide a formal representation of this data as a random variable in a counting process framework. Assume that time points are discrete and are indexed by the time counter, , that starts at 0 and increases by one unit until some ultimate time point of interest .
Let be the terminal event counting process. In brief, it is a random variable that takes value 0 until an event is observed. When an event is observed, the value switches to 1 and remains fixed for subsequent time points. We assume that , i.e. each subject is alive at the beginning of follow-up. For reasons that will become apparent later we define to be the value of the event process at the ultimate time point of interest. Let be the covariate process, which includes both fixed and time-dependent covariates and is observed until the penultimate time K. We refer to the event and covariate processes collectively as , and for notational convenience we define to be the histories of these processes up to time point t.
Let be the treatment process. It is a random variable that takes value 0 if the subject is not treated at time t and 1 if he is treated. Let be the counting process for censoring. Like the event counting process, this is a random variable that takes value 0 before the subject is censored and then becomes fixed to value 1 at the time when censoring is observed. The treatment and censoring processes, like the covariate process, are only observed up to the penultimate time K. We refer to the treatment and censoring processes collectively as , and we define to be the histories of these processes.
By convention, we assume the ordering and when factorizing the likelihood at a particular time t, we use the ordering . Our data can now be written as the longitudinal data structure .
2.2 Factorization of the likelihood,
We assume that our data is a random variable with a particular likelihood, . This likelihood is the joint density of the random variables that make up O. As will be seen, it is useful to factorize into orthogonal components indexed by the time-ordering t. Start by noting that the joint density of O can be factorized into the product of the time-specific conditional densities of the event, covariate, treatment, and censoring processes given the history.
where denotes the parents or all random variables that precede X. Note that:
From here on out we will refer to the conditional densities of the event and covariate processes as Q-factors and the conditional densities of the treatment and censoring processes as g-factors of the likelihood.
We denote the conditional expectations of the and processes given the history with the bar notation, and .
Note that these conditional expectations are functions of the histories.
2.3 Statistical model,
At this point we must commit to a statistical model, , which is a set of assumptions we wish to impose on . Formally, represents a collection of probability distributions (or likelihoods) of which is the particular member that generates our data, O.
In general we would like to avoid strong distributional assumptions on , thus allowing our statistical model, M, to be as nonparametric as possible. There are, however, several constraints that are implicit in right-censored survival data. These are related to the fact that the time-dependent processes become degenerate after censoring or after an event has been observed.
After an event is observed:
The event process remains fixed at 1.
Censoring cannot be observed and remains fixed at 0.
The covariate and treatment processes can no longer be observed and are therefore fixed to a degenerate value, e.g. 0.
After censoring is observed:
The event cannot be observed and remains fixed at 0.
The censoring process remains fixed at 1.
The covariate and treatment processes can no longer be observed and are therefore fixed to a degenerate value, e.g. 0.
2.4 Statistical parameter
Our statistical parameter of interest, , is the marginal cumulative event probability (or cumulative incidence) at time point according to a specific intervention, . Note that “intervention” here corresponds to both treatment and censoring processes. We define our parameter as a mapping applied to the probability distribution in the statistical model, .
Note that this parameter is only well-defined under the positivity assumption, for all t. If some patient subgroup will never receive a particular treatment then a marginal intervention-specific outcome that includes this subgroup would be nonsensical. Similarly, if a certain subgroup is always censored before the ultimate time point then again a marginal parameter is not well-defined. If positivity holds then we may proceed in identifying the parameter.
The parameter mapping may be identified as a function of the Q-factors of , i.e. . This mapping is given below. For notational convenience here we drop the 0 subscript from the Q-factors. Start by recalling the definition .
By the law of iterated conditional expectation:
As noted above, the true parameter, , is simply this same mapping applied to the true distribution, i.e. . This representation makes it clear that our parameter is only a function of the Q-factors through iterated conditional means at each time point. This is very convenient from an estimation perspective. In general, the estimation of conditional means is much easier than the estimation of conditional densities.
2.5 Efficient influence curve
The efficient influence curve, , is a functional of the density of the data that plays a central role in robust estimation in nonparametric statistical models . The mean of is 0, and the variance of divided by the sample size defines the efficiency bound, i.e. the minimal variance that may be achieved by regular asymptotically linear estimators.
be the probability of treatment and censoring by time t conditional on the history. The efficient influence curve for parameter mapping of the previous section is then:
This efficient influence curve was originally presented by Bang and Robins  in the context of estimating equations and was derived in the context of TMLE in the recent article of van der Laan and Gruber . When and factors of the likelihood are estimated consistently, the TMLE solves an efficient (targeted minimum loss) score equation and also solves the efficient influence curve estimating equation, .
2.6 Causal treatment-specific mean parameter,
We now move from the statistical parameter to the causal parameter. Here we assume the existence of a counterfactual data distribution indexed by a particular intervention and then define the causal parameter as a marginal mean over this counterfactual distribution .
Consider the intervention, , which corresponds with a fixed intervention on the treatment and censoring processes. For example, we may choose
corresponding to a fixed treatment and no censoring up to time point K. Setting the values of all intervention nodes according to this intervention results in a data structure with a new probability density, . Our causal parameter is now defined as a mapping applied to the intervention-specific probability distribution .
Our observed data O is not generated by the counterfactual distribution of the data under intervention , but rather from the observed data distribution . Fortunately, we are under certain assumptions able to identify the causal quantity as the mapping applied to described in the previous section. Positivity is required for the same reasons discussed previously. Also required for the identification of from is the consistency assumption, , i.e. that we observe the counterfactual corresponding with the treatment we observe, and the sequential randomization assumption: for all t. Note that the latter encompasses the usual assumption that censoring time is independent of survival time given covariates, and the heuristic translation is that there should be no unmeasured confounding variables. This assumption is untestable in practice, and therefore represents a point of controversy in causal effect estimation. However, even if this assumption is not valid, we may still arrive at an interesting statistical parameter, best described as the association of the intervention with the outcome adjusted for a subset of confounders. If these assumptions are met then and consistent estimation of the statistical parameter is equivalent to consistent estimation of the causal parameter.
3 Targeted minimum loss estimation
TMLE was first introduced by van der Laan and Rubin . TMLE is a two-stage substitution estimator for a finite-dimensional parameter, , represented as a mapping from the density of the data to a vector of real numbers.
The first stage of the TMLE is to construct an estimate, , of the relevant portion of the density of the data, . An increasingly popular approach here is the use of loss-based machine learning algorithms that aim to minimize a global loss function for the (typically high-dimensional) density. The Super Learner [3, 10] is an example of a loss-based machine learning algorithm that relies on cross validation. This methodology has been explored and utilized in a number of studies [11–14] and has been made available as an R package . Substitution of for yields a typical minimum loss-based estimator .
The second stage of the TMLE is the “targeting” stage. Recall that our initial estimator was based on the minimization of the global loss function for the (typically high-dimensional) density of the data. As such, this estimator is unlikely to be optimal for and will almost certainly be biased our finite-dimensional target parameter. Thus the goal of the targeting step is to update our initial density estimate in such a way that we reduce bias and minimize the loss for our parameter of interest.
TMLE achieves this by fitting a least favorable parametric submodel  that takes the initial density estimator as an offset and a (possibly multivariate) covariate. Heuristically, the covariate defines a direction in which we must stretch our initial density estimator in order to remove bias for the parameter of interest, and is therefore often called the “clever covariate.” Typically, the clever covariate involves the treatment assignment or censoring mechanisms . Thus the construction of the clever covariate usually requires an estimator of these factors of the likelihood. The parameter of the submodel represents the magnitude of the stretching required to reduce the bias in our initial estimate. The parameter can usually be estimated with standard regression software. The submodel estimate defines an updated density fit, , and the TMLE is defined by the substitution, .
TMLEs have been derived for several parameters including marginal means and causal effects parameters in point treatment data [5, 16–18, 26], right-censored survival data [19, 20], longitudinal data structures with time-dependent covariates [6, 16, 17], and case-control settings . TMLE has also been used to estimate variable importance measures  and for prediction calibration .
The TMLE presented here is a special case of the general algorithmic template given in van der Laan and Gruber  for the estimation of intervention-specific means in general longitudinal data structures. The latter was inspired in part by the estimating equation methods proposed by Bang and Robins . The particular TMLE implementation discussed here is appropriate for right-censored survival data structures with a baseline treatment intervention and a covariate process that may include both fixed and time-dependent covariates. It allows for both treatment and censoring mechanisms that are dependent on either fixed or time-dependent covariates or on time itself.
3.1 TMLE algorithm
Van der Laan and Gruber  introduced the TMLE algorithm for the estimation of marginal means in the nonparametric statistical model for general longitudinal data structures. Here we outline a particular implementation of this algorithm that exploits additional information in the nonparametric statistical model for right-censored survival data.
Initial Estimator: Propose an initial estimator, , of the innermost conditional expectation, . While one may be tempted to fit a regression straight away, some simple algebra allows more insight. Notethat
The first term reflects our knowledge that if the failure occurred at or before time K then event counting process is fixed at value 1 for all subsequent time points with probability 1. This implies that no update is required for observations with . The second term is the conditional expectation of the event process for persons who have not yet had the event and (due to the intervention settings) are not censored. In other words, this term corresponds to the intensity or hazard at time . A reasonable approach to constructing the initial estimator here might include estimation of intensity/hazard by smoothing over all time points and substituting the estimate at time .
where the Logit transform of our initial estimator, , is used as an offset. The score of this submodel at is where , i.e. the inverse probability of censoring weights constructed from our estimate of the g-factors of the likelihood. is called the “clever covariate” because it is constructed so that the above score at spans the component of the efficient influence curve for . The parameter estimate is computed through minimization of the log-loss function, , where
Standard software for logistic regression is used to fit this submodel.
Targeted update: The targeted update is then computed by plugging in our estimate, , into the parametric submodel.
This solves the component of the efficient influence curve estimating equation.
3.1.2 Step K:
Initial Estimator: Now we must construct an initial estimator, , of the iterated conditional expectation, i.e. the conditional expectation (at time K) of the conditional expectation at time given the history up to time . Note that
Here again the first term reflects our knowledge that if the failure occurred at or before time then event counting process is fixed at value 1 for all subsequent time points with probability 1. We do not need to update this value in the targeted maximum likelihood step. Like the previous step, the second term here corresponds to people who have not yet had the event. Unlike the previous step, however, it is no longer the hazard or intensity at time K. It can be thought of as a random variable that consists of probabilities. Thisinitial estimator is constructed from a regression of the previously targeted fit on the history up to .
Parametric submodel: where is an offset, and is the covariate. The score of this submodel at is . The parameter estimate is computed through minimization of the log-loss function, , where
Most standard software packages for logistic regression can be applied to data with outcomes in the range [0,1] and can therefore be used to fit this submodel.
Targeted update: The targeted update is then given by
This solves the component of the efficient influence curve estimating equation.
The process for Step K is repeated for every time point, by conditioning on the history at time points .
3.1.4 The final step
The final step is to take the empirical mean of the conditional expectations for all observations in the data set. Note that the empirical distribution of the pre-intervention variables is already targeted towards the estimation of our parameter, and therefore does not require any update. More specifically, given the empirical distribution as initial estimator the maximum likelihood estimator of the parameter in the parametric submodel is exactly equal to 0 . We therefore arrive at the TMLE:
3.2 Statistical properties
The TMLE estimator is a substitution estimator based on targeted fits of functions of Q-factors of the likelihood, . Each t-specific targeted Q-factor solves the scoreequation for the t-specific parametric submodel. The submodel is a generalized linear model, and its score is therefore the “clever covariate” times the residual. Thus, the TMLE iteratively solves the scoreequation of each t-specific submodel and also iteratively solves each t-specific component of the efficient influence curve estimating equation. Upon completion of the algorithm the TMLE solves the entire influence curve estimating equation, which is the sum over all components, in a single pass through the data.
This implies that the TMLE is doubly robust in the sense that it will be consistent if either the intervention distribution or the initial estimators of the relevant Q-factors, , are estimated consistently. If both are estimated consistently then the TMLE, has a normal distribution centered at the truth with variance that can be estimated with
The bootstrap can be used for inference in the absence of assumptions regarding the consistency of or .
4 Simulation results
This section investigates the performance of the TMLE on a simulated dataset with a known distribution. Its performance is compared with that of the KM survival curve, the IPCW estimating equation estimator, the MLE substitution estimator, and the DRIPCW estimating equation estimator. The data consisted of fixed covariates and fixed treatment drawn from the following distributions:
We also included a time-dependent covariate that was allowed to jump from value 0 to 1 once and then stay fixed over the course of follow-up. This covariate along with the censoring indicator , and event indicator at each time point of follow-up were drawn from the distributions implied by the following hazard/intensity functions. Note that for convenience we use the abuse of notation, , to indicate that the subject is in the “risk set”, i.e. they have not had the event or been censored, by time t.
If we denote our data can be written as . The parameter of interest is the marginal cumulative event probability at time . This is given by under specific interventions: for “no treatment without censoring,” and for “treatment without censoring.” The true parameter value was computed using an empirical sample consisting of 1,000,000 independent draws from under each intervention.
The KM, IPCW, MLE, DRIPCW, and TMLE were computed for 1,000 samples of size n = 1,000 drawn from . Bias was computed as the difference between the mean of the sample estimates and the true values. Mean squared error was computed in the usual way. For the KM, IPCW, DRIPCW, and TMLE, we computed standard errors and coverage probabilities of 95% confidence intervals based on the variance of the their respective influence curves. Influence curve-based inference is not available for the MLE, and this explains the “na” values for the coverage probabilities for this estimator in the tables of our results. Data simulations and estimators were implemented in SAS software v 9.2.
4.1 Performance under positivity violations
Positivity is a requirement for the identification of both statistical and causal effect parameters. As a heuristic example, if a treatment or intervention of interest is absolutely never given to a particular subgroup of patients, perhaps because it is against medical guidelines or the law, then the average causal effect (of treatment) parameter for such patients is nonsensical. Mathematically, a violation of positivity corresponds with a parameter whose efficient influence curve is unbounded. For the parameter considered here it is easy to see that a positivity violation means that we get a comprised of at least one fraction with a 0 in the denominator, which means .
In practice, we often encounter data that exhibits near-positivity violations. For example, in the ATRIA-1 study, it was uncommon for persons with a history of bleeding events and falls to be prescribed warfarin. The case where treatment is particularly rare for certain subgroups is generally known as a practical positivity violation. For estimators whose influence curves have factors that involve the inverse probability of treatment, practical positivity violations are a concern. Because such estimators behave statistically like the empirical mean of their respective influence curves, a very small probability of treatment in the denominator can lead to a very large outlying value. The effect of this type of outlier may be likened to the effect of an outlier in the typical estimation of an empirical mean. The TMLE, IPCW, and DRIPCW are all estimators of this type, and a careful investigation of their respective influence curves can give insight into their behaviors under practical positivity violations.
Given the true , the TMLE has influence curve
where is the asymptotic limit of our estimator . When is estimated with then the influence curve of the TMLE is the above influence curve minus its projection on the tangent space  of the model for . Confidence intervals may be based on the empirical variance of . The DRIPCW influence curve is similar except that it is based on the initial estimator of the Q-factors of the likelihood, i.e. .
Given the true , the IPCW estimator influence curve is
Note that the KM can be expressed as an IPCW where the estimator of the g-factors of the likelihood does not depend on the covariates. Its influence curve follows the same form as that of the IPCW above.
Under regularity conditions, including when is bounded, the TMLE is efficient when and both converge to true and , and remains consistent if one of them is correct. Despite this, its performance relative to the IPCW under positivity violations is not obvious. The reason is that the TMLE influence curve involves a sum of fractions with in the denominator, whereas the IPCW influence curve only has one such fraction. In the TMLE, the numerator of the fraction is always a number bounded by [–1,1] while the IPCW is exactly equal to 0 or 1.
The simulation study presented here investigates the performance of each estimator under positivity violations, introduced by scaling the linear component of the treatment assignment mechanism by constant factors, 1, 10, and 20 corresponding, respectively, to “no”, “substantial”, and “extreme” positivity violations. In the “no positivity” scenario, no truncation of the g-factors was required. In the “substantial” positivity scenario, 28% of the treatment group, for whom , required truncation, and 4% of the control group, for whom , required truncation. In the “extreme” case, 41% of the treatment group and 12% of the controls required truncation.
For the TMLE, IPCW, and DRIPCW the treatment assignment and censoring mechanisms were estimated with consistent parametric logistic regressions. To reflect common practice, the estimated g-factors at every time point were truncated to the interval [0.01, 0.99]. To carry out the TMLE, MLE, and DRIPCW, we first stratified our data set according to each treatment intervention, and then fit initial estimators of the intervention-specific Q-factors at every time point with Super Learners [10–14]. In brief, the Super Learner at each time point is a convex weighted average of the predictions from a library of five candidate estimators. The convex weights are estimated from the data to minimize the expectation of the 5-fold cross validated negative Bernoulli loglikelihood. In the simulations here, the library of candidate estimators included (1) the unconditional mean; (2) a logistic regression for continuous outcomes in the interval [0,1]; (3) ordinary least squares regression; (4) neural network 3-layer perceptron with two hidden units; and (5) a recursive partitioning decision tree.
Results are shown in Table 1. In the absence of positivity violations, the KM is heavily biased due to the fact that both the treatment and censoring probabilities depend on covariate values. The KM also had the highest mean squared error. The MLE substitution estimator performed somewhat better, but was still more biased and had higher standard error than the remaining estimators. The IPCW, DRIPCW, and TMLE had the lowest mean squared errors. The biases for the TMLE and DRIPCW were higher than that of the IPCW, though their magnitudes were negligible with respect to statistical inference and 95% confidence intervals based on their respective influence curves had reasonably good coverage probabilities.
The bias and mean squared error of the IPCW increased with higher levels of positivity violation. The DRIPCW and TMLE performed better than the IPCW under extreme positivity violations, though the best estimator with respect to overall mean squared error in this scenario was the MLE. Interestingly, the performance of the MLE, which does not rely on estimation of the g-factors of the likelihood, also suffered with increasing positivity.
4.2 Performance under model misspecification
Model misspecification is a major concern for estimation in nonparametric statistical models. As noted previously the IPCW is consistent in the nonparametric model only if the g-factors of the likelihood are estimated consistently. The TMLE and DRIPCW are double robust in that they achieve consistency if either the g-factors or the relevant Q-factors are estimated consistently. The TMLE and DRIPCW are also locally efficient in that they achieve the nonparametric efficiency bound when both the g and Q-factors are estimated consistently. This section investigates these properties via simulation. In order to avoid redundancy with the section on positivity violations above, the linear part of the treatment assignment mechanism was scaled by a factor of 2.
We assess the performance of the TMLE, IPCW, and DRIPCW estimators built upon parametric estimators of correctly specified g-factors, , or misspecified g-factors, , respectively. Model misspecification was specific to the treatment assignment probabilities. Both and used correctly specific logistic regressions model for the censoring hazard function. With regard to the treatment assignment, used the correctly specified logistic regression model, . The misspecified logistic regression model, , was used for .
The gains in efficiency for TMLE and DRIPCW are investigated by comparing three different approaches to the estimation of the Q-factors. The first approach used the NULL model in which each was assumed to be independent of the covariates. The second approach estimated the Q-factors with logistic regression functions of the baseline covariates and the most recent time-dependent covariate. The third approach constructed Super Learners for each Q-factor, as in the previous section.
Results are shown in Table 2. Again the KM, which assumes independence of both the treatment and censoring processes, is quite biased. In this sense, the KM is in fact an IPCW estimator based on a highly misspecified estimator of the g-factors of the likelihood. The IPCW is largely unbiased under , but quite biased under . The TMLEs and DRIPCWs presented here had very similar bias and mean squared error under all model specification settings. The performance of both the TMLE and DRIPCW was better than or equal to that of the IPCW and KM in terms of both bias and MSE. The TMLE and DRIPCW, based on the NULL estimators of the Q-factors, behave very similar to the IPCW. The double robustness property of the TMLE and DRIPCW are best illustrated in simulations where the g-factors are misspecified. This is evidenced by the major bias reduction associated with using GLM or SL fits for the Q-factors and underscores the importance of estimating the Q-factors when the model for g is unknown. In practice, one might choose between different TMLEs indexed by different Q-factor estimators on the basis of the cross validated variance of the influence curve, though in the simulations presented here the observed differences in efficiency were modest at best. The MLE in some settings had lower mean squared error than both the TMLE and DRIPCW, though it appeared to have somewhat higher bias.
5 The causal effect of warfarin
Warfarin is a medication used in the prevention of thromboembolic stroke. However, its physiological mechanism of action also puts users at higher risk for adverse bleeding events, whose health consequences may be just as devastating as thromboembolic stroke. The efficacy of warfarin has been shown in randomized clinical trials, though some practitioners have questioned whether the extent of the benefits is the same in the general atrial fibrillation population. The reason is that typical patients have more comorbidities and may not be as healthy as trial participants. One of the major research objectives of the ATRIA-1 cohort study was to estimate the causal effect of warfarin on adverse outcomes in the general atrial fibrillation patient population at Kaiser Permanente Northern California.
Like most drugs, warfarin prescription decisions are based on the patient’s medical history. In particular those who are thought to be at high risk for stroke are more likely to receive the drug. The risk factors for stroke, however, are largely the same as those for falls, bleeds, and death by any cause. The warfarin prescription decision must therefore weigh the potential benefits of stroke prevention against the potential risks of adverse outcomes. Thus treatment is highly dependent on clinical covariates, and estimation of causal effects must take this into account.
Here we estimate intervention-specific marginal probabilities of “stroke of death” within a 1-year time frame. The first intervention consists of sustained warfarin usage and no censoring, while the second intervention withholds warfarin therapy and enforces no censoring. The causal effect of warfarin is then defined as the difference in these two intervention-specific marginal probabilities.
5.1 ATRIA-1 cohort data
Using the same counting process framework outlined in Section 2.1. will denote the counting process for the stroke or death event at some time point t, and we are interested in its value at the 1-year (365 days) time point ; will denote the (fixed and time-dependent) covariate processes; will denote the fixed baseline warfarin treatment; and will denote the censoring counting process, where censoring events include disenrollment from Kaiser, end of study, or switching warfarin treatment. We regard the 13,599 as an empirical sample drawn from a probability distribution, , and denote the empirical distribution .
At the beginning of the study, 5,289 subjects were taking warfarin and 8,270 were not. After the baseline assessment, 2,267 subjects started using warfarin and 1,228 stopped using warfarin within the first year. These individuals were censored on the day when the treatment was switched. One sixty eight subjects were censored due to disenrollment from the Kaiser system, and none were administratively censored within the 1-year time frame. Among those who were not censored, 171 had a thromboembolic stroke and 485 died within the first year of follow-up. The remaining 9,240 were observed to be alive, stroke-free, and had not switched treatments at the end of the first year of study. Under the data structure described above the working data set contained 13,559 persons 365 days = 4,949,035 rows, though after the timing of censoring and terminal events were taken into account the dataset consisted of 3,915,087 person-days atrisk.
The covariate process for each individual included fixed indicators of gender, race, educational attainment, and income along with a time-dependent covariate for age at time t. Other time-dependent covariates included indicators for diagnoses of prior stroke, hypertension, intracranial hemorrhage, other bleeds, coronary heart disease, congestive heart failure, diabetes mellitus, retinopathy, dialysis, history of fall, dementia, seizure, peripheral artery disease, or hepatitis. The values of these covariates were set to 0 for persons with no history of the diagnoses and were fixed to 1 on the date of diagnosis and all subsequent time points. Time-dependent values for lab measurements of total hemoglobin, HbA1c, international normalized ratio, glomerular filtration rate, and proteinuria abstracted from Kaiser’s administrative databases. These were measured under standard medical protocols and the values between measurements were imputed with the last observation carried forward.
5.2 Causal effect of warfarin parameter
Our outcome here is defined as “stroke or death” within a 1-year time frame, and our parameter of interest is the intervention-specific marginal mean of this outcome: . Equivalently, this is the marginal cumulative probability of stroke of death within 1 year, or the proportion of the population who would experience stroke or death within 1 year under the specified intervention. The fixed interventions of interest are no warfarin for 1-year () versus sustained warfarin therapy for 1-year (). Both interventions also prevent early censoring. We define the additive causal effect parameter as the difference between the two intervention-specific marginal means.
5.3 TMLE implementation
The g-factor corresponding to the fixed treatment assignment mechanism was given by the conditional expectation of warfarin treatment given the medical history at baseline, . This was estimated with a Super Learner [10, 13] for the negative Bernoulli loglikelihood risk. This Super Learner was fitted on a dataset with 13,559 rows – i.e. one per subject. The g-factor corresponding to the time-dependent censoring mechanism was given by the conditional intensity of censoring given the history up to the most recent time point, . This was estimated with a Super Learner for the risk defined as the expectation of the sum of the negative Bernoulli loglikelihood loss at each time point.
The initial fit for the terminal Q-factor corresponding to the time-dependent event intensity mechanism at day 365 was estimated with a Super Learner for the risk defined as the expectation of the sum of the negative Bernoulli loglikelihood at each time point. To improve precision we used the entire ATRIA-1 dataset (even time points beyond day 365) to fit this Super Learner. The initial fits for the remaining Q-factors were fit with Super Learners for the negative Bernoulli loglikelihood risk. The Super Learner for each time point specific Q-factor was fit using only those observations that had not yet been censored or experienced the event.
All of the Super Learners (both for the g and Q-factors) used 5-fold cross validation and a library with the following candidates:
MEAN – the unconditional mean, i.e. the NULL model
LOGIT or LOGIT_CTS01–logistic regression or logistic regression for continuous outcomes in [0,1]
LDA – linear discriminant analysis (not used for the Q-fits with continuous outcomes)
NN2–neural network 3-layer perceptron with two hidden units
TREE – recursive partitioning decision tree
The Super Learners and the TMLE algorithm were implemented in SAS v 9.2.
The Super Learner for the fixed treatment assignment mechanism was a weighted average of the unconditional mean (9%), logistic regression (37%), linear discriminant analysis (12%), neural network (39%), and decision tree (3%). The Super Learner for the time-dependent censoring mechanism only involved the neural network (6%), and decision tree (94%). The fitted g-factors did not suggest major practical positivity violations, and we did not truncate inverse probability weights. The inverse probability weights are plotted as a function of the follow-up time for both the warfarin and the no warfarin interventions are plotted below (Figure 1).
The Super Learner for the event intensity was a weighted average of logistic regression (66%), linear discriminant analysis (4%), neural network (20%), and decision tree (%10). The weights for the Super Learner for remaining Q-factors (364 in total) are too numerous to describe in detail, though generally the higher weight was given to the logistic regression candidate estimator. The initial estimator of the Q-factors for every time point, days 0 to 365, were updated with TMLE to solve the efficient score equation.
Table 3 presents the KM, IPCW, MLE, DRIPCW and TMLE estimators and their standard errors based on their respective influence curves. All three estimators yielded point estimates that indicate the sustained warfarin intervention decreased the proportion of stroke or death over the no warfarin intervention. The KM estimator suggests that the marginal stroke or death probability under the sustained warfarin intervention was 3.5% less than that under the no warfarin intervention. However, because warfarin treatment decisions are made based on the patient’s medical covariate history, the KM is not consistent as an estimator of the marginal probability. It is instead a stratified estimator of the probability of stroke or death amongst persons on warfarin and amongst those who were not, respectively. Further, the KM is subject to bias when there is informative censoring, which appears to have been the case here.
The IPCW, MLE, DRIPCW, and TMLE are of more interest here because these may be consistent causal effect estimators. The range of the estimates of the g-factors of the likelihood used in both TMLE and IPCW suggested that there was no major positivity violation for the estimation of the parameter of interest. Thus, if our estimates of the treatment assignment and the censoring mechanisms are consistent, then the TMLE, IPCW, and DRIPCW are consistent for the statistical parameter of interest. If the fits of the Q-factors are consistent then the TMLE, MLE, and DRIPCW are consistent for the statistical parameter of interest. Under the additional (untestable) assumption that there are no unmeasured confounders, then these estimators are also consistent for the causal effect parameter. The IPCW suggests a reduction of 2.0%, the MLE 2.7%, the DRIPCW 2.0% and the TMLE 1.9%. As expected, the influence curves of these estimators indicate that the TMLE and DRIPCW were slightly more precise than the IPCW.
It is noteworthy that the TMLE, DRIPCW, MLE, and IPCW all yielded higher “stroke or death” probability estimates than KM under the warfarin treatment and lower estimates under the no warfarin treatment. Because warfarin is prescribed to patients who are considered to be at higher risk for stroke, one might anticipate the opposite finding, i.e. that the KM estimator should make the drug look less beneficial. On the other hand, warfarin is often withheld from patients with a high risk of death for safety reasons. For example, recent falls and bleeding events are common contraindications for warfarin.
We explored this apparent confounding by regressing our Super Learner fit for the “stroke or death” event intensity on the baseline warfarin treatment and found that a higher event intensity at baseline was associated with a lower probability of having received warfarin. Recall that individuals who started warfarin therapy later and those who stopped taking it were censored. As an informal exploration, we also fit an estimator of the censoring mechanism as a function of the event intensity, separately for those who started out on warfarin and those who did not. Amongst persons who started the study on warfarin, higher event intensities were positively associated with censoring, which suggests that when a patient’s underlying risk of “stroke or death” increased they were subsequently taken off warfarin.
In summary, an intervention defined by giving warfarin to all atrial fibrillation patients significantly reduced the marginal probability of stroke or death by 2% relative to the null intervention (no warfarin at all). This effect estimate was smaller than the estimate that ignored confounding, seemingly introduced by clinician’s apparent reluctance to prescribe warfarin to patients whose medical history implied a higher underlying risk of death.
Recently developed estimators, e.g. IPCW, MLE, DRIPCW, or TMLE, extend the analysis of causal effects to observational data settings. This is remarkable in that it allows the estimation of causal effects directly from the patient populations in which therapies are typically prescribed, as opposed to trial populations that tend to be comprised of healthier subjects. Increasing awareness of such estimators has led to academic debate over which estimators are more useful in theory and practice. The TMLE combines attractive theoretical features of the estimating equation approaches with those of minimum loss-based estimation and avoids several of their respective shortcomings. Estimating equation approaches require that the parameter of interest be defined as the solution to an estimating equation, and this is not always possible. Even when it is, the estimating equation may have multiple solutions without a clear-cut optimality criterion. Further, inconsistent estimation of the g-factors of the likelihood can lead to parameter estimates that do not obey the constraints of the statistical model. For example, it is possible to compute IPCW confidence intervals for probabilities that lie outside the range . Because the TMLE is a substitution estimator, parameter estimates never escape the bounds of the statistical model. The parameter is defined as a mapping applied to the probability distribution of the data and therefore extends to cases where it cannot be written as the solution to an estimating equation. The minimum loss, or maximum likelihood, principle provides a clear-cut theoretical optimality criterion for use in practical estimation problems.
In this article we implemented a TMLE algorithm  that can be applied in either the RCT or observational scenarios to estimate the marginal intervention-specific cumulative event probability for right-censored survival data with time-dependent covariates. The TMLE solves the efficient influence curve estimating equation for the marginal intervention-specific cumulative event probability and is doubly robust in the sense that it is consistent if either the intervention distribution or the relevant components of the event process distribution are estimated consistently. If both are estimated consistently, it achieves the efficiency bound in the nonparametric model. The implementation here assumes that complete histories of covariate, treatment, event and censoring processes are observed for each individual at each time point in a discrete set of time points. The benefits of double robust estimators TMLE and DRIPCW were evident in our simulation studies. These two estimators, however, performed very similarly with regard to bias, meansquared error, and coverage probabilities for 95% confidence intervals. Although not readily apparent in the settings studied here, the work of Stitleman et al.  documented notable gains in the TMLE as compared to the DRIPCW in settings were the solution to the DRIPCW estimating fell outside the bounds of the model.
In applied work, time-dependent covariate measures are nearly always measured at different times for different subjects, and the researcher must make some plausible imputation of the histories at each time point. Our analysis of the atrial fibrillation data worked with the standard “last observation carried forward” imputation. Imputation of this type does not cause bias, because it does not come at any loss in information. It is important to note here that this imputation is not an attempt to control for unmeasured confounders (such as time-specific values of biomarkers that were not observed). We are merely adjusting for the measured covariate processes and assume the sequential randomization assumption with respect to these measured processes in order to interpret our statistical estimands as a causal effect. Loss in information (and possible bias) only occurs if we use a too coarse time scale, such that the actual observed data must be somehow truncated or otherwise modified.
The more important assumption in this discrete framework is that the intervention nodes (censoring and treatment) are assumed to be correctly defined at each time point. As noted above for the covariate process, an overly coarse discrete time scale may induce bias with respect to the desired causal effect. On the other hand, an overly fine discrete time scale may result in practical violations of the positivity assumption in cases where the treatment process is allowed to change at any point in time. In our case, we were able to use a very fine time scale (daily) since the treatment process and censoring process only jump once. Methods to estimate the causal effects of treatment processes that can jump in continuous time at various time points, instead of enforcing the data to adhere to the discrete time scale, are the subject of ongoing research.
The results of the data analysis indicated that a medical policy of treating all patients with warfarin versus no treatment would have reduced 1-year stroke or death probability from 7% to 5%. It must be restated that these interventions were chosen to illustrate the TMLE method. Such stringent interventions, i.e. sustained warfarin for all patients or no warfarin for any, are unrealistic in practice. Future work applied work in this area should focus on comparisons of realistic treatment rules.
Finally, we reiterate that the TMLE approach presented here is quite general and may be adapted to a variety of other problems. The same method may be used to estimate the causal effects of dynamic treatment rules, in which treatment decisions are made in response to the evolution of the time-dependent covariate process. Extension to the estimation of counterfactual survival curves (i.e. survival probabilities at multiple time points) is straightforward and simultaneous confidence bands may be computed as describedin Quale and van der Laan . Further extensions that involve marginal structural models may be used to estimate continuous treatment effects or treatment effects that depend on a function of baseline covariates .
Robins JM, Rotnitzky A. AIDS epidemiology – methodological issues. Boston, MA: Birkhäuser, 1992. Google Scholar
van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, 2003. Google Scholar
van der Laan MJ, Gruber S. Targeted minimum loss-based estimation of an intervention specific mean outcome. Technical Report 290, Division of Biostatistics, University of California, 2011. PubMedGoogle Scholar
Bickel PJ, Klaassen CA, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. New York: Springer-Verlag, 1993. Google Scholar
Pearl J. Causality: models, reasoning, and inference, 2nd edn. New York: Cambridge University Press, 2009. Google Scholar
Chaffee P, Hubbard AE, van der Laan MJ. Permutation-based pathway testing using the super learner algorithm. Technical Report 263, Division of Biostatistics, University of California, 2010. Available at: http://www.bepress.com/ucbbiostat/paper263.
Diaz Munoz I, van der Laan MJ. Super learner based conditional density estimation with application to marginal structural models. Int J Biostat 2011;7. Available at: http://www.bepress.com/ijb/vol7/iss1/38.PubMedWeb of Science
Polley EC, van der Laan MJ. Super learner in prediction. Technical Report 266, Division of Biostatistics, University of California, 2010. Available at: http://www.bepress.com/ucbbiostat/paper266. Web of Science
Sinisi SE, Polley EC, Petersen ML, Rhee S-Y, van der Laan MJ. Super learning: an application to the prediction of Hiv-1 drug resistance. Stat Appl Genet Mol Biol 2007;6. Available at: http://www.bepress.com/sagmb/vol6/iss1/art7. PubMed
van der Laan MJ. Targeted maximum likelihood based causal inference part i. Int J Biostat 2010;6. Available at: http://www.bepress.com/ijb/vol6/iss2/2.
van der Laan MJ. Targeted maximum likelihood based causal inference part ii. Int J Biostat 2010;6. Available at: http://www.bepress.com/ijb/vol6/iss2/3.
Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. Int J Biostat 2011;7. Available at: http://www.bepress.com/ijb/vol7/iss1/31. PubMedWeb of ScienceCrossref
Stitleman OM, Wester CW, De Gruttola V, van der Laan MJ. Targeted maximum likelihood estimation of effect modification parameters in survival analysis. Int J Biostat 2011;7. Available at: http://www.bepress.com/ijb/vol7/iss1/19. CrossrefWeb of Science
Tuglus C, van der Laan MJ. Repeated measures semiparametric regression using targeted maximum likelihood methodology with application to transcription factor activity discovery. Stat Appl Genet Mol Biol 2011;10. Available at: http://www.bepress.com/sagmb/vol10/iss1/art2. Web of SciencePubMed
Petersen ML, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. Technical Report 312, Division of Biostatistics, University of California, 2013. Google Scholar
van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer-Verlag, 2003. Google Scholar