Show Summary Details
More options …

# The International Journal of Biostatistics

Ed. by Chambaz, Antoine / Hubbard, Alan E. / van der Laan, Mark J.

IMPACT FACTOR 2018: 1.309

CiteScore 2018: 1.11

SCImago Journal Rank (SJR) 2018: 1.325
Source Normalized Impact per Paper (SNIP) 2018: 0.715

Mathematical Citation Quotient (MCQ) 2018: 0.03

Online
ISSN
1557-4679
See all formats and pricing
More options …
Volume 12, Issue 1

# Second-Order Inference for the Mean of a Variable Missing at Random

Iván Díaz
/ Marco Carone
/ Mark J. van der Laan
Published Online: 2016-05-26 | DOI: https://doi.org/10.1515/ijb-2015-0031

## Abstract

We present a second-order estimator of the mean of a variable subject to missingness, under the missing at random assumption. The estimator improves upon existing methods by using an approximate second-order expansion of the parameter functional, in addition to the first-order expansion employed by standard doubly robust methods. This results in weaker assumptions about the convergence rates necessary to establish consistency, local efficiency, and asymptotic linearity. The general estimation strategy is developed under the targeted minimum loss-based estimation (TMLE) framework. We present a simulation comparing the sensitivity of the first and second-order estimators to the convergence rate of the initial estimators of the outcome regression and missingness score. In our simulation, the second-order TMLE always had a coverage probability equal or closer to the nominal value 0.95, compared to its first-order counterpart. In the best-case scenario, the proposed second-order TMLE had a coverage probability of 0.86 when the first-order TMLE had a coverage probability of zero. We also present a novel first-order estimator inspired by a second-order expansion of the parameter functional. This estimator only requires one-dimensional smoothing, whereas implementation of the second-order TMLE generally requires kernel smoothing on the covariate space. The first-order estimator proposed is expected to have improved finite sample performance compared to existing first-order estimators. In the best-case scenario of our simulation study, the novel first-order TMLE improved the coverage probability from 0 to 0.90. We provide an illustration of our methods using a publicly available dataset to determine the effect of an anticoagulant on health outcomes of patients undergoing percutaneous coronary intervention. We provide R code implementing the proposed estimator.

This article offers supplementary material which is provided at the end of the article.

## 1 Introduction

Estimation of the mean of an outcome subject to missingness has been extensively studied in the literature. Under the assumption that missingness is independent of the outcome conditional on observed covariates, the marginal expectation is identified as a parameter depending on the conditional expectation given covariates among observed individuals (outcome regression henceforth) and the marginal distribution of the covariates. If the covariate vector consists of a few categorical variables, a nonparametric maximum likelihood estimator yields an optimal (i. e., asymptotically efficient) estimator of the mean outcome. However, if the covariate vector contains continuous variables or its dimension is large, estimation of the outcome regression requires smoothing on the covariate space. This has often been achieved by means of a parametric model. Unfortunately, the correct specification of a parametric model is a chimerical task in high-dimensional settings or in the presence of continuous variables [1], and data-adaptive estimation methods such as those developed in the statistical learning literature (e. g., super learning, model stacking, bagging) must be used.

Our methods are developed in the context of targeted learning [2, 3], a branch of statistics that deals with the use of data-adaptive methods coupled with optimal estimation theory for infinite-dimensional models. In particular, the targeted minimum loss-based estimation (TMLE) framework allows consistent and locally efficient estimation of arbitrary low-dimensional parameters in high-dimensional models under regularity and smoothness conditions. In our context, targeted learning allows the incorporation of flexible data-adaptive estimators of the outcome regression into the estimation procedure.

Several doubly robust and locally efficient estimators have been proposed for the missing data problem. These estimators are based on a first-order expansion of the parameter functional, and are asymptotically efficient, under certain conditions. Arguably, the most important condition is that the outcome regression and the probability of missingness conditional on covariates (missingness score henceforth) are estimated consistently at an appropriate rate. A sufficient assumption for establishing $\sqrt{n}-\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{s}\mathrm{i}\mathrm{s}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{c}\mathrm{y}$ of doubly robust estimators is that the outcome regression and the missingness score converge to their true values at rates faster than ${n}^{-1/4}.$ In this paper we are concerned with asymptotically efficient estimation under slower consistency rates of these estimators. In particular, we present a second-order TMLE that incorporates a second-order expansion of the parameter functional in order to relax this assumption, which may be implausible in high dimensions and for certain data-adaptive estimators. The method we present is an application of the general higher-order estimation theory we present in Ref. [4]. We refer to the second-order estimator as 2-TMLE in contrast to the first-order TMLE discussed by Ref. [2], referred to as 1-TMLE.

A complete literature review of higher-order estimation theory is presented in Ref. [4]. The most relevant references for the problem studied here are [5] and [6]. In particular [5], presents a particular second-order expansion of the target parameter, as well as a second-order estimator based on that expansion. This estimator directly uses inverse weighting by a kernel estimate of the covariate density. As a result of the curse of dimensionality, the estimator may perform poorly in finite samples as the dimension of the covariate space increases. Particularly, it may fall outside of the parameter space. In contrast, the 2-TMLE presented here is a substitution estimator that always falls in the parameter space. The results presented in Ref. [7] establish the asymptotic properties of various calibration estimators in the context of missing data problems, concluding that some of them are second-order estimators. However, their results are not directly related to this manuscript since they assume a Euclidean parametrization of the outcome model and a known missingness score.

As with the estimator presented in Ref. [5], implementation of the 2-TMLE requires approximating the second-order influence function by means of kernel smoothing. When the covariate space is high-dimensional, this approximation is subject to the curse of dimensionality. This issue may be circumvented by utilizing an alternative second-order expansion that uses kernel smoothing on the missingness score, which is a one-dimensional function of the covariate vector. Since the true missingness score is generally unknown, implementation of this estimator must be carried out using an estimated missingness score. Unfortunately, introduction of the estimated missingness score in place of its true value yields a second-order remainder term in the analysis of the estimator. As a consequence, the estimator obtained is not a second-order estimator. We refer to this estimator as a 1*-TMLE in accordance with this observation. Notably, the second-order remainder term obtained with the 1*-TMLE is different from that of the 1-TMLE, which implies they have different finite sample properties. We conjecture that the 1*-TMLE improves finite sample performance over the 1-TMLE, and present a case study in which there are considerable finite sample gains.

Compared to the standard 1-TMLE, implementation of the 1*-TMLE requires the inclusion of one additional covariate in the outcome regression. As a result, its implementation is straightforward and comes at no computational cost. Moreover, the potential finite sample gains in performance can be overwhelming, as we illustrate in a simulation studying the coverage probability and mean squared error of the two estimators.

The paper is organized as follows. In Section 2 we review first-order efficient estimation theory for the mean outcome in a missing data model. In Section 3 we present the second-order expansion of the parameter functional and use it in Section 3.1 to construct a 2-TMLE. In Section 3.2 we introduce the 1*-TMLE discussed above. Section 4 presents a simulation showing that the 1*- TMLE and the 2-TMLE have improved coverage probabilities and mean squared error for slow convergence rates of the estimated outcome regression and missingness score. We conclude with Section 5 illustrating the use of the 1*-TMLE in a real data application.

## 2 Review of first-order estimation theory

Let W denote a d-dimensional vector of covariates, and let Y denote an outcome of interest measured only when a missingness indicator A is equal to one. To simplify the exposition, we assume that Y is binary or continuous taking values in the interval (0, 1). The observed data $O=\left(W,A,AY\right)$ is assumed to have a distribution P0 in the nonparametric model ${m}$. Assume we observe an i.i.d. sample O1,…, On, and denote the empirical distribution by Pn. For every element $P\in {m}$, we define $\begin{array}{c}{Q}_{W}\left(P\right)\left(w\right):=P\left(W\le w\right)\\ g\left(P\right)\left(w\right):=P\left(A=1|W=w\right)\\ \stackrel{ˉ}{Q}\left(P\right)\left(w\right):={E}_{P}\left(Y|A=1,\phantom{\rule{thinmathspace}{0ex}}W=w\right),\end{array}$

where EP denotes expectation under P. We denote ${Q}_{W,0}:={Q}_{W}\left({P}_{0}\right)$, ${g}_{0}:=g\left({P}_{0}\right)$, and ${\stackrel{ˉ}{Q}}_{0}:=\stackrel{ˉ}{Q}\left({P}_{0}\right)$. We refer to $\stackrel{ˉ}{Q}$ as the outcome regression, and to g as the missingness score. We suppress the argument P from the notation QW(P), g(P), and $\stackrel{ˉ}{Q}\left(P\right)$ whenever it does not cause confusion. For a function f of o, we use the notation $Pf:=\int f\left(o\right)dP\left(o\right)$. Let $\mathrm{\Psi }$: ${m}\to \mathbb{R}$ be a parameter mapping defined as $\mathrm{\Psi }\left(P\right):={E}_{P}\left\{\stackrel{ˉ}{Q}\left(W\right)\right\}$, and let ${\mathrm{\psi }}_{0}:=\mathrm{\Psi }\left({P}_{0}\right)$. Under the assumptions that missingness A is independent of the outcome Y conditional on the covariates W and ${P}_{0}\left({g}_{0}\left(W\right)>0\right)=1$, it can be shown that ${\mathrm{\psi }}_{0}={E}_{{F}_{0}}\left(Y\right)$, where F0 is the true distribution of the full data (W, Y). Because $\mathrm{\Psi }$ depends on P only through $Q:=\left({Q}_{W},\stackrel{ˉ}{Q}\right)$, we also use the alternative notation $\mathrm{\Psi }\left(Q\right)$ to refer to $\mathrm{\Psi }\left(P\right)$.

First-order inference for ${\mathrm{\psi }}_{0}$ is based on the following expansion of the parameter functional $\mathrm{\Psi }\left(P\right)$ around the true P0: $\mathrm{\Psi }\left(P\right)-\mathrm{\Psi }\left({P}_{0}\right)=-{P}_{0}{D}^{\left(1\right)}\left(P\right)+{R}_{2}\left(P,\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right),$(1)

where D(1)(P) is a function of an observation $o=\left(w,\phantom{\rule{thinmathspace}{0ex}}a,\phantom{\rule{thinmathspace}{0ex}}y\right)$ that depends on P, and R2(P, P0) is a second-order remainder term. The super index (1) is used to denote a first-order approximation. This expansion may be seen as analogous to a Taylor expansion when P is indexed by a finite-dimensional quantity, and the expression second-order may be interpreted in the same way.

We use the expression first-order estimator to refer to estimators based on first-order approximations as in eq. (1). Analogously, the expression second-order estimator is used to refer to estimators based on second-order approximations, e. g., as presented in Section 3 below.

Doubly robust locally efficient inference is based on approximation (1) with ${D}^{\left(1\right)}\left(P\right)\left(o\right)=\frac{a}{g\left(w\right)}\left\{y-\stackrel{ˉ}{Q}\left(w\right)\right\}+\stackrel{ˉ}{Q}\left(w\right)-\mathrm{\Psi }\left(P\right),$(2) ${R}_{2}\left(P,{P}_{0}\right)=\int \left\{1-\frac{{g}_{0}\left(w\right)}{g\left(w\right)}\right\}\left\{\stackrel{ˉ}{Q}\left(w\right)-{\stackrel{ˉ}{Q}}_{0}\left(w\right)\right\}d{Q}_{W,0}\left(w\right).$(3)

Straightforward algebra suffices to check that eq. (1) holds with the definitions given above. D(1) as defined in eq. (2) is referred to as the canonical gradient or the efficient influence function [8, 3].

First-order targeted minimum loss-based estimation of ${\mathrm{\psi }}_{0}$ is performed in the following steps [2]:

Step 1. Initial estimators. Obtain initial estimators $\stackrel{ˆ}{g}$ and $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ of g0 and ${\stackrel{ˉ}{Q}}_{0}$. In general, the functional form of g0 and ${\stackrel{ˉ}{Q}}_{0}$ will be unknown to the researcher. Since consistent estimation of these quantities is key to achieve asymptotic efficiency of $\stackrel{ˆ}{\mathrm{\psi }}$, we advocate for the use of data-adaptive predictive methods that allow flexibility in the specification of these functional forms.

Step 2. Compute auxiliary covariate. For each subject i, compute the auxiliary covariate ${\stackrel{ˆ}{H}}^{\left(1\right)}\left({W}_{i}\right):=\frac{1}{\stackrel{ˆ}{g}\left({W}_{i}\right)}.$

Step 3. Solve estimating equations. Estimate the parameter ò in the logistic regression model $\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}{\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{\mathrm{\epsilon },h}\left(w\right)=\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\stackrel{ˉ}{Q}}\left(w\right)+ϵ{\stackrel{ˆ}{H}}^{\left(1\right)}\left(w\right),$(4)

by fitting a standard logistic regression model of Yi on ${\stackrel{ˆ}{H}}^{\left(1\right)}\left({W}_{i}\right)$, with no intercept and with offset logit $\stackrel{ˆ}{\stackrel{ˉ}{Q}}\left({W}_{i}\right)$, among observations with $A=1$. Alternatively, fit the model $\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}{{\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{ϵ}}_{,h}\left(w\right)=\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\stackrel{ˉ}{Q}}\left(w\right)+ϵ$

with weights ${\stackrel{ˆ}{H}}^{\left(1\right)}\left({W}_{i}\right)$ among observations with $A=1$. In either case, denote the estimate of $ϵ\text{\hspace{0.17em}}\mathrm{b}\mathrm{y}\text{\hspace{0.17em}}\stackrel{ˆ}{ϵ}$.

Step 4. Update initial estimator and compute 1-TMLE. Update the initial estimator as ${\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{h}\ast \left(w\right)={\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{\stackrel{ˆ}{\mathrm{\epsilon }}}\left(w\right)$, and define the 1-TMLE as $\stackrel{ˆ}{\mathrm{\psi }}=\mathrm{\Psi }\left(\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast \right)$.

Note that this estimator $\stackrel{ˆ}{P}$ of P0 satisfies ${P}_{n}{D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)=0$. For a full presentation of the TMLE algorithm the interested reader is referred to [3] and the references therein. Using eq. (1) along with ${P}_{n}{D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)=0$ we obtain that $\stackrel{ˆ}{\mathrm{\psi }}-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right){D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)+{R}_{2}\left(\stackrel{ˆ}{P},\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right).$

Provided that

• (i)

${D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)$ converges to D(1)(P0) in L2(P0) norm, and

• (ii)

the size of the class of functions considered for estimation of $\stackrel{ˆ}{P}$ is bounded (technically, there exists a Donsker class ${h}$ so that ${D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)\in {h}$ with probability tending to one),

results from empirical process theory (e. g., theorem 19.24 of Ref. [9]) allow us to conclude that $\stackrel{ˆ}{\mathrm{\psi }}-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right){D}^{\left(1\right)}\left({P}_{0}\right)+{R}_{2}\left(\stackrel{ˆ}{P},\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right).$

In addition, if ${R}_{2}\left(\stackrel{ˆ}{P},{P}_{0}\right)={o}_{P}\left({n}^{-1/2}\right),$(5)

we obtain that $\stackrel{ˆ}{\mathrm{\psi }}-{\mathrm{\psi }}_{0}=\left({P}_{n}-{P}_{0}\right){D}^{\left(1\right)}\left({P}_{0}\right)+{o}_{P}\left({n}^{-1/2}\right)$. This implies, in particular, that $\stackrel{ˆ}{\mathrm{\psi }}$ is a $\sqrt{n}$-consistent estimator of ${\mathrm{\psi }}_{0}$, it is asymptotically normal, and it is locally efficient.

The first-order TMLE requires convergence of the second-order term ${R}_{2}\left(\stackrel{ˆ}{P},{P}_{0}\right)$ to zero at ${n}^{-1/2}$ rate or faster. When this convergence holds with one of ${Q}_{0}$ or ${g}_{0}$ replaced by a misspecified limit ${Q}^{†}$ or ${g}^{†}$, an additional assumption (stating that a certain functional of the data-adaptive estimator ${g}_{n}$ or ${Q}_{n}$ is asymptotically linear) is necessary to prove asymptotic linearity of the first-order TMLE. A method is presented in Ref. [10] that tackles this problem by proposing an estimator that satisfies the required asymptotic linearity assumption.

In this paper we discuss ways of constructing an estimator that requires a consistency assumption weaker than eq. (5). Note that eq. (5) is an assumption about the convergence rate of a second-order term involving the product of the differences $\stackrel{ˆ}{\stackrel{ˉ}{Q}}-{Q}_{0}$ and $\stackrel{ˆ}{g}-{g}_{0}$. Using the Cauchy-Schwarz inequality repeatedly, $|{R}_{2}\left(\stackrel{ˆ}{P},{P}_{0}\right)|$ may be bounded as $|{R}_{2}\left(\stackrel{^}{P},\text{\hspace{0.17em}}{P}_{0}\right)|\le ||1/\stackrel{^}{g}|{|}_{\infty }||\stackrel{^}{g}-{g}_{0}|{|}_{P{}_{0}}||\stackrel{^}{\overline{Q}}-{\overline{Q}}_{0}|{|}_{{P}_{0}},$

where $||f|{|}_{P}^{2}:=\int {f}^{2}\left(o\right)\mathrm{d}P\left(o\right)$, and $||f|{|}_{\mathrm{\infty }}:=\mathrm{s}\mathrm{u}\mathrm{p}\left\{f\left(o\right):o\in {o}\right\}$. For assumption (5) to hold, it is sufficient to have that

• (i)

$\stackrel{ˆ}{g}$ is bounded away from zero with probability tending to one;

• (ii)

$\stackrel{ˆ}{g}$ is the MLE of ${g}_{0}\in {g}=\left\{g\left(w;\text{\hspace{0.17em}}\mathrm{\beta }\right):\text{\hspace{0.17em}}\mathrm{\beta }\in {\mathbb{R}}^{d}\right\}$ (i. e., g0 is estimated in a correctly specified parametric model) since this implies $||\stackrel{ˆ}{g}-{g}_{0}|{|}_{{P}_{0}}={O}_{P}\left({n}^{-1/2}\right)$; and

• (iii)

$||\stackrel{ˆ}{\stackrel{ˉ}{Q}}-{\stackrel{ˉ}{Q}}_{0}|{|}_{{P}_{0}}={o}_{P}\left(1\right).$

Alternatively the roles of $\stackrel{ˆ}{g}$ and $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ could also be interchanged in ii) and iii). As discussed in Ref. [1], however, correct specification of a parametric model is hardly achievable in high-dimensional settings. Data-adaptive estimators must then be used for the outcome regression and missingness score, but they may potentially yield a remainder term R2 with a convergence rate slower than ${n}^{-1/2}$. In the next section we present a second-order expansion of the parameter functional that allows the construction of estimators that require consistency assumptions weaker than eq. (5).

## 3 Second-order estimation

Let us first introduce some notation. For a function ${f}^{\left(2\right)}$ of a pair of observations $\left({o}_{1},{o}_{2}\right)$, let ${P}_{0}^{2}{f}^{\left(2\right)}:=\int \int {f}^{\left(2\right)}\left({o}_{1},{o}_{2}\right)\mathrm{d}{P}_{0}\left({o}_{1}\right)\mathrm{d}{P}_{0}\left({o}_{2}\right)$ denote the expectation of ${f}^{\left(2\right)}$ with respect to the product measure ${P}_{0}^{2}$.

Second-order estimators are based on second-order expansions of the parameter functional of the form $\mathrm{\Psi }\left(P\right)-\mathrm{\Psi }\left({P}_{0}\right)=-{P}_{0}{D}^{\left(1\right)}\left(P\right)-\frac{1}{2}{P}_{0}^{2}{D}^{\left(2\right)}\left(P\right)+{R}_{3}\left(P,\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right),$(6)

where ${D}^{\left(2\right)}\left(P\right)$ is a function of a pair of observations $\left({o}_{1},\phantom{\rule{thinmathspace}{0ex}}{o}_{2}\right)$ that depends on P, and ${R}_{3}\left(P,\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right)$ is a third-order remainder term. ${D}^{\left(2\right)}$ is referred to as a second-order gradient. This representation exists only if W has finite support. If the support of W is infinite, it is necessary to use an approximate second-order influence function relying on smoothing, which yields a bias term referred to as the representation error. This may introduce challenges due to the curse of dimensionality. In this section we discuss two possible estimation strategies: (i) an estimator that implements kernel smoothing on the covariate vector, and (ii) an estimator that implements kernel smoothing on the missingness score. Strategy (i) is only practical in the presence of a few, possibly data-adaptively selected covariates, although a greater number of covariates may be included as sample size increases. Strategy (ii) requires a-priori knowledge of the true missingness score, and is therefore not applicable in most practical situations. As a solution, we propose to use strategy (ii) with the estimated missingness score to obtain an estimator we refer to as 1*-TMLE. As discussed below, the 1*-TMLE is not a second-order estimator, since introduction of an estimated missingness score yields a second-order term in the remainder term. Nevertheless, the potential finite sample gains obtained with the 1*-TMLE compared to the standard 1-TMLE are worth further investigation. In Section 4.2 we present a simulation study in which the 1*-TMLE showed considerable finite sample improvement in both mean squared error and coverage probability of associated confidence intervals.

## 3.1 Second-order estimator with Kernel smoothing on the covariate vector

Assume momentarily that W is discretely supported. Then the second-order expansion (6) holds with $\begin{array}{rl}{D}^{\left(2\right)}\left(P\right)\left({o}_{1},{o}_{2}\right)=& \frac{2{a}_{1}\mathbb{1}1\left\{{w}_{1}={w}_{2}\right\}}{g\left({w}_{1}\right){q}_{W}\left({w}_{1}\right)}\left\{1-\frac{{a}_{2}}{g\left({w}_{1}\right)}\right\}\left\{{y}_{1}-\stackrel{ˉ}{Q}\left({w}_{1}\right)\right\},\\ {R}_{3}\left(P,\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right)=& \int \left\{1-\frac{{g}_{0}\left(w\right){q}_{W}{,}_{0}\left(w\right)}{g\left(w\right){q}_{W}\left(w\right)}\right\}\left\{1-\frac{{g}_{0}\left(w\right)}{g\left(w\right)}\right\}\left\{\stackrel{ˉ}{Q}\left(w\right)-{\stackrel{ˉ}{Q}}_{0}\left(w\right)\right\}\phantom{\rule{thinmathspace}{0ex}}\mathrm{d}{Q}_{W,0}\left(w\right),\end{array}$

where ${q}_{W}$ denotes the probability mass function associated to ${Q}_{W}$, and ${D}^{\left(1\right)}$ is defined in eq. (2). It is easy to explicitly check that eq. (6) holds.

In most practical situations, however, W is high-dimensional or it contains continuous variables so that the indicator $\mathbb{1}1\left\{{w}_{1}={w}_{2}\right\}$ is essentially always zero. To circumvent this issue, we propose to use the above expansion replacing the indicator function with a kernel function ${K}_{h}\left({w}_{1}-{w}_{2}\right)$ for a given bandwidth h. If W takes values on a discrete set, we define ${K}_{h}\left(w\right)=\mathbb{1}1\left(w=0\right)$, so that the estimator ${\stackrel{ˆ}{g}}_{h}$ below is the non-parametric estimator using empirical means in strata defined by W. We denote the corresponding approximation of D(2) by ${D}_{h}^{\left(2\right)}$. The following lemma establishes conditions under which the representation error is negligible.

Suppose that the distribution of W has compact support and is absolutely continuous with respect to Lebesgue measure with density ${Q}_{W,0}$. Suppose that ${\stackrel{ˆ}{Q}}_{W}$ is a working estimate of ${Q}_{W,0}$. If

• 1.

both ${g}_{0}$ and ${Q}_{W,0}$ are $\left({m}_{0}+1\right)$ -times continuously differentiable almost surely;

• 2.

K is orthogonal to all polynomial powers up until ${m}_{0}$;

• 3.

there exists some $\mathrm{\delta }>0$ such that g0 is bounded below by $\mathrm{\delta }$, and both $\stackrel{ˆ}{g}$ and ${\stackrel{ˆ}{Q}}_{W}$ are bounded below by $\mathrm{\delta }$ with probability tending to one,

then we have that ${P}_{0}^{2}{D}_{h}^{\left(2\right)}\left(\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast ,\stackrel{ˆ}{g},{\stackrel{ˆ}{Q}}_{W}\right)-\underset{h\to 0}{lim}\phantom{\rule{thinmathspace}{0ex}}{P}_{0}^{2}{D}_{h}^{\left(2\right)}\left(\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast ,\stackrel{ˆ}{g},{\stackrel{ˆ}{Q}}_{W}\right)={O}_{P}\phantom{\rule{thinmathspace}{0ex}}\left({h}^{{m}_{0}+1}||\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast -{\stackrel{ˉ}{Q}}_{0}||\right),$

where $||\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast -{\stackrel{ˉ}{Q}}_{0}|{|}^{2}:=\int \left(\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast -{\stackrel{ˉ}{Q}}_{0}{\right)}^{2}\left(w\right)\mathrm{d}{Q}_{W,0}\left(w\right)$.

The result above explicitly deals with kernel smoothing with common bandwidth in all dimensions. The lemma also holds, however, if a multivariate bandwidth is utilized, with h substituted by maxj hj in the statement of the lemma.

## 3.1.1 A corresponding 2-TMLE

Analogous to the 1-TMLE discussed in the previous section, we construct an estimator $\stackrel{ˆ}{P}$ satisfying ${P}_{n}{D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)={P}_{n}^{2}{D}_{h}^{\left(2\right)}\left(\stackrel{ˆ}{P}\right)=0$. Solving these equations allows us to exploit expansion (6) and construct a $\sqrt{n}$-consistent estimator in which assumption ${R}_{2}\left(\stackrel{ˆ}{P},\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right)=op\phantom{\rule{thinmathspace}{0ex}}\left({n}^{-1\mathbb{2}}\right)$ is replaced by the weaker assumption ${R}_{3}\left(\stackrel{ˆ}{P},\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right)=op\phantom{\rule{thinmathspace}{0ex}}\left({n}^{-1\mathbb{2}}\right)$.

For a fixed bandwidth h, the proposed 2-TMLE is given by the following algorithm, which is implemented in the R code provided in the supplementary material.

Step 1. Initial estimators. See the previous section on the 1-TMLE.

Step 2. Compute auxiliary covariates. For each subject i, compute auxiliary covariates $\begin{array}{c}{\stackrel{ˆ}{H}}^{\left(1\right)}\left({W}_{i}\right):=\frac{1}{\stackrel{ˆ}{g}\left({W}_{i}\right)}\\ {\stackrel{ˆ}{H}}_{h}^{\left(2\right)}\left({W}_{i}\right):=\frac{1}{\stackrel{ˆ}{g}\left({W}_{i}\right)}\phantom{\rule{thinmathspace}{0ex}}\left\{1-\frac{{\stackrel{ˆ}{g}}_{h}\left({W}_{i}\right)}{\stackrel{ˆ}{g}\left({W}_{i}\right)}\right\},\end{array}$where ${\stackrel{ˆ}{g}}_{h}\left(w\right)=\frac{{\sum }_{i=1}^{n}{K}_{h}\left(w-{W}_{i}\right){A}_{i}}{{\sum }_{i=1}^{n}{K}_{h}\left(w-{W}_{i}\right)}$

is a kernel regression estimator of g0 (w).

Step 3. Solve estimating equations. Estimate the parameter $ϵ=\left({ϵ}_{1},{ϵ}_{2}\right)$ in the logistic regression model $\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}{\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{ϵ,h}\left(w\right)=\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\stackrel{ˉ}{Q}}\left(w\right)+{ϵ}_{1}{\stackrel{ˆ}{H}}^{\left(1\right)}\left(w\right)+{ϵ}_{2}{\stackrel{ˆ}{H}}_{h}^{\left(2\right)}\phantom{\rule{thinmathspace}{0ex}}\left(w\right),$

by fitting a standard logistic regression model of Yi on ${\stackrel{ˆ}{H}}^{\left(1\right)}\phantom{\rule{thinmathspace}{0ex}}\left({W}_{i}\right)$ and ${\stackrel{ˆ}{\stackrel{ˉ}{H}}}_{h}^{\left(2\right)}\phantom{\rule{thinmathspace}{0ex}}\left({W}_{i}\right)$, with no intercept and with offset $\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\stackrel{ˉ}{Q}}\left({W}_{i}\right)$, among observations with $A=1$. Denote the estimate of $ϵ\text{\hspace{0.17em}}\mathrm{b}\mathrm{y}\text{\hspace{0.17em}}\stackrel{ˆ}{ϵ}$.

Step 4. Update initial estimator and compute 2-TMLE. Update the initial estimator as ${\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{h}^{\ast }\left(w\right)={\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{\stackrel{ˆ}{ϵ},h}\left(w\right)$, and define the h-specific 2-TMLE as ${\stackrel{ˆ}{\mathrm{\psi }}}_{h}=\mathrm{\Psi }\left({\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{h}^{\ast }\right)$

Computation of ${H}_{h}^{\left(2\right)}\left(W\right)$ involves inverse weighting by the square of $\stackrel{ˆ}{g}\left(W\right)$. If the exposure is rare, these weights may be highly variable and cause instability and losses in finite sample performance of all estimators using inverse probability weighting. The provision of the theory of targeted minimum loss based learning [11, 12] for these cases is to take out one $\stackrel{ˆ}{g}\left(W\right)$ from the denominator of H(1) (W) and ${H}_{h}^{\left(2\right)}\left(W\right)$, and fit a weighted logistic regression model in Step 3, with weights given by $1/\stackrel{ˆ}{g}\left(W\right)$. This method has been seen to perform well in practice and does not affect the validity of the asymptotic claims of this section.

The estimators presented above required a user-selected bandwidth h. Here we discuss briefly two possible ways to select a bandwidth $\stackrel{ˆ}{h}$ to use in practice. Certain convergence rates are required of this bandwidth so that the resulting estimators achieve second-order properties (see Theorem 1 below). The first and easiest option is to select the bandwidth that maximizes the log-likelihood loss function of the density q0. However, because this choice is targeted to estimation of q0, it may be sub-optimal for estimation of ${\mathrm{\psi }}_{0}$. The second alternative is to use the collaborative TMLE (C-TMLE) presented in Ref. [13], which may result in correct convergence rates as argued in Ref. [4]. The question of whether these selectors achieve the required convergence rate is an open research problem and will be the subject of future research.

The theorem below provides the exact conditions that guarantee asymptotic linearity of $\stackrel{ˆ}{\mathrm{\psi }}$.

Under the conditions of Lemma 1, and provided that

• 1.

each of $\stackrel{ˆ}{g}-{g}_{0},\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast -{\stackrel{ˉ}{Q}}_{0}$ and ${\stackrel{ˆ}{Q}}_{W}-{Q}_{W,0}$ tend to zero in ${L}^{2}\left({Q}_{W,0}\right)$-norm;

• 2.

there exists some $\mathrm{\delta }>0$ such that ${g}_{0},\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{g}$ and ${\stackrel{ˆ}{Q}}_{W}\cdot \stackrel{ˆ}{g}$ are bounded below by $\mathrm{\delta }$ with probability tending to one;

• 3.

each of $\stackrel{ˆ}{g},\phantom{\rule{thinmathspace}{0ex}}\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast$ and ${\stackrel{ˆ}{Q}}_{W}$ have uniform sectional variation norm bounded by some $M<\mathrm{\infty }$ with probability tending to one;

• 4.

the kernel function K is 2d-times differentiable and ${\stackrel{ˆ}{h}}^{2d}n\to +\mathrm{\infty }$,

and either of

• 5a.

${R}_{2}\left(\stackrel{ˆ}{P},\phantom{\rule{thinmathspace}{0ex}}{P}_{0}\right)=op\phantom{\rule{thinmathspace}{0ex}}\left({n}^{-1\mathbb{2}}\right)$; or,

• 5b.

${R}_{3}\left(\stackrel{^}{P},\text{\hspace{0.17em}}{P}_{0}\right)=op\text{\hspace{0.17em}}\left({n}^{-1/2}\right)$ and $||{\stackrel{^}{\overline{Q}}}^{*}-{\overline{Q}}_{0}||{\stackrel{^}{h}}^{{m}_{0}+1}=op\text{\hspace{0.17em}}\left({n}^{-1/2}\right)$

holds, ${\stackrel{ˆ}{\mathrm{\psi }}}_{\stackrel{ˆ}{h}}$ is an asymptotically efficient estimator of ${\mathrm{\psi }}_{0}$.

The proof of this theorem is presented in the supplementary materials. A key argument in the proof is that $\stackrel{ˆ}{P}$ solves the estimating equations ${P}_{n}{D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)={P}_{n}^{2}{D}_{\stackrel{ˆ}{h}}^{\left(2\right)}\left(\stackrel{ˆ}{P}\right)=0$. The score equations of the logistic regression model (7) are equal to $\sum _{i=1}^{n}{\stackrel{^}{H}}^{\left(1\right)}\left\{{Y}_{i}-{\stackrel{^}{\overline{Q}}}_{ϵ,h}\left({W}_{i}\right)\right\}=0\text{}\text{and}\text{}\sum _{i=1}^{n}{\stackrel{^}{H}}_{\stackrel{^}{h}}^{\left(2\right)}\left\{{Y}_{i}-{\stackrel{^}{\overline{Q}}}_{ϵ,\stackrel{^}{h}}\left({W}_{i}\right)\right\}=0.$

Because the maximum likelihood estimator solves the score equations, it can be readily seen that $\sum _{i=1}^{n}{\stackrel{^}{H}}^{\left(1\right)}\left\{{Y}_{i}-{\stackrel{^}{\overline{Q}}}_{h}^{*}\left({W}_{i}\right)\right\}=0\text{}\text{and}\text{}\sum _{i=1}^{n}{\stackrel{^}{H}}_{h}^{\left(2\right)}\left\{{Y}_{i}-{\stackrel{^}{\overline{Q}}}_{h}^{*}\left({W}_{i}\right)\right\}=0,$

which, from the definitions of ${\stackrel{ˆ}{H}}^{\left(1\right)}$ and ${\stackrel{ˆ}{H}}_{\stackrel{ˆ}{h}}^{\left(2\right)}$, correspond to ${P}_{n}{D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)=0$ and ${P}_{n}^{2}{D}_{\stackrel{ˆ}{h}}^{\left(2\right)}\left(\stackrel{ˆ}{P}\right)=0$, respectively.

As is evident from the conditions of the theorem, the rate at which the bandwidth $\stackrel{ˆ}{h}$ decreases plays a critical role in the asymptotic behavior of the 2-TMLE described. On one hand, condition 5b of the theorem requires that the bandwidth converge to zero sufficiently quickly in order for ${n}^{1\mathbb{2}}||\stackrel{ˆ}{\stackrel{ˉ}{Q}}\ast -{\stackrel{ˉ}{Q}}_{0}||{\stackrel{ˆ}{h}}^{{m}_{0}+1}$ to itself converge to zero, where m0 is the order of the kernel K used. This ensures that the representation error is negligible. On the other hand, condition 4 requires $\stackrel{ˆ}{h}$ to converge to zero slowly enough to allow control of a V statistic term displayed in the proof of the theorem in the appendix.

Scrutiny of the theorem above reveals that a 2-TMLE will indeed generally be asymptotically linear and efficient in a larger model compared to a corresponding 1-TMLE. On one hand, as explicitly reflected in Theorem 1, for example, it is generally true that whenever a 1-TMLE is efficient, so will be a 2-TMLE. This illustrates that 2-TMLE operates in a safe haven wherein we expect not to hurt (asymptotically) a 1-TMLE by performing the additional targeting required to construct a 2-TMLE. On the other hand, we note that 2-TMLE will be efficient in many instances in which 1-TMLE is not. As an illustration, suppose in the setting of our motivating example that W is a univariate random variable with a sufficiently smooth density function. Suppose also that g0 is smooth enough so that an optimal univariate second-order kernel smoother can be utilized to produce an estimate of g0, so that $||{g}_{n}-{g}_{0}||{P}_{0}={O}_{P}\left({n}^{-2/5}\right)$. In this case, efficiency of a 1-TMLE requires that $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ tends to Q0 at a rate faster than ${n}^{-1\mathbb{1}0}$. In contrast, the corresponding 2-TMLE built upon a second-order canonical gradient approximated using an optimal second-order kernel smoother will be efficient provided that $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ is consistent for ${\stackrel{ˉ}{Q}}_{0}$, irrespective of the actual rate of convergence. The difference between these requirements may not seem drastic in settings where ${\stackrel{ˉ}{Q}}_{0}$ is sufficiently smooth since then constructing an estimator $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ which satisfies both requirements is easy. This is certainly not so if ${\stackrel{ˉ}{Q}}_{0}$ fails to be smooth, in which case achieving convergence even at ${n}^{-1\left\{10}$-rate may be a challenge. This problem is exacerbated further if W has several components. For example, if W is 5-dimensional, a 1-TMLE requires that $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ tend to ${\stackrel{ˉ}{Q}}_{0}$ faster than ${n}^{-5\left\{18}$, whereas the corresponding 2-TMLE based on a third-order kernel-smoothed approximation requires that $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ tend to ${\stackrel{ˉ}{Q}}_{0}$ faster than ${n}^{-1\text{\}5}$. While the latter is achievable using an optimal second-order kernel smoother, the former is not, and without further smoothness assumptions on ${\stackrel{ˉ}{Q}}_{0}$, a 1-TMLE will generally not be efficient.

## 3.1.2 Comparison with Alternative Second-Order Estimators

To the best of our knowledge, the only second-order estimator preceding our proposal is discussed in Ref. [5]. For a bandwidth $\stackrel{ˆ}{h}$, their estimator is defined as ${\stackrel{ˆ}{\mathrm{\psi }}}_{h}=\mathrm{\Psi }\left(\stackrel{ˆ}{P}\right)+{P}_{n}{D}^{\left(1\right)}\left(\stackrel{ˆ}{P}\right)+\frac{1}{2}{P}_{n}^{2}{D}_{\stackrel{ˆ}{h}}^{\left(2\right)}\left(\stackrel{ˆ}{P}\right).$(8)

Unlike our proposal, this estimator involved direct computation of ${D}_{\stackrel{ˆ}{h}}^{\left(2\right)}$, which in turn involves inverse weighting by an estimated multivariate density estimate $\stackrel{ˆ}{q}w\left(w\right).$ As a consequence of the curse of dimensionality these weights may be very unstable, which may lead to a highly variable estimator in practice. In addition, the above estimator does not always satisfy the global constraints on the parameter space. In contrast, our proposed 2-TMLE is always in the parameter space, since it is defined as a substitution estimator.

## 3.2 Second-order estimator with Kernel smoothing on the missingness score

As transpires from the developments above, even if the support of W is finite but nonetheless rich, large samples will be required to ensure that the non-parametric estimator behaves sufficiently well. Given the sufficiency property of the propensity score as a summary of potential confounders, it is natural to inquire whether the use of a second-order partial gradient based on the propensity score (see discussion in Ref. [4]) may allow us to circumvent the dimensionality of W. Suppose that W is finitely supported, and consider the second-order expansion (6) with ${D}^{\left(2\right)}\left(P\right)\left({o}_{1},{o}_{2}\right)=\frac{2{a}_{1}1\left\{{g}_{0}\left({w}_{1}\right)={g}_{0}\left({w}_{2}\right)\right\}}{g\left({w}_{1}\right){q}_{W}\left({w}_{1}\right)}\left\{1-\frac{{a}_{2}}{g\left({w}_{1}\right)}\right\}\left\{{y}_{1}-\overline{Q}\left({w}_{1}\right)\right\},$ $\phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}\text{\hspace{0.17em}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{2em}{0ex}}{R}_{3}\left(P,{P}_{0}\right)=\left(\int \left\{1-\frac{{g}_{0}\left(w\right){q}_{W}{,}_{0}\left(w\right)}{g\left(w\right){q}_{W}\left(w\right)}\right)}\left\{1-\left(\frac{{g}_{0}\left(w\right)}{g\left(w\right)}}\left\{\stackrel{ˉ}{Q}\left(w\right)-{\stackrel{ˉ}{Q}}_{0}\left(w\right)\right\}\mathrm{d}{Q}_{W,0}\left(w\right).\right)$

In contrast to the previous section, here qW,0(w) represents the density function $\frac{d}{dx}{P}_{0}\left({g}_{0}\left(W\right)\le x\left|{}_{x=g0\left(w\right)},\right)$ and ${q}_{W}\left(w\right)$ represents $\frac{d}{dx}P\left({g}_{0}\left(W\right)\le x\left|{}_{x=g0\left(w\right)}.\right)$ Analogous to the multivariate case, it is often necessary to consider a kernel function ${K}_{h}\left({g}_{0}\left({w}_{1}\right)-{g}_{0}\left({w}_{2}\right)\right)$ instead of the indicator $\left\{{g}_{0}\left({w}_{1}\right)={g}_{0}\left({w}_{2}\right)\right\},$ which may not be well supported in the data. We again denote the approximate second-order influence function obtained with such an approximation by ${D}_{h}^{\left(2\right)}$ to emphasize the dependence on the choice of bandwidth. Using this approximation the estimation procedure described in the previous section may be carried out in exactly the same fashion, but with ${\stackrel{ˆ}{g}}_{h}$ replaced by ${\stackrel{ˆ}{g}}_{h}\left(w\right)=\frac{{\sum }_{i=1}^{n}{K}_{{h}^{}}\left({g}_{0}\left(w\right)-{g}_{0}\left({W}_{i}\right)\right){A}_{i}}{{\sum }_{i=1}^{n}{K}_{h}\left({g}_{0}\left(w\right)-{g}_{0}\left({W}_{i}\right)\right)}.$This algorithm yields an asymptotically linear estimator of ${\mathrm{\psi }}_{0}$ under the assumption that ${R}_{3}\left(\stackrel{ˆ}{P},P\right){=}_{oP}\left({n}^{-1/2}\right),$among other regularity assumptions.

Since ${g}_{0}$ is often unknown, we must instead use an estimate $\stackrel{ˆ}{g}$ of ${g}_{0}$; for example, we may take: ${\stackrel{ˆ}{g}}_{h}\left(w\right):=\frac{{\sum }_{i=1}^{n}{K}_{{h}^{}}\left(\stackrel{ˆ}{g}\left(w\right)-\stackrel{ˆ}{g}\left({W}_{i}\right)\right){A}_{i}}{{\sum }_{i=1}^{n}{K}_{h}\left(\stackrel{ˆ}{g}\left(w\right)-\stackrel{ˆ}{g}\left({W}_{i}\right)\right)}.$

Unfortunately, a careful analysis of the remainder term associated with this estimator reveals that the introduction of an estimate $\stackrel{ˆ}{g}$ in place of g0 yields a second-order remainder term. This implies that asymptotic efficiency of this estimator, denoted 1*-TMLE, requires a second-order term to be negligible in order to be ${n}^{-1/2}$-consistent. The second-order term associated to this 1*-TMLE, however, is different from R2 defined in eq. (3) and required for asymptotic linearity of the 1-TMLE. As a consequence, these estimators are expected to have different finite sample properties. We conjecture that the 1*-TMLE of this section has improved finite sample properties over the 1-TMLE, and present a case study in Section 4 supporting our conjecture.

## 4 Simulation studies

In this section we present the results of two simulation studies, illustrating the improvements obtained by the 1*-TMLE and 2-TMLE compared to the 1-TMLE. We use covariate dimensions $d=1$ and $d=3$ and sample sizes $n\in \left\{500,1000,2000,10000\right\}$ to assess the performance of the estimators in different scenarios. Kernel smoothers were computed using the R package ks [14]. The bandwidth was chosen using the default method of that package [15].

## 4.1.1 Simulation setup

For each sample size n, we simulated 1,000 datasets from the joint distribution implied by the conditional distributions $\begin{array}{l}\text{}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}W\sim 6×Beta\left(1/2,1/2\right)-3\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}A/W\sim Ber\left(\text{expit}\left(1+0.7×W\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}Y|A=1,W\sim Ber\left(\text{expit}\left(-3+0.5×\text{exp}\left(W\right)+0.5×W\right)\right),\end{array}$

where Ber$\left(\cdot \right)$ denotes the Bernoulli distribution, expit denotes the inverse of the logit function, and Beta(a; b) denotes the Beta distribution.

For each dataset, we fitted correctly-specified parametric models for ${\stackrel{ˉ}{Q}}_{0}$ and g0. For a perturbation parameter p, we then varied the convergence rate of $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ by multiplying the linear predictor by a random variable with distribution $U\left(1-{n}^{-p},1\right)$ and subtracting a Gaussian random variable with mean $3×{n}^{-p}$ and standard deviation ${n}^{-p}$. Analogously, the convergence rate of $\stackrel{ˆ}{g}$ was varied using a perturbation parameter q by multiplying the linear predictor by a random variable $U\left(1-{n}^{-q},1\right)$ and subtracting a Gaussian random variable with mean $3×{n}^{-q}$ and standard deviation ${n}^{-q}$. We varied the values of p and q in a grid $\left\{0.01,0.02,0.05,0.1,0.2,0.5{\right\}}^{2}.$ This perturbation of the MLE in a correctly specified parametric models is carried out to obtain initial estimators that have varying consistency rates. This allows us to easily vary the convergence rate in order to assess the performance of the estimators under such scenarios. To see how this procedure achieves varying consistency rates, denote the MLE of g0 in the correct parametric model by ${\stackrel{ˆ}{g}}^{\mathrm{M}\mathrm{L}\mathrm{E}},$ and denote the perturbed estimate by ${\stackrel{ˆ}{g}}_{q}^{{}^{\mathrm{M}\mathrm{L}\mathrm{E}}}.$ Let Un and Vn be random variables distributed $U\left(1-{n}^{-q},1\right)$ and $N\left(-3{n}^{-q},{n}^{-2q}\right),$ respectively. Then, substituting ${\stackrel{ˆ}{g}}_{q}^{\mathrm{M}\mathrm{L}\mathrm{E}}\left(W\right)={\stackrel{ˆ}{g}}^{\mathrm{M}\mathrm{L}\mathrm{E}}\left(W\right){U}_{n}+{V}_{n}\phantom{\rule{thinmathspace}{0ex}}\mathrm{i}\mathrm{n}\mathrm{t}\mathrm{o}\parallel {\stackrel{ˆ}{g}}_{q}^{\mathrm{M}\mathrm{L}\mathrm{E}}-{g}_{0}{\parallel }_{{P}_{0}}^{2}$ yields $\parallel {\stackrel{ˆ}{g}}_{q}^{\mathrm{M}\mathrm{L}\mathrm{E}}-{g}_{0}{\parallel }_{p0}^{2}\le \parallel {U}_{n}\left({\stackrel{ˆ}{g}}^{\mathrm{M}\mathrm{L}\mathrm{E}}-{g}_{0}\right){\parallel }_{{P}_{0}}^{2}+\parallel {g}_{0}\left({U}_{n}-1\right){\parallel }_{{P}_{0}}^{2}+\parallel {V}_{n}{\parallel }_{{P}_{0}}^{2}$ $={O}_{P}\left({n}^{-1}+{n}^{-2q}\right)$

Consider now different values of q. For example, $q=0.5$ yields the parametric consistency rate

$\parallel {\stackrel{ˆ}{g}}_{q}^{\mathrm{M}\mathrm{L}\mathrm{E}}-{g}_{0}{\parallel }_{{P}_{0}}^{2}={O}_{P}\left(1/n\right),$ whereas $q=0$ yields an inconsistent estimator.

We computed a 1-TMLE, 1*-TMLE, as well as a 2-TMLE for each initial estimator $\left(\stackrel{ˆ}{\stackrel{ˉ}{Q}},\stackrel{ˆ}{g}\right)$ obtained through this perturbation. We compare the performance of the two estimators through their bias inflated by a factor $\sqrt{n},$ relative variance compared to the nonparametric efficiency bound, and the coverage probability of 95 % confidence interval assuming a known variance. We assume the variance is known (and compute it as a empirical variance across simulated datasets) in order to isolate randomness and bias in its estimation. The variance, bias, and coverage probabilities are approximated through empirical means across the 1,000 simulated datasets.

## 4.1.2 Simulation results

Table 1 shows the relative variance (rVar, defined as n times the variance divided by the efficiency bound), the absolute bias inflated by a factor $\sqrt{n},$ as well as the coverage probability of a 95 % confidence interval for selected values of the perturbation parameter (p, q). Figure 1 shows the absolute bias of each estimator multiplied by $\sqrt{n},$ and Figure 2 shows the coverage probability of a 95 % confidence interval.

Table 1:

Performance of the estimators for different sample sizes and convergence rates of the initial estimators of ${\stackrel{ˉ}{Q}}_{0}$ and g0, when $d=1.$

Figure 1:

Absolute bias of the estimators (multiplied by $\sqrt{n}$) for different sample sizes and convergence rates of the initial estimators of ${\stackrel{ˉ}{Q}}_{0}$ and g0, when $d=1.$

Figure 2:

Coverage probabilities of confidence intervals for different sample sizes and varying convergence rates of the initial estimators of ${\stackrel{ˉ}{Q}}_{0}$ and g0, when $d=1.$

First, we notice that for certain slow convergence rates all the estimators have a very large bias (e. g., $p=0.01$ and $q=0.01,\phantom{\rule{thinmathspace}{0ex}}p=0.1$ and $q=0.01$). In contrast, for some other slow convergence rates, the absolute bias scaled by $\sqrt{n}$ of the 1-TMLE diverges very fast in comparison to the 2- TMLE and 1*-TMLE (e. g., $p=0.1$ and $q=0.1$). The improvement in asymptotic absolute bias of the proposed estimators comes at the price of increased variance in certain small sample scenarios $\left(n\le 2000\right),$ such as when the outcome model converges at a fast enough rate $\left(p=0.5\right)$ but the missingness mechanism does not $\left(p=0.1\right).$ In this case, the 1-TMLE has lower variance than its competitors. This advantage of the first-order TMLE disappears asymptotically as predicted by theory.

In terms of coverage, the improvement obtained with the 1*-TMLE and the 2-TMLE is overwhelming for small values of both p and q. As an example, consider the case $n=2000,\phantom{\rule{thinmathspace}{0ex}}p=0.01,\phantom{\rule{thinmathspace}{0ex}}q=0.1,$ in which the coverage probability is 0 and 0.91 for the 1-TMLE and the 1*-TMLE, respectively. This simulation illustrates the potential for dramatic improvement obtained by using the 1*-TMLE and the 2-TMLE, which comes at the cost of over-coverage in small sample sizes with a fast enough convergence rate $\left(n\le 2000,\phantom{\rule{thinmathspace}{0ex}}p=0.5,\phantom{\rule{thinmathspace}{0ex}}q=0.1\right).$

Figures 1 and 2 show clearly a region of slow convergence rates in which the proposed estimators outperform the standard first-order TMLE. In addition, as seen in Figure 1, we observe a small advantage of the 2-TMLE over the 1*-TMLE in terms of $\sqrt{n}$ bias.

## 4.2.1 Simulation setup

For each sample size $n\in \left\{500,\phantom{\rule{thinmathspace}{0ex}}1000,\phantom{\rule{thinmathspace}{0ex}}2000,\phantom{\rule{thinmathspace}{0ex}}10000\right\},$ we simulated 1,000 datasets from the joint distribution implied by the conditional distributions $\begin{array}{rl}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{W}_{1}\sim & Beta\left(2,2\right)\\ \phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{W}_{2}|{W}_{1}\sim & Beta\left(2{W}_{1},2\right)\\ \phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{W}_{3}|{W}_{1},{W}_{2}\sim & Beta\left(2{W}_{1},2{W}_{2}\right)\\ \phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}A|W\sim & Ber\left(\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{i}\mathrm{t}\left(1+0.12{W}_{1}+0.1{W}_{2}+0.5{W}_{3}\right)\right)\\ Y|A=1,\phantom{\rule{thinmathspace}{0ex}}W\sim & Ber\left(\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{i}\mathrm{t}\left(-4+0.2{W}_{1}+0.3{W}_{2}+0.5exp\left({W}_{3}\right)\right)\right),\end{array}$

where $Ber\left(\cdot \right)$ denotes the Bernoulli distribution, expit denotes the inverse of the logit function, and $Beta\left(\cdot \right)$ denotes the beta distribution. For each dataset, we fitted correctly-specified parametric models for ${\stackrel{ˉ}{Q}}_{0}$ and g0. We then varied the convergence rate of $\stackrel{ˆ}{\stackrel{ˉ}{Q}}$ and $\stackrel{ˆ}{g}$ by adding Gaussian random variables as in the previous subsection.

## 4.2.2 Simulation results

Table 2 shows the $\sqrt{n}$ absolute bias, relative variance, and coverage probability of each estimator for selected values of the convergence perturbation (p, q). Figures 3 and 4 show the $\sqrt{n}$ absolute bias and coverage probability of a 95 % confidence interval for all values of (p, q) used in the simulation.

Table 2:

Performance of the estimators for different sample sizes and convergence rates of the initial estimators of ${\stackrel{ˉ}{Q}}_{0}$ and g0, when $d=3.$

Figure 3:

Absolute bias of the estimators (multiplied by $\sqrt{n}$) for different sample sizes and varying convergence rates of the initial estimators of ${\stackrel{ˉ}{Q}}_{0}$ and g0.

Figure 4:

Coverage probabilities of confidence intervals for different sample sizes and varying convergence rates of the initial estimators of ${\stackrel{ˉ}{Q}}_{0}$ and g0.

The remarks of the previous section regarding the trade-offs between variance and bias in different regions of the convergence rates also hold for this simulation. The main difference observed here is that the 2-TMLE has poorer performance in terms of $\sqrt{n}$ bias than the 1-TMLE and the 1*-TMLE for small samples when one of the models converges at a fast enough rate $\left(p=0.5\phantom{\rule{thinmathspace}{0ex}}\mathrm{o}\mathrm{r}\phantom{\rule{thinmathspace}{0ex}}q=0.5\right).$ This problem disappears somewhat as n increases, but it highlights the point that the 2-TMLE should be used with caution in small samples.

In this simulation we do not see any practical advantage of the 2-TMLE over the 1*-TMLE. In fact, the 1*-TMLE performs better than the 2-TMLE for small samples, and outperforms the 1-TMLE in all sample sizes, with the caveat of increased variance in certain scenarios as discussed in the previous section.

## 5 Data illustration

In order to illustrate the method presented, we make use of the dataset lindner available in the R package PSAgraphics. The dataset contains data on 996 patients treated at the Lindner Center, Christ Hospital, Cincinnati in 1997, and were originally analyzed in Ref. [16]. All patients received a Percutaneous Coronary Intervention (PCI). One of the primary goals of the original study was to assess whether administration of Abciximab, an anticoagulant, during PCI improves short and long term health outcomes of patients undergoing PCI. We reanalyze the lindner dataset focusing on the cardiac related costs incurred within 6 months of patients initial PCI as an outcome. The covariates measured are: indicator of coronary stent deployment during the PCI, height, sex, diabetes status, prior acute myocardial infarction, left ejection fraction, and number of vessels involved in the PCI.

As noted by several authors [e. g. Refs 8, 17, 18], causal inference problems may be tackled using methods for missing data. Let T denote an indicator of having received Abciximab. Adopting the potential outcomes framework, consider the potential outcomes ${Y}_{t},t\in \left\{0,1\right\},$ given by the outcomes that would have been observed in a hypothetical world if, contrary to fact, $P\left(T=t\right)=1.$ The consistency assumption states that $A=t$ implies that ${Y}_{t}=Y,$ where Y is the observed outcome. Thus, E(Yt) may be estimated using methods for missing outcomes, where Yt is observed only when $T=t.$ In particular, estimation of E(Y1) and E(Y0) is carried out using the methods described in the previous sections with $A=t$ and $A=1-T,$ respectively. Our parameter of interest is the average treatment effect $E\left({Y}_{1}\right)-E\left({Y}_{0}\right).$

Since the outcome is continuous, we first used the transformation $\left(y-min\left(y\right)\right)/\left(max\left(y\right)-min\left(y\right)\right)$ to map it to the interval [0; 1]. We then used the approach outlined in Ref. [19] to construct the 1-TMLE and the 1*-TMLE. We do not consider the 2-TMLE since the curse of dimensionality precludes estimation of the propensity score via kernel regression. The distribution of both estimators was estimated with the bootstrap as discussed in Section 4.2 of Ref. [4], which involves bootstrapping the second-order expansion of the estimator. This bootstrapped distribution is preferred to using the first order influence function, as it is expected to capture the second-order behavior of the estimators and therefore possibly results in finite sample gains for the 1*-TMLE. For comparison, we also present the confidence interval obtained using the asymptotic normal distribution with the variance estimated as the empirical variance of the first-order efficient influence function.

The mean of the outcome conditional on covariates was estimated separately for the two treatment groups. Both the outcome regression and the treatment mechanism where estimated using a model stacking technique called Super Learning [20]. Super Learning takes a collection of candidate estimators and combines them in a weighted average, where the weights are chosen to minimize the cross-validated prediction error of the final predictor, measured in terms of the L2 loss function. The collection of algorithms used is described in Table 3. Table 4 shows the cross-validated risks of the algorithms as well as their weights in the final predictor of ${\stackrel{ˉ}{Q}}_{0}$ and g0.

Table 3:

Prediction algorithms used to estimate ${\stackrel{ˉ}{Q}}_{0}$ and ${g}_{0}$.

Table 4:

Cross-validated risk and weight of each algorithm in the Super Learner for estimation of ${\stackrel{ˉ}{Q}}_{0}$ and ${g}_{0}.$

For bandwidth selection, we use a loss function that targets directly the first-order expansion of the parameter of interest, which is equivalent to the first step of the collaborative TMLE (C-TMLE) presented in Ref. [13]. This approximation of the C-TMLE is computationally more tractable and is justified theoretically as argued below.

Following [21], let $s\in \left\{1,\dots ,S\right\}$ index a random sample split into a validation sample $V\left(s\right)$ and a training sample $T\left(s\right).$ The cross-validation bandwidth selector is defined as $\stackrel{ˆ}{h}:=\underset{h}{\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{i}\mathrm{n}}\left\{c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}RSS\left(h\right)+\phantom{\rule{thinmathspace}{0ex}}c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}Var\left(h\right)+n×{\left[c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}Bias\left(h\right)\right]}^{2}\right\},$

where $\begin{array}{rl}c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}RSS\left(h\right):=& \sum _{s=1}^{S}{\sum _{i\in V\left(s\right)}\left\{{Y}_{i}-{\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{h,s}^{\ast }\phantom{\rule{thinmathspace}{0ex}}\left({W}_{i}\right)\right\}}^{2},\\ c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}Var\left(h\right):=& \sum _{s=1}^{S}{\sum _{i\in V\left(s\right)}\left[\frac{{A}_{i}}{{\stackrel{ˆ}{g}}_{s}\left({W}_{i}\right)}\left\{{Y}_{i}-{\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{h,s}^{\ast }\phantom{\rule{thinmathspace}{0ex}}\left({W}_{i}\right)\right\}+{\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{h,s}^{\ast }\phantom{\rule{thinmathspace}{0ex}}\left({W}_{i}\right)-{\stackrel{ˆ}{\mathrm{\psi }}}_{h,s}\right]}^{2},\mathrm{a}\mathrm{n}\mathrm{d}\\ c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}Bias\left(h\right):=& \frac{1}{S}\sum _{s=1}^{S}\left({\stackrel{ˆ}{\mathrm{\psi }}}_{h,s}-{\stackrel{ˆ}{\mathrm{\psi }}}_{h}\right)\end{array}$

are the cross-validated residual sum of squares (RSS), cross-validated variance estimate, and cross-validated bias estimate, respectively. The key idea is to select the bandwidth h that makes ${\stackrel{ˆ}{H}}_{h}^{\left(2\right)}$ most predictive of Y, while adding an asymptotically negligible penalty term for increases in bias and variance in estimation of ${\mathrm{\psi }}_{0}.$ Here, ${\stackrel{ˆ}{\stackrel{ˉ}{Q}}}_{h,s}^{\ast },\phantom{\rule{thinmathspace}{0ex}}{\stackrel{ˆ}{\mathrm{\psi }}}_{h,s},$ and ${\stackrel{ˆ}{g}}_{s}$ are the result of applying the estimation algorithms described in Section 3 using only data in the training sample T(s).

This loss function is the result of adding a mean squared error (MSE) term $c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}Var\left(h\right)+n×{\left[c\mathrm{\upsilon }\phantom{\rule{thinmathspace}{0ex}}Bias\left(h\right)\right]}^{2}$ to the usual RSS loss function used in regression problems. Since he MSE contribution to the loss function is asymptotically negligible compared to the RSS, this loss function yields a valid loss function for the parameter ${\stackrel{ˉ}{Q}}_{0}.$ Intuitively, the cross-validated MSE term serves the purpose of penalizing bandwidths that are solely targeted to estimation of ${\stackrel{ˉ}{Q}}_{0}$ but perform poorly for ${\mathrm{\psi }}_{0}.$ This bandwidth selection algorithm as well as the estimator is implemented in the R code provided in the supplementary materials.

## 5.1 Results

The unadjusted dollar difference in the outcome between the two groups is equal to US$1512. The 1-TMLE and the $1\ast -\mathrm{T}\mathrm{M}\mathrm{L}\mathrm{E}$ give an adjusted difference of US$765 and US\$561; with 95 % bootstrap confidence intervals $\left(-667,2732\right)$ and $\left(-1212,2174\right)$, respectively. The bootstrap standard errors of the two estimators are 803 and 826, respectively. For comparison. the confidence interval obtained with the 2-TMLE using its asymptotic Gaussian distribution and the empirical variance of the first-order efficient influence function is $\left(-1078,2201\right).$ The larger variance of the $1\ast -\mathrm{T}\mathrm{M}\mathrm{L}\mathrm{E}$ may be a consequence of our conjectured property that the $1\ast -\mathrm{T}\mathrm{M}\mathrm{L}\mathrm{E}$ has a better finite sample bias-variance trade-off. In this illustration the use of an estimator with improved asymptotic properties considerably changes the point estimate and confidence intervals.

## 6 Discussion

We proposed a second-order estimator of the mean of an outcome missing at random, and present a theorem showing the conditions under which it is expected to be asymptotically efficient. Our main accomplishment is to show that the second-order TMLE achieves efficiency under slower convergence rates of the initial estimators than those required for efficiency of first-order estimators. The conditions for effficiency of our proposed second-order procedure include the convergence of a kernel bandwidth estimator at rates that are not allowed to be too fast or too slow. The construction of algorithms that achieve the required rates remains an open question.

In addition to the second-order estimator, we presented a novel first-order estimator whose construction is inspired by a second-order expansion of the parameter functional. We showed dramatic improvements in bias and coverage probability of this estimator compared to a first-order competitor in simulations. We conjecture that gains of this kind are expected to hold in general for finite samples, but a formal study of the remainder terms of both estimators remains to be done.

The properties of our proposed method under inconsistent estimation of one of the estimators of g0 and Q0 remain to be studied. In particular, an extension of the methodology of Ref. [10] to obtain second-order, doubly robust asymptotic inference is the subject of the future research in this area.

## References

• 1. Starmans R.J. Models, inference, and truth: probabilistic reasoning in the information era. In: M van der Laan and S Rose, editors. Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer, 2011.

• 2. van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat 2006;2. http://www.bepress.com/ijb/vol2/iss1/11. Google Scholar

• 3. van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011.

• 4. Carone M, Díaz I, van der Laan MJ. Higher-order targeted minimum loss-based estimation. 2014.

• 5. Robins J, Li L, Tchetgen E, van der Vaart AW. Quadratic semipara-metric von mises calculus. Metrika 2009;69:227–47.

• 6. Robins J, Tchetgen ET, Li L, van der Vaart A, et al. Semiparametric minimax rates. Electron J Stat 2009;3:1305–21.

• 7. Tan Z. Second-order asymptotic theory for calibration estimators in sampling and missing-data problems. J Multivariate Anal 2014;131:240–53.

• 8. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61:962–73.

• 9. van der Vaart AW. Asymptotic statistics. Cambridge: Cambridge University Press, 1998. Google Scholar

• 10. van der Laan MJ. Targeted estimation of nuisance parameters to obtain valid statistical inference. Int J Biostat 2014;10:29–57.

• 11. Díaz I, Rosenblum M. Targeted maximum likelihood estimation using exponential families. Int J Biostat 2015;11:233–51.

• 12. Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. Int J Biostat 2011;7:1–34.

• 13. van der Laan MJ, Gruber S. Collaborative double robust targeted maximum likelihood estimation. Int J Biostat 2010;6(1):17. doi:.

• 14. Duong T. ks: Kernel Smoothing, 2015. Available at: http://CRAN.R-project.org/package= ks. R package version 1.9.4.

• 15. Wand MP, Jones MC. Multivariate plug-in bandwidth selection. Comput Stat 1994;9:97–116. Google Scholar

• 16. Bertrand ME, Simoons ML, Fox KA, Wallentin LC, Hamm CW, McFadden E, et al. Management of acute coronary syndromes in patients presenting without persistent ST-segment elevation. Eur Heart J 2002;23(23):1809–40.

• 17. Mohan K, Pearl J, Tian J. Missing data as a causal inference problem. In Proceedings of the Neural Information Processing Systems Conference (NIPS). Citeseer, 2013.

• 18. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70:41–55.

• 19. Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat 2010;6(1). DOI: .

• 20. van der Laan MJ, Polley E, Hubbard A. Super learner. Stat Appl Genet Mol Biol 2007;6. DOI: .

• 21. Gruber S, van der Laan M. C-tmle of an additive point treatment effect. In: M van der Laan and S Rose, editors. Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer, 2011.

## Supplemental Material

The online version of this article (DOI: 10.1515/ijb-2015-0031) offers supplementary material, available to authorized users

Published Online: 2016-05-26

Published in Print: 2016-05-01

Funding: Marco Carone was supported in part by NIH grant UM1AI068635, by an endowment generously provided by Genentech, and by the University of Washington Department of Biostatistics Career Development Fund. Mark J. van der Laan was supported NIH grant R01AI07434506.

Citation Information: The International Journal of Biostatistics, Volume 12, Issue 1, Pages 333–349, ISSN (Online) 1557-4679, ISSN (Print) 2194-573X,

Export Citation