Jump to ContentJump to Main Navigation
Show Summary Details

Journal of Causal Inference

Ed. by Imai, Kosuke / Pearl, Judea / Petersen, Maya Liv / Sekhon, Jasjeet / van der Laan, Mark J.

2 Issues per year

Online
ISSN
2193-3685
See all formats and pricing

Causal Inference for a Population of Causally Connected Units

Mark J. van der Laan
  • Corresponding author
  • University of California – Berkeley, Berkeley, CA, USA
  • Email:
Published Online: 2014-01-14 | DOI: https://doi.org/10.1515/jci-2013-0002

Abstract

Suppose that we observe a population of causally connected units. On each unit at each time-point on a grid we observe a set of other units the unit is potentially connected with, and a unit-specific longitudinal data structure consisting of baseline and time-dependent covariates, a time-dependent treatment, and a final outcome of interest. The target quantity of interest is defined as the mean outcome for this group of units if the exposures of the units would be probabilistically assigned according to a known specified mechanism, where the latter is called a stochastic intervention. Causal effects of interest are defined as contrasts of the mean of the unit-specific outcomes under different stochastic interventions one wishes to evaluate. This covers a large range of estimation problems from independent units, independent clusters of units, and a single cluster of units in which each unit has a limited number of connections to other units. The allowed dependence includes treatment allocation in response to data on multiple units and so called causal interference as special cases. We present a few motivating classes of examples, propose a structural causal model, define the desired causal quantities, address the identification of these quantities from the observed data, and define maximum likelihood based estimators based on cross-validation. In particular, we present maximum likelihood based super-learning for this network data. Nonetheless, such smoothed/regularized maximum likelihood estimators are not targeted and will thereby be overly bias w.r.t. the target parameter, and, as a consequence, generally not result in asymptotically normally distributed estimators of the statistical target parameter.

To formally develop estimation theory, we focus on the simpler case in which the longitudinal data structure is a point-treatment data structure. We formulate a novel targeted maximum likelihood estimator of this estimand and show that the double robustness of the efficient influence curve implies that the bias of the targeted minimum loss-based estimation (TMLE) will be a second-order term involving squared differences of two nuisance parameters. In particular, the TMLE will be consistent if either one of these nuisance parameters is consistently estimated. Due to the causal dependencies between units, the data set may correspond with the realization of a single experiment, so that establishing a (e.g. normal) limit distribution for the targeted maximum likelihood estimators, and corresponding statistical inference, is a challenging topic. We prove two formal theorems establishing the asymptotic normality using advances in weak-convergence theory. We conclude with a discussion and refer to an accompanying technical report for extensions to general longitudinal data structures.

Keywords: networks; causal inference; targeted maximum likelihood estimation; stochastic intervention; efficient influence curve

1 Introduction and motivation

Most of the literature on causal inference has focussed on assessing the causal effect of a single or multiple time-point intervention on some outcome based on observing n longitudinal data structures on n independent units that are not causally connected. For literature reviews, we refer to a number of books on this topic: Rubin [1], Pearl [2], van der Laan and Robins [3], Tsiatis [4], Hernán and Robins [5], and van der Laan and Rose [6].

Such a causal effect is defined as an expectation of the effect of the intervention assigned to the unit on the unit’s outcome, and causal effects of the intervention on other units on the unit’s outcome are assumed non-existent. As a consequence, causal models only have to be concerned about the modeling of causal relations between the components of the unit-specific data structure. Statistical inference is based on the assumption that the n data structures can be viewed as n independent realizations of a random variable, so that central limit theorems (CLTs) for sums of independent random variables can be employed. The latter requires that the sample size n is large enough so that statistical inference based on the normal limit distributions is indeed appropriate.

In many applications, one may define the unit as a group of causally connected individuals, often called a community or cluster. It is then assumed that the communities are not causally connected, and that the community-specific data structures can be represented as n independent random variables. One can then define a community-specific outcome and assess the causal effect of the community level intervention/exposure on this community-specific outcome with methods from the causal inference literature. Such causal effects incorporate the total effect of community level intervention, where the effect of the community level exposure on an individual in a community also occurs through other individuals in that same community.

We refer to Halloran and Struchiner [7], Hudgens and Halloran [8], VanderWeele et al. [9], and Tchetgen Tchetgen and VanderWeele [10] for defining different types of causal effects in the presence of causal interference between units. Lacking a general methodological framework, many practical studies assume away interference for the sake of simplicity. The risk of this assumption is practically demonstrated in Sobel [11], who shows that ignoring interference can lead to completely wrong conclusions about the effectiveness of the program. We also refer to Donner and Klar [12], Hayes and Moulton [13], and Campbell et al. [14] for reviews on cluster randomized trials and cluster level observational studies.

In many such community randomized trials or observational studies, the number of communities is very small (e.g. around 10 or so), so that the number of independent units itself is not large enough for statistical inference based on limit distributions. In the extreme, but not uncommon, case, one may observe a single community of causally connected individuals. Can one now still statistically evaluate a causal effect of an intervention assigned at the community level on a community level outcome, such as the average of individual outcomes? This is the very question we aim to address in this article. Clearly, causal models incorporating all units are needed in order to define the desired causal quantity, and identifiability of these causal quantities under (minimal) assumptions need to be established without relying on asymptotics in a number of independent units.

An important ingredient of our modeling approach carried out in this article is the incorporation of network information that describes for each unit i (in a finite population of N units) at certain points in time t a set of other units this unit may receive input from. This allows us to pose a structural equation model for this group of units in which the observed data node at time t of a unit i is only causally affected by the observed data on the units in , beyond exogenous errors. This group of friends needs to include the actual immediate friends of unit i that directly affect the data at time t of unit i, and if one knows the actual immediate friends, then should not include anybody else. Such a structural equation model could be visualized through a so-called causal graph involving all N units, which one might call a network. Our assumptions on the exogenous errors in the structural equation model will correspond with assuming sequential conditional independence of the unit-specific data nodes at time t, conditional on the past of all units at time t. That is, conditional on the most recent past of all units, including the recent network information, the data on the units at the next time-point are independent across units. The smaller these sets (i.e. friends of i at time t) can be selected, the fewer incoming edges for each node in the causal graph, the larger the effective sample size will be for targeting the desired quantity. Even though these causal graphs allows the units to depend on each other in complex ways, if the size of is bounded universally in N and under our independence assumptions on the exogenous errors, it will follow that the likelihood of the data on all N units allows statistical inference driven by the number of units N instead of driven by the number of communities (e.g. 1). In future work, we will generalize our formal asymptotic results in which is universally bounded, to the case in which the size of can grow with N.

To precisely define and solve the estimation problem, we will apply the roadmap for targeted learning of a causal effect (e.g. Refs [2, 6, 15]). We start out with defining a structural causal model [2] that models how each data node is a function of parent data nodes and exogenous variables, and defining the causal quantity of interest in terms of stochastic interventions on the unit-specific treatment nodes. The structural assumptions of the structural causal model could be visualized by a causal graph describing the causal links between the N units and how these links evolve over time, and from that it is clear that this structural causal model describes what one might call a dynamic causal network.

As mentioned above, our structural equation model also makes strong independence assumptions on the exogenous errors, which imply that the unit-specific data nodes at time t are independent across the N units, conditionally on the past of all N units. We refer to this assumption as a sequential conditional independence assumption. Thus, it is assumed that any dependence of the unit-specific data nodes at time t can be fully explained by the observed past on all N units. (In our technical report, we weakened this assumption to allow for residual dependence after this adjustment, among units that are causally connected.) As a next step in the roadmap, we then establish the identifiability of the causal quantity from the data distribution under transparent additional (often non-testable) assumptions. This identifiability result allows us to define and commit to a statistical model that contains the true probability distribution of the data, and an estimand (i.e. a target parameter mapping applied to true data distribution) that reduces to this causal quantity if the required causal assumptions hold. The statistical model needs to contain the true data distribution, so that the statistical estimand can be interpreted as a pure statistical target parameter, while under the stated additional causal conditions that were needed to identify the causal effect, it can be interpreted as the causal quantity of interest. This statistical model, and the target parameter mapping that maps data distributions in this statistical model into the parameter values, defines the pure statistical estimation problem. As a next step in the roadmap, we develop targeted estimators of the statistical estimand and develop the theory for statistical inference. To understand the deviation between the estimand and the causal quantity under a variety of violations of these causal assumptions, one may carry out a sensitivity type analysis [1618, 36], which represents the final step of the roadmap.

Since the statistical model does not assume that the data generating experiment involves the repetition of independent experiments, the development of targeted estimators and inference represents novel and new challenges in estimation and inference that, to the best of our knowledge, have not been addressed by the current literature. TMLE was developed for estimation in semi-parametric models for i.i.d. data [6, 19, 20] and extended to a particular form of dependent treatment/censoring allocation as present in group sequential adaptive designs [19, 21, 22] and community randomized trials [23]. In this article, we need to generalize TMLE to the complex semi-parametric statistical model presented in this article, and we also need to develop corresponding statistical inference.

Our models generalize the models in the causal inference literature for independent units. Even though in this article our causal model models a single group of units, it obviously includes the case that the units can be partitioned in multiple causally independent groups of units. In addition, our models also incorporate group sequential adaptive designs in which treatment allocation to an individual can be based on what has been observed on previously recruited individuals in the trial [19, 21, 22, 24]. Our models also allow that the outcome of an individual is a function of the treatments other individuals received. The latter is referred to as interference in the causal inference literature. Thus the causal models proposed in this article do not only generalize the existing causal models for independent units, but they also generalize causal models that incorporate previously studied causal dependencies between units. Finally, we note that our models and corresponding methodology can also be used to establish a methodology for assessing causal effects of interventions on the network on the average of the unit-specific outcomes. For example, one might want to know how the community level outcome changes if we change the network structure of the community through some intervention, such as increasing the connectivity between certain units in the community. In this case, our treatment nodes need to be defined as properties of the sets so that a change in treatment corresponds with a change in the network structure.

Nonetheless, our assumed universal bounds on the size of in our formal results exclude many realistic and important types of networks, demonstrating that our asymptotic theorems need to be further generalized in order to capture many realistic networks, but that will be beyond the scope of this article.

1.1 A bibliographic remark and possible relation to network literature

We acknowledge that our contribution does not really fit well in the current literature on networks, which is much more concerned with properties of the network structure and uses particular types of models and estimands that are often not embedded within a causal model as we have done here (e.g. Ref. [25]). Our contribution is aligned and builds on the current causal inference literature (Neyman–Rubin or Pearl’s structural equation models) to define the causal quantity of interest and establish identifiability from observed data. In addition, it builds on the modern literature of targeted learning in semi-parametric models and weak-convergence theory in order to deal with the estimation problem based on dependent data. Nonetheless, we think it is appropriate to define and model networks of units in terms of a structural equation model, so that the impact of interventions on this network of units can be formally defined, and methods for assessing such causal effects can be developed, as we do in this article. Therefore, we suggest and hope that our contributions may become relevant to the literature on networks.

In this article, we focussed on the case that we observe all N units in the population, while we refer to our technical report for generalizing this to sampling a random sample from this population of N units. We also restricted our attention to particular types of causal quantities, namely the counterfactual mean under a stochastic intervention on the unit-specific treatment nodes (and thereby also causal contrasts). The network literature, on the other hand, has been much more focussed on particular types of direct/indirect and peer effects among others (e.g. see Bakshy et al. [26] and Airoldi et al. [27] for estimation of causal peer influence effects, and the above references). We hope to apply our framework and approach to tackle such questions as well in future research.

We refer to Aronow and Samii [28] for an inverse probability of treatment weighted approach for estimation of an average causal effect (ACE) under general interference, relying on the experimental design to generate these required generalized propensity scores. In addition, these authors also provide finite sample positively biased estimators of the true (non-identifiable) conditional variance of this IPTW-estimator, conditioning on the underlying counterfactuals, again relying on knowing the generalized propensity score. In addition, the authors consider asymptotics when one observes multiple independent samples from subpopulations, the number of subpopulations converging to infinity, each sample allowing for their general type of interference.

Their innovative approach relies on defining an exposure model that maps the treatment nodes of the N units and specified characteristics of unit i into a generalized exposure of unit i. For example, you might define this generalized exposure as the vector of exposures of the friends of unit i, beyond the exposure of unit i itself. It defines for each unit i the counterfactual outcome corresponding with the static intervention that sets this generalized exposure to a certain value, same for each unit i, and then defines the counterfactual mean outcome as the expectation of the average of these unit-specific counterfactuals. It inverses probability weights by the conditional probability of this generalized exposure to obtain an unbiased estimator of this expectation of the average of these counterfactual outcomes.

Our model includes the case of observing many independent clusters of units as a special case, but by assuming more general conditional independence assumptions we also allow for asymptotic statistical inference when we only observe one population of interconnected units, we define causal quantities in terms of stochastic interventions on the N unit-specific exposures, we allow for more general dependencies than interference, and we develop highly efficient estimators that are very different from the above-mentioned IPTW-type estimator, overall making our approach distinct from Aronow and Samii [28].

1.2 Organization of article

The organization of this article is as follows.

Section 2: We formulate a counterfactual causal model that can be viewed as an analogue of the structural causal model actually used in this article. This section provides a perspective of the contribution of this article in the context of the causal inference literature that relies on the Neyman–Rubin model, demonstrating that in essence it corresponds with allowing for (statistical) dependence between the unit-specific counterfactuals indexed by interventions on the total of N unit-specific exposures, allowing for the unit-specific counterfactuals to be affected by the treatments of other units (i.e. causal interference between the units), and that the treatment assigned to a unit are informed by other units in the population. This section is succinct and is not necessary for understanding the remainder of the article.

Section 3: We present our structural causal model that models the data generating process for a population of interconnected units, where changes of the connections over time (i.e. ) themselves are part of the randomness. Specifically, it represents a model for the distribution of , where denotes the observed data on unit i, and represents a vector of exogenous errors for the structural equations for unit i. This structural causal model allows us to define stochastic interventions denoted with on the collection of unit-specific treatment nodes (contained in ), and corresponding counterfactual outcomes. The causal quantity, denoted with , is defined in terms of the (possibly conditional) expectation of the intervention-specific counterfactual outcomes , and it represents a parameter of the distribution of . Subsequently, we establish identifiability of the causal quantity from the data distribution of data on the N units, commit to a statistical model for the probability distribution of O, define the statistical target parameter mapping that defines the estimand , where the latter reduces to the causal quantity under the additional assumptions that were needed to establish the identifiability. The statistical estimation problem is now defined by the data , the statistical model and target parameter . The parameter only depends on P through a parameter . Therefore, we also use the notation to denote this target parameter .

Section 4: We discuss maximum likelihood estimation (MLE), unified loss-based cross-validation [2931], and likelihood based super-learning [32, 33] of the relevant factor of (which implies ). The resulting smoothed/regularized maximum likelihood substitution estimators are not targeted and will thereby be overly biased w.r.t. the target parameter , and, as a consequence, generally not result in asymptotically normally distributed estimators of the statistical target parameter. Thus there is a need for targeted learning (targeting the fit toward ) instead of MLE.

Section 5: We present heuristic arguments demonstrating that the log-likelihood of O will satisfy a local asymptotic normality condition [34, 35] so that efficiency theory can be applied to pathwise differentiable target parameters of the data distribution. As demonstrated in van der Vaart [35], under local asymptotic normality the normal limit distribution of the MLE (ignoring all regularity conditions that would be needed to establish the asymptotic normality of the MLE) is optimal in the sense of the convolution theorem [34]. In this section, we demonstrate that the variance of the efficient influence curve (i.e. the canonical gradient of the pathwise derivative of the target parameter) corresponds with the asymptotic variance of a maximum likelihood estimator of the target parameter. From this, we learn that our goal should be to construct estimators that are asymptotically normally distributed with variance equal to the standardized variance of the efficient influence curve (and thus asymptotically equivalent with a MLE), while appropriately dealing with the curse of dimensionality through super-learning and TMLE [6, 20].

In the remainder of the article, we focus on the simpler single time-point longitudinal data structure in which , where are baseline covariates, is the subsequent treatment assigned to unit i, and is the final outcome of interest measured on unit i. This simplification allows us to present a TMLE in closed form and formally analyze this TMLE, while much of what we learn can be generalized to general longitudinal data structures.

Section 6: We derive the efficient influence curve, also called the canonical gradient of the pathwise derivative of the statistical target parameter [34, 35]. We also establish that the expectation of the efficient influence curve under misspecified parameters of the data distribution can be represented as plus a product of differences of Q and and a specified and . This result provides a fundamental ingredient in establishing a first-order expansion of the TMLE under conditions that make these second-order terms negligible relative to the first-order term, while a separate analysis of the first-order term (which is a sum of dependent random variables) establishes the asymptotic normality of the TMLE.

Section 7: We present the TMLE for the causal effect of a single time-point intervention on an outcome, controlling for the baseline covariates across the units. This TMLE generalizes the TMLE of the causal effect of a single time-point intervention under causal and statistical independence of the units [6, 3639]. It is shown that the efficient influence curve satisfies a double robustness property, which implies the double robustness of the TMLE. We also present an estimator defined as a solution of the efficient influence curve based estimating equation: Robins and Rotnitzky [40] and van der Laan and Robins [3]. We propose effective schemes for implementing the TMLE.

Section 8: We present a theorem establishing asymptotic normality of this TMLE for the causal effect of a single time-point intervention and discuss statistical inference based on its normal limit distribution. The theorem relies on modern advances in weak convergence of processes as presented in van der Vaart and Wellner [41] and van der Vaart [35]. The proof of the theorem is deferred to the Appendix. The generalization of the formal asymptotics results for this TMLE to the TMLE for general longitudinal data structures is also discussed in the Appendix of our accompanying technical report.

Section 9: We present an analogue theorem for this TMLE as an estimator of the intervention-specific mean outcome, conditional on all baseline covariates . This result avoids making any independence assumptions on the distribution of W, and the asymptotic variance of the TMLE is reduced.

Section 10: We conclude with a summary and some remarks.

We will address the actual implementation of the proposed TMLE and simulation studies in an article in the near future. We refer to our accompanying technical report for various additional results such as weakening of the sequential conditional independence assumption (still heavily restricting the amount of dependence, but allowing that, even conditional on the observed past, a subject can be dependent on maximally K other subjects), and only observing a random sample of the complete population of causally connected units, among others.

2 Formulation of estimation problem in terms of Neyman–Rubin model for counterfactuals

The estimation problem defined in the next section in terms of a semi-parametric structural equation model corresponds with the following counterfactual missing data problem formulation also called the Neyman–Rubin causal model [1, 4246].

Let be the full-data structure consisting of all static regimen-specific counterfactuals for unit i, where represents the static regimens for all N units, is a time-dependent process up till time , and only depends on a through . Note that counterfactuals are indexed by a and not just the treatment for unit i: we refer to Halloran and Struchiner [7], Hudgens and Halloran [8], VanderWeele et al. [9], Tchetgen Tchetgen and VanderWeele [10], Aronow and Samii [28] for discussions of counterfactuals under interference.

Let be the probability distribution of and let be the full-data model, i.e. the collection of possible distributions of . This full-data model will thus incorporate additional assumptions such as that the counterfactuals of unit i only depend on the regimens of a subset of the N individuals and conditional independence assumptions, as presented below. We observe the missing data structure , on the full-data . We view as a missing data structure on the full-data with censoring variable A. In other words, for a specified function . We assume that the conditional density of , given , satisfies

where is a function of . Note that this corresponds with assuming that at each time t, , , are independent, conditional on the past of the N subjects (i.e. a sequential randomization assumption (SRA). We remind the reader that one definition of coarsening at random [4749] is that the conditional density of censoring variable A, given full-data , w.r.t. an appropriate dominating measure, only depends on through the censored data structure . Thus the SRA implies that the missingness mechanism on the full-data satisfies coarsening at random: Note that is a measurable function of O so that this assumption indeed implies the coarsening at random assumption.

Due to this coarsening at random assumption, the likelihood of O factorizes in a full-data distribution factor and the joint intervention mechanism :

We use the notation , , and . Note that the full-data distribution factor equals the likelihood of at set regimen at value and is thus identified by the full-data distribution . We could model this full-data distribution factor of the likelihood as follows:

where the second equality assumes coarsening at random, the third equality assumes that are conditionally independent, given , the fourth equality assumes that only depends on the past through an i-specific fixed (in N) dimensional summary measure of , and the final equality assumes that each is drawn from for a common . These assumptions define the full-data model . Because of these assumptions, the full-data distribution factor of the distribution of O only depends on through , so that the data distribution if parameterized by and could thus denoted with . The statistical model is now defined as , where are unspecified beyond the specifications presented above.

Our full-data target parameter is a parameter defined on the full-data model. The factorization of the likelihood of O due to coarsening at random establishes the identifiability of as a parameter of the distribution of O, under the assumption that only depends on the full-data distribution through . As a consequence, we can now define a statistical target parameter so that . We need to construct an estimator of based on this single draw of , and we need to establish a limit distribution of the standardized estimator: as for some limit distribution Z (e.g. ).

Let be a conditional distribution of a random variable , given X, satisfying coarsening at random so that for some function . We refer to this choice as a stochastic intervention which can be used to define a modified version of the data distribution P by replacing g by resulting in the probability distribution

whose random variable is denoted with , where we will also use the notation for . The latter distribution is the so-called G-computation formula for the post-intervention distribution of L under the stochastic intervention [45] and is a parameter of P. Under the causal model including the SRA and a positivity assumption, would equal the post-intervention distribution of the (counterfactual) random variable obtained by first drawing the counterfactuals , then drawing an , and reporting . A possible statistical target parameter is now given by , as addressed in this article, which equals the full-data parameter under the causal model.

The fact that the counterfactual outcome of subject i can be a function of the treatments of other subjects is referred to as interference in the causal inference literature. In addition, the above formulation allows that treatment allocation for unit i depends on data collected on other units. The above formulation also allows dependence between the counterfactuals between different units. The above formulation can thus be viewed as the causal inference estimation problem when interference, adaptive treatment allocation, and dependence of the counterfactuals of different units is allowed. Our structural equation model defined in next section implies such restrictions on the distribution of the counterfactuals and defines this same particular full-data model .

3 Formulation of estimation problem using a structural causal model

For a unit i, let be a time-ordered longitudinal observed data structure, where are baseline covariates, denotes an action/treatment/exposure at time t, which will play the role of intervention node in the structural equation model below, denotes time-dependent measurements on unit i, possibly including an outcome process , and denotes a final outcome, realized after the final intervention node . Let be a component of that denotes the set of friends individual i may receive input from at time t, . Thus, .

If we define , and similarly we define , then the observed data can be represented by a single time-ordered data structure

The latter ordering is the only causally relevant ordering, and the ordering of units within a time-point is user supplied but inconsequential. We define and , as the parents of and , respectively, w.r.t. this ordering. The parents of , denoted with , are defined to be equal to , and the parents of , denoted with , are also defined to be equal to , , .

In order to define causal quantities, we assume that O is generated by a structural equation model of the following type: first generate a collection of exogenous errors across the N units, where the exogenous errors for unit i are given by

and then generate O deterministically by evaluating functions as follows:

These functions , are unspecified at this point, but will be subjected to modeling below.

Since and , an alternative succinct way to represent this structural equation model is

Recall that set of friends, , is a component of and is thus also a random variable defined by this structural equation model, , although we decided to condition on in our formal theorems for the point-treatment data structure in our later sections, representing the case .

Counterfactuals and stochastic interventions: This structural equation model for

allows us to define counterfactuals corresponding with a dynamic intervention d on A [46, 5053]. For example, one could define at time t as a particular deterministic function of the parents of subject . Such an intervention corresponds with replacing the equations for by this deterministic equation , . More generally, we can replace the equations for that describe a degenerate distribution for drawing , given , and , by a user-supplied conditional distribution of an , given . Such a conditional distribution defines a so-called stochastic intervention: Dawid and Didelez [54], Didelez et al. [55], and Diaz and van der Laan [56].

Let denote our selection of a stochastic intervention identified by a set of conditional distributions of , given , . For convenience, we represent the stochastic intervention with equations in terms of random errors . This implies the following modified system of structural equations:

where is the same set of variables as , but with replaced by . Let , or short-hand , denote the corresponding counterfactual outcome for unit i. A causal effect at the unit level could now be defined as a contrast such as for two interventions and . Note that, for a given , is a deterministic function of the error-term that are inputted in the structural equations.

Post-intervention distribution, and SRA: We assume the SRA on U,

(1)

and . Then, the probability distribution of is given by the so-called G-computation formula [45, 52, 53, 55, 57]

where is the conditional distribution of , given , and . We will denote the distribution of with . Thus, under this SRA, the post-intervention distribution is identified from the observed data distribution of O generated by the structural equation model. The distribution of corresponds now with a marginal distribution of .

ACE: One might now define an ACE as the following target parameter of this distribution of :

Let , so that we can also write this causal effect as . Since the distribution is indexed by N, the parameter depends on N. In particular, the effect of stochastic intervention on a population of N interconnected units will naturally depend on the size N of that population, and the network information F: i.e. adding a unit will change the dynamics. As we will do in our point-treatment sections, one might decide to replace these marginal expectations by conditional expectations conditioning on , , or even conditioning on . We will focus on the causal quantity for a user-supplied stochastic intervention, and our results naturally generalize to causal quantities that a Euclidean valued function of a collection of such intervention-specific means.

Iterative conditional expectation representation of ACE: The parameter can be represented as an iterative conditional expectation w.r.t. the probability distribution of [58, 59]:

where . Thus, this mapping involves iteratively integrating w.r.t. the observed data distribution of , given its parents, and the conditional intervention distribution of , given , respectively, starting at , till .

Dimension reduction and exchangeability assumptions: The above-stated identifiability of is not of interest since we cannot estimate the distribution of O based on a single observation. Therefore, we will need to make much more stringent assumptions that will allow us to learn the distribution of O based on a single draw. One could make such assumptions directly on the distribution of O, but below we present these assumptions in terms of assumptions on the structural equations and exogenous errors.

Beyond the assumptions above, we will also assume that for each node and , we can define known functions, and , that map into a Euclidean set with a dimension that does not depend on N, and corresponding common (in i) functions , so that

(2)

(As mentioned above, an interesting variation of this structural causal model treats as given and thus removes that data generating equation.) Examples of such dimension reductions are , i.e. the observed past of unit i itself and the observed past of its current friends, and, similarly, we can define . By augmenting these reductions to data on maximally K friends, filling up the empty cells for units with fewer than K friends with a missing value, these dimension reductions have a fixed dimension and include the information on the number of friends. This structural equation model assumes that, across all units i, the data on unit i at the next time-point t is a common function of its own past and past of its friends. In our formal asymptotic results for the TMLE based on the point-treatment data structure , we assume this particular type of summary measure of maximally K friends in order to enforce enough independence to establish an asymptotic normal limit distribution, but the sequel and the TMLE are defined for any summary measure, and in future work we hope to address the analysis of the TMLE for more general summary measures.

Independence assumptions on exogenous errors: Beyond the SRA (1), we make the following (conditional) independence assumptions on the exogenous errors. Firstly, we assume independence assumptions on (and thereby , ) such as that , , are independent (so that , , are independent), or that is independent of if . We will estimate the joint distribution of with the empirical counterpart that puts mass 1 on the actual observed , and the resulting empirical expectation w.r.t. this empirical distribution in our estimator, i.e. in the iterative algorithm above, has to satisfy that has to converge to a normal distribution. The key assumption for this convergence in distribution is that depends on at most K for a universal K. So we will assume a model on the distribution of that assumes the latter, at minimal.

In addition, for all , conditional on , , are independent and identically distributed, and for all , conditional on , , , are independent and identically distributed. The important implication of the latter assumptions is that, given the observed past , for any two units i and j that have the same value for their summaries as functions of , we have that and are independent and identically distributed, and similarly, we have this statement for the treatment nodes. This allows us to factorize the likelihood of the observed data as done below, parameterized by common conditional distributions and that can actually be learned from a single (but growing) O when .

Identifiability: G-computation formula for stochastic intervention. For notational convenience, let , and let be defined accordingly with replaced by . Due to the exchangeability and dimension reduction assumptions, the probability distribution of now simplifies:

(3)

where are the above defined conditional distributions of , given , where, by our assumptions, these i-specific conditional densities are constant in , as functions of , . We will also use the notation for the conditional distribution of , given , which is thus parameterized in terms of . Similarly, we use the notation or to denote the conditional distribution of , given , which is thus parameterized in terms of . We introduced the notation for the right-hand side in eq. (3) which thus represents an expression in terms of the distribution of the data under the assumption that the conditional densities of , given , are constant in i as functions of , indexed by the choice of stochastic intervention , while one needs the causal model and randomization assumption in order to have that the right-hand side actually models the counterfactual post-intervention distribution . This shows that for a mapping from the distribution of O to the real line. Strictly speaking this does not establish a desired identifiability result yet, since we cannot learn based on a single draw O. To start with, we need to realize that , , and are indexed by N, and we only observed one draw from . Therefore, we still need to show that we can construct an estimator based on a single draw that is consistent for as . For that purpose, we note that the distribution is identified by the common conditional distributions , , and with . We can construct consistent estimators of these common conditional distributions based on MLE that are consistent as , which follows from our presentation of estimators and theory. This demonstrates the identifiability of as , . In addition, our target parameter involves an average w.r.t. which can be consistently estimated by its empirical counterpart under our independence assumptions, as discussed above. This demonstrates the desired identifiability of from the observed data as .

Likelihood and statistical model: Let denote the distribution of . By our assumptions, the likelihood of the data

is given by:

(4)

We denoted the factors representing the conditional distributions of with , where these conditional densities at , given , are constant in i, as functions of and . Similarly, we modeled the g-factor in terms of common conditional distributions . Let represent the collection of all these factors, and , so that the distribution of O is parameterized by . The conditional distributions are unspecified functions of and , beyond that for each value of it is a conditional density, and satisfies a particular independence model discussed above. Similarly, the conditional distributions are unspecified conditional densities. This defines now a statistical parameterization of the distribution of O in terms of , and a corresponding statistical model

(5)

where and denote the parameter spaces for Q and g, respectively. Note that we derived the same likelihood and statistical model based on the Neyman–Rubin model in Section 2: instead of making assumptions on the structural equation model, we assumed coarsening at random, and made assumptions on the full-data distribution factor of the likelihood.

Statistical target parameter: Let denote a random variable with distribution (eq. 3), defined as a function of the data distribution P of O. We define our statistical target parameter as which is a function of the intervention-specific distribution , so that it equals the causal quantity under the above-stated causal assumptions. Thus

(6)

depends on the distribution P of the data O through . Note that Q is determined by the distribution of , and the conditional distributions of , given , which, by assumption, equal a common function , . As shown above, we can represent this statistical target parameter also as an iterative conditional expectation involving the iterative integration w.r.t. , , starting at and moving backward till the expectation over :

This representation allows the effective evaluation of by first evaluating a conditional expectation w.r.t. conditional distribution of , and thus w.r.t. , then the conditional mean of the previous conditional expectation w.r.t. conditional distribution of , and iterating this process of taking a conditional expectation w.r.t. and till we end up with a conditional expectation over , given , and finally we take the marginal expectation w.r.t. the distribution of . Note that each conditional expectation involves an expectation over vector or w.r.t. product measure of common conditional distributions or , .

One can also define an -conditional statistical target parameter as , which can still be effectively evaluated by the iterative conditional expectations presented above, but one simply removes the final integration over the distribution of .

Statistical estimation problem: We have now defined a statistical model (eq. 5) for the distribution (eq. 4) of O, and a statistical target parameter mapping (eq. 6) for which only depends on Q. We will also denote this target parameter with , with some abuse of notation by letting represent these two mappings. Given a single draw , we want to estimate . In addition, we want to construct an asymptotically valid confidence interval. Recall that our notation suppressed the dependence on N and F of the data distribution , statistical model , and target parameter . In the conditional model for the conditional distribution of O, given , we will make the dependence on of the data distribution , , and explicit.

Summary: So we defined a structural causal model (eq. 2), including the stated independence (and i.i.d.) assumptions on the exogenous errors, the dimension reduction assumptions, and the SRA (eq. 1). This resulted in the likelihood (eq. 4) and corresponding statistical model (eq. 5) for the distribution of O. In addition, these assumptions allowed us to write the causal quantity as a statistical estimand (eq. 6): , where can be learned from a single draw O as . The pure statistical estimation problem is now defined: , and we want to learn where . Under the non-testable causal assumptions, beyond the statistical assumption , we can interpret as , but, even without these non-testable assumptions, one might interpret (and its contrasts) pure statistically as an effect measure of interest controlling for the observed confounders.

3.1 Example

In order to provide the reader with some sense of the type of applications that can be addressed with our model approach, we present a few examples.

Consider a study in which we wish to evaluate the effect of starting HIV treatment early after HIV-infection on the rate of HIV-infection for the population of interest. For that purpose, the study tracks the cohort of individuals for 5 years, and for each individual one obtains baseline characteristics , one regularly tests for HIV-infection (), one measures when the individual starts treatment and one measures if the person was lost to follow up (), one regularly measures biomarkers and other time-dependent characteristics of interest such as condom use (), and one regularly measures the set of sexual partners. Let , where the th point represents end-point 5 years after baseline. Suppose one is interested in the effect of early HIV treatment () on the proportion of HIV-infections at 5 years. One knows that an HIV-infected person that is being treated is much less infectious than a non-treated HIV-infected person, so that early treatment might have a strong beneficial effect on the spread of HIV-infection. One might be interested to estimate the mean outcome under a stochastic intervention on , . For example, the stochastic intervention deterministically starts HIV-treatment after the first observed HIV-infection, and it enforces no-right-censoring. This would be an example of a deterministic dynamic intervention. In our model, we may assume that our conditional distributions of , , and , given the past on all individuals only depends on the individual pasts of the sexual partners of subject i, beyond the past of subject i itself. In particular, it is clear that the HIV-infection at time t for individual i is very much a function of the treatment status of its sexual partners.

A simplified version of this example is the case that we only observe on individual i baseline covariates (including baseline HIV-infection status), treatment status , and subsequent HIV-infection , for the N individuals. One might now assume that the treatment status of individual i is not only a function of its own baseline characteristics, but also of the baseline characteristics of its sexual partners, and that its outcome status is a function of the baseline characteristics and treatment status of its friends as well as its own.

Similarly, the treatment node could be defined as the indicator of condom use, so that the counterfactual mean outcomes evaluates the effect of condom use on the spread of the HIV-epidemic. One could also think about interventions on itself, such as interventions that decrease the number of sexual partners. This corresponds with specifying a conditional distribution of , given the past, at each time t, where such a conditional distribution might be a part of the actual distribution of the set , given the past.

It is also of interest to note that a stochastic intervention could only target a random subset of the total set of intervention nodes, , by focussing on a subset of individuals and a subset of the time-points. That is, a stochastic intervention could be equal to the actual treatment mechanism that generated the at certain times and for certain individuals while it enforces an intervention elsewhere. For example, resources might only allow one to carry out a limited number of interventions, and one wishes to evaluate different strategies for selecting the nodes for which the intervention will be enforced.

Another example of interest might be one in which taking an anti-depression drug or an intervention is the treatment node, and a depression score at a final time-point is the outcome of interest. Consider a group of individuals that are socially connected and for which a reasonable proportion is subjected to this intervention or drug-treatment. One might expect that drug/intervention node of the friends of individual i affects (indirectly) the psychological health and thereby outcome of individual i, so that this would be an example of causal interference. In addition, one expects that drug/intervention node of individual i is affected by the drug/intervention nodes of its friends and other features of its friends, so that this is also an example of adaptive treatment allocation (i.e. treatment for individual i is affected by the past of the friends of individual i, beyond the past of individual i itself). Thus, this would be an example where one naturally needs to allow that both the treatment nodes and the outcome nodes of an individual are affected by the observed past of its friends. Clearly, the causal effect of different stochastic interventions on the anti-depression drug/intervention nodes for the individuals in the population will include the peer effects.

4 Maximum likelihood estimation, cross-validation, super-learning, and targeted maximum likelihood estimation

We could estimate the distribution of with the empirical distribution that puts mass 1 on . This choice also corresponds with a TMLE of the intervention-specific mean outcome that conditions on , as we formally show in our later sections for the single time-point data structure. If it is assumed that are independent, then we estimate the distribution of with the NPMLE that maximizes the log-likelihood over all possible distributions of that the statistical model allows. In particular, if it is known that are i.i.d., then we would estimate the common distribution of with the empirical distribution that puts mass on , .

Regarding estimation of for , we consider the log-likelihood loss function for :

Note that is minimized in by the true , since, conditional on , the true distribution of is given by , . In addition, this expectation is well approximated by , since, conditional on , this is a sum of independent random variables , . The latter allows us to prove convergence of the empirical mean process to the true mean process uniformly in large parameter spaces for , using similar techniques as we use in the Appendix based on weak-convergence theory in van der Vaart and Wellner [41]. As a consequence, one could pose a parametric model for , say , and use standard MLE

as if the observations , , are independent and identically distributed and we are targeting this common conditional density of given . More importantly, we can use loss-based cross-validation and super-learning to fit this function of , thereby allowing for adaptive estimation of . Specifically, consider a collection of candidate estimators that maps a data set into an estimate, , and let denote the empirical distribution that puts mass onto each . Given a random split vector , define and as the empirical distributions of the validation sample and training sample , respectively. We can now define the cross-validation selector of k as

If is continuous, one could code in terms of binary variables across the different levels l of , and model the conditional distribution/hazard of , given and , as a function of and l, as in van der Laan [60, 61]. One could now construct candidate estimators of this conditional hazard, possibly smoothing in the level l, by utilizing estimators of predictors of binary variables in the machine learning literature, including standard logistic regression software for fitting parametric models. Similarly, this can be extended to multivariate by first factorizing the conditional distribution of in univariate conditional distributions. In this manner, one obtains then candidate estimators of based on a large variety of algorithms from the literature.

We could fit each separately for , but it is also possible to pool across t by constructing estimators and using cross-validation based on the sum loss function

Similarly, we can use the log-likelihood loss function for :

and use loss-based cross-validation and super-learning to fit , possibly pooling across time based on the sum loss function

Given the resulting estimator of , one can evaluate as estimator of , according to the iterative conditional expectation mapping presented earlier. Since is optimized to fit (i.e. involving trading off bias and variance w.r.t. , not ), such a data-adaptive plug-in estimator, although it inherits the (e.g. minimax adaptive) rate of convergence at which converges to , it is overly biased for , so that will generally not converge to at rate .

TMLE: TMLE will involve modifying an initial estimator into a targeted version , , through utilization of an estimator of , a least-favorable submodel (w.r.t. target parameter ) through a current fit at , fitting for each t and each step k with standard MLE , iterative updating , , till convergence in . The resulting TMLE of is defined accordingly as the substitution estimator . Thus, a TMLE will also involve estimation of the intervention mechanism . To define such a TMLE, we need to determine the efficient influence curve of the statistical target parameter, which will imply these least-favorable submodels. We refer to our technical report van der Laan [62] for a derivation of the efficient influence curve, a study of its robustness, and a detailed presentation of this general TMLE. (In the next section, we also showcase the formula for this efficient influence curve.) Instead, in this article, we will focus on the single time-point longitudinal data structure with and present a complete self-contained analysis of the TMLE.

5 Characterizing the optimal asymptotic variance of the MLE in terms of efficient influence curve

Due to our sequential conditional independence assumption, the log-likelihood of O, i.e. the log of the data-density (eq. 4) of O, can be represented as a double sum over time-points t and units i, and for each t, the sum over i consists of independent random variables, conditional on the past. As a consequence, under regularity conditions, one can show that the log-likelihood is asymptotically normally distributed. Therefore, we conjecture that we can establish so-called local asymptotic normality of our statistical model, which involves establishing asymptotic normality of log-likelihood under sampling from fluctuations/submodels of a fixed data distribution P across all possible fluctuations. As shown in van der Vaart [35], for models satisfying the local asymptotic normality condition, the normal limit distribution of an MLE is an optimal limit distribution based on the convolution theorem [34]. In this section, we informally demonstrate the importance of the efficient influence curve as the random variable whose variance characterizes the normal limit distribution of an MLE of the target parameter for our semi-parametric model for , and thereby characterizes the normal limit distribution of optimal estimators. As part of this we use a template for establishing the normal limit distribution of the MLE, which can be equally well applied to the TMLE.

Even though it is well known that a regular estimator based on sample of n i.i.d. observations is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve, here we are not interested in asymptotics when we observe n of our data structures that are indexed by this parameter N (like observing an i.i.d. sample , where each describes the data on N causally connected units), but we are interested in the asymptotics in N based on a single draw of O. Therefore, we think it is important to point out the asymptotic behavior of the MLE based on such a single when , showing that the asymptotic variance of the MLE is still characterized by the efficient influence curve. Our lesson is that our goal should still be to construct an estimator that is asymptotically normally distributed with variance equal to the variance of the efficient influence curve, appropriately normalized, and our proposed TMLE achieves this goal by using least-favorable submodels whose score span the efficient influence curve.

Specifically, we show that, under appropriate regularity conditions required for an MLE to be valid (i.e. all observables are discrete, so that MLE is well defined asymptotically), the asymptotic variance of a standardized MLE of the target parameter equals the limit in N of , where is the variance of the efficient influence curve . The formal analysis of an MLE requires understanding of an empirical process (specified below) uniformly in Q, which is challenging due to the fact that, contrary to , at misspecified Q, the time-specific components of cannot be represented as sums of independent random variables, conditional on the history at that time. Since the TMLE is tailored to deal with the curse of dimensionality (and MLE is a special case of TMLE by defining the initial estimator for the TMLE as the MLE, assuming this MLE is a well-defined estimator), while a regularized MLE will not be asymptotically normally distributed when the observables are continuous valued, the analysis of a TMLE is more important. Such a formal analysis is presented for the point-treatment case in a later section and much can be learned from that analysis for the purpose of analyzing the TMLE or MLE for general K. Nonetheless, the template below can be used to establish the asymptotic normality for both the MLE and also for the TMLE under the assumption that initial estimator is consistent for .

Let be an MLE, assuming it is well defined for N large enough (i.e. all covariates are discrete). We wish to analyze the plug-in MLE of . We can represent the efficient influence curve as for some parameter , as shown in our technical report. In our accompanying technical report we show that , where is a second-order term defined as sum of two terms and . The first involves square differences of , while the second involves the product of differences and . We will assume that , which basically corresponds with assuming that relevant parts of are estimated by at a rate faster than . Since is an MLE, and is a score at , we have that the MLE solves the efficient influence curve equation

We also have for any h, as explicitly shown in our technical report. This allows us to establish a first-order expansion of the standardized MLE:

Thus under the assumption that , it follows that the asymptotic distribution of equals the limit distribution of

A non-trivial analysis as carried out for the case , and using appropriate conditions, can be used to establish that , so that behaves as . Under these assumptions, it then remains to investigate weak convergence of as N converges to infinity.

In our technical report, we establish the following representation of the efficient influence curve:

where

, , and . Here, we assumed that , , are independent. Thus, we can represent the efficient influence curve as

where we defined

Note that has conditional mean zero, given . In order to claim that has finite variance one needs that the summation over l reduces essentially to a finite sum due to being conditionally independent of , given , for most m.

This yields the following representation (suppressing the dependence of on ):

where is a function of and with conditional mean zero, given . Due to factorization of the likelihood in terms of and that is a score of , it follows that is an orthogonal sum over in , so that the variance of is given by

We have

Thus, the asymptotic variance of is given by limit of

where , and it can be expected that converges to a fixed function as . Here we defined

Note that we also have that equals N times the variance of the efficient influence curve for the target parameter at . This demonstrates that the asymptotic variance of (and thus the asymptotic variance of the standardized MLE is given by .

This does not demonstrate the asymptotic normality of the MLE yet. For that purpose, we note that , where , with , is a sum of independent random variables , conditional on . As a consequence of the latter, it follows that for (i.e. just condition on , making -fixed, and use that ). Using CLTs, we can therefore establish that for each , converges weakly to a normal distribution . Under weak regularity conditions, this also implies that for and thus that these t-specific limit normally distributed random variables are pairwise independent. As a consequence, the sum across t converges to a normal distribution with variance equal to the sum of the t-specific variances, and thus as defined above. To conclude, under appropriate regularity conditions, we will have that converges weakly to

This demonstrates that the efficient influence curve characterizes the limit distribution of the maximum likelihood estimator, and thus indeed characterizes an asymptotically optimal mean zero normal limit distribution with variance equal to the asymptotic variance of the “efficient influence curve empirical process” .

6 The TMLE of causal effect of single time-point intervention

We will present the TMLE for the point-treatment intervention case (i.e. ). This case is of great interest itself, extends estimation of a causal effect of a single time-point intervention to dependent data of the form studied in this article, and thereby covers important applications. In the next section, we will formally analyze this TMLE. The tools of the proof will be generalizable to the general case. In addition, the single time-point case allows for a TMLE that is actually double robust in the sense that it remains consistent if either or a is consistently estimated, while the efficient influence curve for the general case with appears to not satisfy such a double robustness result as is evident from the efficient influence curve representation provided in our technical report van der Laan [62].

6.1 Structural equation model

Using notation for the baseline covariate , and for , the structural equation model for the case reduces now to

where the fixed-dimensional summary measures and are determined by and with , respectively. We assume throughout that A is discrete valued, so that conditional densities of A, given W, are just conditional probability distributions: this is by no means a necessary condition, but simplifies presentation. The “friends” of subject i may be included in : . The function includes , beyond summary measures of and might be defined as , assuming that for some fixed K, so that can indeed be defined as a fixed multivariate dimensional function not depending on N. Similarly, the function includes beyond summary measures of and might be defined as . We also use the short-hand notation and . The above structural equation model assumes that and are the same function of this dimension reduction and , respectively, for each i, so that two units with the same number of friends who have the same individual covariate and treatment values, and also have the same values for the covariates and treatments of their friends, will be subjected to the same conditional distribution for drawing their treatment and outcome. In our asymptotics theorem in the next section, we treat , , as fixed, so that also the probability distribution of O and the target parameter are indexed by the fixed value of .

In addition, we assume that conditional on W, (1) , , are i.i.d. and (2) for each i, is independent of . In one model, we assume that , , are i.i.d.: note that (since is allowed to be different for each i) this corresponds with assuming that are independent, but not necessarily identically distributed. We will highlight the case that this latter assumption is considerably weakened, which will be made explicit in our theorem. These independence assumptions on the ’s imply that (1) are independent (or more generally, their dependence is weak enough), (2) conditional on , are independent, and (3) conditional on , are independent. Thus, all the dependence between units is explained by the observed pasts of the units themselves and of their friends.

Causal quantity: Let be a user-supplied conditional distribution of A, given W, and let us denote the random variable with this distribution with . For simplicity, let us assume that under this are conditionally independent, given W, and that for a common conditional density and summary measure . Our goal is to estimate the mean of the counterfactual outcome of under the stochastic intervention . Let be the counterfactual indexed by a stochastic intervention on A and . The causal quantity of interest is defined as , which is a parameter of the distribution of modeled by the above structural equation model. In this expectation defining , we actually condition on the vector of sets of friends.

Identifiability from observed data distribution: We observe , where . Due to the above assumptions, the probability distribution of O is given by:

(7)

where is a common (in i) density for for each value , and is a common density for for each value . Our model also implies a model on the distribution of W, such as the model that assumes that all are independent.

Since our assumptions imply the randomization assumption stating that is independent of , given , the post-intervention probability distribution of is identified by the following G-computation formula applied to the probability distribution P of O:

(8)

where is defined as the conditional distribution of , given with A in the parents replaced by . We denoted the probability distribution of on the right-hand side with , which is thus always defined as a parameter of the data distribution P of O for a P in the statistical model for implied by our causal model for the underlying distribution . The random variable with distribution is denoted with .

Statistical model, statistical target parameter, and statistical estimation problem: Let be the statistical model for the data distribution P of O defined by (eq. 7) in which for a specified model , and the common for some model , while is unspecified. Thus, the density of O factorizes in three factors:

where , is unspecified, and . This defines the statistical model .

Let the statistical target parameter mapping be defined as . Under the stated causal model and identifiability assumptions under which , we have , so that in that case can be interpreted as the desired causal quantity. Our goal is to construct an estimator of based on , which defines the statistical estimation problem.

Let be the conditional mean under . Note that . The target parameter only depends on P through , and :

(9)

where denotes integration w.r.t. measure implied by density w.r.t. some dominating measure . If we want to emphasize that only depends on P through , then we will also use (and abuse) the notation to indicate the mapping from Q into the desired estimand.

6.2 Efficient influence curve

In our technical report van der Laan [62] we established a general representation of the efficient influence curve of for the longitudinal data structure and the model that assumes that the baseline covariates are independent, and it is given by:

For a different model for the covariate distribution of , only the first component would be different. In our case, we have , giving the following two terms:

where , , and are densities w.r.t. some appropriate dominating measure . We have, using short-hand notation for ,

and

and thereby

Thus,

Therefore,

which does thus not depend on m.

This proves the following representation of the efficient influence curve in the case :

We will state this result and the double robustness of the efficient influence curve in the following theorem.

Theorem 1 Consider the model in which are assumed to be independent. The efficient influence curve at of target parameter is given by

where

and is the conditional probability that equals c, given , which is a probability determined by . In addition, and with are densities defined w.r.t. a dominating measure and it is assumed that is uniformly bounded on a set that contains with probability 1 for all i.

Double robustness of efficient influence curve: Represent the efficient influence curve as . We have

,

so that

Since the efficient influence curve at depends on only through , we have that if , then

and thus

Let denote the conditional distribution of O, given W, and let be the degenerate distribution of W that puts mass 1 on W. We also note that

(10)

We also have that for all g,

Explicit proof of double robustness: Even though our general theorem in the technical report can be applied to this single time-point case and this double robustness result follows by noting that the second-order term in that theorem equals 0, here we provide an explicit proof of the stated double robustness for this single time-point case. Firstly, we have

We also have

This derivation with replaced by also establishes (eq. 10).

This proves that with , , we have

This proves the robustness w.r.t. misspecification of . In addition, it follows trivially that for any choice g.

6.3 Double robustness for an inefficient influence curve

In the following lemma, we present an inefficient influence curve and establish its double robustness. This could be used to construct an inefficient TMLE analogue to the efficient TMLE presented below.

Lemma 1 Suppose only depends on through . For notational convenience, in this lemma let include i itself: . Define the conditional probability densities and , and define

Let and . We have

We also have for all g.

Proof: We have

6.4 Estimating equation approach

Consider the efficient influence curve and let us represent it as an estimating function in :

where now . We will represent it as to stress that it only relies on through . We have , so that is a targeted estimating function for fitting . Given an estimator and of and , respectively, based on the data O, we can estimate with the solution of

Since , this solution is given by

This estimator, as the TMLE presented below, is double robust w.r.t. misspecification of and is asymptotically efficient if both are estimated consistently, assuming the required regularity conditions hold (as presented in our theorem below). Since it is not a substitution estimator, it will be more sensitive to practical violations of the positivity assumptions due to being large.

Remark regarding the balance of the two contributions in the efficient influence curve: The factor in might come as a surprise in relation to . Let us consider the case that . To intuitively understand that this efficient influence curve does indeed represent a balance between these two contributions, we note the following:

where , and . Thus, indeed the contribution is of the same size as function of N as , under the assumption that for some , which is indeed an assumption we made to establish -asymptotics.

6.5 TMLE

Recall the target parameter representation defined by (eq. 9).

Let be an estimator of , where . Suppose or that is continuous with values in . This estimator could be based on the log-likelihood loss function

For example, suppose that we assume a logistic regression model . Then we can estimate with the standard maximum likelihood based logistic regression estimator:

More generally, one can also use cross-validation based on this loss function and thereby estimate with an -based super-learner. The super-learner takes as input a library of candidate logistic regression estimators (including machine learning algorithms) and uses cross-validation to select the optimal weighted-combination of this library of estimators, where the weight is obtained by minimizing the cross-validated risk based on this loss function. Thus, if one uses V-fold cross-validation, then one divides up the sample , , in V-subgroups, one defines one of the subgroups as validation sample, and the remainder as training sample. One then trains the jth algorithm on the vth training sample, and one evaluates the v-specific cross-validated risk for this jth algorithm. This is done for each choice of sample split , and the V cross-validated risks are averaged, giving a single cross-validated risk for the jth algorithm. One could now select the best choice by selecting the algorithm that has the smallest cross-validated risk. The estimator is referred to as the discrete super-learner. Similarly, one can define a candidate algorithm for a vector of weights and select the optimal choice that minimizes the cross-validated risk of over the choice . This estimator is referred to as the super-learner. The estimator could also be based on a squared error loss function

Let be a nonparametric maximum likelihood estimator of , thus respecting the model for the joint distribution of . For example, if are i.i.d., then we would estimate this marginal distribution of with the empirical distribution of . If are only known to be independent, then we would estimate each marginal distribution of with the discrete distribution that puts mass 1 on the singleton , : note that this empirical distribution is equivalent with the joint distribution that puts mass 1 on . If the model is larger than the independence model, then we would still estimate with this degenerate distribution .

Given the estimator and of and , one could now define a corresponding plug-in estimator . However, the TMLE differs from this estimator using a targeted version of instead.

Let be an estimator of , and let be the corresponding estimator of the conditional distribution of A, given W. Given the model assumption for a common conditional density , this estimator can be based on the log-likelihood loss:

As explained above, this could be a simple logistic regression estimator or a super-learner based on this loss function based on the sample , .

Given , , and , let be a target-parameter-specific submodel through defined by

where , with , and, similarly, , with , all defined as densities w.r.t. a dominating measure .

Let

be the maximum likelihood estimator, which simply involves running univariate logistic regression on a pooled data set with outcomes and covariate , using as off-set . This defines now an update .

The TMLE of is defined as the corresponding plug-in estimator

We note that this TMLE solves the efficient influence curve equation
which is a key ingredient in our proof of asymptotic normality of . Or, using the notation , and , we can write this as

Specifically, being a substitution estimator and using an NPMLE of , we have , while the targeted update of guarantees that

6.6 The clever covariate

Computation of the above TMLE requires the construction of an estimator of the clever covariate (density ratio). This estimator needs to be evaluated at for each , in order to compute the TMLE update . In addition, since involves integration of over any point in support of w.r.t. product measure , we also need to evaluate at any such point. One possible estimator is a plug-in estimator

obtained by plugging in our empirical counterpart for , and an estimator of . Let us consider the case that puts mass 1 on W. In that case, this simplifies to

In addition, one can use that so that for each i, the integral only integrates over , where we used the convention that . Nonetheless, this type of implementation can easily be quite computationally overwhelming.

Therefore, we use this subsection to formulate insights about the clever covariate that will allow a much easier implementation of an estimator of this clever covariate. The basic idea is that we will directly estimate instead of indirectly through plugging in estimators of and . These insights are formulated in the following lemma, where we consider the case that with .

Lemma 2 We note that is a mixture of densities of (living in a single space common in i) and thus represents a density of a random variable which we will denote with . Suppose for some k, representing covariates and treatment values of the subject and its friends.

We have

(11)

where we maximize over a set of densities of that contains the true .

The density can be factorized as

where is the conditional density of , given , and is the marginal density of , under the joint density .

We also have

(12)

where we maximize over a set of conditional densities of , given , that contains the true , and,

(13)By the same arguments, , where is a density of random variable ,
(14)

is the conditional density of , given , and is the marginal of , under the joint density . The latter equals the defined above as the marginal density under .

As a consequence, we can conclude that

(15)

Thus, the take home point of this lemma is (eq. 15) teaching us that we only need to estimate and , where can be fitted as if we are estimating a conditional density of , given , based on data , , , as if these N observations are i.i.d. That is, an important practical implementation is to fit with maximum likelihood based estimation, treating as i.i.d., as if we are fitting the common conditional distribution of , given . For example, if is binary, , then such a conditional distribution could be factorized in terms of a product of k binary conditional distributions. Each of these binary conditional distributions can be fitted with logistic regression, possibly incorporating adaptive estimation. The asymptotic consistency of such a maximum likelihood based estimator, and the validity of cross-validation ignoring the dependence, would rely on only being dependent on for a finite (universal in N) number of . Such an estimator yields an actual fitted function that is easily evaluated at any required value.

Suppose now that is known, as in an RCT. The above-mentioned approach would ignore the knowledge on and is thus not necessarily appropriate. If is very simple, as if often the case in an RCT, then one might simply be able to show that is known (e.g. if the randomization probability for does not depend on covariates) in which case there is no need to estimate . In such cases, one could also use a simple marginal empirical distribution for this conditional density in the estimation procedure outlined in previous paragraph. Consider now the case that is known, but that it is a quite complex function. In that case, one could decide to simulate a very large number of from and use an adaptive maximum likelihood based estimator of based on this large sample using the method presented in previous paragraph. This maximum likelihood based estimator would obviously utilize that it is known that only depends on certain covariates, so that the estimator can be simplified as much as possible. That is, we use the above-described estimation procedure for estimation of , but now applied to a very large data set simulated from the distribution of under . In this manner, one can still obtain excellent approximation of the true that fully utilizes that we know the true .

Let us now discuss estimation of . Given that we know , as above for the case that is known, one might either be able to determine (e.g. if the randomization probabilities of do not depend on covariates), and for complex , we can simulate a very large number from , and use an adaptive maximum likelihood based estimator of based on this large sample using the above-described estimation procedure.

In this manner, we obtain a functional form that approximates and well (by utilization of being known), and that one can evaluate for any . The TMLE can now be computed, and the target parameter evaluation as well.

Suppose now that , where is a summary measure of the treatment nodes of the friends of subject i. In this case, by a simple generalization of the lemma above, it follows that only involves fitting the conditional density of , given , treating these i-specific data points as i.i.d., as above. Thus, a reduction of the dependence of on the treatment nodes (i.e. a model assumption in our model) would result in a significantly less variable estimated clever covariate , and the above method can still be applied. For example, one might feel that it is a reasonable assumption to assume that the mean outcome for unit i depends on only through the treatment node for subject i and the proportion of treated among the friends of i, beyond dependence on all the covariates.

7 Asymptotic normality of TMLE of counterfactual mean of single time-point stochastic intervention

In this section, we state a theorem establishing the asymptotics of the TMLE of under conditions. Subsequently, we discuss the implications of this theorem regarding statistical inference in terms of confidence intervals. The proof is deferred to the Appendix. In the Appendix of our technical report, we demonstrate that our proof is generalizable to the general longitudinal data structures. In this section, we define to include i itself: i.e. .

Theorem 2 Consider the statistical formulation of data , , statistical model , and statistical target parameter , all defined conditionally on the network-profile . Recall that this network-profile F implies that only depends on through and that depends on W through . Suppose , and that satisfies an independence assumption specified below, and is unspecified. A probability distribution of O is thus parameterized by as follows:

(16)

where , , is a density for Y for each possible , but is otherwise unspecified, is a density for A for each possible , and . This defines the statistical model for the probability distribution of O.

For a specified stochastic intervention , the target parameter is defined by

where (defined as density w.r.t. some dominating measure), denotes integration w.r.t. the measure implied by , , and is the mean under density .

Let be the efficient influence curve of as defined in Theorem 1:

where

and . We will also denote these functions with and to emphasize that they only depend on g through . We use the definitions of , , , , defined as densities w.r.t. a dominating measure , and let .

Let be the distribution that puts mass 1 on . Consider the TMLE defined above using in . As shown above, this TMLE solves

Note that is a plug-in estimator of implied by and .

We make the following assumptions:

Entropy condition: Consider a class of functions on a set in that contains with probability 1. Assume that with probability 1. Consider a class of functions on . Assume that with probability 1. Define the dissimilarity measure on the Cartesian product of :

Assume that there exists some , so that , where is the number of balls of size w.r.t. metric d needed to cover .

In particular, this assumption holds if , , , where is the uniform sectional variation norm as defined in Gill et al. [41] and van der Laan [63].

Universal bound: Assume , where the supremum of O is over a set that contains O with probability 1. This assumption will typically be a consequence of the entropy condition, such as it is a consequence of the uniform sectional variation norm condition above.

Uniform consistency and rate condition: Assume in probability as ,

and

Asymbiotic linearity condition on :

where only depends on O through , and .

Positivity condition: Assume

Universal bound on connectivity between units: Assume that there exists a so that for all a.s.

Universal bound on dependence of W-distribution, and stochastic intervention: Assume that there exists a , so that only depends on with , and, for each i, is independent of with , where , and K does not depend on N.

First-order approximation: Then,

where

Weak convergence of first-order approximation: We can orthogonally decompose

where

For , let be the indicator that and are dependent, , and . For example, if are independent, then . We have

and

assuming these limits exist, and denotes the marginal expectation of , given F. As a consequence, .

Alternative expression of asymptotic variance: One can also represent as

To provide the reader with a general understanding of the asymptotic normality of the TMLE we note the following. In the Appendix, we provide general conditions under which a process , where , converges weakly to a Gaussian process as random functionals in the Banach space of real valued functionals on a family of functions, endowed with the supremum norm [41], where the dependence between the s is restricted by assuming that can only depend on a set of maximally K s, where the integer bound K does not depend on N. For completeness, we provide here the general theorem that is a corollary from the results established in the Appendix and provides the key building block for the probabilistic component of our proofs:

Theorem 3 Consider a process , with , where , for each i, is independent of for a set with for a universal K, where , and is a set of multivariate uniformly bounded real valued functions . Let be the indicator that and are dependent. We make the following additional assumptions:

For all integers , for supremum norm on , and universal .

There exists an so that the entropy integral for w.r.t. norm is finite.

The marginal distributions converge to a normal distribution for all .

Then converges weakly to a Gaussian process Z identified by the covariance operator defined by

In particular, is asymptotically equicontinuous in the sense that if w.r.t. supremum norm, where , then converges to zero in probability.

7.1 Statistical inference

One can estimate by plugging in estimators in the expressions for . Given an estimator of , one can then construct a confidence interval . If is consistent for , then this will be an asymptotically valid 0.95-confidence interval. The expression for suggests that a consistent estimator of relies on consistent estimation of , even though the consistency of only relies on a consistent estimator of and thus the relevant part of (since the expectation w.r.t. W is consistently estimated). Even if relied on a less nonparametric estimator of , this suggests using a super-learner using flexible machine learning algorithms when estimating this asymptotic variance . However, below, we provide alternative estimators of the asymptotic variance that appear to avoid having to estimate .

Ignoring contribution of : We claim that if is unknown, and one uses an MLE according to some model, then ignoring the contribution in due to estimation of will result in an upper bound for the actual asymptotic variance of the TMLE, based on a generalization of the result in van der Laan and Robins [3]. This result relies on the fact that is an orthogonal nuisance parameter w.r.t. . Such a result would then allow us to use this simplified plug-in estimator (using for ) in the statistical model in which is not known but a correctly specified model for (i.e. ) is available. Again, such a result will need to be formally established in future research.

In the sequel of this subsection, we suggest the following practical proposals for variance estimation.

Assuming a consistent : Suppose that one is willing to assume that is consistent for . In that case, ignoring the contribution by the argument above, it follows that , so that we can estimate with

(17)

where

Assuming rare outcome: suppose now that one is not willing to assume that is consistent but it is known that is close to zero (e.g. rare outcome). In addition, assume that as well, which can be guaranteed by incorporating such a constraint in the logistic regressions submodel of the TMLE as in Balzer and van der Laan [64]. In that case it follows that the contributions to the variance of and are second-order relative to the contributions of w.r.t. . As a consequence, in that case, it would be appropriate to still use this estimate (eq. 17), and the inconsistency of will only make the estimate of conservative. In fact, by this argument one could even drop the contribution, but for the sake of being conservative, we would recommend including this term.

A generally appropriate variance estimator: We now proceed with deriving a more general variance estimator under reasonable assumptions. Firstly, we will ignore the contribution due to estimation of , and as mentioned above we conjecture (based on i.i.d. theory) that this will only make the variance estimator conservative. Secondly, we note that (recall )

where . We now note that

where in this last expression denotes the empirical distribution that puts mass 1 on W. As shown in the next section, it follows that converges to a normal distribution, and therefore one expects that the conditional bias . We will assume that indeed . Under this assumption, we have that . As a consequence,

In addition, the first sum on the right-hand side already has conditional mean zero, given W, so that the asymptotic variance of the left-hand side equals the variance of the first sum plus the variance of the second sum. The second variance can be consistently estimated with

The variance of the first sum can be represented as:

If one is willing to assume that

then a conservative estimate of the first variance is defined as:

Till what degree this is a reasonable assumption will need to be further studied. Under this assumption, the proposed estimator of the asymptotic variance is given by

8 TMLE of intervention-specific mean, conditional on W

In our target parameter, we conditioned on the network information , but marginalized over W, given F. As a consequence, in order to establish asymptotic normality of the TMLE we had to rely on an independence assumption on the joint distribution of W (given F), such as that all are independent, or only that each only depends on maximally K ’s. In this section, we define the target parameter conditional on all of W, which happens to equal , where is the empirical distribution that puts mass 1 on W. Our target parameter is now a parameter of the conditional distribution of O, given W, modeled in same way as above (but without need to model a distribution of W). Its efficient influence curve is now just the -component, where for the sake of notational convenience we will still denote with (just in this section and in proof of next theorem in the Appendix). We will use the same TMLE as presented in the previous sections. In the Appendix, we show how our template for analyzing the TMLE can be modified to analyze the TMLE with respect to this conditional W-specific target parameter, and that essentially the terms due to estimation of now drop while the other terms are essentially the same. As a consequence, there is no need to redo all the technical proofs. Our proof now relies on the identity , as established by (eq. 10). This results in the following Theorem 4. This theorem differs from Theorem 2 in that it dropped the independence assumption on the distribution of W and that the asymptotic variance of the TMLE (w.r.t. instead of ) does not include the -term anymore. Thus, by changing our target parameter to this conditional version, we removed a restrictive assumption and we reduced the asymptotic variance of the TMLE w.r.t. this conditional target parameter.

Theorem 4 The conditional probability distribution of O, given W, is parameterized by as follows:

(18)

where , , is a density for Y for each possible , but is otherwise unspecified, is a density for A for each possible , and . This defines the statistical model for the conditional probability distribution of O, given W. Let denote the probability distribution of W that puts mass 1 on the observed .

For a specified stochastic intervention , the target parameter is defined by

where . Since only depends on through , we will also denote this parameter with .

The efficient influence curve of at is given by:

where

Consider the TMLE defined above using in . As shown above, this TMLE solves

We use the definitions , , , , defined as densities w.r.t. a dominating measure , and let , , where . Note that is a plug-in estimator of implied by and .

We make the following assumptions:

Entropy condition: Consider a class of functions on a set in that contains with probability 1. Assume that with probability 1. Consider a class of functions on . Assume that with probability 1. Define the dissimilarity measure on the Cartesian product of :

Assume that there exists some , so that , where is the number of balls of size w.r.t. metric d needed to cover .

In particular, this assumption holds if , where is the uniform sectional variation norm as defined in Gill et al. [65] and van der Laan [63].

Universal bound: Assume , where the supremum of O is over a set that contains O with probability 1. This assumption will typically be a consequence of the entropy condition, such as it is a consequence of the uniform sectional variation norm condition above.

Uniform consistency and rate condition: Assume in probability as ,

and

Asymptotic linearity condition on :

where only depends on O through , and .

Positivity condition:

Universal bound on connectivity: Assume that there exists a so that for all a.s.

Restriction on stochastic intervention: Assume only depends on W through with for some universal .

First-order approximation: Then,

where

.

Weak convergence of first-order approximation: We can orthogonally decompose

where

For , let . We have

and

assuming these limits exist, and denotes the conditional expectation of , given W. As a consequence, .

Alternative expression of asymptotic variance: One can also represent as

8.1 Variance estimation

Known and consistent : Let us consider an RCT so that and the term . If one is willing to assume that is consistent for , then . Therefore, in this case, the asymptotic variance can be estimated as

(19)

Known , rare outcome: Suppose now that we still have an RCT, but we are not willing to assume is consistent, but is close to zero (e.g. rare outcome). In addition, assume : i.e. one might incorporate this constraint on in the submodel of the TMLE, as in Balzer and van der Laan [64]. It follows that a first-order (w.r.t. approximating zero) approximation of the asymptotic variance can still ignore the -contribution. As a consequence, in that case an appropriate approximation of the asymptotic variance is given by . That is, this asymptotic variance is approximated by

However, the latter is conservatively estimated by using a possibly inconsistent , showing that we can still use (eq. 19) as the estimator of the asymptotic variance.

Ignoring contribution of is conservative: Even when is estimated with , as argued before, we suggest that the contribution only reduces the asymptotic variance, so that ignoring this contribution will be fine for the sake of reliable statistical inference. Thus, our overall conclusion is that (eq. 19) is an appropriate (possibly conservative) estimator for the asymptotic variance when either is consistent or if .

A general variance estimator: Assume that

Since this represents the bias term of the TMLE , and we have asymptotic normality of , and is consistent for , this should be true under the assumptions of the previous theorem. However, under this assumption we have that, ignoring ,

and, as a consequence,

Note that indeed

Thus, under the assumptions of the Theorem, and ignoring the contribution from , we have

where the linear term has conditional mean zero w.r.t. . The conditional variance of the linear term on the right-hand side is thus given by the following expression:

Suppose that

Then a conservative estimate of the last expression is defined as:

Note that if is consistent for , then this estimator is asymptotically equivalent with (eq. 19), but we expect the latter to be significantly larger for finite samples when is not a good approximation of . If is not known, then is replaced by its estimator .

9 Summary and concluding remarks

We formulated a general causal model for the longitudinal data structure generated by a finite population of causally connected units. This allows us to define counterfactuals indexed by interventions on the treatment nodes of the units, and corresponding causal contrasts. We established identifiability of the causal quantities from the data observed on the units when observing all units or a random sample of the units, assuming that the size of the population converges to infinity, under appropriate assumptions. Our causal assumptions implied conditional independence across units at time t, conditional on the past of all units, resulting in a factorized likelihood of the observed data (even though the observed data is generated by a single experiment, not by a repetition of independent experiments). To deal with the curse of dimensionality we assumed that a unit’s dependence on the past of other units can be summarized by a finite dimensional measure and that this dependence is described by a common function across the units. This describes now the statistical model for the data distribution and the statistical target parameter, and thereby the statistical estimation problem. We demonstrated that we can use cross-validation and super-learning to estimate the different factors of the likelihood. Given the statistical model and statistical target parameter that identifies the counterfactual mean under an intervention, we derived the efficient influence curve of the target parameter. We showed that this efficient influence curve characterizes the normal limit distribution of a maximum likelihood estimator and thus still represents an optimal asymptotic variance among estimators of the target parameter. However, due to the curse of dimensionality, maximum likelihood estimators will be ill-defined for finite samples, and smoothing will be needed.

Such smoothed/regularized maximum likelihood estimators are not targeted and will thereby be overly biased w.r.t. the target parameter, and, as a consequence, generally not result in asymptotically normally distributed estimators of the statistical target parameter. Therefore, we formulated targeted maximum likelihood estimators of this estimand and showed that the robustness of the efficient influence curve implies that the bias of the TMLE will be a second-order term involving squared differences and for two nuisance parameters and the relevant factor of likelihood . Subsequently, as showcased in this article, we focussed on defining and analyzing the TMLE of causal effects of an intervention on a single treatment node on a future outcome. In this special case, we showed that the efficient influence curve is double robust w.r.t. these two nuisance parameters , where depends on the intervention mechanism and the distribution of the covariates, and is a common conditional mean function for the outcome. We established two formal asymptotic normality theorems for the TMLE under the assumption that each unit is only connected to fewer than K other units for a universal K.

In future work, it will be of interest to extend our asymptotics theorem to the case that a unit can depend on a fixed (in N)-dimensional summary measure that can depend on a number of units that can converge to infinity with sample size. We can also be less-restrictive and allow that these summary measures have a dimension K that increases with N, and then establishes rates of convergence that are slower than and establishes corresponding (e.g. normal) limit distributions. In addition, in future work, the finite sample behavior of these estimators and confidence intervals will need to be evaluated through simulation studies. We will also generalize our TMLE to the TMLE of parameters defined by marginal structural working models for the causal dose–response curve for a collection of stochastic interventions. We also plan to investigate if there are other causal models for causally connected units that might allow the formulation of TMLE for the general longitudinal data structure in terms of sequential regressions, as in the double robust estimating equation based estimators for i.i.d. data presented in Bang and Robins [58] and subsequent analogue TMLE in van der Laan and Gruber [59] and Petersen et al. [66].

Overall, we believe that the statistical study of these causal models for dynamic networks of units provides a fascinating and important area of future research, relying on deep advances in empirical process and statistical estimation theory, while raising new challenges. In the mean time, these advances will be needed to move forward statistical practice.

Acknowledgments

This research was supported by NIH grant R01 AI074345-05. The author owes thanks to Elizabeth Ogburn, Maya Petersen, and Oleg Sofrygin for helpful discussions. The author also thanks Tyler vanderWeele for suggesting to weaken the independence assumptions on W resulting in our Theorem 4.

Appendix

Introduction to Appendix

We start out with presenting a general template of our proof of Theorem 2 which establishes the asymptotics of the TMLE for the case . In this template, we define the remaining ingredients (eq. A1), (eq. A2), and (eq. A3) that will need to be established in the remainder of the proof. Each of these three ingredients is carried out in a separate section. These sections are themselves organized by special tasks that need to be carried out. We conclude with a similar template of the proof of Theorem 4, demonstrating that the technical components are the same as needed for Theorem 2. At the end of the Appendix, we provide a notation index that will be helpful to read through the article as well as through the Appendix.

General template of proof of Theorem 2

Recall that is a sum over the units j. We will use the notation , while is its expectation w.r.t. distribution . Due to Theorem 1, we have , , , and . In particular, this yields

We now proceed as follows:

We note that

where , , and . From this, it follows that

where we used that for a given function f . We assumed that the second-order term . In addition, we define

We also note that

where

and

We used here that . Define the process indexed by . Note that . As a consequence, showing that corresponds with proving that for a sequence that converges to zero w.r.t. supremum norm. Therefore, our proof will involve studying this empirical process and establishing the required asymptotic equicontinuity. In this manner, we will establish

(A2)

Thus, we have obtained the following expansion:

We have

We will show that

(A3)

To understand these last two terms, define the process

which is a sum of the form indexed by , where plays role of . Note that , while . Thus, showing that and comes down to showing that for converging to zero w.r.t. supremum norm. Therefore, our proof will involve studying this process and establishing the required asymptotic equicontinuity. Specifically, we will decompose this process in three orthogonal processes that can be represented as sums over functions of conditionally independent random variables identified by the sets (analogue to orthogonal decomposition below of the first-order approximation) and establish this asymptotic equicontinuity for each of the three orthogonal processes.

Consider now the term

(20)

This term equals

where

We assumed that . Using that , , and , it follows that (eq. 20) reduces to

where we note that , and we defined

where we defined the process with

The term is included in the first-order expansion and thus partly characterizes the normal limit distribution of , so that its analysis will be part of the analysis of the first-order approximation. Since only depends on through , where we condition on , we will indeed be able to show that a term is nicely behaved empirical process (converging to a normal distribution), even though each i-specific term is correlated with the j-specific terms when .

Showing that comes down to showing that is an asymptotic equicontinuous process w.r.t. supremum norm, and that converges to zero w.r.t. the supremum norm. In this manner, we show that

(A4)

We assumed that

where only depends on O through , and .

Thus, if we prove (eq. A2), (eq. A3), and (eq. A4), then we have obtained the following first-order expansion:

Analysis of first-order approximation: Let . The first-order approximation equals

It remains to prove that this first-order expansion converges to a normal limit distribution. This proof has its own outline. Firstly, we decompose by , where , , and . We can represent as , where , , and . It follows that simplifies to:

In addition,

and . We also note that, conditional on , is a sum of independent mean zero random variables (functions of ); conditional on W, is a sum of with conditional mean zero, given W; , are (conditionally) independent, given W; and, finally, , with W satisfying our independence assumption (e.g. are independent). Recall that the sets are defined such that only depends on W through .

Exploiting these independence structures, we will show that

(A1)

with the expressions for , , and as specified in the theorem. Here (eq. A1) represents all three convergence statements. Due to the orthogonality of the three empirical processes, using moment generating functions, our results also imply . For example, we can analyze and use convergence of moments of each process separately to establish convergence to . Once we have convergence of all moments, and we can bound for some , which follows from results established in our separate analysis, then we obtain convergence in moment generating function, and thereby weak convergence of the sum . In this manner, the desired weak convergence of the sum is shown.

This finishes the outline of the proof. It remains to establish (eq. A1), (eq. A2), (eq. A3), and (eq. A4).

(A3)

(A3): Outline of proof

Let and we will denote with . Our goal is to prove that and . Let , , and , denote the expectation operators w.r.t. their respective conditional distributions. We have

We now note that, for a fixed , conditional on , is a sum of independent mean zero random variables (functions of ). We also note that for a fixed , conditional on W, is a sum of mean zero , where , are (conditionally) independent. Finally, for a fixed , , and, by assumption on , for each i, is only dependent on maximally K .

Let be the limit of , and let be the limit of . By exploiting these independence structures, we will use empirical process theory to establish that

This then establishes and .

(A3): Outline of establishing asymptotic equicontinuity of a process

For that purpose, we will apply Lemma 5 in van der Vaart and Wellner [41], which concerns establishing weak convergence of a process , indexed by a . Given that is a subset of some metric space of functions with metric d, one defines as the minimal number of balls of size needed to cover . In addition, for a given strictly monotone function , let be the so-called orlics norm of the random variable .

For example, one can select the -norm of for arbitrary large p which correspond with the choice of orlics norm defined by . The orlics norm implied by is the typical orlics norm pursued in the case of sums of independent random variables, and this is the one we will also use.

This Lemma 5 states that, if (1) is bounded by for some universal constant c and metric , (2) is totally bounded w.r.t. this metric d, (3) for some , , (4) the marginal distributions converge to a normal distribution , then converges weakly to a Gaussian process Z in , where is the metric space of functions endowed with supremum norm . We assumed that our parameter space for consists of uniformly bounded functions on a set that contains with probability 1, and we defined the metric d as the supremum norm. Thus, (2) holds. We posed (3) as an entropy condition on the parameter space , which will thus hold by assumption. For example, could be the class of functions on that have uniform sectional variation norm bounded by a , in which case this entropy condition holds. Under conditions 1–3 we have that the process is asymptotically tight, and, for any sequence , we have for each ,

So once we have established the orlics-norm condition (1), then this tightness can be used to establish that terms for random converging to w.r.t. metric d in probability, assuming satisfies the entropy condition and is totally bounded w.r.t. this metric d.

Bounding the orlics norm of our empirical processes

The orlics norm indexed by function is defined as

We consider a stochastic process indexed by for a class of functions . In our application, we have that, for example, represents two real valued functions and defined on a set that contains with probability 1. In addition, our processes can be represented as , where, for example, for each i there is an associated set , and, if , then and are independent, and, in general, it is known that for each i, is independent of for sets with . For some of our processes, these independencies are conditional on a random variable (e.g. conditional on infinite sequence ). In that case, we will apply our general proof below conditional on this random variable and obtain a bound on the orlics norm that holds for almost every value of the conditioning random variable. For example, one establishes a universal bound C in with the P in the orlics norm being a conditional distribution, given a value of the conditioning random variable) where C does not depend on the value of the random variable one conditions upon. Finally, we really need to bound , so that we will apply the lemmas below to instead of .

So our goal is to bound for some universal (in N and ). As outlined in previous subsection, the choice of orlics norm and norm for is important, since the corresponding entropy requirement on is that . We will establish our results for the strongest orlics norm which corresponds with , while we select the supremum norm for the functions .

Lemma 3 Let be the orlics norm defined above w.r.t. . Suppose that for each p

Let be a number so that

Then,

In particular, if can be bounded from above by constant in N, and one finds a D (constant in N) so that , then it follows that .

Proof. We first note

Suppose that for each even p . Then, we have

So is bounded by a C chosen so that

Let be a number so that

Then, C can be selected so that , or equivalently, . Thus, we have shown that . The last statement is straightforwardly shown. □

Thus, apparently, it suffices to establish a bound of the type for some that is somewhat well behaved as function in p for so that the previous lemma applies.

We use the following lemma to bound the pth moment of .

Lemma 4 Assume that, for each , and each integer p, we have a universal constant C so that

(21)

Then, we have

The bounding (eq. 21) is a straightforward consequence of our conditions stated in the theorem, where we use the supremum norm on , thereby allowing us to apply this lemma.

Proof. By repeatedly applying Cauchy–Schwarz inequality, it follows that

By assumption, , so that the latter is bounded by . □

The following lemma provides us with an upper bound for so that .

Lemma 5 Assume that, for each i, and each integer p, we have a universal constant C so that

Let be an indicator, identified by indices , which equals 1 if there exist a set among the sets that is disjoint from the other sets. More generally, we can define equals 1 if there exists an element so that is independent of for all with .

Let

.

Then

Proof. We have

By the previous lemma, we have for a , so that we obtain

By putting a bound on , we can obtain a nice bound on , so that the previous lemma combined with Lemma 3 results in the following lemma providing the desired universal bound on the orlics norm.

Lemma 6 Assume that, for each i, and each p, we have a universal constant C so that

Assume that is independent of for a set and . For p an integer, we have , where

For , we have for some for some universal .

Proof. We first need to show that . Selection of one particular corresponds with p times in a row selecting an element in . Without restrictions on this sequence of p draws, one has N options at each of the subsequent p steps resulting in vectors . Suppose we have arrived at the lth draw, so that we have a sequence with corresponding sets . For a next we define a binary if . Suppose . , is an island, and one cannot find a single element in for which is an element of both (1) and (2) , since we arranged that . As a consequence, an element with will need at least one future selection with in order to connect with , and such a future selection s cannot simultaneously connect with another with . As a consequence, if the sequence of p elements has more than 1’s, then there will be at least one island among of size 1 with . Thus, in that case . Thus, we only need to count the vectors for which has at most 1’s.

For a choice with , we have at most possible choices since we cannot select any of the elements in . For a choice with , we have maximally Kp choices. The total number of sequences for which there are at most 1’s is upper-bounded by . The total number of sequences present in one such sequence is given by . To conclude, we have the following upper bound

which proves our first result.

Thus, we have with bounded by this upper bound. We now want to bound the orlics norm . Let us first do this for the orlics norm . Using that , , we have

Thus there exists a so that the term on the left of the inequality is smaller or equal than 1, so that we have shown . It also follows that can be bounded by a universal constant times . This completes the proof for this orlics norm identified by .

Let us now do the proof for the orlics norm identified by . Note . Thus, we have

The term within can be made smaller than an arbitrary number by just selecting large enough. Therefore, we need to show that is bounded for some small enough . The proof then proceeds as above for the -orlics norm. Now, we note that, using for , and ,

Thus, behaves as . Since , by selecting small enough with , this sum can be made arbitrarily small. As before it follows that can be bounded by universal constant times . □

(A3): Asymptotic equicontinuity of

The process is a sum of independent random variables conditional on , so that its analysis is a simple imitation of the general analysis presented in previous subsection, conditional on . The proof that the -norm (conditional on ) of is bounded by a universal constant times the supremum norm of is as follows:

where, because and , we have that .

(A3): Asymptotic equicontinuity of

Conditional on W, for a fixed , we can represent this process as , where depends on A through , while all , , are independent. As a consequence, for each i, conditional on W, is independent of . Again, the above general analysis can be applied, and the proof that the -norm of is bounded by a universal constant times the supremum norm of is as follows. Firstly,

We have and . By our uniform bound on the class of functions we have that for some . We also have

where . The same bounding applies to . This proves that indeed is bounded by C times , which completes the proof.

(A3): Asymptotic equicontinuity of

Conditional on , we can represent as . Specifically, . Under our independence assumption, we know that for each i, only depends on maximally K . Thus, we can apply our general proof above to establish the bound of its orlics norm. As above, we can show that the norm of is bounded by a constant C times the supremum norm of .

Proof of (A2)

Define the process indexed by , where . We need to prove that . This proof is completely analogue to our proof above for establishing asymptotic equicontinuity of the other process analyzed above, but now with respect to the supremum norm for .

Proof of (A4)

Recall the definition of the process

where

and is the conditional distribution of , given W, implied by . We need to prove that . This proof is completely analogue to our proof above for establishing asymptotic equicontinuity of the other process analyzed above, but now with respect to the supremum norm for .

(A1): Establishing weak convergence of first-order approximation of standardized estimator

Outline of proof

Recall

where

We will establish weak convergence of each of the three terms separately.

The proof of weak convergence of can be based on standard CLT since, conditional on , is a sum of mean zero independent random variables.

Lemma 7 converges weakly to a normal distribution with mean zero and variance

assuming this limit exists, where

For example, if is binary, then the latter expression equals

Recall that .

We establish weak convergence of by establishing convergence of its pth moment. Specifically, we establish that for p even, and for p odd, as , where represents the limit of the second moment . This convergence in moments implies that converges weakly to a normal distribution , where we utilize the following two lemmas.

Lemma 8 A random variable Z with for p even, and for p odd has probability distribution equal to , the normal distribution with mean zero and variance .

Proof. We have

which is the moment generating function of , i.e. a normal distribution with mean zero and variance equal to . □

Lemma 9 Suppose for a universal . Suppose that for p even, and for p odd, as . Then converges in distribution to , as .

Proof. Consider the moment generating function when . By Fubini’s theorem,

Because , we have

which converges to zero in . Therefore, we can truncate the summation defining the moment generating function of and focus on establishing convergence of , but the latter follows from as . This proves that

This proves that converges in distribution to as . □

(A1): Establishing convergence of pth moment for

We consider the case that are independent, given F. The proof can be generalized to handle our weaker independence assumption on the distribution of W.

Lemma 10 Consider the empirical mean , Let

For example, if , we have

where is the conditional distribution of , given W, which only depends on A through .

Let . For two integers , define as the indicator that the intersection of and is non-empty. Assume that for a constant , we have

We have for p even,

For p odd, this pth moment converges to zero.

Proof. Given an index (one among ), we can draw a graph by drawing a line between two elements in whenever the two corresponding sets and have a non-empty intersection. Classify an element by the sizes of the connected sets that make up the graph of . One category of indices is that each connected set is of size 2, assuming p is even, and let be the indicator of falling in this category. For each of the other categories with all connected sets of size larger or equal than 2, but at least one larger than 2, we can show that its number X of elements is of smaller order than : as , using that . The latter shows, in particular, that the moment for p odd converges to zero. In addition, for with , let index the pairs that are connected, and let denote the two indices in corresponding with each jth pair. We also note that are independent across the pairs j, conditional on W. We have

Let represent the i-specific baseline covariates, so that is separate from . We now want to take a conditional expectation, given , of the last expression in order to obtain an expression for the pth moment only conditioning on F. Conditional on , the indicators are fixed. Since only depends on W through , the sets in the product over j are disjoint across j, and are independent, it follows that, conditional on ,

Let . For two integers , define as the indicator that the intersection of and is non-empty. Let , and is the Cartesian product of this set. Let , where we are reminded that is the indicator of all connected sets among being of size 2. We have the following lemmas.

Lemma 11 We have

Proof of Lemma 11. Note that the right-hand side sums over vectors in while the left-hand side sums over vectors that are both in and satisfy that the corresponding p-dimensional vector is an element of . Since a vector made up of -connected pairs can correspond with connected sets of larger size than 2, we have that , i.e. the right-hand side sums over more elements. However, the number of these extra vectors that should not have been counted is of smaller order than , so that the contribution is negligible. □

Lemma 12 We have

Proof of Lemma 12: Consider a vector of three connected pairs (i.e. ). These three connected pairs appear (i.e. ) times on right-hand side. However, on the left-hand side, any vector of length 6 with two 1s, two 2s, and two 3s is counted, and there are (i.e. ) of such vectors: the number of ordered vectors of length 6 is 6!, but flipping the two 1’s or two 2’s or two 3’s does not yield a different vector. □

Finally, we state the following trivial result

Lemma 13 We have

This proves that

Finally, we assumed that the latter summation within the power converges to . Thus, for p even, we have

(A1): Convergence of pth moment of .

The same proof can be applied to establish the convergence of the pth moment of resulting in the following lemma.

Lemma 14 Let , and for set defined by F with for some fixed , where we condition on F. Let

Specifically, for we have

We assumed that only depends on for sets implied by F. Thus, in this case

For two integers , define as the indicator that the and are dependent (conditional on F). Assume

We have for p even,

For p odd, this pth moment converges to zero.

General template of proof of Theorem 4

We have . We now proceed as follows:

The second term we denote with . We note that equals

Thus, we have obtained the following expansion:

where

By assumption, we have .

We have

Analogue to our proof for Theorem 2, we can show that

(A3)

Consider now the term

We have

where

We assumed that . We also assumed

where only depends on O through , and .

Thus, we have obtained the following first-order expansion:

Analysis of first-order approximation: Let

Then, the first-order approximation is given by , where . It remains to prove that this first-order expansion converges to a normal limit distribution. This proof has its own outline. Firstly, we decompose using , where , and . Denote the two corresponding terms with .

Note that

and . We also note that, conditional on , is a sum of independent mean zero random variables (functions of ), and, conditional on W, for some which depends on A through , while , are pairwise (conditionally) independent.

Analogue to our proof of Theorem 2, we can show that

with the expressions for , as specified in the Theorem. Due to the orthogonality of the two empirical processes, using moment generating functions, it also follows that . □

Notation index

TMLE: Targeted Minimum Loss-Based Estimation/Estimator

Oi: Data observed on unit i. In general, , and the special case is denoted with

: Intervention node for unit i at time t

: Measurements/covariates for unit i at time t in between intervention nodes and

: Final outcome for unit i

: . Similarly, we define

:

: Friends of unit i at time t indicating that and causally only depends on the history of all subjects through the history of unit i itself and the history of its friends . This defines exclusion restrictions in the structural equation model for the equations for and . For the -data structure, we denote this set with

O: is the collection of all data on the N units

: P is possible probability distribution of data O under our model assumptions, and is the true probability distribution of O

L:

A:

: the history of L for all N units

: the history of treatment/intervention process A on all N subjects Y: the outcomes on all N subjects

: the average outcome for the combined N units

: , parent nodes of according to the following time-ordering only:

: parent nodes of according to time-ordering only

F: the friend-process/network-process for all N units. In all probability distributions, we always condition on

U: the exogenous errors in the structural equation model for O defined as , , , , where are functions of the parent nodes and exogenous errors, modeled as in article

: A possible probability distribution of as modeled by the structural equation model

: The true distribution of

: The set of possible probability distributions of as specified by the structural equation model formulated in the article. We also refer to this as the full-data model

: The set of possible probability distributions of O, implied by , or defined without reference to the underlying model . is called the statistical model for the data distribution

: , is a conditional distribution of a , given , . The distribution is modeled through a common : . . represents a stochastic intervention on the intervention nodes A representing the intervention that replaces the true conditional distribution of , given , by this user-supplied choice , for all . One can also denote

g: . is a possible conditional distribution of , given , . The distribution is modeled through a common : . . is the true conditional distribution parametrized in terms of the true

: The post-intervention random version of L obtained by replacing the structural equations for A by the stochastic intervention . It is also called a intervention-specific counterfactual

: The post-intervention random version of Y. Note is a component of

: , the average outcome under intervention

: A possible probability distribution of the counterfactual . the true probability distribution of implied by the true distribution of .

: The G-computation formula expression for , purely defined as a function of P. Under the posed causal model , we would have

: A random variable with probability distribution . Similarly, we define and

: represents the parameter mapping that maps a distribution of the underlying into the desired quantity of interest: represents the true causal quantity value. In this article, we defined [:] represents the parameter mapping that maps a distribution of P of O into a parameter value of interest. represents the true statistical parameter value/estimand. In this article, , i.e. the expectation of under the G-computation distribution . Under the causal model , we have

Statistical estimation problem: Estimation of based on , i.e. defined separately from the underlying causal model, but the causal model allows a causal interpretation

: , are i-specific summary measures of the past that and depend upon, respectively

: . . The statistical model , where Q is left-unspecified, and is some model for We denote and . Note and denote common (in i) conditional distributions of and , respectively. We also use short-hand

: , i.e. same summary measure as but with A replaced by .

: Same as , but stressing that only depends on P through Q

: The canonical gradient/efficient influence curve of at . Also denoted with to stress that it only depends on g through a specified and can be viewed as estimating function in

L(Q): A loss function for Q satisfying . In our case, we define a loss for for each t and define as the sum-loss: . For example, one can use the log-likelihood loss. We use a separate loss function for

: A loss function for . See for sum-loss representation

Cross-validation: For example, suppose we want to estimate , the common conditional distribution of , given . Create a data set , . Consider a V-fold sample split of these N observations in a so-called validation sample and its complement, the so-called training sample, , . Let be an estimator applied to the training sample . The cross-validated risk w.r.t. loss of this estimate is defined as . A cross-validation selector among a set of candidate estimators is defined as the one that minimizes this cross-validated risk across the candidate estimators. Similarly, we can define cross-validation for estimation of

h: , , . Similarly, we define . Short-hand notations are and . In addition, and

Analogue point-treatment notation: , , , , , , , , , , , , common conditional density of , given , is the probability density of W and , common density of , given

: , orthogonal decomposition in function of W and function of with conditional mean zero, given , both are elements of the tangent space at P of the statistical model

PW: is conditional distribution of O, given W.

: , where is probability distribution of W that puts mass 1 on the observed W.

Pf: f always represents a function of O: .

PNf: since represents probability distribution that puts mass 1 on observed O

ZN(θ): for specified . represents a process indexed by class of functions , which we aim to analyze. In our processes plays role of

fi, fY,i, fA,i, fW,i: Given a , we orthogonally decompose with , , , where represents the conditional distribution of O, given X.

: indicates that it only depends on O through W and will be centered marginally, indicates that it only depends on O through and will be centered conditional on W, and indicates that it is centered to have mean zero conditionally, given . In addition, we use superscripts to have notation for multiple of such functions if part of a single proof: e.g. . In different separate parts of proofs, we often use same notation so that can denote one thing in one proof and another in another proof.

ZNA, ZNY, ZNW: Given a mean zero centered process , we define a corresponding orthogonal decomposition with , , and

: Number of balls of size needed to cover w.r.t. metric d

Y, A: is set that contains for all i with probability 1. It is a subset of for some k (constant in N). Similarly, is set that contains with probability 1

: The orlics norm of a random variable implied by a strictly monotone function . We are concerned with bounding the orlics norm of the random variable uniformly in and N.

References

  • 1.

    Rubin DB. Matched sampling for causal effects. Cambridge, MA: Cambridge University Press, 2006.

  • 2.

    Pearl J. Causality: models, reasoning, and inference, 2nd edn. New York: Cambridge University Press, 2009.

  • 3.

    van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer, 2003.

  • 4.

    Tsiatis AA. Semiparametric theory and missing data. New York: Springer, 2006.

  • 5.

    Hernán MA, Robins JM. Causal inference. New York: Chapman & Hall/CRC, 2012. Unpublished.

  • 6.

    van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2012.

  • 7.

    Halloran ME, Struchiner CJ. Causal inference in infectious diseases . Epidemiology 1995;6:142–51. [Crossref]

  • 8.

    Hudgens MG, Halloran ME. Toward causal inference with interference . J Am Stat Assoc 2008;1030:832–42. PMCID: PMC2600548.

  • 9.

    VanderWeele TJ, Vandenbrouke JP, Tchetgen Tchetgen EJ, Robins JM. A mapping between interactions and interference: implications for vaccine trials . Epidemiology 2012;230:285–92. PMID: 22317812.

  • 10.

    Tchetgen Tchetgen EJ, VanderWeele TJ. On causal inference in the presence of interference . Stat Meth Med Res 2012;210:55–75. PMID: 21068053.

  • 11.

    Sobel M. What do randomized studies of housing mobility demonstrate? Causal inference in the face of interference . J Am Stat Assoc 2006;101:1398–407.

  • 12.

    Donner A, Klar N. Design and analysis of cluster randomization trials in health research. London: Arnold, 2000.

  • 13.

    Hayes RJ, Moulton LH. Cluster randomized trials. Boca Raton, FL: Chapman & Hall/CRC, 2009.

  • 14.

    Campbell MJ, Donner A, Klar N. Developments in cluster randomized trials and statistics in medicine . Stat Med 2007;26:2–19. [Crossref].

  • 15.

    Petersen ML, van der Laan MJ. A general roadmap for the estimation of causal effects . Unpublished, Division of Biostatistics, University of California, Berkeley, CA, 2012.

  • 16.

    Robins JM, Rotnitzky A, Scharfstein DO. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran ME, Berry D, editors. Statistical models in epidemiology, the environment and clinical trials. IMA volume 116. New York: Springer-Verlag, 1999:1–92.

  • 17.

    Rotnitzky A, Scharfstein D, Su T-Li, Robins J. Methods for conducting sensitivity analysis of trials with potentially nonignorable competing causes of censoring . Biometrics 2001;570:103–13. ISSN 1541-0420.

  • 18.

    Diaz I, van der Laan MJ. Sensitivity analysis for causal inference under unmeasured confounding and measurement error problems . Technical Report 303, Division of Biostatistics, University of California, Berkeley, CA, 2012. Submitted to IJB, also technical report. Available at: http://www.bepress.com/ucbbiostat/paper303

  • 19.

    van der Laan MJ. Estimation based on case-control designs with known prevalence probability . Int J Biostat 2008. Available at: http://www.bepress.com/ijb/vol4/iss1/17/

  • 20.

    van der Laan MJ, Rubin DB. Targeted maximum likelihood learning . Int J Biostat 2006;2:Article 11.

  • 21.

    Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, theoretical study . Int J Biostat 2011;70:1–32. Working paper 258. Available at: www.bepress.com/ucbbiostat

  • 22.

    Chambaz A, van der Laan MJ. Targeting the optimal design in randomized clinical trials with binary outcomes and no covariate, simulation study . Int J Biostat 2011;70:33–. Working paper 258. Available at: www.bepress.com/ucbbiostat

  • 23.

    van der Laan MJ, Balzer LB, Petersen ML. Adaptive matching in randomized trials and observational studies. J Stat Res 2013;46:113–56.

  • 24.

    Hu F, Rosenberger WF. The theory of response adaptive randomization in clinical trials. New York: Wiley, 2006.

  • 25.

    Carrington PJ, Scott J, Wasserman S. Models and methods in social network analysis (structural analysis in the social sciences). New York: Cambridge University Press, 2005.

  • 26.

    Bakshy E, Eckles D, Yan R, Rosenn I. Social influence in social advertising: evidence from field experiments . EC 2012: Proceedings of the ACM Conference on Electronic Commerce, ACM, 2012. Available at: http://arxiv.org/abs/1206.4327

  • 27.

    Airoldi EM, Toulis P, Kao E, Rubin DB. Estimation of causal peer influence effects . Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, JMLR: W&CP volume 28, 2013.

  • 28.

    Aronow PM, Samii C. Estimating average causal effects under general interference . Technical report, Yale University and New York University, 2013. Unpublished manuscript.

  • 29.

    van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples . Technical report, Division of Biostatistics, University of California, Berkeley, CA, November 2003.

  • 30.

    van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator . Stat Decisions 2006;240:373–95.

  • 31.

    van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation . Stat Decisions 2006;240:351–71.

  • 32.

    Polley EC, Rose S, van der Laan MJ. Super learning. In: van der Laan MJ and Rose S, editors. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2012.

  • 33.

    van der Laan MJ, Polley E, Hubbard A. Super learner . Stat Appl Genet Mol Biol 2007;60. ISSN 1.

  • 34.

    Bickel PJ, Klaassen CA, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. New York: Springer-Verlag, 1997.

  • 35.

    van der Vaart AW. Asymptotic statistics. New York: Cambridge University Press, 1998.

  • 36.

    Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric non-response models (with discussion) . J Am Stat Assoc 1999;94:1096–146. [Crossref]

  • 37.

    Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder) . J Am Stat Assoc 1999;94:1096–120 (1121–46). [Crossref]

  • 38.

    Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome . Int J Biostat 2010;6:Article 26. Available at: www.bepress.com/ijb/vol6/iss1/26

  • 39.

    Rosenblum M, van der Laan MJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model . Int J Biostat 2010;60:Article 19. [Crossref]

  • 40.

    Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell N, Dietz K, Farewell V, editors. Aids epidemiology. Methodological issues. Basel: Birkhäuser, 1992:296–31.

  • 41.

    van der Vaart AW, Wellner JA. Weak convergence and empirical processes. New York: Springer-Verlag, 1996.

  • 42.

    Neyman J. On the application of probability theory to agricultural experiments . Stat Sci 1990;5:465–80.

  • 43.

    Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies . J Educ Psychol 1974;64:688–701.

  • 44.

    Holland PW. Statistics and causal inference . J Am Stat Assoc 1986;810:945–60.

  • 45.

    Robins JM. Addendum to: “A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect” [math. Modelling 7 (1986), no. 9–12, 1393–1512; MR 87m:92078] . Comput Math Appl 1987;140:923–45. ISSN 0097-4943.

  • 46.

    Robins JM. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods . J Chron Dis (40, Suppl) 1987;2:139s–61s.

  • 47.

    Heitjan DF, Rubin DB. Ignorability and coarse data . Ann Stat 1991;190:2244–53.

  • 48.

    Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time . Ann Stat 1995;23:774–86. [Crossref]

  • 49.

    Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counter-examples. In: Lin DY and Fleming TR, editors. Proceedings of the first Seattle symposium in biostatistics. New York: Springer Verlag, 1997:255–94.

  • 50.

    Robins JM. Causal inference from complex longitudinal data. In: Berkane M, editor. Latent variable modeling and applications to causality. New York: Springer Verlag, 1997:69–117.

  • 51.

    Robins JM. [Choice as an alternative to control in observational studies]: comment . Stat Sci. 1999;140:281–93.

  • 52.

    Gill R, Robins JM. Causal inference in complex longitudinal studies: continuous case . Ann Stat 2001;290.

  • 53.

    Yu Z, van der Laan MJ. Measuring treatment effects using semiparametric models . Technical report, Division of Biostatistics, University of California, Berkeley, CA, 2003.

  • 54.

    Dawid A, Didelez V. Identifying the consequences of dynamic treatment strategies: a decision theoretic overview . Stat Surv 2010;4:184–231. [Crossref]

  • 55.

    Didelez V, Dawid AP, Geneletti S. Direct and indirect effects of sequential treatments . Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, 2006:138–46.

  • 56.

    Diaz I, van der Laan MJ. Population intervention causal effects based on stochastic interventions . Biometrics 2012;68:541–9

  • 57.

    Zheng W, van der Laan MJ. Cross-validated targeted minimum loss based estimation. In: van der Laan MJ and Rose S, editors. Targeted learning: causal inference for observational and experimental studies. New York: Springer, 2011:459–74.

  • 58.

    Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models . Biometrics 2005;61:962–72. [Crossref]

  • 59.

    van der Laan MJ, Gruber S. Targeted minimum loss based estimation of causal effects of multiple time point interventions . Int J Biostat 2012;88. DOI:10.1515/1557-4679.1370. PMID: 22611591. [Crossref]

  • 60.

    van der Laan MJ. Targeted maximum likelihood based causal inference: part I . Int J Biostat 2010;60:Article 2.

  • 61.

    van der Laan MJ. Targeted maximum likelihood based causal inference: part II . Int J Biostat 2010;60:Article 3.

  • 62.

    van der Laan MJ. Causal inference for networks . Technical Report 300, Division of Biostatistics, University of California, Berkeley, CA. Submitted to JCI, also technical report. Available at: http://www.bepress.com/ucbbiostat/paper300, 2012.

  • 63.

    van der Laan MJ. Efficient and inefficient estimation in semiparametric models . Centre of Computer Science and Mathematics, Amsterdam, CWI tract 114 edition, 1996.

  • 64.

    Balzer LB, van der Laan MJ. Estimating effects on rare outcomes: knowledge is power . Technical Report 310, Division of Biostatistics, University of California, Berkeley, CA, 2013.

  • 65.

    Gill RD, van der Laan MJ, Wellner JA. Inefficient estimators of the bivariate survival function for three models . Ann De l’Inst Henri Poincaré 1995;31:545–97.

  • 66.

    Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted minimum loss based estimation of marginal structural working models . J Causal Inference 2013;Submitted, technical report. Available at: http://biostats.bepress.com/ucbbiostat/paper312/

About the article

Published Online: 2014-01-14

Published in Print: 2014-03-01


Citation Information: Journal of Causal Inference J. Causal Infer., ISSN (Online) 2193-3685, ISSN (Print) 2193-3677, DOI: https://doi.org/10.1515/jci-2013-0002. Export Citation

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Comments (0)

Please log in or register to comment.
Log in