There is a rich literature on estimation of individual-level causal effects using data from randomized trials and observational studies. In such studies, treatment is “assigned” to individuals in some target population, and one typically collects data on the treatment, covariates, and outcomes of a sample of individuals from this population. Causal effect estimation in this setting aims to make the best use of measured covariates to control for the fact that the treatment is empirically or theoretically a function of such covariates.
Many important causal inference problems, however, require evaluating the causal effects of treatments or exposures assigned at the community level based on data collected from samples of individuals within these communities. In particular, interest in methods for estimating the causal effects of community-based interventions has been driven by the need to evaluate the effectiveness of interventions when delivered on a large scale in realistic settings, as compared to the efficacy of the individual level counterparts of these interventions, which can frequently be evaluated in a classic randomized controlled trial. For example, understanding the impact of an early childhood nutritional intervention on children’s weight-for-age scores or subsequent educational achievement, as compared to the efficacy of nutritional supplements as delivered in the controlled context of an individual randomized trial, requires estimating the effect of the community-wide intervention. The communities of interest in such applications are typically based on a common set of characteristics or connections among their members (e.g., shared neighborhoods, schools, clinics, and work sites).
In reviewing the causal inference literature on the related topic of neighborhood effects, Oakes  points out the lack of rigorous discussion addressing causal inference in multilevel settings. There is now a growing body of literature addressing multilevel settings and community level effect estimation (e.g., Small et al. , Hanse and Bowers , and Imai et al. ), as well as discussions on causal inference in the presence of interference (which we refer to later). However, there is a paucity of literature that addresses the scenario in which data are only available from a very small number of communities (in the extreme case, one community per level of the intervention). Such a scenario arises commonly in practice, either due to logistical or financial barriers to studying a larger number of communities, or because the size of the target population of communities is fixed. (See, e.g., studies reviewed in Varnell et al.  and Atienza and King .)
Community level causal effects are commonly estimated using one of two model-based approaches: mixed models (a.k.a. conditional models) and population-average models (see, e.g., a comprehensive review of group randomized trials published in American Journal of Public Health and Preventive Medicine from 1998 through 2002 in Varnell et al. ). Both of these methods have limited applicability when data are only available from few communities. At least 10 communities per treatment arm are commonly recommended when generalized estimating equations are used to estimate the parameters of a population average model [7–10], while a minimum of 10–30 communities are generally recommended when fitting mixed models. Further, both mixed models and population average models typically define the effect of the intervention as a function of a parametric data-generating distribution. Background knowledge is rarely if ever sufficient to justify the parametric model employed, and misspecification can result in both biased point estimates and misleading inference.
Oakes points out that “multilevel regression models, no matter how sophisticated, appear unable to identify useful neighborhood effects from observational data,” and proposes randomized community trials as a superior alternative to observational data when estimating the effect of community level interventions. We refer the readers to Imai et al.  and Zhang and Small  for discussions on designing such experiments when resources permit. However, the success of such designs requires random allocation of the intervention among a large enough number of communities to ensure that all sources of confounding are distributed evenly across treatment arms. The benefits of randomization thus decrease with the number of communities; in the extreme case of one community per level of the intervention, randomization confers no advantage at all and the experiment is de facto an observational study. In this paper, we address definition, identification, and efficient estimation of the causal effect of a community level intervention when data are only available from few communities. We focus on the following data generating experiment: 1) two levels of an intervention (referred to as treatment and control, respectively) are assigned to two communities, either by the investigator or by Nature; 2) a sample of independent units (referred to here as individuals) is drawn from each of the two communities; and 3) covariates known not to have been affected by the intervention (referred to as pre-intervention covariates) and an outcome of interest are measured on each individual. The results presented can also be straightforwardly generalized to the case in which a community level intervention with two possible levels is assigned to a fixed number K communities.
We hope to contribute to the literature on community-based interventions in several ways. First, we use a rigorous causal inference framework to nonparametrically define a community level causal effect in such settings, and to establish conditions under which this effect is identified as a parameter of the observed data distribution. Specifically, the causal effect of interest is defined in terms of a hypothetical experiment where one first assigns the treatment level of the intervention to both communities and records individual outcomes in the combined population of the two communities, and then one turns back the clock to the moment before the intervention assignment, assigns the control level of the intervention to both communities, and records the individual outcomes of the combined population. We refer to this effect as the marginal causal effect; in addition we consider the causal effect of the intervention in the treated community. In the case of a fixed number K communities, this definition generalizes to the causal effect of the intervention on the individual-level outcome distribution within a target population consisting of the combined population of the K communities included in the study.
Causal effects are commonly defined and identified under a “no interference” assumption [12–15]: the potential outcomes of one observation unit are assumed to be independent of the treatment received by the other units. There is a growing body of literature on causal inference in the presence of interference (e.g., [16–24]) and contagion (e.g., ). Related problems have also been rigorously formulated and studied in Manski , Graham , Graham et al. . In this paper, we make the assumption that there is no treatment interference at either the community or individual levels.
Identification of the causal effects of interest given the two community data generating experiment described is challenging because the treated and control communities differ not only in their intervention levels, but also in a number of “environmental” factors. All individuals within a community share both the same intervention level and the same environment. Any difference between the individual outcomes in the treated and the control communities may thus be due either to an effect of the intervention or to differences in environment. In other words, the effect of the intervention is confounded by any environmental factors that differ between communities and also affect the outcome. When there are only two (or few) communities in a study, these confounders cannot be properly adjusted for, even if measured. In order to establish identifiability of causal effects in such cases, we are forced to assume that one collects pre-intervention individual covariates that are sufficient to “block” the effect of the environment on the outcome of interest (i.e., control for confounding by the environment). We formulate this assumption as an exclusion restriction assumption in a nonparametric structural equation model .
Particular instances of the identifiability conditions we present have been discussed in the context of group randomized trials . Our goal here is to provide an explicit, general, and distribution-free statement of these conditions, with the aim of providing a common platform for discussion of causal inference in this challenging setting. We emphasize that our results are not meant to advocate or justify the use of studies with few communities. On the contrary, we provide a rigorous argument that inferring causal effects with few communities is feasible only in very limited situations, as the identifiability assumptions needed are strong. In situations where the investigator must contend with very few communities, the interpretable identifiability conditions we present may inform study design and assessment of the extent to which results can be interpreted causally.
In addition to nonparametric definition and identification of causal effects for the two community data generating experiment, we present semiparametric efficient estimators for the estimands corresponding to the marginal treatment effect and the treatment effect among the treated under these identifiability results. The targeted maximum likelihood estimators proposed employ estimators of both the distribution of the intervention given individual level covariates, and the conditional expectation of the individual level outcome given intervention level and covariates. They are double robust, in the sense that they remain consistent if either of these nuisance parameters is estimated consistently, and they are efficient if both are estimated consistently. Further, the proposed estimators naturally integrate the state of the art in machine learning through loss-based super learning. The estimators presented have been previously described (see, e.g., [31, 32]). Here, we illustrate their utility for the estimation of a novel causal target parameter and in the context of a novel data generating experiment.
Finally, for the case of one community per treatment arm, we consider a design in which individuals are sampled from the treated and control communities by matching on a set of individual level covariates, thereby aiming to make the different communities similar in their individual covariate distributions. We provide additional conditions under which our target causal parameters will be identified given such a biased sampling scheme. This individual-level matching provides a design alternative in situations where matching at the community-level is not feasible. We present efficient weighted targeted maximum likelihood estimators for these matched cohort designs, by application of general results on semiparametric models for case–control biased sampling in van der Laan , and we evaluate the theoretical gain in information for the causal effects of interest using the matched relative to independent random sampling designs.
1.1 Organization of the article
This article is organized as follows. In Sections 2–4 we address the case that one follows up a sample of individuals from each of two communities, one of which is assigned a treatment and one a control level of an intervention, and collects on each individual pre-intervention covariates and an outcome. In Sections 2, we focus on identification and estimation of the marginal treatment effect. The key identifiability assumption is formulated as an exclusion restriction on an individual level nonparametric structural equation model and corresponding targeted maximum likelihood estimators (TMLE) are proposed that involve adjustment by pre-treatment covariates measured at the individual level in order to control for the confounding due to different environments. Sections 3 presents an analogous identification result and TMLE for the effect among individuals in the treated community. Sections 4 evaluates the benefit of matching individuals in the treated community to individuals in the control community, and proposes a TMLE for that type of design. Sections 5 uses simulations to illustrate the properties of the proposed estimators for the effect among the treated, and to investigate gains in efficiency made possible through use of a matched cohort design. Sections 6 summarizes how the results can be generalized to the case of K fixed communities. Additional generalizations are presented in the technical report van der Laan .
2 Average causal effect: two communities
We consider an experiment in which a treatment level of an intervention is assigned to the population of one community and a control level of the intervention is assigned to the population of another community. Let be a variable indicating the two treatment regimens ( for treatment and for control). Let be a variable indicating environmental factors that differ between the two communities and affect the outcome of interest, where denotes the level of these factors in the control community and the level of these factors in the treated community. Typically, a realization of is generated as follows: one selects two populations, whose environment defines two e-profiles, and then assigns the control level of the intervention to one of these populations, and the treatment level of the intervention to the other, giving the realized and , respectively, for . The combined population of the two communities represents the target population of individuals.
2.1 Observed data
Individuals are sampled from both the treatment and control populations, and pre-intervention covariates and a post-intervention outcome are measured on each sampled individual. Let W denote these individual covariates and Y denote individual outcomes in the combined population. We sample units from the treated population, providing i.i.d draws from the conditional probability distributions of given . Similarly, we sample units from the control population, providing i.i.d draws from the conditional distribution of given .
2.1.1 The causal model and target causal parameter
In this subsection we formally define the causal effect using a nonparametric structural equation model (NPSEM; ), and provide an explicit link between this model and the observed data. This lays the groundwork for defining our target causal effect and addressing its identifiability from the observed data.
18.104.22.168 Nonparametric structural equation model
Since our target population of units is the combined population in the given study, we assume that . We use the variable to denote an individual’s community and its intervention level. We assume the following causal NPSEM to describe the mechanism for assigning to a sampled individual from the target population, and for generating covariate and outcome data on that individual.
The NPSEM for an individual, with endogenous variables , is given by:
with denoting the exogenous variables. This encodes an experiment where an individual is drawn by first selecting which community E to draw from and what treatment A is assigned to that community; then, for this given , one has a fixed distribution on the errors of individual level variables; individual covariates and outcomes are drawn according to these error distributions. Note that it is assumed here that individual errors are i.i.d. draws from the environment–treatment-specific . Further, we assume that the individual covariates W are measured before the community intervention was implemented or, more generally, that they are not affected by the community intervention. Therefore, W is only a function of E, and not of A. This NPSEM defines a random variable on an individual. We denote its distribution by .
We define the counterfactual as , which is the outcome of an individual had the community been assigned while the environment E remains the same, that is, , and the individual’s covariates W are kept fixed. Note that the outcome error is now , since the change in intervention is implemented at the community level and therefore affects the choice of environment–treatment-specific error. Similarly, we define , , and .
22.214.171.124 Link to observed data
Consider the study design presented above where one draws individuals from the treated community, and individuals from the control community. Thus, the observed data consists of i.i.d. draws from the -specific counterfactual and i.i.d. draws from the -specific counterfactual . Let denote the distribution of that results from restricting the above NPSEM by setting . This way, the observed data on the individuals can be represented as n i.i.d. copies of whose probability distribution P is implied by . That is, one first draws according to the Bernoulli distribution ; then given , one draws implied by the NPSEM with individual level errors drawn from . This representation of the two observed samples as n i.i.d. draws simplifies our presentation.
Our goal now is to define the causal effect of interest on the NPSEM, as a parameter of the distribution , and then, under certain additional assumptions on the NPSEM, establish identifiability of this causal effect from the observed data distribution .
126.96.36.199 Target parameter on the NPSEM
This NPSEM allows us to define the outcome distribution under set intervention, keeping the selection of the environment random, and to define corresponding causal effects of the intervention on the outcome.
We define our causal parameter of interest in the NPSEM for as
where the reader is reminded that is the counterfactual defined by setting . This additive causal effect of A on Y corresponds to a hypothetical experiment where one first assigns treatment to both communities, records the individual outcomes, then turns back the clock and assigns control to both communities and records the individual outcomes. As is apparent from the NPSEM, this causal quantity assesses the effect of the intervention in the context of the given environments and . That is, this effect is conditional on the values of the given and , and should not be generalized to situations outside of the study without further assumptions.
2.2 Identifiability of target causal effect from observed data
We next address the identifiability of from the probability distribution of O. For this purpose, we make the following assumption on the NPSEM: we assume that E affects Y only via its effects on W. More specifically, we assume
(No direct effect of environment on individual outcome).
It follows that . Hereafter, we also refer to this as the “no residual environmental confounding” assumption. In addition, we will assume that is independent of , .
2.2.1 Heuristics behind “no residual environmental confounding” assumption
The idea behind this assumption is that , although common to all units in community j, results in unit specific effects of on Y, which is some function of characteristics C of the unit and . Suppose we are able to observe this particular function of the characteristics C of the unit and for each unit, so that it is captured by W: for example, is this particular function of and the characteristics. Similarly, is this same function of and the characteristics of the unit in the control population. By controlling for , we are then able to control for the difference in environments between the two populations (i.e., ) at the individual level.
Let us consider an example. Consider a study that is interested in evaluating the causal effect of an intervention such as a community wide program to improve early childhood nutrition, consisting, for example, in community-wide educational outreach and availability of vouchers for nutritional supplements. Consider two cohorts of children from two different regions. In one region all children are exposed to the intervention in the sense that they live in a community in which the media outreach is implemented and vouchers are available. Note that this community-level definition of exposure is chosen to reflect the policy question of interest, and differs from possible individual-level definitions of the exposure, such as use of a voucher by a child’s care-giver. The outcome measured at the individual level is weight for age at a specific time point following implementation of the intervention. A simple comparison of weight for age scores between the two cohorts is problematic since it is known that access to high quality food sources is different between the two regions (e.g., because the treatment region is located in greater proximity to a major trading center). Thus, the environments in the two regions differ in ways expected to affect the chance that a child has a low weight-for-age score: i.e, there is a higher probability of being underweight in one cohort versus the other cohort.
Which covariates should we measure to block this effect of the different region-specific access to food? Suppose we measure at the individual level the availability of food in a child’s household prior to implementation of the intervention. We include this covariate as a component of W, giving us a component of and . One would expect that this covariate will help to block the effect of the differential access to food in the two regions on individual child nutrition, and thereby make the “no residual environmental confounding” assumption more reasonable. In other words, one might argue that a person in the treated community and a person in the control community who have the same value for this “blocking” covariate (and for other pre-treatment covariates), would have the same probability of developing a low weight-for-age score over the course of the study, were the communities to receive the same level of the intervention.
2.2.2 Formal statement of identifiability result
We state the identifiability result formally as a theorem.
Theorem 1 NPSEM.Consider an NPSEM with structural equations for the endogenous variables :
and exogenous variables . Suppose the marginal distribution is known.
Counterfactuals. For , let denote the counterfactuals corresponding with setting . We also define as the counterfactuals corresponding with setting and . Finally, we define as the counterfactual of Y corresponding with intervention .
Observed data. Let , where . Conditional on , the pair is distributed as . In particular, we note that the distribution of B in the observed data equals the marginal distribution of E. We also note that . Let be the probability distribution of O. We observe n i.i.d. observations on O.
Relevance to two sample problem. We note that the distribution of also approximates the two sample experiment in which one samples i.i.d. observations from , and i.i.d. observations from , with .
Target parameter on NPSEM. Consider the following parameter of the distribution of :
Exclusion and Randomization assumptions on NPSEM. Suppose the following hold:
No direct effect: E has no direct effect on Y. In other words, for , and .
Assume is independent of for .
Identifiability Result. Then,
Proof. Firstly, for the full data parameter, we have
where we used that W is independent of , which holds since is independent of and thereby is independent of . We note that .
Consider now the parameter of observed data. Since, given , is distributed as , we have
where we use that is independent of .
Similarly, . In addition, involves averaging w.r.t .
Thus, this proves
which is thus identical to . This completes the proof. □
2.2.3 Commitment to statistical parameter and model for observed data
The model for the probability distribution of implied by the NPSEM is nonparametric. Based on theorem 1, we propose the following target parameter of the distribution of the observed data structure with B Bernoulli in :
This statistical parameter corresponds to a statistical experiment where, for each stratum of the individual covariate in the combined population, one obtains the difference in expected outcome for the treated community and the expected outcome for the control community, then averages this difference over the covariate distribution in the combined population. At its face value, this statistical parameter provides a nonparametric treatment effect measure. The causal interpretation of this treatment effect measure, however, is contingent on the non-testable assumptions A1–A2. In particular, with respect to A1, we note that even if W does not succeed in capturing the complete effect of e on the unit-specific outcome Y, (in other words, one is not able to establish identifiability of the target causal parameter), adjusting for a rich set of covariates W will still help to take away some of the difference in outcome distributions of Y that is purely due to the differences between the two environmental profiles and . Thus, the statistical estimand eq. , fully adjusted for W, may still be of interest as a closest possible approximation to the wished for causal effect. Of course, in such a case, the resulting estimate should be carefully and transparently interpreted as at best an approximation.
2.3 Estimation and inference
The targeted maximum likelihood estimator of the statistical parameter eq.  has been defined previously and statistical inference for this estimator has been described (see, e.g., van der Laan and Rubin , van der Laan and Rose  for the targeted MLE, and van der Laan and Gruber , Gruber and van der Laan ) for the collaborative targeted MLE).
Since we have arrived at the statistical estimation stage, we will denote and the pooled sample with , . We will also use to denote the realization , for . Note that the data generating distribution is determined by the marginal distribution of W, the conditional distribution of B, given W, and the conditional distribution of Y, given . The parameter depends on through both and , as well as the “treatment” mechanism . We will denote the treatment mechanism with and the other two factors of the likelihood with .
Implementation of the estimator begins with application of a data adaptive algorithm such as super learner (van der Laan et al. , Polley and van der Laan , van der Laan and Rose ) to fit . This initial estimate is subsequently updated using targeted maximum likelihood estimation. The marginal distribution of W is estimated with the empirical of the pooled sample , . The targeted maximum likelihood estimate also requires a fit of . The estimator is double robust in the sense that it remains unbiased if one either consistently estimates or . The estimator is efficient if the initial estimator is consistent and is consistent as well. If is misspecified but is consistent, the estimator can be either super efficient or inefficient, depending on the limit of . The targeted maximum likelihood estimator can be further refined with the collaborative targeted maximum likelihood estimation method, resulting in a collaborative double robust estimator which has generally better finite sample efficiency and is consistent under weaker conditions.
The double robustness of the targeted maximum likelihood estimator in terms of the components of the distribution of observed data structure , translates into the following robustness in terms of the underlying counterfactual distributions and of the two samples. Firstly, we note that . We also define , , and we note that
Thus, the double robustness of the targeted MLE for estimation of in a nonparametric model for in terms of can be restated as follows: the targeted MLE will be consistent if either the outcome regressions on the covariates are consistently estimated for both samples, or if the ratio of the covariate distributions for the two samples is consistently estimated. In particular, the identifiability condition almost everywhere (a.e.) translates into , and that the Radon–Nykodym derivatives and for the covariate distributions are bounded. Thus, if a covariate can have a certain value in the treated population, then that value should also occur in the control population, and vice versa.
Statistical inference for the targeted MLE can be based on an estimate of the efficient influence curve for in the nonparametric model for , given by
That is, one can estimate the asymptotic variance of the targeted MLE with and an asymptotic 0.95-confidence interval for is given by , where is the estimate of the efficient influence curve obtained by substituting the estimates for .
3 Causal effect among the treated: two communities
Using the same NPSEM and notations as in theorem 1, we now consider an alternative target causal parameter, referred to as the treatment effect among the treated.
This causal parameter corresponds to an effect assessed by an ideal experiment where after recording the individual outcomes of the treated community, the investigator can turn back the clock and implement the control intervention on this same community, and record its individual outcomes under the control regimen.
A similar argument as in theorem 1 shows that if the following assumptions on the NPSEM hold:
No direct effect: E has no direct effect on Y, that is, .
Randomization: is independent of , ,
We consider estimation and inference for the alternative target statistical parameter implied by this result:
As we discuss below, the identifiability of eq.  has weaker requirements for data support than that of eq. . More specifically, one only needs that , that is, the support of contains the support of . Moreover, in the following sections, we show that an individually matched cohort design is particularly optimal for targeting this effect among the treated.
3.1 Targeted MLE for the causal effect among the treated
In this subsection we consider estimation of the statistical parameter eq.  defined above. Similar to the previous section, we denote the observed data as i.i.d. observations of , where B is binary. We make no assumptions on the data generating distribution ; that is, we assume a nonparametric statistical model.
3.1.1 The efficient influence curve of the target parameter
3.1.2 Double robustness of efficient influence curve
This efficient influence curve of can be represented as an estimating function , where we suppress the dependence on the scalar . We note that this estimating function is double robust in the sense that it is an unbiased estimating function for , if either Q is correctly specified or g is correctly specified. Formally, this is stated as
and a.e. Here we recall the notation . This double robustness result can be explicitly verified.
One can use this estimating function to define a closed form asymptotically efficient double robust estimator defined as the solution of the efficient influence curve estimating equation,
given estimators of and of .
3.1.3 Targeted maximum likelihood estimator
We next explain the targeted maximum likelihood algorithm that maps an initial estimator into a targeted fit . Suppose Y is bounded in the unit interval. Given an initial estimator of , an initial estimator of , and the empirical distribution of W, in order to compute the targeted MLE we define the fluctuation , and , where
These two one-dimensional fluctuations of the regression and the treatment mechanism represent a fluctuation of , where the empirical distribution of W is held fixed. The empirical distribution is already unbiased for the parameter of interest so that no fluctuation is needed. We estimate with maximum likelihood: note that is estimated with standard linear logistic regression fixing as an off-set, and is estimated with standard linear logistic regression fixing as offset in the logistic regression model for .
This maximum likelihood estimator now defines an update . This targeted maximum likelihood updating is iterated till convergence and the final , identified by a (and the empirical for ), is called the targeted maximum likelihood estimator of the distribution , while is called the targeted maximum likelihood estimator of . The targeted maximum likelihood estimator solves the efficient influence curve estimating equation,
Note that we can use machine learning/super learning to obtain the initial (i.e., and ).
Since solves, in particular,
it follows that the targeted MLE can also be evaluated as
that is, as the empirical mean of among the observations with . This evaluation makes use of only through .
4 Matching the treated and control community at individual level
We now understand the estimator of the statistical parameter and the conditions under which this statistical parameter equals the target additive causal effect (marginally or among the treated). From this, we conclude the importance of measuring individual characteristics that can “block” the effect of the two communities’ differing environments (i.e., and ) on the individual outcome, so that units in the treated population with are exchangeable with units in the control population with with respect to their counterfactual outcome distributions. However, if is very different from , then the covariate distributions and will also be very different, thereby possibly generating lack of experimentation for : i.e., may approach 0 or 1 for some W-values. Even if such imbalances do not result in non-identifiability, large imbalances can still increase the asymptotic variance of the targeted MLE; in other words, the variance of the efficient influence curve for increases when the covariate distributions and become more separated. As a consequence, even if all the wished W can be measured so that the effect of E on Y can be blocked, it is still crucial that the two populations are fairly comparable with respect to the factors e that have an impact on the outcome. The more comparable the communities, the smaller the asymptotic variance of the targeted MLE adjusting for W will be.
Despite the desirability of comparable treatment and intervention communities, in many cases achieving this comparability is not feasible. In particular, the investigator may have minimal control over selection of the study communities. Nonetheless, in such settings, the investigator may maintain some control over how individuals from the study communities are sampled. This suggests the potential utility of an individually matched cohort design, in which a unit from the treated population is matched with a unit from the control population based on a set of individual level variables that are not affected by the treatment.
4.1 Two target parameters: average causal effect, and average causal effect among treated
Recall that we defined a random variable representing the data on a random draw from the combined population of the two communities, and representing the two sample problem of sampling observations from and observations from . We work with the random experiment defined by O because it allows us to view the data set as one sample of i.i.d observations, while we fully respect the true two-sample estimation problem. The model for is nonparametric.
As above, we consider the following two target statistical parameters of
We saw previously that under the exclusion and randomization assumptions and , equals , the marginal causal effect of the intervention on the individuals in the combined population of the two communities, and equals , the causal effect on the individuals of the treated community. As we will see, the latter parameter is easier to identify from the data and makes the matched cohort design (defined below) particularly effective and optimal.
Recall that the efficient influence curves for and are given by
4.2 Matched cohort sampling
As an alternative to the two sample design considered above, which we treat as the equivalent of sampling i.i.d. copies of , we consider the following J-to-1 matched cohort sampling:
Let be a subset of the individual level covariates W that we will match on.
Sample from the conditional distribution of , given . Let denote the observed value of .
Sample J times from the conditional distribution given and .
be the cluster of matched observations.
Repeat this experiment n times, resulting in n clusters , .
Note that the dependence of observations within a cluster is only due to the matching on variable M: e.g., if the matching variable is empty, each cluster consists of independent copies.
4.3 Estimation in matched cohort designs
Matched cohort designs provide a biased sample from the distribution, , so that a new identifiability result is required since the previous identifiability results were based on sampling i.i.d. copies of . Targeted maximum likelihood estimation, and efficient estimation in general, based on this type of biased sampling, including matched case–control/cohort sampling, was studied in van der Laan  and Rose and van der Laan . This previous work assumed that the following quantities are known
Knowledge of these quantities allows one to identify any parameter that would have been identifiable under regular i.i.d sampling of . Therefore, this knowledge allows us to target the causal effect parameters and of interest.
The case–control weighted targeted MLE is now defined by applying the targeted MLE of or presented above, based on i.i.d. sampling of , but giving each observation a weight. The observations with are assigned weights . The J observations with , which were matched to a , receive weights . The resulting case–control weighted targeted MLE now targets the same parameter or , is asymptotically efficient, and has the same double robustness property as the targeted MLE applied to the i.i.d. sampling of . The efficient influence curve of , for this matched cohort sampling model, as given below, can as usual be used for statistical inference based on the case–control weighted targeted MLE.
4.3.1 Knowledge needed to determine weights used to correct for matched sampling
Suppose that one can determine for each matching category m the proportion of units that have in the two populations/communities. This yields, , for each m. In addition, we can set , which corresponds with in the NPSEM and thereby affects the interpretation of the marginal causal effects. This particular choice corresponds with the sampling actually used, and thereby is well supported by the data, but other choices can be accommodated as well. For example, if one aims to target the combined population, while are not proportional to population size, then is different from . Of course, the required weights are now determined by the Bayes rule.
4.4 Evaluating gain of matching cohorts, relative to no-matching of the two cohorts
In van der Laan  it is shown that the efficient influence curve for the parameter based on sampling the cluster equals a “case–control”-weighted efficient influence curve for the parameter based on sampling the data structure O. That is,
for the marginal causal effect , and
for the causal effect among the treated . This design includes the “no-matching” choice by setting M equal to empty set, and , in which case , and the case and control observations in the cluster are now independent. That is, if we set M empty, then this design corresponds with our original two i.i.d. samples study design.
4.4.1 Evaluating the benefit of matching in the design
The variance of the efficient influence curve for a parameter is the information bound for that parameter in the semiparametric model. As a consequence, any regular estimator has a larger asymptotic variance than the variance of the efficient influence curve, and an estimator is asymptotically efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve.
Therefore, by studying the variance of the efficient influence curve of we can investigate if matching decreases the variance relative to the no-matching design, and thereby increases the amount of information generated by the matching design for the purpose of estimation of .
To consider the comparison of a matching design with no-matching, we focus on the case that , since the argument should not depend on the number of controls that are matched to the case. It follows from the formulas above that the efficient influence curve for based on sampling the cluster observation is given by
and the efficient influence curve for based on sampling the cluster observation is given by:
Firstly, it is good to see that indeed, if M is empty, and thereby , then and correspond with i.i.d sampling of and the respective efficient influence curves and , for and given above. In other words, and simply combine observations from control and treatment sample in the cluster , but if the observations in a cluster are independent, this coupling serves no purpose beyond that it allows us to compare the efficient influence curve under matching with the efficient influence curve under the original no-matching two sample design. Therefore, the question “Is matching improving the design with respect to estimation of the target parameter?” corresponds with “Do the variances of the efficient influence curves and become smaller as the subset M becomes closer to W?”
To answer this question we simply write down the efficient influence curves for the cases of , and , for both target parameters. With a slight abuse to notation, we write and to stress that these are related to the conditional probability distribution of B, given W. This means that we can denote .
Marginal Causal effect, no matching ():
Marginal Causal effect, partial matching ():
Marginal Causal effect, full matching ():
where with probability 1. We note that the inverse weighting by and is reduced to inverse weighting by only, due the matching. Therefore, it seems that matching reduces the variance in many cases, and at least, weakens the required identifiability condition to only a.e. Explicit calculations, not carried out here, will have to provide more support for this claim.
Causal effect among treated, no matching ():
Causal effect among treated, partial matching ():
Causal effect among treated, full matching ():
Note that the estimator that solves the efficient influence curve equation is given by , a difference of sample means between the two groups, where the observations are from a subject with the same covariate W. This suggests strongly that the efficient influence curve for the full-matching design has the smallest variance, thereby establishing the benefit of matching for the purpose of estimation of .
4.4.2 Remark: matched case–control sampling vs. matched cohort sampling
It was shown in Rose and van der Laan , by practical demonstration in simulation studies, that matched case–control sampling designs typically provide less information about the parameter of interest than non-matched case–control designs. On the other hand, we see from the above formulas that matched cohort sampling is likely to improve upon unmatched cohort sampling. This distinction can be understood by noting that in order for matching to be effective, the weights ought to cancel/stabilize the inverse treatment weightings in the i.i.d.-based efficient influence curve. In matched cohort sampling, we see that this is indeed the case since one conditions on the “treatment” B. However, in matched case–control samplings, one condition on the outcome, therefore, the weights introduce additional inverse outcome weighting, without stabilizing the treatment weights.
4.4.3 Summary: optimizing the design
If one can select two populations for which the pre-treatment covariate distributions are almost equivalent, that is, , while W blocks any effect of E on Y, then that implies that , and thereby will result in an excellent information bound for any target parameter . On the other hand, if this is not possible, then one still has the good option of using a matched cohort design, and targeting the causal effect for the treatment-population, .
In previous sections we showed that under the exclusion and randomization assumptions and , causal effects of the intervention are identifiable from the observed data of the combined population. Moreover, the resulting statistical parameters are equivalent to those arising from a causal model in which the intervention is assigned at the individual level. Properties of the targeted MLE for this estimand, which corresponds to the marginal treatment effect under identifiability results presented here, have been illustrated in previous work (e.g., ). In this section, we illustrate the double robustness of the targeted MLE for the estimand corresponding to the causal effect among the treated, under both the independent sampling and matched cohort sampling two community designs. In addition, we use simulations to investigate cases in which the matched cohort design provides an efficiency gain over an independent sampling design when using a substitution estimator.
5.1 Data generation
We evaluated the relative efficiency of the matched cohort design, compared to independent sampling, while varying the extent to which a matching variable affects the outcome Y and the extent to which it predicts the intervention A (or equivalently, community membership). We considered 3 levels (low, medium, high) for each of these parameters, resulting in data generating distributions. The data generating process for an individual i is described below.
where ranges over three distributions with different effects of E on :
and the outcome conditional mean ranges over three functions with different effects of on Y:
As mentioned previously, when one only has two communities, the causal effect assessed is in terms of the two observed communities. Therefore, we fixed the observed communities to be , and assigned to community using a single Bernoulli trial. The outcome of this trial determined the distributions of individual covariates and outcome in each community, as well as the value of the true causal effect parameter. All samples were drawn from these distributions.
5.2 Implementation of sampling schemes
Independent sampling was implemented as independent draws with replacement from each community j. We used a balanced design with , so that there were independent observations under the independent sampling design.
Matched cohort sampling with one-to-one matching was implemented as follows: we drew an independent sample of size from the treated community; for each observation i in this sample, with , we then drew an individual from the control population by drawing conditional on (call its realization ) and conditional on . The true probability distribution was used in computing the weights for each matched control. There were independent pairs of matched observations under this matched design.
The following estimators were applied to each dataset. Note that TMLE was implemented with the logistic fluctuation, with the proper linear transformations.
: Substitution estimator using the estimator of which only adjusts for A but not individual level covariates.
: TMLE with initial estimator and a correctly specified , which is obtained using the correctly specified and Bayes rules.
: Substitution estimator using a misspecified estimator of :
: TMLE with initial estimator and a correctly specified g.
: Substitution estimator using the correctly specified estimator of .
: TMLE with initial estimator and a misspecified treatment mechanism .
: TMLE with initial estimator and a correctly specified g.
5.4 Results: double robustness of TMLE
We illustrate the double robustness of the TMLE for the matched and independent designs, respectively, using the data generating distribution with and . The value of the causal effect among the treated in these simulations is 23.48. For each sampling design and for two choices of (100 and 1,000), bias, variance, and mean squared error (MSE) were estimated over 1,000 datasets. The results in Tables 1 and 2 demonstrate that when g was correctly specified, the targeting step reduced bias in a misspecified initial outcome model. Further, the mean squared error of the unbiased estimators decreased at a rate proportional to n.
5.5 Results: when does the matched cohort design provide efficiency gain?
We investigated the relative efficiency of the matched cohort design as compared to the independent design using targeted and non-targeted substitution estimators. For sample sizes of and for each data generating distribution, we compared the ratio of the sample variances of each estimator over 1,000 datasets under independent versus matched sampling.
Results of these simulations (Table 3) show that in general there was no loss of efficiency as long as data from the matched design were analyzed with the targeted estimator. The matched design, when analyzed with the targeted estimator, provided more meaningful gains in efficiency when the matching variable strongly predicted the intervention. When the initial outcome model () did not adjust for the matching variable or other confounders, then matching and provided a great efficiency gain when analyzed with the targeted estimator, while matching without targeting did not improve efficiency. Adjustment for confounders, even using a misspecified initial outcome model, reduced the relative gain in efficiency provided by matching and targeting. When the initial outcome model was correctly specified (), the efficiency gain provided by matching was further reduced, and the relative efficiency gains were equivalent when data were analyzed with the targeted or non-targeted estimator.
5.6 Results: variance of in matched vs independent sampling
Recall that is in fact the Cramer–Rao lower bound for the variance of a locally unbiased and asymptotically linear estimator of the parameter in a nonparametric model. On the other hand, an estimator that is asymptotically linear with influence curve D will have asymptotic variance . Therefore, we can assess the theoretical gain in efficiency of using matching and targeting by comparing the variance of vs variance of , at , and , , or . To this end, for each data generating distribution and sampling design, we used a sample of observations. The limits and were approximated using the sample. was approximated with the sample variance of over the independent observations, while was approximated with the sample variance of over 100,000 independent pairs of matched observations.
Tables 4–6 show that when the matching variable was a strong predictor of A, the matched cohort design provided a significant gain in efficiency over the independent design, a gain which was more pronounced when the outcome model was incorrect ( and ). On the other hand, the effect size of the matching variable on outcome seemed to have little bearing on efficiency.
6 Generalization to K communities
Full generalization of the results presented here to K fixed communities is presented in the technical report van der Laan . Let denote the set of K communities. An NPSEM analogous to eq.  is defined by using the set . Under this NPSEM, the counterfactual and causal parameter are defined as in Sections 2.1.
In a real experiment, one only observes environment–treatment combinations . Therefore, the observed data may be considered as sampling from a restricted NPSEM, where is restricted to . More specifically, our observed data correspond with observing K samples of i.i.d. observations from , . The identifiability result stated in the theorem in van der Laan  teaches us that, under the same exclusion restriction and randomization assumption, the causal parameter corresponds with the statistical parameter
The efficient influence curve, targeted MLE, collaborative targeted MLE, and statistical inference based on an estimate of the efficient influence curve correspond exactly with those presented earlier for the statistical target parameter based on observing n i.i.d. copies of with , , and . Thus, the practical conclusion is that one can create a single combined sample from the K community-specific samples, reduce each observation to by ignoring the data on environmental factors, and apply the targeted MLE for .
7 Summary and concluding remarks
This article provides contributions to the literature on causal inference for community-based interventions, concerning inference conditional on a given finite set of communities and asymptotics in the number of individuals sampled, under the assumption of no-interference.
For the experiment in which we observe two populations under two different treatment regimens, while collecting data at the individual level, the article makes the following contributions. We define the causal effect of a treatment versus control level of an intervention as the difference in the mean individual outcome under an experiment where one assigns the treatment level to both communities versus under an experiment where one assigns the control level to both communities. We establish that this causal effect can be identified under a randomization assumption, together with an exclusion restriction that assumes that measured individual level covariates which are not affected by the treatment are sufficient to block the effect of any environmental factors that differ between the two communities on the outcome. Covariates in this setting, while remaining crucial for control of confounding, thus play a slightly different role than in classical causal inference for individual-based treatments. The main purpose of the covariates here is to remove bias by blocking the effect of different environments between the populations on the outcome distributions. Moreover, the statistical parameter representing this additive causal effect involves computing the mean outcome as a function of the covariates for the treatment population, repeating the process for the control population, taking the difference between these covariate-value specific mean outcomes, and averaging the difference over all units in the combined sample.
We emphasize that the identifiability conditions required are both strong and untestable – they concern the causal mechanism, and are not related to any statistical model or analytical methods used by the researcher. The strength of the assumptions required should serve as a warning that inferring causal effects of community-based interventions when data are available from only a few communities is sensible only in very limited cases, and with a very careful study design. The statistical parameters eqs.  and  provide nonparametric measures for quantifying the intervention effect. However, when the identifiability conditions are not satisfied, the investigator is warned against moving beyond a statistical to a casual interpretation of these parameters. Nonetheless, by providing a nonparametric measure of treatment effect, these statistical parameters are more interpretable than those defined based on parametric model specifications.
We present (collaborative) targeted maximum likelihood estimators for the statistical parameters provided by these identifiability results. The estimators are based on first using super learning to estimate the outcome regression, and subsequently targeting this initial fit using a targeted maximum likelihood step, which relies on an estimate of the probability of being selected in population 1 as a function of the covariates. The resulting targeted maximum likelihood estimator is double robust and efficient. That is, with our formulation we make it possible to use state of the art statistical methodology in causal inference to obtain fully efficient and double robust estimators of the causal effect of an intervention assigned at the population level.
We show in the accompanying technical report that this approach, somewhat surprisingly, can also be used to estimate the additive causal effect of setting treatment at time t (choosing between the observed treatment level at time t for population 1 versus the observed treatment level at time t for population 2). The past treatment and past environment is viewed as the environment variable assigned at the population level, current treatment at time t is viewed as the treatment assigned at the population level, and the individual past before treatment at time t represents the covariates that can potentially be used to block the effect of differential environment. In this manner, the same methodology can be applied, providing us with double robust and efficient targeted maximum likelihood estimators of the t-specific causal effects of community-based interventions, and user-supplied summary measures of these t-specific causal effects.
We extend our results to identification and estimation of the effects of community-based interventions using individually matched two community cohort designs, and investigate practical and theoretical efficiency gains under this design compared to one in which individuals are sampled independently. We use general results on efficient influence curves and targeted maximum likelihood estimation for case–control sampling as established in van der Laan  to compute the semiparametric information bounds and present the targeted maximum likelihood estimator for these matched cohort designs. We show that the matched cohort design is very much targeted towards the causal effect among the treated population, showing the strong potential benefits of matching for the purpose of this target causal parameter.
Finally, we generalize our identifiability theorems, efficient influence curve, and estimators to the case that one assigns two possible interventions to K communities. In particular, our i.i.d. representation of this multi-sample data structure in terms of shows that the effective sample size will be , the pooled sample size across all communities.
If a large number of communities is sampled from some target population of communities, so that asymptotics in the number of communities is sensible, then one can obtain identifiability of the causal effects of community level interventions under the usual randomization and positivity assumption known from the classical causal inference literature. In such a case, one need not relay on the strong identifiability assumptions assumed in this article.
The targeted maximum likelihood estimators presented here for a fixed number of K communities are still consistent and asymptotically linear if the intervention is randomly assigned at the community level, as in cluster randomized trial designs. They do not involve adjustment by environmental factors, however, and are, therefore, somewhat inefficient. On the other hand, even in the case that one aims to identify a causal effect for a target population of communities, the targeted maximum likelihood estimator presented here remains of interest when the number of communities is small, the number of sampled individuals is large, and the exclusion restriction assumption holds up to a reasonable approximation. We refer to the technical report van der Laan  for results for this TMLE with respect to a causal target parameter defined with respect to a target population of communities in the context that K converges to infinity, in the case where the exclusion restriction assumption does not necessarily hold.
Mark van der Laan and Wenjing Zheng are supported by NIH grant R01 AI074345-06. Maya Petersen is a recipient of a Doris Duke Clinical Scientist Development Award.
Oakes JM. The (mis)estimation of neighborhood effects: causal inference for a practicable social epidemiology. Soc Sci Med 2004;58(10):1953–1960.
Small DS, Ten Have TR, Rosenbaum PR. Randomization inference in a group-randomized trial of treatments for depression. J Am Stat Assoc 2008;1030:271–79. [Web of Science]
Hanse B, Bowers J. Attributing effects to a cluster-randomized get-out-the-vote campaign. J Am Stat Assoc 2009;1040:873–85. [Web of Science]
Imai K, King G, Nall C. The essential role of pair matching in cluster-randomized experiments, with application to the mexican universal health insurance evaluation. Stat Sci 2009;240:29–53. [Crossref] [Web of Science]
Kenward MG, Roger JH. Biometrics 1997;53:983–97. [PubMed]
Skene S, Kenward MG. The analysis of very small samples of repeated measurements I: a modified box correction. Stat Med 2010b;29:2837–56. [Web of Science]
Zhang K, Small DS. Comment: the essential role of pair matching in cluster-randomized experiments, with application to the mexican universal health insurance evaluation. Stat Sci 2009;24:59–64. [Crossref] [Web of Science]
Cox DR. The planning of experiments. New York: Wiley, 1958.
Rubin DB. Bayesian inference for causal effects: the role of randomization. Ann Stat 1978;6:34–58. [Crossref]
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70:41–55. [Crossref]
Holland PW. Statistics and causal inference. J Am Stat Assoc 1986;81:945–60. [Crossref]
Hong B, Raudenbush S. Evaluating kindergarten retention policy. J Am Stat Assoc 2006;101:901–10. [Crossref]
Sobel M. What do randomized studies of housing mobility demonstrate? J Am Stat Assoc 2006;101:1398–407. [Crossref]
Halloran ME, Struchiner CJ. Causal inference in infectious diseases. Epidemiology 1995;60:142–51. [Crossref]
Rosenbaum P. Interference between units in randomized experiments. J Am Stat Assoc 2007;102:191–200. [Crossref]
Tchetgen Tchetgen EJ, VanderWeele TJ. On causal inference in the presence of interference. Statistical Methods in Medical Research – Special Issue on Causal Inference 2012; 21:55–75. [Web of Science]
Sinclair B, McConnell M, Green D. Detecting spillover in social networks: design and analysis of multi-level experiments. Am J Political Sci 2012;56(4):1055–1069. [Crossref]
Shalizi CR, Thomas AC. Homophily and contagion are generically confounded in observational social network studies. SociolMethods Res 2011;40:211–39.
Manski CF. Identification of endogenous social effects: the reflection problem. Rev Econ Stud 1993;60:531–42. [Crossref]
Graham B, Imbens GW, Ridder G. Measuring the effects of segregation in the presence of social spillovers: a nonparametric approach. Technical report, 2010.
Pearl J. Causality: models, reasoning and inference, 2nd ed. New York: Cambridge University Press, 2009.
Varnell SP, Murray DM, Baker W. An evaluation of analysis options for the one-group-per-condition design: can any of the alternatives overcome the problems inherent in this design? Eval Rev 2001;25:440–53. [Crossref] [PubMed]
Polley EC, van der Laan MJ. Super learner in prediction. Technical Report 266, Division of Biostatistics, University of California, Berkeley, 2010.
Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies. Int J Biostat 2008. Available at: http://www.bepress.com/ijb/vol4/iss1/19/.
van der Laan MJ. Estimation based on case-control designs with known prevalance probability. Int J Biostat 2008, http://www.bepress.com/ijb/vol4/iss1/17/.
van der Laan MJ. Estimation of causal effects of community based interventions. Working paper 268, U.C. Berkeley Division of Biostatistics Working Paper Series, http://www.bepress.com/ucbbiostat/paper268, June 2010.
van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat 2006; 2: Available at http://www.degruyter.com/view/j/ijb.2006.2.1/ijb.2006.2.1.1043/ijb.2006.2.1.1043.xml?format=INT.
van der Laan MJ, Rose S. Targeted learning. New York: Springer, 2011.
van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat 2010; 6: Available at http://www.degruyter.com/view/j/ijb.2010.6.1/ijb.2010.6.1.1181/ijb.2010.6.1.1181.xml?format=INT.
Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat 2010;60. [Web of Science]
van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol 2007;60. ISSN 1. [Web of Science]
Rose S, van der Laan MJ. Why match? Investigating matched case-control study designs with causal effect estimation. Int J Biostat 2009. Available at:http://www.bepress.com/ijb/vol5/iss1/1/. [Web of Science]
Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Technical report 265, Division of Biostatistics, University of California, Berkeley, May 2010.