Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter August 24, 2018

Propensity Score Weighting for Causal Inference with Clustered Data

Shu Yang ORCID logo

Abstract

Propensity score weighting is a tool for causal inference to adjust for measured confounders in observational studies. In practice, data often present complex structures, such as clustering, which make propensity score modeling and estimation challenging. In addition, for clustered data, there may be unmeasured cluster-level covariates that are related to both the treatment assignment and outcome. When such unmeasured cluster-specific confounders exist and are omitted in the propensity score model, the subsequent propensity score adjustment may be biased. In this article, we propose a calibration technique for propensity score estimation under the latent ignorable treatment assignment mechanism, i. e., the treatment-outcome relationship is unconfounded given the observed covariates and the latent cluster-specific confounders. We impose novel balance constraints which imply exact balance of the observed confounders and the unobserved cluster-level confounders between the treatment groups. We show that the proposed calibrated propensity score weighting estimator is doubly robust in that it is consistent for the average treatment effect if either the propensity score model is correctly specified or the outcome follows a linear mixed effects model. Moreover, the proposed weighting method can be combined with sampling weights for an integrated solution to handle confounding and sampling designs for causal inference with clustered survey data. In simulation studies, we show that the proposed estimator is superior to other competitors. We estimate the effect of School Body Mass Index Screening on prevalence of overweight and obesity for elementary schools in Pennsylvania.

1 Introduction

A main statistical approach to causal inference is built on the potential outcomes framework [1], in which a causal effect is defined as a comparison of the potential outcomes of the same units under different treatment levels. Observational studies are often used to infer causal effects in medical and social science studies. In observational studies, there often is confounding by indication: some covariates are predictors of both the treatment and outcome. One implication is that the covariate distributions differ between the treatment groups. Under the assumption of ignorable treatment assignment and that all confounders are observed, the causal effect of treatments can be obtained by comparing the outcomes for units from different treatment groups, adjusting for the observed confounders. Rosenbaum and Rubin [2] further demonstrated the central role of the propensity score, and showed that adjusting for the propensity score removes confounding bias. An extensive literature thereafter proposed a number of estimators based on the propensity score, including matching [3], [4], [5], weighting [6], [7], [8], and stratification [9], [10]. In particular, propensity score weighting can be used to create a weighted population where the covariate distributions are balanced between the treatment groups, on average. Therefore, under some assumptions, the comparison between the weighted outcomes has a causal interpretation; see Imbens and Rubin [11] for a textbook discussion.

Propensity score weighting has been mainly developed and applied in settings with independently and identically distributed (i.i.d.) data. However, in many research areas, data often present complex structures, such as clustering. Clustering can be formed in diverse ways. First, clusters are created by experimental design. The classical examples are given in educational and health studies, where students are nested in schools (e. g. [12]) and patients are grouped in hospitals (e. g. [13]). Second, clusters are induced by high-level features, such as common environmental and contextual factors, shared by individual units (e. g. [14]). For a motivating example, we examine the 2007–2010 body mass index (BMI) surveillance data from Pennsylvania Department of Health to estimate the effect of School Body Mass Index Screening (SBMIS) on the annual overweight and obesity prevalence in elementary schools in Pennsylvania. The data set includes 493 schools (units) in Pennsylvania, which are clustered by two factors: type of community (rural, suburban, and urban), and population density (low, median, and high). The data structure is schools nested within high-level of environments.

In this article, we address the problem of estimating the average treatment effect from clustered data, where the treatment is administered at the unit level, the covariates are measured at both unit and cluster levels (where cluster-level covariates are cluster characteristics shared by all units within clusters), and finally the outcome is measured at the unit level. Even if we collect a rich set of unit-level covariates, there may be unobserved cluster-level covariates that are related to both the treatment assignment and outcome. This problem is ubiquitous in clustered data where data are collected sufficiently at the unit level, however insufficient information is available at the cluster level. In our motivating example, we have unit-level school covariates including the baseline prevalence of overweight and obesity and percentage of reduced and free lunch. However, certain key contextual factors, such as accessibility to and quality of care, socioeconomic and environmental variables, which can be very different across clusters, are perceivably important factors for schools implementing prevention screening strategy and children’s obesity rate. Unfortunately, these cluster-specific confounders are not available. When such unmeasured confounders exist and are omitted in the propensity score model, the subsequent analysis may be biased.

We make the stable unit and treatment version assumption (SUTVA; [15]). Under the SUTVA, potential outcomes for each unit are not affected by the treatments assigned to other units. This assumption is not uncommon. In our application, the treatment was implemented school-wise. The potential outcomes for one school are likely to be unaffected by the treatments implemented at other schools, and therefore the SUTVA is plausible. However, in other settings, this assumption may not hold. A classical example is given in infectious diseases [16], [17], where whether one person becomes infected depends on who else in the population is vaccinated. In this article, we will not discuss the case when the SUTVA is violated.

The literature has proposed different methods for clustered data. Oakes [18] and VanderWeele [19] used multi-level models for the potential outcomes to draw causal conclusions in neighborhood effect studies. A series of papers has proposed various propensity score matching algorithms with multi-level models for the propensity score; see, e. g., [12], [20], [21], [22], [23], [13], [24], [25], and [26]. Recently, Li et al. [14], Leite, Jimenez, Kaya, Stapleton, MacInnes and Sandbach [27] and Schuler, Chu and Coffman [28] examined propensity score weighting methods to reduce selection bias in multi-level observational studies. Xiang and Tarasawa [29] employed propensity score stratification and multi-level models to balance key covariates between treatment groups of a cross-state sample of students. For comparison of the effectiveness of various propensity score strategies; see, e. g., [30], [13] and [31]. Among these works, researchers considered different modeling choices for the propensity score and outcome, such as generalized linear fixed/mixed effects models. The fixed effects models create dummy variables for each cluster, regarding the cluster variables as fixed; while the random effects models use random intercepts for each cluster, treating the cluster variables as random. Nonetheless, all existing methods require correct specification of the propensity score and outcome models.

The goal of this article is to develop a novel propensity score weighting method for causal inference with clustered data in the presence of unmeasured cluster-level confounders. An important contribution is to provide a robust construction of inverse propensity score weights under the latent ignorable treatment assignment mechanism; i. e., the treatment-outcome relationship is unconfounded given the observed confounders and the latent cluster-level confounders. The key insight is based on the central role of the propensity score in balancing the covariate distributions between the treatment groups. For propensity score estimation, we then adopt the calibration technique and impose balance constraints for moments of the observed and latent cluster-level confounders between the treatment groups. Because the latent cluster-level confounders are not observed, we impose stronger balance constraints enforcing the sum of weighted treatments equal the cluster size for all clusters, which imply the exact balance of the cluster-level confounders. The proposed propensity score weighting estimator is doubly robust (e. g., [32], [33], [7], [34]) in the sense that it is consistent for the average treatment effect if either the propensity score model is correctly specified or the outcome follows a linear mixed effect model. In general cases, if the conditional mean of the outcome given the observed confounders and the latent cluster-level confounders can be well approximated by the power series of confounders, imposing constraints on these power series can also eliminate confounding biases, and the propensity score weighting estimator is consistent. The simulation results demonstrate that the proposed estimator has improved robustness to model misspecification compared to existing methods.

Importantly, our results are in agreement with some existing findings that misspecification of the propensity score model may have minor impact on the bias of the estimator for the average treatment effect ([35]). This is especially true in the matching and stratification algorithm, because the estimated propensity score is only used to balance the covariate distributions instead of directly in estimation. Our results suggest that if both individual and cluster-level confounders achieve a good balance between the treatment groups, the proposed weighting estimator for the average treatment effect is robust.

Clustered data often arise in survey sampling. In complex surveys, the challenge is to take design information or design weights into account when developing propensity score methods for causal inference. The proposed weighting method can be combined with sampling weights for an integrated solution to handle confounding and sampling designs for causal inference with clustered survey data.

This article is organized as follows. Section 2 introduces the data structure and assumptions, defines the estimands, and presents existing inverse probability of treatment weighting estimators for clustered data. Section 3 proposes our estimators. Section 4 presents the main theoretical results. Section 5 extends the proposed calibration estimator to clustered survey data. Section 6 reports a simulation study to evaluate finite sample properties of our estimator. Section 7 applies our methods to investigate the effect of SBMIS on the annual overweight and obesity prevalence in elementary schools in Pennsylvania. A concluding remark is given in Section 8. Finally, proofs of the main theoretical results and additional simulation results are provided in the Appendix.

2 Basic setup

2.1 Observed data structure

To fix the ideas, we first focus on two-level clustered data. The extension to clustered survey data will be addressed in Section 5.

Suppose we have a two-level data structure where at the first level we have m clusters, and at the second level each cluster i includes ni units. Denote the sample size by n=i=1mni. For unit j in cluster i, we observe a p-dimensional vector of pre-treatment covariates Xij, which may include the observed individual and cluster characteristics, a binary treatment variable Aij{0,1}, with 0 and 1 being the labels of control and active treatments, respectively, and lastly an outcome variable Yij.

2.2 Potential outcomes and assumptions

We use the potential outcomes framework [1]. Assume that each unit has two potential outcomes: Yij(0), the outcome that would be realized, possibly contrary to the fact, had the unit received the control treatment, and Yij(1), the outcome that would be realized, possibly contrary to the fact, had the unit received the active treatment. The observed outcome is Yij=Yij(Aij). This notation implicitly makes the SUTVA [15] that there is no interference between units and no versions of each treatment.

Suppose that clusters are random draws from a super-population of clusters, and that for observations within cluster i, {Aij,Xij,Yij(0),Yij(1):j=1,,ni} i.i.d. follow a cluster-specific super-population model. Our goal is to estimate the average treatment effect, τ=E[n1i=1mj=1ni{Yij(1)Yij(0)}], where the expectation is taken with respect to the super-population model ξ of all random variables, which will be specified later. In the binary case, τ is called the causal risk difference.

The fundamental problem is that not all potential outcomes can be observed for each unit in the sample; only one potential outcome, the outcome corresponding to the treatment the unit actually followed, can be observed [36]. Throughout, we use Z1Z2Z3 to denote the conditional independence of Z1 and Z2 given Z3 [37]. With unstructured i.i.d. data, Rubin [1] described the following assumption for identifying the average treatment effect.

Assumption 1 (Ignorability).

{Yij(0),Yij(1)}AijXij.

Assumption 1 indicates that all confounders are included in Xij.

For clustered data, confounding may vary across clusters and are related to some cluster-level covariates that are not always observable. In these cases, Assumption 1 does not hold. Instead, we assume a cluster-specific latent variable Ui that summarizes the effect of unobserved cluster-level confounders, and consider the following modified ignorability assumption.

Assumption 2 (Latent ignorability).

{Yij(0),Yij(1)}AijXij,Ui.

Under Assumption 2, the propensity score becomes pr{Aij=1Xij,Ui,Yij(0),Yij(1)}=pr(Aij=1Xij,Ui)e(Xij,Ui). Moreover, we make the standard positivity assumption for the propensity score.

Assumption 3 (Positivity).

There exist constantse_ande¯such that, with probability 1,0<e_<e(Xij,Ui)<e¯<1.

Remark 1.

In our setting, the treatment is assigned at the unit level. Assumption3implies that each unit in each cluster has a positive probability to receive either treatment or control. Therefore, our setting does not apply to the settings with a cluster-level treatment where all units in one cluster receive one treatment level. For these settings, we refer the interested readers to Stuart [38] and VanderWeele [19].

Under Assumption 2, we write the joint distribution of {(Aij,Xij,Ui,Yij):i=1,,m;j=1,ni} as

i=1mf(Ui)j=1nif(XijUi)f1(YijXij,Ui)e(Xij,Ui)Aijf0(YijXij,Ui){1e(Xij,Ui)}1Aij,

where fa(·Xij,Ui) is a conditional distribution of Yij(a) given (Xij,Ui), for a=0,1.

The literature has considered generalized linear mixed effects models for fa(YijXij,Ui) and e(Xij,Ui); see, e. g., [24], [25], [14]. Following the literature, we assume generalized linear mixed effects models for the outcome and propensity score.

Assumption 4 (Outcome model).

The potential outcomeYij(a)follows a generalized linear mixed effects model with a random interceptUias

(1)μij(a)=ga(XijTβa+Ui),
whereμij(a)=E{Yij(a)Xij,Ui},ga(·)is an unspecified inverse link function, andβais a p-dimensional vector.

Assumption 5 (Propensity score model).

The actual treatmentAijgiven(Xij,Ui)follows a generalized linear mixed effects model with a random interceptUias

(2)e(Xij,Ui;η)=h(XijTη+Ui),
whereh(·)is an unspecified inverse link function, and η is a q-dimensional vector of parameters.

There are two different model specifications regarding the cluster-level confounders. The fixed effects model treats Ui as fixed but unknown parameters across clusters. In this fixed-effects approach, treatment assignment is an ignorable process, which complies with Assumption 1 given that Xij stacks all observed confounders and cluster dummy variables. On the other hand, the random effects model treats Ui as random and i.i.d. drawn from a distribution. The difference between the two modeling strategies has been addressed in both statistics and econometrics literature; see, e. g., [39] and [40]. Briefly, there are both statistical and logical considerations. First, if the number of clusters is relatively large, the parameter estimates in the fixed effects model are inconsistent [41]. In this case, the random effects approach is preferred. Second, the fixed effects approach does not make distributional assumptions of the cluster-level confounders; whereas, the random effects approach assumes that Ui is random and i.i.d. drawn from a distribution.

2.3 Inverse probability of treatment weighting estimator

To estimate the average treatment effect τ, let ν=(U1,,Um) denote the vector of cluster-level confounders. Under (2), the inverse propensity score or probability of treatment weighting (IPTW) estimator of τ can be expressed as

(3)τˆIPTW(ν,η)=1ni=1mj=1niAijYije(Xij,Ui;η)(1Aij)Yij1e(Xij,Ui;η).

Under Assumptions 2 and 3, if the propensity score is known, it is straightforward to verify that τˆIPTW(ν,η) is unbiased for τ. Moreover, if it is unknown but depends only on fixed parameters, τˆIPTW(ν,η) with the consistently estimated propensity score is asymptotically unbiased for τ. The challenge with clustered data is that τˆIPTW(ν,η) may depend on a growing number of unobserved cluster-level confounders. To resolve this issue, there are several options:

(i)

Weight based on predicted random intercepts; i. e., treat the Ui’s in model (2) as random intercepts, and predict the propensity score as e(Xij,Uˆiran;ηˆran), where Uˆiran is the mode of a predictive distribution for Ui given the observed Aij and Xij, and ηˆran is the maximum likelihood estimator of η.

(ii)

Weight based on estimated fixed intercepts; i. e., treat the Ui’s in model (2) as fixed intercepts, and estimate the propensity score as e(Xij,Uˆifix;ηˆfix), where Uˆifix and ηˆfix are maximum likelihood estimators.

Let τˆIPTW(ν,η) in (3) be denoted as τˆran or τˆfix when the propensity score is predicted under option (i) or estimated under option (ii), respectively. The two approaches suffer several drawbacks. First, to obtain τˆran often involves numerical integration, which can be computationally heavy. Second, the predicted value of the propensity score does not guarantee the balance of covariate distributions between the treatment groups, due to the shrinkage of random intercepts toward zero. Lastly, it is well-known that under (2), τˆfix is not consistent as m increases, because when treating Ui as fixed, the number of parameters has an order similar to m [42].

3 Proposed methodology

To motivate our estimation of the propensity score, we note

(4)EAe(X,U)XU=EEAe(X,U)X,UXU=EXU,

and

(5)E1A1e(X,U)XU=EE1A1e(X,U)X,UXU=EXU.

Equations (4) and (5) clarify the central role of the propensity score in balancing the covariate distributions between the treatment groups in the super-population. For simplicity of exposition, let eij be the propensity score for unit j in cluster i, and let eˆij be the corresponding estimate. We consider the propensity score estimate to satisfy the following constraints:

(6)i=1mj=1niAijeˆijXij=i=1mj=1ni1Aij1eˆijXij=i=1mj=1niXij,
(7)j=1niAijeˆij=j=1ni1Aij1eˆij=j=1ni1=ni,(i=1,,m).
Note that (6) is the empirical version of (4). The empirical version of (5) is

(8)i=1mj=1niAijeˆijUi=i=1mj=1ni1Aij1eˆijUi=i=1mj=1niUi,

which however is infeasible because the Ui’s are unobserved. Instead, we impose (7) which implies (8), without the need to observe the Ui’s.

To obtain the propensity score estimate that achieves (6) and (7), we use the calibration technique in the following steps:

Step 1.

Obtain an initial propensity score estimate eˆij0 under a working propensity score model, e. g. a logistic linear mixed effects model. This in turn provides an initial set of inverse propensity score weights, W0={dij:i=1,,m;j=1,,ni}, where dij=1/eij0 if Aij=1 and dij=1/(1eij0) if Aij=0.

Step 2.

Modify the initial set of weights W0 to a new set of weights W={αij:i=1,,m;j=1,,ni} by minimizing the Kullback-Leibler distance [43] of W0 and W:

(9)i=1mj=1niG(αij,dij)=i=1mj=1niαijlogαijdij,

subject to (6) and (7). By the Lagrange Multipliers technique, the minimizer of (9) subject to (6) and (7) is

(10)αij(λ1,λ2)=niAijdijexp(λ1TXijAij)j=1niAijdijexp(λ1TXijAij)+ni(1Aij)dijexp{λ2TXij(1Aij)}j=1ni(1Aij)dijexp{λ2TXij(1Aij)},

where (λ1,λ2) is the solution to the following equation

(11)Qˆ(λ1,λ2)=Qˆ1(λ1,λ2)Qˆ2(λ1,λ2)=n1i=1mj=1niAijαij(λ1,λ2)1Xijn1i=1mj=1ni(1Aij)αij(λ1,λ2)1Xij=0.
Step 3.

Obtain the propensity score estimate as

eˆij=αij(λˆ1,λˆ2)Aij{1αij(λˆ1,λˆ2)}1+Aij.
Finally, our proposed IPTW estimator is

(12)τˆIPTW=1ni=1mj=1niAijYijeˆij(1Aij)Yij1eˆij.

Remark 2 (Calibration).

Calibration has been used in many scenarios. In survey sampling, calibration is widely used to integrate auxiliary information in estimation or handle nonresponse; see, e. g., [44], [45], [46], [47], [48] and [49]. In causal inference, calibration has been used such as in Constrained Empirical Likelihood [50], Entropy Balancing [51], Inverse Probability Tilting [52], and Covariate Balance Propensity Score of Imai and Ratkovic [53]. Chan, Yam and Zhang [54] showed that estimation of average treatment effects by empirical balance calibration weighting can achieve global efficiency. However, all these works were developed in settings with i.i.d. variables and they assumed that there are no unmeasured confounders. Our article is the first to use calibration for causal inference with unmeasured cluster-level confounders.

Remark 3 (Distance function).

In Step 2 of the calibration algorithm, different distance functions, other than the Kullback-Leibler distance, can be considered. For example, if we chooseG(αij,dij)=dij(αij/dij1)2, then the minimum distance estimation leads to generalized regression estimation [55] of theαij’s. If we chooseG(αij,dij)=dijlog(αij/dij), then it leads to empirical likelihood estimation [56]. We use the Kullback–Leibler distance function, which leads to exponential tilting estimation [57], [58], [59]. The advantage of using the Kullback-Leibler distance is that the resulting weights are always non-negative. Also, with Kullback-Leibler distance, the calibration constraint (7) can be built into a closed form expression for the weights, and thus avoiding solving a large number of equations. This reduces the computation burden greatly, when there is a large number of clusters.

Remark 4 (Nonparametric methods).

It is worth commenting on the existing robust nonparametric methods and the advantages of our estimator. In the i.i.d. data setting, many nonparametric and machine learning methods have been proposed to capture the complex relationship of different variables without parametric assumptions, such as generalized boosted regression, causal trees, random forest, and neural networks. Indeed, many studies have shown the superiority of these methods to the parametric propensity score estimation; see, e. g., [60], [61], [62], [63]. However, these data-driven methods assume that all confounders are observed, and therefore they can not handle unobserved cluster-level confounders, unlike our proposed method.

4 Main results

To discuss the asymptotic properties of the proposed estimator, we assume that the cluster sample sizes satisfy the condition that min1imni and sup1imni=O(n1/2). To show the double robustness of the proposed estimator τˆIPTW, we distinguish two cases and indicate different roles of calibration in estimation. We provide a heuristic argument below and relegate the technical details to the Appendix.

First, consider the case when the initial propensity score model is correctly specified. The weighting estimator with the initial propensity score estimates is then consistent for the average treatment effect. In this case, the role of calibration is to improve estimation efficiency by incorporating additional covariate information. This role of calibration has been demonstrated extensively in the survey literature (e. g., [64]) to modify the initial design weights to incorporate known auxiliary information.

Second, consider the case when the outcome models are linear mixed effects models:

EYij(a)Xij,Ui=XijTβa+Ui+ea,ij,

where the ea,ij’s are independent with E(ea,ijXij,Ui)=0, for a=0,1. In this case, the role of calibration is to balance the confounders between the treatment groups for reducing the selection bias. We note that eˆij does not depend on outcome variables, and therefore under Assumptions 2, 4 and 5, eˆijYij(1)Xij,Ui. Then, we have

(13)E1ni=1mj=1niAijeˆij1Yij(1)=1ni=1mj=1niEAijeˆij1EYij(1)Xij,Ui=E1ni=1mj=1niAijeˆij1XijTβ1+Ui=0,

where the last equality follows by the constraints (6) and (7). Using Assumption 2 and (13), it follows

(14)E1ni=1mj=1niAijeˆijYij=E1ni=1mj=1niAijeˆijYij(1)=E1ni=1mj=1niYij(1).

Similarly, we establish

(15)E1ni=1mj=1ni1Aij1eˆijYij=E1ni=1mj=1niYij(0).

Combining (14) and (15), we have E(τˆIPTW)=τ, which yields the unbiasedness of τˆIPTW. Under standard regularity conditions specified in the Appendix, we show that τˆIPTW is consistent for τ.

In general cases, if the conditional mean of the outcome given the observed confounders and the latent cluster-level confounders can be well approximated by the power series of confounders, imposing constraints on these power series can also eliminate confounding biases, and the propensity score weighting estimator is consistent.

We derive the asymptotic distribution of τˆIPTW in the following theorem and relegate the proof to the Appendix.

Theorem 1.

Suppose that Assumptions2,3, and the regularity conditions specified in the Appendix hold. Suppose further that the cluster sample sizesni, fori=1,,m, satisfy the condition thatmin1imniandsup1imni=O(n1/2). If the outcome model (1) is a linear mixed effects model or the propensity score model (2) is correctly specified, the proposed propensity score weighting estimator in (12), subject to constraints (6) and (7), satisfies

V11/2(τˆIPTWτ)N(0,1),
in distribution, asn, whereV1=var(n1i=1mj=1niτij),
τij={αij(λ1,λ2)Aij(YijB1TXij)+B1TXij}{αij(λ1,λ2)(1Aij)(YijB2TXij)+B2TXij},B1=Ei=1mj=1niαij(λ1,λ2)1αij(λ1,λ2)niAijYijXijTEi=1mj=1niαij(λ1,λ2)1αij(λ1,λ2)niAijXijXijT1,B2=Ei=1mj=1niαij(λ1,λ2)1αij(λ1,λ2)ni(1Aij)YijXijTEi=1mj=1niαij(λ1,λ2)1αij(λ1,λ2)ni(1Aij)XijXijT1,
and(λ1,λ2)satisfiesE{Qˆ(λ1,λ2)}=0withQˆ(λ1,λ2)defined in (11).

The asymptotic result in Theorem 1 allows for variance estimation of τˆIPTW. We now discuss variance estimation. Let τˆij=αij(λˆ1,λˆ2){Aij(YijBˆ1TXij)(1Aij)(YijBˆ2TXij)}+(Bˆ1Bˆ2)TXij, where

Bˆ1=i=1mj=1niαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)niAijYijXijTi=1mj=1niαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)niAijXijXijT1,Bˆ2=i=1mj=1niαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)ni(1Aij)YijXijTi=1mj=1niαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)ni(1Aij)XijXijT1.

Let τˆi=ni1j=1niτˆij and Vˆi=(ni1)1j=1ni(τˆijτˆi)2. The variance estimator can be constructed as

Vˆ(τˆIPTW)=1n1m1i=1m(τˆiτˆIPTW)2+1mi=1mVˆi.

5 Extension to clustered survey data

In this section, we extend the proposed propensity score weighting estimator to clustered survey data. Consider a finite population FN with M clusters and Ni units in the ith cluster, where N=i=1MNi denotes the population size. We assume that in the finite population, {Aij,Xij,Yij(0),Yij(1):i=1,,M;j=1,,Ni} follows the super-population model ξ as described in Section 2. We are interested in estimating the population average treatment effect τ=E[N1i=1Mj=1Ni{Yij(1)Yij(0)}].

We assume that the sample is selected according to a two-stage cluster sampling design. Specifically, at the first stage, cluster i is sampled with the first-order inclusion probability πi=pr(iSI), where SI is the index set for the sampled clusters. Let πij=pr{(i,j)SI} be the second-order inclusion probability for clusters i and j being sampled. At the second stage, given that cluster i was selected at the first stage, unit j is sampled with conditional probability πj|i=pr(jSIIiSI), where SII is the index set for the sampled units. The final sample size is n=iSIni. The design weight for unit j in cluster i be ωij=(πiπj|i)1, which reflects the number of units for cluster i in the finite population this unit j represents. We assume that the design weights are positive and known throughout the sample. Also, let πkl|i=pr{(k,l)SIIiSI} be the second-order inclusion probability for units k and l being sampled given that cluster i was selected. The second-order inclusion probabilities, πij and πkl|i, are often used for variance estimation.

For clustered survey data, if the propensity score e(Xij,Ui) is known, we can express the IPTW estimator of τ as

(16)τˆIPTW=1NiSIj=1niωijAijYije(Xij,Ui)(1Aij)Yij1e(Xij,Ui).

Let Eξ and Ep denote expectation under the super-population model and the sampling design, respectively. It is easy to verify that

E(τˆIPTW)=Eξ{Ep(τˆIPTW)}=Eξ1Ni=1Mj=1NiAijYije(Xij,Ui)(1Aij)Yij1e(Xij,Ui)=τ.

In practice, because the propensity score e(Xij,Ui) is often unknown, (16) is not feasible. To estimate the propensity score, we now require the propensity score estimate eˆij satisfy the following design-weighted moment constraints

(17)iSIj=1niωijAijeˆijXij=iSIj=1niωij1Aij1eˆijXij=iSIj=1niωijXij,
(18)j=1niωijAijeˆij=j=1niωij1Aij1eˆij=Ni,(iSI).
These moment constraints (17) and (18) are the sample version of (4) and (5), respectively.

To obtain the propensity score estimate that achieves (17) and (18), we use the calibration technique in the following steps:

Step 1.

Obtain an initial propensity score estimate eˆij0 under some working propensity score model, e. g. a logistic linear mixed effect model, each unit weighted by the design weight ωij. This in turn provides an initial set of inverse propensity score weights, W0={dij:i=1,,m;j=1,,ni}, where dij=1/eij0 if Aij=1 and dij=1/(1eij0) if Aij=0.

Step 2.

Modify the initial set of weights W0 to a new set of weights W={αij:i=1,,m;j=1,,ni} by minimizing i=1mj=1niωijαijlog(αij/dij), subject to (17) and (18). By Lagrange Multiplier, αij can be obtained as

(19)αij(λ1,λ2)=NiAijdijexp(λ1XijAij)j=1niωijAijdijexp(λ1XijAij)+Ni(1Aij)dijexp{λ2Xij(1Aij)}j=1niωij(1Aij)dijexp{λ2Xij(1Aij)},

where (λ1,λ2) is the solution to the following equation

(20)Qˆ(λ1,λ2)=Qˆ1(λ1,λ2)Qˆ2(λ1,λ2)=N1iSIj=1niωijAijαij(λ1,λ2)1XijN1iSIj=1niωij(1Aij)αij(λ1,λ2)1Xij=0.
Step 3.

Obtain the propensity score estimate as

eˆij=αij(λˆ1,λˆ2)Aij{1αij(λˆ1,λˆ2)}1+Aij.
Finally, our proposed IPTW estimator is

(21)τˆIPTW=1NiSIj=1niωijAijYijeˆij(1Aij)Yij1eˆij.

In the above procedure, the design wights are used in both the propensity score estimates and the weighting estimator.

We now consider the asymptotic property of τˆIPTW in (21). We use an asymptotic framework, where the sample size n indexes a sequence of finite populations and samples ([65]; Section 1.3), such that the population size N increases with n. In addition, we have the following regularity conditions for the sampling mechanism.

Assumption 6.

(i) The first-order inclusion probabilityπiπj|iis positive and uniformly bounded in the sense that there exist positive constantsC1andC2that do not depend on N, such thatC1πiπj|iNn1C2, for any i and j; (ii) the sequence of Horvitz-Thompson estimatorsYˆHT=N1iSIj=1niωijyisatisfiesvarp(YˆHT)=O(n1)and{varp(YˆHT)}1/2(YˆHTY¯)FNN(0,1), in distribution, asn, whereY¯=N1i=1Mj=1Niyiis the population mean of Y, and the reference distribution is the randomization distribution generated by the sampling mechanism.

Sufficient conditions for the asymptotic normality of the Horvitz-Thompson estimators are discussed in Chapter 1 of Fuller [65].

Theorem 2.

Suppose that Assumptions26, and the regularity conditions specified in the Appendix hold. Suppose further that the cluster sample sizesNi, fori=1,,M, satisfy the condition thatmin1iMNiandsup1iMNi=O(N1/2). If the outcome model (1) is a linear mixed effects model or the propensity score model (2) is correctly specified, the proposed propensity score weighting estimator in (21), subject to constraints (17) and (18), satisfies

V21(τˆIPTWτ)N(0,1),
in distribution, asn, whereV2=var(N1iSIj=1niωijτij),
τij={αij(λ1,λ2)Aij(YijB1TXij)+B1TXij}{αij(λ1,λ2)(1Aij)(YijB2TXij)+B2TXij},B1=Ei=1Mj=1Niαij(λ1,λ2)1αij(λ1,λ2)niAijYijXijTEi=1Mj=1Niαij(λ1,λ2)1αij(λ1,λ2)niAijXijXijT1,B2=Ei=1Mj=1Niαij(λ1,λ2)1αij(λ1,λ2)ni(1Aij)YijXijTEi=1Mj=1Niαij(λ1,λ2)1αij(λ1,λ2)ni(1Aij)XijXijT1,
and(λ1,λ2)satisfiesE{Qˆ(λ1,λ2)}=0withQˆ(λ1,λ2)defined in (20).

For variance estimation of τˆIPTW, let τˆij=αij(λˆ1,λˆ2){Aij(YijBˆ1TXij)(1Aij)(YijBˆ2TXij)}+(Bˆ1Bˆ2)TXij, where

Bˆ1=iSIj=1niωijαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)niAijYijXijTiSIj=1niωijαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)niAijXijXijT1,Bˆ2=iSIj=1niωijαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)ni(1Aij)YijXijTiSIj=1niωijαij(λˆ1,λˆ2)1αij(λˆ1,λˆ2)ni(1Aij)XijXijT1.

Let τˆi=j=1niπj|i1τˆij and

Vˆi=k=1nil=1niπkl|iπk|iπl|iπkl|iτˆikπk|iτˆilπl|i.

The variance estimator can be constructed as

Vˆ(τˆIPTW)=1N2iSIjSIπijπiπjπijτˆiπiτˆjπj+iSIVˆiπi.

6 Simulation studies

We conduct two simulation studies to evaluate the finite-sample performance of the proposed estimator, assessing its robustness against model misspecification in Section 6.1 and the robustness against omitting a unit-level confounder in Section 6.2.

6.1 Robustness against model misspecification

We first generate finite populations and then select a sample from each finite population using a two-stage cluster sampling design. In the first setting, we specify the number of clusters in the population to be M=10,000, and the size of the ith cluster size Ni to be the integer part of 500×exp(2+Ui)/{1+exp(2+Ui)}, where UiN(0,1). The cluster sizes range from 100 to 500. The potential outcomes are generated according to linear mixed effects models, Yij(0)=Xij+Ui+eij and Yij(1)=Xij+τ+τUi+eij, where τ=2, XijN(0,1), eijN(0,1), Ui, Xij, and eij are independent, for i=1,,M, j=1,,Ni. The parameter of interest is τ. We consider three propensity score models, pr(Aij=1Xij;Ui)=h(γ0+γ1Ui+Xij), with h(·) being the inverse logit, probit and complementary log-log link function, for generating Aij. We set (γ0,γ1) to be (−0.5,1), (−0.25,0.5) and (−0.5,0.1) for the above three propensity score models, respectively. The observed outcome is Yij=AijYij(1)+(1Aij)Yij(0). From each realized population, m clusters are sampled by Probability-Proportional-to-Size (PPS) sampling with the measure of size Ni. So the first-order inclusion probability of selecting cluster i is equal to πi=mNi/i=1mNi, which implicitly depends on the unobserved random effect. Once the clusters are sampled, the ni units in the ith selected cluster are sampled by Poisson sampling with the corresponding first-order inclusion probability πj|i=nezij/(j=1Mizij), where zij=0.5 if eij<0 and 1 if eij>0. With this sampling design, the units with eij>0 are sampled with a chance twice as big as the units with eij<0. We consider four combinations of m and ne: (i) (m,ne)=(50,50); (ii) (m,ne)=(100,30), representing a large number of small clusters; (iii) (m,ne)=(30,100); and (iv) (m,ne)=(5,100), representing a small number of large clusters.

In the second setting, all data-generating mechanisms are the same as in the first setting, except that the potential outcomes are generated according to logistic linear mixed effects models, Yij(0)Bernoulli(pij0) with logit(pij0)=Xij+Ui and Yij(1)Bernoulli(pij1) with logit(pij1)=Xij+τ+τui. Moreover, in the 2-stage cluster sampling, πj|i=nezij/(j=1Mizij), where zij=0.5 if Yij=0 and 1 if Yij=1. With this sampling design, the units with Yij=1 are sampled with a chance twice as big as the units with Yij=0.

We compare four estimators for τ: (i) τˆsimp, the simple design-weighted estimator without propensity score adjustment; (ii) τˆfix, the weighting estimator in (3) with the propensity score estimated by a logistic linear fixed effects model with fixed cluster intercepts; (iii) τˆran, the weighting estimator in (3) with the propensity score estimated by a logistic linear mixed effects model with random cluster intercepts; (iv) τˆIPTW, the proposed estimator with calibrations (17) and (18).

Table 1 shows biases, variances and coverages for 95% confidence intervals from 1,000 simulated data sets. The simple estimator shows large biases across difference scenarios, even adjusting for sampling design. This suggests that the covariate distributions are different between the treatment groups in the finite population, contributing to the bias. τˆfix works well under Scenario 1 with the linear mixed effects model for the outcome and the logistic linear mixed effects model for the propensity score; however, its performance is not satisfactory in other scenarios. Moreover, τˆfix shows the largest variance among the four estimators in most of scenarios. This is because for a moderate or large number of clusters, there are too many free parameters, and hence the propensity score estimates may not be stable. For τˆran, we assume that the cluster effect is random, which reduces the number of free parameters. As a result, τˆran shows less variability than τˆfix. Nonetheless, both τˆfix and τˆran cannot control the bias well. The proposed estimator shows small bias and good empirical coverage across all scenarios. Notably, to compute τˆIPTW, we used a working model, a logistic linear model, to provide an initial set of weights. When the true propensity score is probit or complementary log-log model, τˆIPTW still has small biases. This suggests that our proposed estimator achieves improved robustness compared to the existing weighting estimators.

6.2 Robustness against omitting a unit-level (higher moment of) confounder

The data generating mechanisms are the same as in Section 6.1, except that in the potential outcomes and the treatment assignment models, we use a squared covariate instead of the original covariate. Now, for the potential outcomes models, we have in the first setting, Yij(0)=Xij2+Ui+eij and Yij(1)=Xij2+τ+τUi+eij, where τ=2, XijN(0,1), eijN(0,1), Ui, Xij, and eij are independent, for i=1,,M, j=1,,Ni; while in the second setting, Yij(0)Bernoulli(pij0) with logit(pij0)=Xij2+Ui and Yij(1)Bernoulli(pij1) with logit(pij1)=Xij2+τ+τui. For three propensity score models, we now have pr(Aij=1Xij;Ui)=h(γ0+γ1Ui+Xij2), with h(·) being the inverse logit, probit and complementary log-log link function, for generating Aij. We set (m,ne)=(50,50). In the proposed method, the calibration condition (17) is imposed only for the first moment of Xij. This represents the case of omitting a unit-level confounder.

Table 1

Simulation results: bias, variance (×103) and coverage (%) of 95 % confidence intervals based on 1,000 Monte Carlo samples; the outcome is linear and logistic linear mixed effects model and the propensity score is logistic, probit or complementary log-log (C-loglog).

(m,ne)(50,50)(100,30)(30,100)(5,100)
biasvarcvgbiasvarcvgbiasvarcvgbiasvarcvg
Scenario 1: Linear outcome & Logistic propensity score
τˆsimp-0.372227.4-0.38128.7-0.383542.3-0.3822855.1
τˆfix-0.013695.60.002195.6-0.014295.2-0.0129882.0
τˆran0.142690.20.211464.60.073794.70.0726387.1
τˆcal0.012694.50.021195.10.003395.60.0024593.3
Scenario 2: Linear outcome & Probit propensity score
τˆsimp-0.291634.4-0.0892.3-0.223065.6-0.3016258.0
τˆfix0.083590.3-0.10194.50.126990.40.0734182.5
τˆran0.242873.9-0.071629.90.216085.50.1530386.7
τˆcal0.012294.90.011195.40.003394.60.0025295.1
Scenario 3: Linear outcome & C-loglog propensity score
τˆsimp-0.212062.0-0.211041.2-0.223065.6-0.2124065.0
τˆfix0.124888.80.123682.70.126990.40.1344580.6
τˆran0.293869.10.362232.50.216085.50.2244184.6
τˆcal0.002195.30.001095.10.003394.6-0.0124894.1
Scenario 4: Logistic outcome & Logistic propensity score
τˆsimp-0.111009.1-0.115400.5-0.1116020.5-0.11962.9
τˆfix-0.11440.3-0.11380.1-0.11390.1-0.11330.6
τˆran-0.09331.3-0.08210.5-0.10340.3-0.10245.8
τˆcal0.017496.30.015595.20.017495.90.01594.4
Scenario 5: Logistic outcome & Probit propensity score
τˆsimp-0.085813.1-0.08342.3-0.088125.3-0.07565.9
τˆfix-0.10936.9-0.10854.5-0.10733.8-0.10550.5
τˆran-0.086723.0-0.074829.9-0.09618.3-0.09467.0
τˆcal0.018994.70.016595.40.018495.00.01695.4
Scenario 6: Logistic outcome & C-loglog propensity score
τˆsimp-0.060.33.2-0.060.21.0-0.060.23.7-0.06262.0
τˆfix-0.050.544.6-0.050.543.6-0.050.543.0-0.05376.8
τˆran-0.030.595.4-0.030.497.3-0.030.492.8-0.03393.4
τˆcal-0.010.795.50.000.695.8-0.010.795.20.00595.9

Table 2

Simulation results: bias, variance (×103) and coverage (%) of 95 % confidence intervals based on 1,000 Monte Carlo samples; the outcome is linear and logistic linear mixed effects model and the propensity score is logistic, probit or complementary log-log (C-loglog).

Scenario123
Linear outcomeLinear outcomeLinear outcome
Logistic PSProbit PSC-loglog PS
biasvarcvgbiasvarcvgbiasvarcvg
τˆsimp-0.84342.5-0.88220.6-0.66284.6
τˆfix0.005295.2-0.068792.70.5912147.8
τˆran0.114195.70.087296.70.679133.8
τˆcal-0.023194.7-0.084293.90.333984.0
Scenario456
Logistic outcomeLogistic outcomeLogistic outcome
Logistic PSProbit PSC-loglog PS
biasvarcvgbiasvarcvgbiasvarcvg
τˆsimp-0.140.841.1-0.130.520.2-0.120.220.0
τˆfix-0.100.808.9-0.111.7921.4-0.010.8693.9
τˆran-0.080.6015.0-0.091.2832.40.010.7199.3
τˆcal0.000.8195.5-0.011.0895.70.060.7886.1

Table 2 shows biases, variances and coverages for 95% confidence intervals from 1,000 simulated data sets. The proposed estimator τˆIPTW does not control bias well in some scenarios, similar to all other estimators. This is in line with the consensus in the causal inference literature that all unit-level confounders, including higher moments if present, must be properly controlled for in order to obtain a consistent causal effect estimator.

7 An application

Ethical approval: The conducted research uses an existing de-identified dataset and is not considered as human subject research by the authors’ institutional review board. We examine the 2007–2010 BMI surveillance data from Pennsylvania Department of Health to investigate the effect of School Body Mass Index Screening (SBMIS) on the annual overweight and obesity prevalence in elementary schools in Pennsylvania. Early studies have shown that SBMIS has been associated with increased parental awareness of child weight [66], [67]. However, there have been mixed findings about the effect of screening on reducing prevalence of overweight and obesity [66], [68]. The data set includes 493 schools in Pennsylvania. The baseline is the school year 2007. Previous studies (e. g. [69]) have shown that two high-level contextual factors are strongly associated with school policies for SBMIS: type of community (rural, suburban, and urban), and population density (low, median, and high). Therefore, we cluster schools according to these two factors. This results in five clusters: rural-low, rural-median, rural-high, suburban-high, and urban-high, with cluster sample sizes n1=33, n2=118, n3=116, n4=84, and n5=142, respectively.

Let A=1 if the school implemented SBMIS, and A=0 if the school did not. In this data set, 63% of schools implemented SBMIS, and the percentages of schools implemented SBMIS across the clusters range from 45% to 70%, indicating cluster-level heterogeneity of treatment. The outcome variable Y is the annual overweight and obesity prevalence for each school in the school year 2010. The prevalence is calculated by dividing the number of students with BMI > 85th by the total number of students screened for each school. Therefore, the outcome was measured for each school. For each school, we obtain school characteristics including the baseline prevalence of overweight and obesity (X1), and percentage of reduced and free lunch (X2).

For a direct comparison, the average difference of the prevalence of overweight and obesity for schools that implemented SBMIS and those that did not is 8.78%. This unadjusted difference in the prevalence of overweight and obesity ignores differences in schools and clusters. To take the cluster-level heterogeneity of treatment into account, we consider three propensity score models: (i) a logistic linear fixed effects model with linear predictors including X1, X2, and a fixed intercept for each cluster; (ii) a logistic linear mixed effects model with linear predictors including X1, X2, and a random intercept for each cluster; (iii) the proposed calibrated propensity score. Using the estimated propensity score, we estimate the average treatment effect τ by the weighting method.

Table 3 displays the standardized differences of means for X1 and X2 between the treated and control groups for each cluster and the whole population, standardized by the standard errors in the whole population. Without any adjustment, there are large differences in means for X1 and X2. For this specific data set, the three methods for modeling and estimating the propensity score are similar in balancing the covariate distributions between the treated and control groups. All three propensity score weighting methods improve the balance for X1 and X2. Table 4 displays point estimates and variance estimates based on 500 bootstrap replicates. The simple estimator shows that the screening has a significant effect in reducing the prevalence of overweight and obesity. However, this may be due to confounders. After adjusting for the confounders, the screening does not have a significant effect. Given the different sets of assumptions for the different methods, this conclusion is reassuring.

Table 3

Balance Check.

simplefixedrandomcalibration
X1Cluster 11.68-0.220.680.20
Cluster 21.210.10-0.410.10
Cluster 31.75-0.020.990.02
Cluster 40.86-0.04-1.050.02
Cluster 5-0.360.37-1.390.33
Whole Pop1.28-0.02-0.020
X2Cluster 10.480.020.300.03
Cluster 20.430.13-0.010.14
Cluster 30.730.010.460.02
Cluster 40.18-0.08-0.34-0.07
Cluster 5-0.57-0.39-1.53-0.44
Whole Pop0.39-0.003-0.0010

Table 4

Results: estimate, variance estimate (ve) based on 500 bootstrap replicates, and 95 % confidence interval (c.i.).

estimateve95% c.i.
simple8.782.11(5.94, 11.63)
fixed0.470.44(-0.83, 1.77)
random0.520.44(-0.77, 1.82)
calibration0.530.39(-0.71, 1.76)

8 Discussion

We provide a doubly robust construction of inverse propensity score weights by imposing the exact balance of unit- and (observed and unobserved) cluster-level covariate distributions between the treatment groups. When either the treatment assignment is correctly specified or the outcome follows a linear mixed effects model, we show that consistent estimation of the average treatment effect is possible. Our simulation examines the robustness property of the proposed estimator under various data generating mechanisms. The results suggest that if all confounders in the linear predictors of the treatment and outcome models (including all higher moments) achieve a good balance between two treatment groups, the proposed estimator is robust. The balance conditions help to satisfy the underlying latent ignorable treatment assignment assumption, and may be particularly useful in the case where not sufficient cluster-level confounders are available. In this case, misspecification of the propensity score model has little impact on the bias of the causal effect estimator.

Moreover, our simulation results also indicate that robustness may not hold in the case where higher moments of unit-level confounders are present however are omitted in the balance constraints. This is similar to the case when there are unmeasured unit-level confounders. We therefore emphasize that unbiased estimation of the average treatment effect requires that all unit-level confounders be sufficiently controlled for.

It is well known that the IPTW estimator is sensitive to near-zero values of the estimated propensity score (e. g. [70], [34], [8]). Our proposed estimator prevents the occurrence of extreme values of weights through the calibration constraints where the weights within each cluster are positive and sum to the cluster sample sizes. Therefore, it is unlikely that some units receive extremely large weights that dominate.

We have focused on two-level data with a binary treatment and the average treatment effect over the full population in this article. Our proposed method can be easily generalized to many other scenarios not discussed here, such as multi-level data, multiple treatments [10] or other causal estimands, e. g., the average causal effects over a subset of population [71], [72], [73], including the average causal effect on the treated.

The IPTW estimator is not efficient in general. Semiparametric efficiency bounds for estimating the average treatment effects in the setting with i.i.d. random variables were derived by Hahn [74]. He showed that the efficient influence function for the average treatment effect depends on both the propensity score and the outcome model. An important implication is that combining the propensity score model and the outcome regression model can improve efficiency of the IPTW estimator. For clustered data, because the data are correlated through the random cluster variables, the efficiency theory established for the i.i.d. data is not applicable. It remains an interesting avenue for future research to develop the semiparametric efficiency theory for clustered data.

Funding source: Division of Mathematical Sciences

Award Identifier / Grant number: 1811245

Funding statement: The author acknowledges the support in part by Ralph E. Powe Junior Faculty Enhancement Award from Oak Ridge Associated Universities and NSF grant DMS 1811245.

Acknowledgment

The author would like to thank Peng Ding and Fan Li for insightful and fruitful discussions. The author is grateful to the Associated Editor and two referees for helpful comments and suggestions that have greatly improved the original submission. Conflict of interest: Authors state no conflict of interest.

Appendix Regularity conditions and proof of Theorem 1

We formulate the proposed estimator as a Z-estimator (e. g., [75]), which invokes the standard Z-estimation theory to show the asymptotic properties. Write τˆIPTW(λ1,λ2)=n1i=1mj=1niαij(λ1,λ2)Yij, where αij(λ1,λ2) is defined in (10). The proposed estimator is τˆIPTW(λˆ1,λˆ2), where (λˆ1,λˆ2) satisfies Qˆ(λˆ1,λˆ2)=0, where Qˆ(λ1,λ2) is defined in (11). Define Q(λ1,λ2)=limnE{Qˆ(λ1,λ2)}, and define (λ1,λ2) that satisfy Q(λ1,λ2)=0. Denote AB as A=B+oP(n1/2), where the reference distribution is the super-population model, as n.

We impose the following regularity conditions.

Condition A1.Qˆ(λ1,λ2)Q(λ1,λ2)in probability uniformly for(λ1,λ2)Basn, and there exists a unique(λ1,λ2)Bsuch thatQ(λ1,λ2)=0;

Condition A2.τˆIPTW(λ1,λ2)/(λ1T,λ2T)andQˆ(λ1,λ2)/(λ1T,λ2T)are continuous at(λ1,λ2)Balmost surely;

Condition A3.The matrix

EQˆ(λ1,λ2)(λ1T,λ2T)T=E1ni=1mj=1niαij(λ1,λ2)1αij(λ1,λ2)niAijXijXijT
is invertible;

Condition A4.E||Xij||3<,E|Yij(0)|3<, andE|Yij(1)|3<.

The convergence in Condition A1 is uniform convergence. That is, given ϵ>0, there exists n0=n0(ϵ) such that pr{|Qˆ(λ1,λ2)Q(λ1,λ2)|>ϵ}ϵ holds for all nn0 and (λ1,λ2)B. A sufficient condition for the uniform convergence is that |αij(λ1,λ2)|<M for all (i,j) and (λ1,λ2)B, where M is a constant. Condition A4 specifies moment conditions for the central limit theorem. Conditions A1–A4 are standard regularity conditions on Z-estimation; see, e. g., [75]. However, the regularity conditions may be difficult to check in practice. We give an extreme example where a certain condition is violated. For example, if two covariates in X are perfectly correlated, then Condition A3 fails to hold. Aside from extreme cases, the regularity conditions are often satisfied for the models we are considering and reasonable choices of covariates in practice.

Under Conditions A1–A4, using the standard linearization technique, we obtain

(A1)τˆIPTW(λˆ1,λˆ2)τˆIPTW(λ1,λ2)EτˆIPTW(λ1,λ2)(λ1T,λ2T)EQˆ(λ1,λ2)(λ1T,λ2T)1Qˆ(λ1,λ2)τˆIPTW(λ1,λ2)B1TQˆ1(λ1,λ2)B2TQˆ2(λ1,λ2).

First, consider the case when the initial propensity score model is correctly specified and consistently estimated. We have eij0eij and λ1=λ2=0. This is because with λ1=λ2=0, limnE{Qˆ(λ1,λ2)}=0. We now evaluate the terms in (A1) further. We express τˆIPTW(0,0) as

(A2)n1i=1mj=1niαij(0,0)Yij=n1i=1mj=1nidijAijYijni1j=1nidijAijn1i=1mj=1nidij(1Aij)Yijni1j=1nidij(1Aij)=n1i=1mj=1nidijAijYij(1)ni1j=1nidijAijn1i=1mj=1nidij(1Aij)Yij(0)ni1j=1nidij(1Aij)n1i=1mj=1nieij1AijYij(1)ni1j=1nieij1Aijn1i=1mj=1ni(1eij)1(1Aij)Yij(0)ni1j=1ni(1eij)1(1Aij)τ,

where the third line follows by the consistency assumption, the forth line follows by the assumption that the initial propensity score model is correctly specified, and the last line follows by the strong law of large numbers and the condition of min1imni. Also, following the similar argument, we obtain

(A3)Qˆ1(0,0)=n1i=1mj=1niAijαij(0,0)1Xij0,
(A4)Qˆ2(0,0)=n1i=1mj=1ni(1Aij)αij(0,0)1Xij0.
Combining (A1)–(A4), we obtain τˆIPTW(λˆ1,λˆ2)τ.

Second, consider the case when the outcome follows a linear mixed effects model. We do not assume that the initial propensity score model is correctly specified, and therefore λ1 and λ2 in (A1) are not necessarily zero. Conditions A1–A4 ensure that (A1) is consistent for some parameter. We have shown in Section 4 that the proposed estimator is asymptotically unbiased for τ. It follows that τˆIPTW(λˆ1,λˆ2)τ.

To derive asymptotic variance formula, continuing (A1), we obtain

τˆIPTW(λˆ1,λˆ2)=1ni=1mj=1niαij(λ1,λ2)Aij(YijB1TXij)+B1TXij1ni=1mj=1niαij(λ1,λ2)(1Aij)(YijB2TXij)+B2TXij=1ni=1mj=1niτij,

where

τij=αij(λ1,λ2)Aij(YijB1TXij)+B1TXijαij(λ1,λ2)(1Aij)(YijB2TXij)+B2TXij.

Therefore, var(τˆIPTW)=var(n1i=1mj=1niτij), denoted as V1.

To establish the asymptotic normality of τˆIPTW, we use the central limit theory for dependent variables [76], [77]. Let var(τij)=στ2 and cov(τij,τik)=ντ for jk. Arrange the τij’s in a n-length sequence {τ11,,τ1n1,τ21,,τmnm}. To simplify the notation, let the kth random variable in this sequence be denoted by τk, for k=1,,n. We now consider such sequences {τk:k=1,,n} are indexed by n. By Condition A4, the absolute central moments E|τkE(τk)|3 are bounded uniformly in k. Moreover, by the assumption of sup1imni=O(n1/2), we then have var(k=a+1a+nτk)nA2, uniformly in a, as n, where A2 is a positive constant. Following Serfling [77], these are typical criterion for verifying the Lindeberg condition [78], and therefore V11/2(τˆIPTWτ)N(0,1), in distribution, as n.

Appendix Regularity conditions for Theorem 2

For the clustered survey data, we now write τˆIPTW(λ1,λ2)=N1iSIj=1niωij×αij(λ1,λ2)Yij, where αij(λ1,λ2) is defined in (19). The proposed estimator is τˆIPTW(λˆ1,λˆ2), where (λˆ1,λˆ2) satisfies Qˆ(λˆ1,λˆ2)=0, where Qˆ(λ1,λ2) is defined in (20). We assume Conditions A1–A4 holds with the new definitions of τˆIPTW(λ1,λ2), αij(λ1,λ2), and Qˆ(λ1,λ2).

References

1. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701.10.1037/h0037350Search in Google Scholar

2. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.10.21236/ADA114514Search in Google Scholar

3. Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39:33–8.10.1017/CBO9780511810725.019Search in Google Scholar

4. Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25:1–21.10.1214/09-STS313Search in Google Scholar

5. Abadie A, Imbens GW. Matching on the estimated propensity score. Econometrica. 2016;84:781–807.10.3386/w15301Search in Google Scholar

6. Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Serv Outcomes Res Methodol. 2001;2:259–78.10.1023/A:1020371312283Search in Google Scholar

7. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–73.10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar

8. Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–34.10.1093/biomet/asp033Search in Google Scholar

9. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc. 1984;79:516–24.10.1017/CBO9780511810725.018Search in Google Scholar

10. Yang S, Imbens GW, Cui Z, Faries DE, Kadziola Z. Propensity score matching and subclassification in observational studies with multi-level treatments. Biometrics. 2016;72:1055–65.10.1111/biom.12505Search in Google Scholar

11. Imbens GW, Rubin DB. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge UK: Cambridge University Press; 2015.10.1017/CBO9781139025751Search in Google Scholar

12. Hong G, Raudenbush SW. Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. J Am Stat Assoc. 2006;101:901–10.10.1198/016214506000000447Search in Google Scholar

13. Griswold ME, Localio AR, Mulrow C. Propensity score adjustment with multilevel data: setting your sites on decreasing selection bias. Ann Intern Med. 2010;152:393–5.10.7326/0003-4819-152-6-201003160-00010Search in Google Scholar

14. Li F, Zaslavsky AM, Landrum MB. Propensity score weighting with multilevel data. Stat Med. 2013;32:3373–87.10.1002/sim.5786Search in Google Scholar

15. Rubin DB. Bayesian inference for causal effects: The role of randomization. Ann Stat. 1978;6:34–58.10.1214/aos/1176344064Search in Google Scholar

16. Ross R. An application of the theory of probabilities to the study of a priori pathometry. part i. Proc R Soc Lond, a Contain Pap Math Phys Character. 1916;92:204–30.10.1098/rspa.1916.0007Search in Google Scholar

17. Hudgens MG, Halloran ME. Toward causal inference with interference. J Am Stat Assoc. 2008;103:832–42.10.1198/016214508000000292Search in Google Scholar

18. Oakes JM. The (mis) estimation of neighborhood effects: causal inference for a practicable social epidemiology. Soc Sci Med. 2004;58:1929–52.10.1016/j.socscimed.2003.08.004Search in Google Scholar

19. VanderWeele TJ. Ignorability and stability assumptions in neighborhood effects research. Stat Med. 2008;27:1934–43.10.1002/sim.3139Search in Google Scholar

20. Hong G, Yu B. Early-grade retention and children’s reading and math learning in elementary years. Educ Eval Policy Anal. 2007;29:239–61.10.3102/0162373707309073Search in Google Scholar

21. Hong G, Yu B. Effects of kindergarten retention on children’s social-emotional development: An application of propensity score method to multivariate, multilevel data. Dev Psychol. 2008;44:407–21.10.1037/0012-1649.44.2.407Search in Google Scholar

22. Kim J, Seltzer M. Causal inference in multilevel settings in which selection processes vary across schools. Technical Report Working Paper 708. University of California, Los Angeles, Center for the Study of Evaluation; 2007.10.1037/e644002011-001Search in Google Scholar

23. Kelcey BM. Improving and assessing propensity score based causal inferences in multilevel and nonlinear settings. PhD thesis. University of Michigan; 2009.Search in Google Scholar

24. Arpino B, Mealli F. The specification of the propensity score in multilevel observational studies. Comput Stat Data Anal. 2011;55:1770–80.10.1016/j.csda.2010.11.008Search in Google Scholar

25. Thoemmes FJ, West SG. The use of propensity scores for nonrandomized designs with clustered data. Multivar Behav Res. 2011;46:514–43.10.1080/00273171.2011.569395Search in Google Scholar

26. Kim J-S, Steiner PM. Multilevel propensity score methods for estimating causal effects: A latent class modeling strategy. In: Quantitative Psychology Research. Springer; 2015. p. 293–306.10.1007/978-3-319-19977-1_21Search in Google Scholar

27. Leite WL, Jimenez F, Kaya Y, Stapleton LM, MacInnes JW, Sandbach R. An evaluation of weighting methods based on propensity scores to reduce selection bias in multilevel observational studies. Multivar Behav Res. 2015;50:265–84.10.1080/00273171.2014.991018Search in Google Scholar

28. Schuler MS, Chu W, Coffman D. Propensity score weighting for a continuous exposure with multilevel data. Health Serv Outcomes Res Methodol. 2016;16:271–92.10.1007/s10742-016-0157-5Search in Google Scholar

29. Xiang Y, Tarasawa B. Propensity score stratification using multilevel models to examine charter school achievement effects. J School Choice. 2015;9:179–96.10.1080/15582159.2015.1028862Search in Google Scholar

30. Su Y-S, Cortina J. What do we gain? combining propensity score methods and multilevel modeling. In: Annual Meeting of the American Political Science Association. Toronto, Canada; 2009.Search in Google Scholar

31. Eckardt P. Propensity score estimates in multilevel models for causal inference. Nurs Res. 2012;61:213–23.10.1097/NNR.0b013e318253a1c4Search in Google Scholar

32. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–66.10.1080/01621459.1994.10476818Search in Google Scholar

33. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23:2937–60.10.1002/sim.7231Search in Google Scholar

34. Kang JD, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci. 2007;22:523–39.Search in Google Scholar

35. Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiol Drug Saf. 2004; 13:855–7.10.1002/pds.968Search in Google Scholar

36. Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986;81:945–60.10.1080/01621459.1986.10478354Search in Google Scholar

37. Dawid AP. Conditional independence in statistical theory. J R Stat Soc, Ser B, Stat Methodol. 1979;41:1–31.10.1111/j.2517-6161.1979.tb01052.xSearch in Google Scholar

38. Stuart EA. Estimating causal effects using school-level data sets. Educ Res. 2007;36:187–98.10.3102/0013189X07303396Search in Google Scholar

39. Baltagi B. Econometric Analysis of Panel Data. New York: John Wiley & Sons, Wiley; 1995.Search in Google Scholar

40. Wooldridge JM. Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT press; 2002.Search in Google Scholar

41. Wallace TD, Hussain A. The use of error components models in combining cross section with time series data. Econometrica. 1969;37:55–72.10.2307/1909205Search in Google Scholar

42. Skinner CJ, et al.. Inverse probability weighting for clustered nonresponse. Biometrika. 2011;98:953–66.10.1093/biomet/asr058Search in Google Scholar

43. Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22:79–86.10.1214/aoms/1177729694Search in Google Scholar

44. Wu C, Sitter RR. A model-calibration approach to using complete auxiliary information from survey data. J Am Stat Assoc. 2001;96:185–93.10.1198/016214501750333054Search in Google Scholar

45. Chen J, Sitter R, Wu C. Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika. 2002;89:230–7.10.1093/biomet/89.1.230Search in Google Scholar

46. Särndal C-E, Lundström S. Estimation in Surveys with Nonresponse. New York: John Wiley & Sons; 2005.10.1002/0470011351Search in Google Scholar

47. Kott PS. Using calibration weighting to adjust for nonresponse and coverage errors. Surv Methodol. 2006;32:133–42.Search in Google Scholar

48. Chang T, Kott PS. Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika. 2008;95:555–71.10.1093/biomet/asn022Search in Google Scholar

49. Kim JK, Kwon Y, Paik MC. Calibrated propensity score method for survey nonresponse in cluster sampling. Biometrika. 2016;103:461–73.10.1093/biomet/asw004Search in Google Scholar

50. Qin J, Zhang B. Empirical-likelihood-based inference in missing response problems and its application in observational studies. J R Stat Soc B. 2007;69:101–22.10.1111/j.1467-9868.2007.00579.xSearch in Google Scholar

51. Hainmueller J. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Polit Anal. 2012;20:25–46.10.1093/pan/mpr025Search in Google Scholar

52. Graham BS, Pinto CCDX, Egel D. Inverse probability tilting for moment condition models with missing data. Rev Econ Stud. 2012;79:1053–79.10.3386/w13981Search in Google Scholar

53. Imai K, Ratkovic M. Covariate balancing propensity score. J R Stat Soc B. 2014;76:243–63.10.1111/rssb.12027Search in Google Scholar

54. Chan KCG, Yam SCP, Zhang Z. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. J R Stat Soc B. 2015;78:673–700.10.1111/rssb.12129Search in Google Scholar

55. Park M, Fuller WA. Generalized regression estimators. Encycl Environmetrics. 2012;2:1162–6.10.1002/9780470057339.vag012.pub2Search in Google Scholar

56. Newey WK, Smith RJ. Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica. 2004;72:219–55.10.1111/j.1468-0262.2004.00482.xSearch in Google Scholar

57. Kitamura Y, Stutzer M. An information-theoretic alternative to generalized method of moments estimation. Econometrica. 1997;65:861–74.10.2307/2171942Search in Google Scholar

58. Imbens G, Johnson P, Spady RH. Information theoretic approaches to inference in moment condition models. Econometrica. 1998;66:333–57.10.3386/t0186Search in Google Scholar

59. Schennach SM. Point estimation with exponentially tilted empirical likelihood. Ann Stat. 2007;35:634–72.10.1214/009053606000001208Search in Google Scholar

60. McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods. 2004. 403–425.10.1037/1082-989X.9.4.403Search in Google Scholar

61. Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, Cook EF. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf. 2008;17:546–55.10.1002/pds.1555Search in Google Scholar

62. Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29:337–46.10.1002/sim.3782Search in Google Scholar

63. Pirracchio R, Petersen ML, van der Laan M. Improving propensity score estimators’ robustness to model misspecification using super learner. Am J Epidemiol. 2014;181:108–19.10.1093/aje/kwu253Search in Google Scholar

64. Deville J-C, Särndal C-E. Calibration estimators in survey sampling. J Am Stat Assoc. 1992;87:376–82.10.1080/01621459.1992.10475217Search in Google Scholar

65. Fuller WA. Sampling Statistics. Hoboken, NJ: Wiley; 2009.10.1002/9780470523551Search in Google Scholar

66. Harris KC, Kuramoto LK, Schulzer M, Retallack JE. Effect of school-based physical activity interventions on body mass index in children: a meta-analysis. Can Med Assoc J. 2009;180:719–26.10.1503/cmaj.080966Search in Google Scholar

67. Ebbeling CB, Feldman HA, Chomitz VR, Antonelli TA, Gortmaker SL, Osganian SK, Ludwig DS. A randomized trial of sugar-sweetened beverages and adolescent body weight. N Engl J Med. 2012;367:1407–16.10.1056/NEJMoa1203388Search in Google Scholar

68. Thompson JW, Card-Higginson P. Arkansas’ experience: statewide surveillance and parental information on the child obesity epidemic. Pediatrics. 2009;124:73–82.10.1542/peds.2008-3586JSearch in Google Scholar

69. Peyer KL, Welk G, Bailey-Davis L, Yang S, Kim J-K. Factors associated with parent concern for child weight and parenting behaviors. Childhood Obesity. 2015;11:269–74.10.1089/chi.2014.0111Search in Google Scholar

70. Robins J, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Stat Sci. 2007;22:544–59.10.1214/07-STS227DSearch in Google Scholar

71. Crump R, Hotz VJ, Imbens G, Mitnik O. Moving the goalposts: Addressing limited overlap in the estimation of average treatment effects by changing the estimand. Technical report, 330. Cambridge, MA: National Bureau of Economic Research; 2006.10.3386/t0330Search in Google Scholar

72. Li F, Morgan KL, Zaslavsky AM. Balancing covariates via propensity score weighting. J Am Stat Assoc. 2017. 10.1080/01621459.2016.1260466.Search in Google Scholar

73. Yang S, Ding P. Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika. 2018;105:487–93.10.1093/biomet/asy008Search in Google Scholar

74. Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66:315–31.10.2307/2998560Search in Google Scholar

75. van der Vaart. Asymptotic Statistics. vol. 3. Cambridge: Cambridge university press; 2000.Search in Google Scholar

76. Hoeffding W, Robbins H, et al.. The central limit theorem for dependent random variables. Duke Math J. 1948;15:773–80.10.1215/S0012-7094-48-01568-3Search in Google Scholar

77. Serfling RJ. Contributions to central limit theory for dependent variables. Ann Math Stat. 1968;39:1158–75.10.1214/aoms/1177698240Search in Google Scholar

78. Loève M. Probability Theory. 2nd ed. Princeton: Van Nostrand; 1960.Search in Google Scholar

Received: 2017-12-05
Revised: 2018-08-18
Accepted: 2018-08-19
Published Online: 2018-08-24
Published in Print: 2018-09-25

© 2018 Walter de Gruyter GmbH, Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.