Propensity score weighting is a tool for causal inference to adjust for measured confounders in observational studies. In practice, data often present complex structures, such as clustering, which make propensity score modeling and estimation challenging. In addition, for clustered data, there may be unmeasured cluster-level covariates that are related to both the treatment assignment and outcome. When such unmeasured cluster-specific confounders exist and are omitted in the propensity score model, the subsequent propensity score adjustment may be biased. In this article, we propose a calibration technique for propensity score estimation under the latent ignorable treatment assignment mechanism, i. e., the treatment-outcome relationship is unconfounded given the observed covariates and the latent cluster-specific confounders. We impose novel balance constraints which imply exact balance of the observed confounders and the unobserved cluster-level confounders between the treatment groups. We show that the proposed calibrated propensity score weighting estimator is doubly robust in that it is consistent for the average treatment effect if either the propensity score model is correctly specified or the outcome follows a linear mixed effects model. Moreover, the proposed weighting method can be combined with sampling weights for an integrated solution to handle confounding and sampling designs for causal inference with clustered survey data. In simulation studies, we show that the proposed estimator is superior to other competitors. We estimate the effect of School Body Mass Index Screening on prevalence of overweight and obesity for elementary schools in Pennsylvania.
A main statistical approach to causal inference is built on the potential outcomes framework , in which a causal effect is defined as a comparison of the potential outcomes of the same units under different treatment levels. Observational studies are often used to infer causal effects in medical and social science studies. In observational studies, there often is confounding by indication: some covariates are predictors of both the treatment and outcome. One implication is that the covariate distributions differ between the treatment groups. Under the assumption of ignorable treatment assignment and that all confounders are observed, the causal effect of treatments can be obtained by comparing the outcomes for units from different treatment groups, adjusting for the observed confounders. Rosenbaum and Rubin  further demonstrated the central role of the propensity score, and showed that adjusting for the propensity score removes confounding bias. An extensive literature thereafter proposed a number of estimators based on the propensity score, including matching , , , weighting , , , and stratification , . In particular, propensity score weighting can be used to create a weighted population where the covariate distributions are balanced between the treatment groups, on average. Therefore, under some assumptions, the comparison between the weighted outcomes has a causal interpretation; see Imbens and Rubin  for a textbook discussion.
Propensity score weighting has been mainly developed and applied in settings with independently and identically distributed (i.i.d.) data. However, in many research areas, data often present complex structures, such as clustering. Clustering can be formed in diverse ways. First, clusters are created by experimental design. The classical examples are given in educational and health studies, where students are nested in schools (e. g. ) and patients are grouped in hospitals (e. g. ). Second, clusters are induced by high-level features, such as common environmental and contextual factors, shared by individual units (e. g. ). For a motivating example, we examine the 2007–2010 body mass index (BMI) surveillance data from Pennsylvania Department of Health to estimate the effect of School Body Mass Index Screening (SBMIS) on the annual overweight and obesity prevalence in elementary schools in Pennsylvania. The data set includes 493 schools (units) in Pennsylvania, which are clustered by two factors: type of community (rural, suburban, and urban), and population density (low, median, and high). The data structure is schools nested within high-level of environments.
In this article, we address the problem of estimating the average treatment effect from clustered data, where the treatment is administered at the unit level, the covariates are measured at both unit and cluster levels (where cluster-level covariates are cluster characteristics shared by all units within clusters), and finally the outcome is measured at the unit level. Even if we collect a rich set of unit-level covariates, there may be unobserved cluster-level covariates that are related to both the treatment assignment and outcome. This problem is ubiquitous in clustered data where data are collected sufficiently at the unit level, however insufficient information is available at the cluster level. In our motivating example, we have unit-level school covariates including the baseline prevalence of overweight and obesity and percentage of reduced and free lunch. However, certain key contextual factors, such as accessibility to and quality of care, socioeconomic and environmental variables, which can be very different across clusters, are perceivably important factors for schools implementing prevention screening strategy and children’s obesity rate. Unfortunately, these cluster-specific confounders are not available. When such unmeasured confounders exist and are omitted in the propensity score model, the subsequent analysis may be biased.
We make the stable unit and treatment version assumption (SUTVA; ). Under the SUTVA, potential outcomes for each unit are not affected by the treatments assigned to other units. This assumption is not uncommon. In our application, the treatment was implemented school-wise. The potential outcomes for one school are likely to be unaffected by the treatments implemented at other schools, and therefore the SUTVA is plausible. However, in other settings, this assumption may not hold. A classical example is given in infectious diseases , , where whether one person becomes infected depends on who else in the population is vaccinated. In this article, we will not discuss the case when the SUTVA is violated.
The literature has proposed different methods for clustered data. Oakes  and VanderWeele  used multi-level models for the potential outcomes to draw causal conclusions in neighborhood effect studies. A series of papers has proposed various propensity score matching algorithms with multi-level models for the propensity score; see, e. g., , , , , , , , , and . Recently, Li et al. , Leite, Jimenez, Kaya, Stapleton, MacInnes and Sandbach  and Schuler, Chu and Coffman  examined propensity score weighting methods to reduce selection bias in multi-level observational studies. Xiang and Tarasawa  employed propensity score stratification and multi-level models to balance key covariates between treatment groups of a cross-state sample of students. For comparison of the effectiveness of various propensity score strategies; see, e. g., ,  and . Among these works, researchers considered different modeling choices for the propensity score and outcome, such as generalized linear fixed/mixed effects models. The fixed effects models create dummy variables for each cluster, regarding the cluster variables as fixed; while the random effects models use random intercepts for each cluster, treating the cluster variables as random. Nonetheless, all existing methods require correct specification of the propensity score and outcome models.
The goal of this article is to develop a novel propensity score weighting method for causal inference with clustered data in the presence of unmeasured cluster-level confounders. An important contribution is to provide a robust construction of inverse propensity score weights under the latent ignorable treatment assignment mechanism; i. e., the treatment-outcome relationship is unconfounded given the observed confounders and the latent cluster-level confounders. The key insight is based on the central role of the propensity score in balancing the covariate distributions between the treatment groups. For propensity score estimation, we then adopt the calibration technique and impose balance constraints for moments of the observed and latent cluster-level confounders between the treatment groups. Because the latent cluster-level confounders are not observed, we impose stronger balance constraints enforcing the sum of weighted treatments equal the cluster size for all clusters, which imply the exact balance of the cluster-level confounders. The proposed propensity score weighting estimator is doubly robust (e. g., , , , ) in the sense that it is consistent for the average treatment effect if either the propensity score model is correctly specified or the outcome follows a linear mixed effect model. In general cases, if the conditional mean of the outcome given the observed confounders and the latent cluster-level confounders can be well approximated by the power series of confounders, imposing constraints on these power series can also eliminate confounding biases, and the propensity score weighting estimator is consistent. The simulation results demonstrate that the proposed estimator has improved robustness to model misspecification compared to existing methods.
Importantly, our results are in agreement with some existing findings that misspecification of the propensity score model may have minor impact on the bias of the estimator for the average treatment effect (). This is especially true in the matching and stratification algorithm, because the estimated propensity score is only used to balance the covariate distributions instead of directly in estimation. Our results suggest that if both individual and cluster-level confounders achieve a good balance between the treatment groups, the proposed weighting estimator for the average treatment effect is robust.
Clustered data often arise in survey sampling. In complex surveys, the challenge is to take design information or design weights into account when developing propensity score methods for causal inference. The proposed weighting method can be combined with sampling weights for an integrated solution to handle confounding and sampling designs for causal inference with clustered survey data.
This article is organized as follows. Section 2 introduces the data structure and assumptions, defines the estimands, and presents existing inverse probability of treatment weighting estimators for clustered data. Section 3 proposes our estimators. Section 4 presents the main theoretical results. Section 5 extends the proposed calibration estimator to clustered survey data. Section 6 reports a simulation study to evaluate finite sample properties of our estimator. Section 7 applies our methods to investigate the effect of SBMIS on the annual overweight and obesity prevalence in elementary schools in Pennsylvania. A concluding remark is given in Section 8. Finally, proofs of the main theoretical results and additional simulation results are provided in the Appendix.
2 Basic setup
2.1 Observed data structure
To fix the ideas, we first focus on two-level clustered data. The extension to clustered survey data will be addressed in Section 5.
Suppose we have a two-level data structure where at the first level we have m clusters, and at the second level each cluster i includes units. Denote the sample size by . For unit j in cluster i, we observe a p-dimensional vector of pre-treatment covariates , which may include the observed individual and cluster characteristics, a binary treatment variable , with 0 and 1 being the labels of control and active treatments, respectively, and lastly an outcome variable .
2.2 Potential outcomes and assumptions
We use the potential outcomes framework . Assume that each unit has two potential outcomes: , the outcome that would be realized, possibly contrary to the fact, had the unit received the control treatment, and , the outcome that would be realized, possibly contrary to the fact, had the unit received the active treatment. The observed outcome is . This notation implicitly makes the SUTVA  that there is no interference between units and no versions of each treatment.
Suppose that clusters are random draws from a super-population of clusters, and that for observations within cluster i, i.i.d. follow a cluster-specific super-population model. Our goal is to estimate the average treatment effect, , where the expectation is taken with respect to the super-population model ξ of all random variables, which will be specified later. In the binary case, τ is called the causal risk difference.
The fundamental problem is that not all potential outcomes can be observed for each unit in the sample; only one potential outcome, the outcome corresponding to the treatment the unit actually followed, can be observed . Throughout, we use to denote the conditional independence of and given . With unstructured i.i.d. data, Rubin  described the following assumption for identifying the average treatment effect.
Assumption 1 (Ignorability).
Assumption 1 indicates that all confounders are included in .
For clustered data, confounding may vary across clusters and are related to some cluster-level covariates that are not always observable. In these cases, Assumption 1 does not hold. Instead, we assume a cluster-specific latent variable that summarizes the effect of unobserved cluster-level confounders, and consider the following modified ignorability assumption.
Assumption 2 (Latent ignorability).
Under Assumption 2, the propensity score becomes . Moreover, we make the standard positivity assumption for the propensity score.
Assumption 3 (Positivity).
There exist constantsandsuch that, with probability 1,.
In our setting, the treatment is assigned at the unit level. Assumption3implies that each unit in each cluster has a positive probability to receive either treatment or control. Therefore, our setting does not apply to the settings with a cluster-level treatment where all units in one cluster receive one treatment level. For these settings, we refer the interested readers to Stuart  and VanderWeele .
Under Assumption 2, we write the joint distribution of as
where is a conditional distribution of given , for .
The literature has considered generalized linear mixed effects models for and ; see, e. g., , , . Following the literature, we assume generalized linear mixed effects models for the outcome and propensity score.
Assumption 4 (Outcome model).
The potential outcomefollows a generalized linear mixed effects model with a random interceptas
Assumption 5 (Propensity score model).
The actual treatmentgivenfollows a generalized linear mixed effects model with a random interceptas
There are two different model specifications regarding the cluster-level confounders. The fixed effects model treats as fixed but unknown parameters across clusters. In this fixed-effects approach, treatment assignment is an ignorable process, which complies with Assumption 1 given that stacks all observed confounders and cluster dummy variables. On the other hand, the random effects model treats as random and i.i.d. drawn from a distribution. The difference between the two modeling strategies has been addressed in both statistics and econometrics literature; see, e. g.,  and . Briefly, there are both statistical and logical considerations. First, if the number of clusters is relatively large, the parameter estimates in the fixed effects model are inconsistent . In this case, the random effects approach is preferred. Second, the fixed effects approach does not make distributional assumptions of the cluster-level confounders; whereas, the random effects approach assumes that is random and i.i.d. drawn from a distribution.
2.3 Inverse probability of treatment weighting estimator
To estimate the average treatment effect τ, let denote the vector of cluster-level confounders. Under (2), the inverse propensity score or probability of treatment weighting (IPTW) estimator of τ can be expressed as
Under Assumptions 2 and 3, if the propensity score is known, it is straightforward to verify that is unbiased for τ. Moreover, if it is unknown but depends only on fixed parameters, with the consistently estimated propensity score is asymptotically unbiased for τ. The challenge with clustered data is that may depend on a growing number of unobserved cluster-level confounders. To resolve this issue, there are several options:
Weight based on predicted random intercepts; i. e., treat the ’s in model (2) as random intercepts, and predict the propensity score as , where is the mode of a predictive distribution for given the observed and , and is the maximum likelihood estimator of η.
Weight based on estimated fixed intercepts; i. e., treat the ’s in model (2) as fixed intercepts, and estimate the propensity score as , where and are maximum likelihood estimators.
3 Proposed methodology
To motivate our estimation of the propensity score, we note
Equations (4) and (5) clarify the central role of the propensity score in balancing the covariate distributions between the treatment groups in the super-population. For simplicity of exposition, let be the propensity score for unit j in cluster i, and let be the corresponding estimate. We consider the propensity score estimate to satisfy the following constraints:
- Step 1.
Obtain an initial propensity score estimate under a working propensity score model, e. g. a logistic linear mixed effects model. This in turn provides an initial set of inverse propensity score weights, , where if and if .
- Step 2.
Modify the initial set of weights to a new set of weights by minimizing the Kullback-Leibler distance  of and :(9)(10)
where is the solution to the following equation(11)
- Step 3.
Obtain the propensity score estimate as
Remark 2 (Calibration).
Calibration has been used in many scenarios. In survey sampling, calibration is widely used to integrate auxiliary information in estimation or handle nonresponse; see, e. g., , , , ,  and . In causal inference, calibration has been used such as in Constrained Empirical Likelihood , Entropy Balancing , Inverse Probability Tilting , and Covariate Balance Propensity Score of Imai and Ratkovic . Chan, Yam and Zhang  showed that estimation of average treatment effects by empirical balance calibration weighting can achieve global efficiency. However, all these works were developed in settings with i.i.d. variables and they assumed that there are no unmeasured confounders. Our article is the first to use calibration for causal inference with unmeasured cluster-level confounders.
Remark 3 (Distance function).
In Step 2 of the calibration algorithm, different distance functions, other than the Kullback-Leibler distance, can be considered. For example, if we choose, then the minimum distance estimation leads to generalized regression estimation  of the’s. If we choose, then it leads to empirical likelihood estimation . We use the Kullback–Leibler distance function, which leads to exponential tilting estimation , , . The advantage of using the Kullback-Leibler distance is that the resulting weights are always non-negative. Also, with Kullback-Leibler distance, the calibration constraint (7) can be built into a closed form expression for the weights, and thus avoiding solving a large number of equations. This reduces the computation burden greatly, when there is a large number of clusters.
Remark 4 (Nonparametric methods).
It is worth commenting on the existing robust nonparametric methods and the advantages of our estimator. In the i.i.d. data setting, many nonparametric and machine learning methods have been proposed to capture the complex relationship of different variables without parametric assumptions, such as generalized boosted regression, causal trees, random forest, and neural networks. Indeed, many studies have shown the superiority of these methods to the parametric propensity score estimation; see, e. g., , , , . However, these data-driven methods assume that all confounders are observed, and therefore they can not handle unobserved cluster-level confounders, unlike our proposed method.
4 Main results
To discuss the asymptotic properties of the proposed estimator, we assume that the cluster sample sizes satisfy the condition that and . To show the double robustness of the proposed estimator , we distinguish two cases and indicate different roles of calibration in estimation. We provide a heuristic argument below and relegate the technical details to the Appendix.
First, consider the case when the initial propensity score model is correctly specified. The weighting estimator with the initial propensity score estimates is then consistent for the average treatment effect. In this case, the role of calibration is to improve estimation efficiency by incorporating additional covariate information. This role of calibration has been demonstrated extensively in the survey literature (e. g., ) to modify the initial design weights to incorporate known auxiliary information.
Second, consider the case when the outcome models are linear mixed effects models:
where the ’s are independent with , for . In this case, the role of calibration is to balance the confounders between the treatment groups for reducing the selection bias. We note that does not depend on outcome variables, and therefore under Assumptions 2, 4 and 5, . Then, we have
Similarly, we establish
In general cases, if the conditional mean of the outcome given the observed confounders and the latent cluster-level confounders can be well approximated by the power series of confounders, imposing constraints on these power series can also eliminate confounding biases, and the propensity score weighting estimator is consistent.
We derive the asymptotic distribution of in the following theorem and relegate the proof to the Appendix.
Suppose that Assumptions2,3, and the regularity conditions specified in the Appendix hold. Suppose further that the cluster sample sizes, for, satisfy the condition thatand. If the outcome model (1) is a linear mixed effects model or the propensity score model (2) is correctly specified, the proposed propensity score weighting estimator in (12), subject to constraints (6) and (7), satisfies
The asymptotic result in Theorem 1 allows for variance estimation of . We now discuss variance estimation. Let , where
Let and . The variance estimator can be constructed as
5 Extension to clustered survey data
In this section, we extend the proposed propensity score weighting estimator to clustered survey data. Consider a finite population with M clusters and units in the ith cluster, where denotes the population size. We assume that in the finite population, follows the super-population model ξ as described in Section 2. We are interested in estimating the population average treatment effect .
We assume that the sample is selected according to a two-stage cluster sampling design. Specifically, at the first stage, cluster i is sampled with the first-order inclusion probability , where is the index set for the sampled clusters. Let be the second-order inclusion probability for clusters i and j being sampled. At the second stage, given that cluster i was selected at the first stage, unit j is sampled with conditional probability , where is the index set for the sampled units. The final sample size is . The design weight for unit j in cluster i be , which reflects the number of units for cluster i in the finite population this unit j represents. We assume that the design weights are positive and known throughout the sample. Also, let be the second-order inclusion probability for units k and l being sampled given that cluster i was selected. The second-order inclusion probabilities, and , are often used for variance estimation.
For clustered survey data, if the propensity score is known, we can express the IPTW estimator of τ as
Let and denote expectation under the super-population model and the sampling design, respectively. It is easy to verify that
In practice, because the propensity score is often unknown, (16) is not feasible. To estimate the propensity score, we now require the propensity score estimate satisfy the following design-weighted moment constraints
- Step 1.
Obtain an initial propensity score estimate under some working propensity score model, e. g. a logistic linear mixed effect model, each unit weighted by the design weight . This in turn provides an initial set of inverse propensity score weights, , where if and if .
- Step 2.
where is the solution to the following equation(20)
- Step 3.
Obtain the propensity score estimate as
In the above procedure, the design wights are used in both the propensity score estimates and the weighting estimator.
We now consider the asymptotic property of in (21). We use an asymptotic framework, where the sample size n indexes a sequence of finite populations and samples (; Section 1.3), such that the population size N increases with n. In addition, we have the following regularity conditions for the sampling mechanism.
(i) The first-order inclusion probabilityis positive and uniformly bounded in the sense that there exist positive constantsandthat do not depend on N, such that, for any i and j; (ii) the sequence of Horvitz-Thompson estimatorssatisfiesand, in distribution, as, whereis the population mean of Y, and the reference distribution is the randomization distribution generated by the sampling mechanism.
Suppose that Assumptions2–6, and the regularity conditions specified in the Appendix hold. Suppose further that the cluster sample sizes, for, satisfy the condition thatand. If the outcome model (1) is a linear mixed effects model or the propensity score model (2) is correctly specified, the proposed propensity score weighting estimator in (21), subject to constraints (17) and (18), satisfies
For variance estimation of , let , where
The variance estimator can be constructed as
6 Simulation studies
We conduct two simulation studies to evaluate the finite-sample performance of the proposed estimator, assessing its robustness against model misspecification in Section 6.1 and the robustness against omitting a unit-level confounder in Section 6.2.
6.1 Robustness against model misspecification
We first generate finite populations and then select a sample from each finite population using a two-stage cluster sampling design. In the first setting, we specify the number of clusters in the population to be , and the size of the ith cluster size to be the integer part of , where . The cluster sizes range from 100 to 500. The potential outcomes are generated according to linear mixed effects models, and , where , , , , , and are independent, for , . The parameter of interest is τ. We consider three propensity score models, , with being the inverse logit, probit and complementary log-log link function, for generating . We set to be (−0.5,1), (−0.25,0.5) and (−0.5,0.1) for the above three propensity score models, respectively. The observed outcome is . From each realized population, m clusters are sampled by Probability-Proportional-to-Size (PPS) sampling with the measure of size . So the first-order inclusion probability of selecting cluster i is equal to , which implicitly depends on the unobserved random effect. Once the clusters are sampled, the units in the ith selected cluster are sampled by Poisson sampling with the corresponding first-order inclusion probability , where if and 1 if . With this sampling design, the units with are sampled with a chance twice as big as the units with . We consider four combinations of m and : (i) ; (ii) , representing a large number of small clusters; (iii) ; and (iv) , representing a small number of large clusters.
In the second setting, all data-generating mechanisms are the same as in the first setting, except that the potential outcomes are generated according to logistic linear mixed effects models, with and with . Moreover, in the 2-stage cluster sampling, , where if and 1 if . With this sampling design, the units with are sampled with a chance twice as big as the units with .
We compare four estimators for τ: (i) , the simple design-weighted estimator without propensity score adjustment; (ii) , the weighting estimator in (3) with the propensity score estimated by a logistic linear fixed effects model with fixed cluster intercepts; (iii) , the weighting estimator in (3) with the propensity score estimated by a logistic linear mixed effects model with random cluster intercepts; (iv) , the proposed estimator with calibrations (17) and (18).
Table 1 shows biases, variances and coverages for confidence intervals from 1,000 simulated data sets. The simple estimator shows large biases across difference scenarios, even adjusting for sampling design. This suggests that the covariate distributions are different between the treatment groups in the finite population, contributing to the bias. works well under Scenario 1 with the linear mixed effects model for the outcome and the logistic linear mixed effects model for the propensity score; however, its performance is not satisfactory in other scenarios. Moreover, shows the largest variance among the four estimators in most of scenarios. This is because for a moderate or large number of clusters, there are too many free parameters, and hence the propensity score estimates may not be stable. For , we assume that the cluster effect is random, which reduces the number of free parameters. As a result, shows less variability than . Nonetheless, both and cannot control the bias well. The proposed estimator shows small bias and good empirical coverage across all scenarios. Notably, to compute , we used a working model, a logistic linear model, to provide an initial set of weights. When the true propensity score is probit or complementary log-log model, still has small biases. This suggests that our proposed estimator achieves improved robustness compared to the existing weighting estimators.
6.2 Robustness against omitting a unit-level (higher moment of) confounder
The data generating mechanisms are the same as in Section 6.1, except that in the potential outcomes and the treatment assignment models, we use a squared covariate instead of the original covariate. Now, for the potential outcomes models, we have in the first setting, and , where , , , , , and are independent, for , ; while in the second setting, with and with . For three propensity score models, we now have , with being the inverse logit, probit and complementary log-log link function, for generating . We set . In the proposed method, the calibration condition (17) is imposed only for the first moment of . This represents the case of omitting a unit-level confounder.
|Scenario 1: Linear outcome & Logistic propensity score|
|Scenario 2: Linear outcome & Probit propensity score|
|Scenario 3: Linear outcome & C-loglog propensity score|
|Scenario 4: Logistic outcome & Logistic propensity score|
|Scenario 5: Logistic outcome & Probit propensity score|
|Scenario 6: Logistic outcome & C-loglog propensity score|
|Linear outcome||Linear outcome||Linear outcome|
|Logistic PS||Probit PS||C-loglog PS|
|Logistic outcome||Logistic outcome||Logistic outcome|
|Logistic PS||Probit PS||C-loglog PS|
Table 2 shows biases, variances and coverages for confidence intervals from 1,000 simulated data sets. The proposed estimator does not control bias well in some scenarios, similar to all other estimators. This is in line with the consensus in the causal inference literature that all unit-level confounders, including higher moments if present, must be properly controlled for in order to obtain a consistent causal effect estimator.
7 An application
Ethical approval: The conducted research uses an existing de-identified dataset and is not considered as human subject research by the authors’ institutional review board. We examine the 2007–2010 BMI surveillance data from Pennsylvania Department of Health to investigate the effect of School Body Mass Index Screening (SBMIS) on the annual overweight and obesity prevalence in elementary schools in Pennsylvania. Early studies have shown that SBMIS has been associated with increased parental awareness of child weight , . However, there have been mixed findings about the effect of screening on reducing prevalence of overweight and obesity , . The data set includes 493 schools in Pennsylvania. The baseline is the school year 2007. Previous studies (e. g. ) have shown that two high-level contextual factors are strongly associated with school policies for SBMIS: type of community (rural, suburban, and urban), and population density (low, median, and high). Therefore, we cluster schools according to these two factors. This results in five clusters: rural-low, rural-median, rural-high, suburban-high, and urban-high, with cluster sample sizes , , , , and , respectively.
Let if the school implemented SBMIS, and if the school did not. In this data set, of schools implemented SBMIS, and the percentages of schools implemented SBMIS across the clusters range from to , indicating cluster-level heterogeneity of treatment. The outcome variable Y is the annual overweight and obesity prevalence for each school in the school year 2010. The prevalence is calculated by dividing the number of students with BMI > 85th by the total number of students screened for each school. Therefore, the outcome was measured for each school. For each school, we obtain school characteristics including the baseline prevalence of overweight and obesity (), and percentage of reduced and free lunch ().
For a direct comparison, the average difference of the prevalence of overweight and obesity for schools that implemented SBMIS and those that did not is . This unadjusted difference in the prevalence of overweight and obesity ignores differences in schools and clusters. To take the cluster-level heterogeneity of treatment into account, we consider three propensity score models: (i) a logistic linear fixed effects model with linear predictors including , , and a fixed intercept for each cluster; (ii) a logistic linear mixed effects model with linear predictors including , , and a random intercept for each cluster; (iii) the proposed calibrated propensity score. Using the estimated propensity score, we estimate the average treatment effect τ by the weighting method.
Table 3 displays the standardized differences of means for and between the treated and control groups for each cluster and the whole population, standardized by the standard errors in the whole population. Without any adjustment, there are large differences in means for and . For this specific data set, the three methods for modeling and estimating the propensity score are similar in balancing the covariate distributions between the treated and control groups. All three propensity score weighting methods improve the balance for and . Table 4 displays point estimates and variance estimates based on 500 bootstrap replicates. The simple estimator shows that the screening has a significant effect in reducing the prevalence of overweight and obesity. However, this may be due to confounders. After adjusting for the confounders, the screening does not have a significant effect. Given the different sets of assumptions for the different methods, this conclusion is reassuring.
We provide a doubly robust construction of inverse propensity score weights by imposing the exact balance of unit- and (observed and unobserved) cluster-level covariate distributions between the treatment groups. When either the treatment assignment is correctly specified or the outcome follows a linear mixed effects model, we show that consistent estimation of the average treatment effect is possible. Our simulation examines the robustness property of the proposed estimator under various data generating mechanisms. The results suggest that if all confounders in the linear predictors of the treatment and outcome models (including all higher moments) achieve a good balance between two treatment groups, the proposed estimator is robust. The balance conditions help to satisfy the underlying latent ignorable treatment assignment assumption, and may be particularly useful in the case where not sufficient cluster-level confounders are available. In this case, misspecification of the propensity score model has little impact on the bias of the causal effect estimator.
Moreover, our simulation results also indicate that robustness may not hold in the case where higher moments of unit-level confounders are present however are omitted in the balance constraints. This is similar to the case when there are unmeasured unit-level confounders. We therefore emphasize that unbiased estimation of the average treatment effect requires that all unit-level confounders be sufficiently controlled for.
It is well known that the IPTW estimator is sensitive to near-zero values of the estimated propensity score (e. g. , , ). Our proposed estimator prevents the occurrence of extreme values of weights through the calibration constraints where the weights within each cluster are positive and sum to the cluster sample sizes. Therefore, it is unlikely that some units receive extremely large weights that dominate.
We have focused on two-level data with a binary treatment and the average treatment effect over the full population in this article. Our proposed method can be easily generalized to many other scenarios not discussed here, such as multi-level data, multiple treatments  or other causal estimands, e. g., the average causal effects over a subset of population , , , including the average causal effect on the treated.
The IPTW estimator is not efficient in general. Semiparametric efficiency bounds for estimating the average treatment effects in the setting with i.i.d. random variables were derived by Hahn . He showed that the efficient influence function for the average treatment effect depends on both the propensity score and the outcome model. An important implication is that combining the propensity score model and the outcome regression model can improve efficiency of the IPTW estimator. For clustered data, because the data are correlated through the random cluster variables, the efficiency theory established for the i.i.d. data is not applicable. It remains an interesting avenue for future research to develop the semiparametric efficiency theory for clustered data.
Funding source: Division of Mathematical Sciences
Award Identifier / Grant number: 1811245
Funding statement: The author acknowledges the support in part by Ralph E. Powe Junior Faculty Enhancement Award from Oak Ridge Associated Universities and NSF grant DMS 1811245.
The author would like to thank Peng Ding and Fan Li for insightful and fruitful discussions. The author is grateful to the Associated Editor and two referees for helpful comments and suggestions that have greatly improved the original submission. Conflict of interest: Authors state no conflict of interest.
Appendix Regularity conditions and proof of Theorem 1
We formulate the proposed estimator as a Z-estimator (e. g., ), which invokes the standard Z-estimation theory to show the asymptotic properties. Write , where is defined in (10). The proposed estimator is , where satisfies , where is defined in (11). Define , and define that satisfy . Denote as , where the reference distribution is the super-population model, as .
We impose the following regularity conditions.
Condition A1.in probability uniformly foras, and there exists a uniquesuch that;
Condition A2.andare continuous atalmost surely;
Condition A3.The matrix
Condition A4.,, and.
The convergence in Condition A1 is uniform convergence. That is, given , there exists such that holds for all and . A sufficient condition for the uniform convergence is that for all and , where M is a constant. Condition A4 specifies moment conditions for the central limit theorem. Conditions A1–A4 are standard regularity conditions on Z-estimation; see, e. g., . However, the regularity conditions may be difficult to check in practice. We give an extreme example where a certain condition is violated. For example, if two covariates in X are perfectly correlated, then Condition A3 fails to hold. Aside from extreme cases, the regularity conditions are often satisfied for the models we are considering and reasonable choices of covariates in practice.
Under Conditions A1–A4, using the standard linearization technique, we obtain
First, consider the case when the initial propensity score model is correctly specified and consistently estimated. We have and . This is because with , . We now evaluate the terms in (A1) further. We express as
where the third line follows by the consistency assumption, the forth line follows by the assumption that the initial propensity score model is correctly specified, and the last line follows by the strong law of large numbers and the condition of . Also, following the similar argument, we obtain
Second, consider the case when the outcome follows a linear mixed effects model. We do not assume that the initial propensity score model is correctly specified, and therefore and in (A1) are not necessarily zero. Conditions A1–A4 ensure that (A1) is consistent for some parameter. We have shown in Section 4 that the proposed estimator is asymptotically unbiased for τ. It follows that .
To derive asymptotic variance formula, continuing (A1), we obtain
Therefore, , denoted as .
To establish the asymptotic normality of , we use the central limit theory for dependent variables , . Let and for . Arrange the ’s in a n-length sequence . To simplify the notation, let the kth random variable in this sequence be denoted by , for . We now consider such sequences are indexed by n. By Condition A4, the absolute central moments are bounded uniformly in k. Moreover, by the assumption of , we then have , uniformly in a, as , where is a positive constant. Following Serfling , these are typical criterion for verifying the Lindeberg condition , and therefore , in distribution, as .
3. Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39:33–8.10.1017/CBO9780511810725.019Search in Google Scholar
6. Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Serv Outcomes Res Methodol. 2001;2:259–78.10.1023/A:1020371312283Search in Google Scholar
8. Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–34.10.1093/biomet/asp033Search in Google Scholar
10. Yang S, Imbens GW, Cui Z, Faries DE, Kadziola Z. Propensity score matching and subclassification in observational studies with multi-level treatments. Biometrics. 2016;72:1055–65.10.1111/biom.12505Search in Google Scholar
12. Hong G, Raudenbush SW. Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. J Am Stat Assoc. 2006;101:901–10.10.1198/016214506000000447Search in Google Scholar
13. Griswold ME, Localio AR, Mulrow C. Propensity score adjustment with multilevel data: setting your sites on decreasing selection bias. Ann Intern Med. 2010;152:393–5.10.7326/0003-4819-152-6-201003160-00010Search in Google Scholar
16. Ross R. An application of the theory of probabilities to the study of a priori pathometry. part i. Proc R Soc Lond, a Contain Pap Math Phys Character. 1916;92:204–30.10.1098/rspa.1916.0007Search in Google Scholar
18. Oakes JM. The (mis) estimation of neighborhood effects: causal inference for a practicable social epidemiology. Soc Sci Med. 2004;58:1929–52.10.1016/j.socscimed.2003.08.004Search in Google Scholar
21. Hong G, Yu B. Effects of kindergarten retention on children’s social-emotional development: An application of propensity score method to multivariate, multilevel data. Dev Psychol. 2008;44:407–21.10.1037/0012-1622.214.171.1247Search in Google Scholar
22. Kim J, Seltzer M. Causal inference in multilevel settings in which selection processes vary across schools. Technical Report Working Paper 708. University of California, Los Angeles, Center for the Study of Evaluation; 2007.10.1037/e644002011-001Search in Google Scholar
23. Kelcey BM. Improving and assessing propensity score based causal inferences in multilevel and nonlinear settings. PhD thesis. University of Michigan; 2009.Search in Google Scholar
26. Kim J-S, Steiner PM. Multilevel propensity score methods for estimating causal effects: A latent class modeling strategy. In: Quantitative Psychology Research. Springer; 2015. p. 293–306.10.1007/978-3-319-19977-1_21Search in Google Scholar
27. Leite WL, Jimenez F, Kaya Y, Stapleton LM, MacInnes JW, Sandbach R. An evaluation of weighting methods based on propensity scores to reduce selection bias in multilevel observational studies. Multivar Behav Res. 2015;50:265–84.10.1080/00273171.2014.991018Search in Google Scholar
28. Schuler MS, Chu W, Coffman D. Propensity score weighting for a continuous exposure with multilevel data. Health Serv Outcomes Res Methodol. 2016;16:271–92.10.1007/s10742-016-0157-5Search in Google Scholar
29. Xiang Y, Tarasawa B. Propensity score stratification using multilevel models to examine charter school achievement effects. J School Choice. 2015;9:179–96.10.1080/15582159.2015.1028862Search in Google Scholar
30. Su Y-S, Cortina J. What do we gain? combining propensity score methods and multilevel modeling. In: Annual Meeting of the American Political Science Association. Toronto, Canada; 2009.Search in Google Scholar
32. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–66.10.1080/01621459.1994.10476818Search in Google Scholar
33. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23:2937–60.10.1002/sim.7231Search in Google Scholar
34. Kang JD, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci. 2007;22:523–39.Search in Google Scholar
39. Baltagi B. Econometric Analysis of Panel Data. New York: John Wiley & Sons, Wiley; 1995.Search in Google Scholar
40. Wooldridge JM. Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT press; 2002.Search in Google Scholar
45. Chen J, Sitter R, Wu C. Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika. 2002;89:230–7.10.1093/biomet/89.1.230Search in Google Scholar
47. Kott PS. Using calibration weighting to adjust for nonresponse and coverage errors. Surv Methodol. 2006;32:133–42.Search in Google Scholar
50. Qin J, Zhang B. Empirical-likelihood-based inference in missing response problems and its application in observational studies. J R Stat Soc B. 2007;69:101–22.10.1111/j.1467-9868.2007.00579.xSearch in Google Scholar
51. Hainmueller J. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Polit Anal. 2012;20:25–46.10.1093/pan/mpr025Search in Google Scholar
54. Chan KCG, Yam SCP, Zhang Z. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. J R Stat Soc B. 2015;78:673–700.10.1111/rssb.12129Search in Google Scholar
60. McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods. 2004. 403–425.10.1037/1082-989X.9.4.403Search in Google Scholar
61. Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, Cook EF. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf. 2008;17:546–55.10.1002/pds.1555Search in Google Scholar
63. Pirracchio R, Petersen ML, van der Laan M. Improving propensity score estimators’ robustness to model misspecification using super learner. Am J Epidemiol. 2014;181:108–19.10.1093/aje/kwu253Search in Google Scholar
66. Harris KC, Kuramoto LK, Schulzer M, Retallack JE. Effect of school-based physical activity interventions on body mass index in children: a meta-analysis. Can Med Assoc J. 2009;180:719–26.10.1503/cmaj.080966Search in Google Scholar
67. Ebbeling CB, Feldman HA, Chomitz VR, Antonelli TA, Gortmaker SL, Osganian SK, Ludwig DS. A randomized trial of sugar-sweetened beverages and adolescent body weight. N Engl J Med. 2012;367:1407–16.10.1056/NEJMoa1203388Search in Google Scholar
68. Thompson JW, Card-Higginson P. Arkansas’ experience: statewide surveillance and parental information on the child obesity epidemic. Pediatrics. 2009;124:73–82.10.1542/peds.2008-3586JSearch in Google Scholar
69. Peyer KL, Welk G, Bailey-Davis L, Yang S, Kim J-K. Factors associated with parent concern for child weight and parenting behaviors. Childhood Obesity. 2015;11:269–74.10.1089/chi.2014.0111Search in Google Scholar
70. Robins J, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Stat Sci. 2007;22:544–59.10.1214/07-STS227DSearch in Google Scholar
71. Crump R, Hotz VJ, Imbens G, Mitnik O. Moving the goalposts: Addressing limited overlap in the estimation of average treatment effects by changing the estimand. Technical report, 330. Cambridge, MA: National Bureau of Economic Research; 2006.10.3386/t0330Search in Google Scholar
73. Yang S, Ding P. Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika. 2018;105:487–93.10.1093/biomet/asy008Search in Google Scholar
75. van der Vaart. Asymptotic Statistics. vol. 3. Cambridge: Cambridge university press; 2000.Search in Google Scholar
78. Loève M. Probability Theory. 2nd ed. Princeton: Van Nostrand; 1960.Search in Google Scholar
© 2018 Walter de Gruyter GmbH, Berlin/Boston
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.