1.1 Propensity score analysis
Propensity score (PS) analysis is an important statistical tool for controlling confounding in observational studies  and has been widely used in clinical research, economics, and social sciences [2–5]. In this article, we consider the population PS analysis framework in which represent independent and identically distributed data randomly sampled from a population of research interest, where denotes the outcome, or 0 indicates whether the subject was assigned to the treatment or control, and is a vector of observed covariates related to the treatment assignment and the outcome. Informally, the research objective is to estimate the effect of the treatment on the outcome.
It is helpful to conceptualize this problem using a potential outcomes framework. We assume that the observed outcome for subject i would be if the subject had been assigned to the treatment group and if assigned to the control. Since and cannot be observed simultaneously, they are called the potential outcomes. The treatment assignment determines which of the two potential outcomes can be observed for subject i, i.e. . Two key assumptions are required for PS methods to produce valid estimates of causal effects. The first of these is the stable unit treatment value assumption (SUTVA, Rubin ), which is commonly invoked for causal inference and stipulates that the potential outcomes for a subject will be the same irrespective of the mechanism used to assign the treatment to that subject and irrespective of which treatments the other subjects receive. The second is the assumption of strong ignorability or unconfoundedness [1, 3], expressed in terms of the potential outcomes as , where denotes conditional independence. This assumption stipulates that must include a sufficient set of covariates such that conditional on , the potential outcomes are independent of the treatment assignment. We define the treatment effect at subject level as , that is, the difference in the potential outcomes for subject i. The goal of PS methods is to obtain an average treatment effect, often expressed as the expectation of taken over a population of research interest.
The PS is defined for each subject to be the conditional probability of receiving the treatment, given the covariates , i.e. . Rosenbaum and Rubin  proved that SUTVA and the unconfoundedness imply that and or, informally, that the treatment may be viewed as being randomly assigned among subjects with the same PS. Therefore, under the SUTVA and unconfoundedness assumptions, one can think of the entire data set as a collection of many tiny randomized experiments, each defined on a distinct value of the PS. Estimators for the average treatment effect may be formed by properly aggregating results from these tiny experiments.
Recent reviews of PS analysis can be found in Imbens , Imbens and Wooldridge , Guo and Fraser , Stuart , and Luo et al. . Popular methods for PS adjustment include stratification or regression [2, 10, 11], matching [8, 12, 13], or weighting [11, 14, 15], or combinations of these approaches [4, 7]. Some confusion has been caused by the fact that there is considerable variation in the estimands of different propensity methods. Commonly used estimands include the population average treatment effect (PATE), , where the average is taken over the population from which the random sample is drawn, and the population average treatment effect for the treated (PATT), , where the average is taken over the subpopulation of treated subjects. Imbens  and Imbens and Wooldridge  also discussed estimands defined on the observed sample, but since those estimands are rarely used in medicine, we do not study them in this article.
1.2 Pair matching
In this article, we propose a weighting analog to 1:1 caliper matching on the PS without replacement (hereafter referred to as “pair matching”). Before introducing the proposed method, we first give a brief overview of pair matching in this subsection. Pair matching is currently the most widely used PS matching methods in the medical literature, accounting for 83% of the medical publications that use PS matching methods in a recent survey [13, 16, §3]. Matching with replacement, one-to-many matching, and many-to-many matching are employed rarely in medical literature [16, §2].
Pair matching involves four steps. First, a model relating Z to X is fit using binary outcome regression, and the PS is estimated for each subject. Second, a positive constant is specified as a caliper such that a treated and a control subject can only be matched if the difference in their PSs is less than the caliper; the treated and control subjects are selected to form matched pairs without replacement. Quite often a treated subject is paired with an unmatched control that has the closest PS. The matching ends when no more matched pairs, within the caliper, can be identified. Third, the matched data are examined for balance in covariate distributions between the treatment groups . If there is imbalance, the PS model needs to be reformulated until balance is achieved . In the fourth and final step, the matched pairs are compared, often using a paired analysis [9, 13]. The caliper size broadly governs the bias-variance trade-off: a larger caliper size often results in more matched pairs and thus less variation in the estimator, but the distance between the PSs of the matched pair may be larger, leading to higher chance of mismatch and bias . Therefore, a smaller caliper is preferred when the sample size is larger.
Pair matching is often used to estimate the PATT . However, a sometimes overlooked fact is that this is possible only when sufficient control subjects are available to match with each treated subject across the support of the PS distribution. In general, this occurs only when there is a much larger reservoir of controls than the treated subjects, and there is sufficient overlap in the supports of the sample PS distributions of the treated subjects and the controls. However, medical studies often violate one or both of these conditions (e.g. Austin  and the data application of this article). In particular, when the number of treated subjects is comparable or greater than the number of control subjects, a substantial proportion of the PSs typically fall between 0.5 and 1. In this case, within any small PS stratum in that range, there are by definition more treated subjects than the controls, and the proportion of treated subjects increases as the PS approaches 1. With 1:1 matching without replacement, some treated subjects in the (0.5, 1) range either cannot find a matching control within the caliper, resulting in their exclusion, or must be matched to controls with systematically smaller PSs, resulting in inaccurate control of confounding. This latter problem can often be seen in the nearest neighbor matching with a very large caliper or without the caliper restriction at all. Therefore, in the following, we only consider pair matching with a fixed caliper, whose chosen value is smaller with larger sample size. If we define the estimand of the pair matching estimator as its asymptotic limit, then in such situations the estimand of pair matching will remain a weighted average of casual effects across the distribution of PSs, but will generally deviate from the PATT.
We take the perspective that deviations of estimands from either the PATT or the PATE, as often occurs with pair matching, may often be acceptable or even desirable in many medical applications. As pointed out by Imbens and Wooldridge , there is often little motivation for a focus on the PATE or PATT. For example, in the data application of this article, the research objective is to determine the effect of intraoperative blood transfusion on cardiac surgical patients. The PATE would be of interest if we wished to contrast either withholding blood transfusion for all patients or giving blood to all patients; the PATT would be of interest if we wished to consider withholding blood transfusion for all the patients who currently receive blood. From a practical perspective, it is unlikely that these are the treatment policy alternatives. Exceptions will almost always be made on some subpopulations, particularly those with extreme PSs (i.e. scores that are close to 1 or 0). Those with PSs in the middle range exhibit more equipoise and the treatment policy alternatives of “withholding blood for all” or “giving blood to all” apply to them better. A related issue is that the PATE and PATT are in general contingent on both the patient mix and the treatment allocation pattern of the research database/hospital that the investigator uses, but both of these rarely have special standing. Studies using data from other hospitals with different patient mixes or different treatment allocation patterns will typically have different PATEs and PATTs. It is important to clearly describe which population or subpopulation the average treatment effect estimator is developed for (e.g. Table 3), but the PATE or PATT may not be the only quantities of scientific relevance.
1.3 Rationale for the proposed method and outline
Despite the popularity of pair matching in the medical literature, there are some serious difficulties in its practical implementation, as highlighted by Austin , Hill , Stuart , and Hansen .
First, variance estimation for the pair matching estimator is a challenging task. Complicated correlation structures between and within pairs are introduced by the matching algorithm and by the fact that all the PSs are estimated with error from a binary regression. There is still debate on whether a paired or unpaired analysis is more appropriate for estimating the variance of the pair matching estimator [13, 20, 21]. The source of the problem is the complexity of the asymptotic distribution of the matching estimator [23, 24]. For a specific form of matching with replacement, Abadie and Imbens  proved that failing to adjust for the uncertainty in the estimated PS introduces bias in variance estimation. Theoretical results on other types of matching, including pair matching, remain unavailable.
Second, in practice covariate balance is checked by the standardized difference of each covariate after matching . The PS model needs to be reformulated repeatedly until the standardized difference is close to zero . Since the standardized difference can never be zero due to random variation, it remains a question how small its absolute value needs to be for the data analyst to stop reformulating the model and claim that the observed standardized difference is caused by chance and not by systematic imbalance. Some investigators have proposed that the standardized difference should be within 10% for all covariates . Ho et al.  recommended tighter thresholds for covariates that are prognostically more important. These criteria are rules of thumb and there is room for improvement. For example, it is plausible that the optimal threshold for the standardized difference should decline with increasing sample size as greater precision becomes feasible.
In this article, we propose a weighting approach for PS analysis, which, unlike other weighting approaches, has approximately the same estimand as the pair matching estimator. In contrast to pair matching, theoretical analysis of the weighting method is relatively straightforward. This allows the formulation of a rigorous theoretical foundation which can be used to guide the practical implementation of the weighting estimator and avoid the aforementioned difficulties with pair matching. Since the proposed weight provides a direct analog to pair matching, we call it the matching weight (MW).
Specifically, in Section 2, we show that the MW estimator is in general more efficient than pair matching estimators and apply the approach in Lunceford and Davidian  to develop a sandwich type variance estimate that properly adjusts for the uncertainty in the estimated PSs. In Section 3, we extend the double robust technique [26–28], developed for other weighting methods, to the proposed weighting estimator to protect against model misspecification. Double robust estimation has not been established theoretically for matching estimators, though it has been observed that combining pair matching with direct regression on the outcome variable often leads to more robust estimation [4, 20]. In Section 4, we present a statistical test for balance diagnosis, which provides a new guideline for data analysts to indicate whether the propensity model is misspecified. Sections 5 and 6 present results of a simulation study and a real data application. Section 7 provides further discussion on the estimand of the MW method and other issues.
2 Matching weight
2.1 The matching weight estimator
The PS is often estimated by a parametric regression that models the probability of as a function of :
We call the PS model and denote its unknown parameters by . Throughout this article, the term “PS” refers to on its probability scale, i.e. , unless otherwise specified. The PS cannot be 0 or 1, so that the support of the conditional distribution of given overlaps completely with that of the conditional distribution of given . This was called the overlap assumption by Imbens and Wooldridge [7, §1].
We define the MW for subject i as
and the MW estimator as
The MW is a variant of the inverse probability weight [11, 15, 27, 29], with , instead of 1, being placed at the numerator of eq. . However, this small change leads to quite different interpretation and numerical properties, as described below.
The intuition behind the MWs is best illustrated through the mirror histogram (Figures 2 and 3, description in Section 6). Suppose we focus on a small neighborhood around a PS , which includes subjects in it. We would expect that roughly subjects are treated () and that the other are controls (). When , there are asymptotically more controls than treated subjects in this neighborhood, and all treated subjects can be selected into the matched data set, that is, their selection probability is 1, but only a subset of the controls can be matched (selection probability ). When , there are more treated subjects than the controls, and the probability of being selected into the matched data set is for the treated subjects and 1 for the controls. Hence, the MW (2) can be thought of as the probability of being selected into the matched data set. Unlike matching, which classifies each subject as either being matched or unmatched and discards the unmatched subjects, the MW never leaves out any subject entirely; instead, it only down-weights some of them. When multiple subjects can all potentially be matched, the weight is approximately equally distributed among them so that every subject contributes to the estimation. This improves efficiency and balance, reduces bias, and stabilizes the computation, as will be shown in Sections 5 and 6.
There is an interesting relationship between the heuristic justifications of the inverse probability weight and MW. The unstandardized inverse probability weight is always larger than 1, because it “creates multiple copies” of each subject to recover the distribution of the potential outcomes, which may be missing due to the selective treatment assignment. The unstandardized MW is always less than or equal to 1, because the pair-matched data set is always a subset of the original data; as an analog, the MW lets each subject to “contribute only a fraction of itself”, and that “fraction” is the MW. We emphasize that this comparison reflects only the heuristic motivations of the respective estimators. The weights used by the MW and inverse probability weight methods, as in eq.  or in the IPW2 estimator of Lunceford and Davidian , have been standardized, and they influence the respective estimators only through their relative sizes compared to the weights of other subjects, and the absolute value or the scale of the unstandardized weight is not important.
2.2 The estimand and the relationship to pair matching
MW estimation resembles pair matching in a number of ways. First, the weighted subjects in the treatment and control groups have balanced distributions in both the PS and all the covariates, because and have identical asymptotic limits for any measurable function .
Second, the estimands of pair matching and the MW method are similar. Suppose the treated and control groups have an overlapping support in their PS distributions, so that no subjects are excluded prior to matching due to insufficient overlap, the estimand of pair matching, as defined by its asymptotic limit, depends on the specifics of the matching algorithm, most notably the caliper size. To our knowledge, a theoretical expression of the estimand is not yet available. Sometimes a matching algorithm would exclude treated and/or control subjects who fall outside the common support using ad hoc criteria prior to matching, and this further complicates the estimand. For the moment, we consider a simplified scenario, when the PS takes on a finite set of discrete values at with non-zero probabilities () so that we can match on the PS exactly without the complications due to the caliper and the specifics of the matching algorithm. This scenario is actually quite general, as K can be arbitrarily large, as long as it is bounded and does not increase with the sample size n. For example, if we round the PS, a probability between 0 and 1, to 2 digits after the decimal point, then excluding 0 and 1. In such a scenario, we can show that the pair matching estimator and the MW estimator have identical estimands (proof in the Appendix):
is a weighted average of individual treatment effects, with more weights applied to subjects with PSs close to 0.5, that is, those with greater “equipoise” in the sense of having similar probabilities of assignment to either the treatment or control.
Third, when there is a large reservoir of controls so that every treated subject is matched within the caliper, the estimand of pair matching is the PATT. When the sample size is large, so that the caliper is small, all the treated subjects being matched implies that the PS is almost always smaller than 0.5. In this case, the MW becomes the weight developed for estimating the PATT (e.g. the numerator of eq.  is ; Hirano and Imbens ), and the MW estimand (4) is also the PATT.
Fourth, if we define the effective sample size of the weighted subjects in the treatment group as and, in the control group, . They are asymptotically equal because and form a natural analog to the number of matched pairs. Proposition 1 of the Appendix proves that in the discrete scenario at least, the effective sample size of the MW is asymptotically identical to the expected number of matched pairs from matching.
The discussion above suggests that we might use the MW as an alternative to pair matching. We emphasize that the MW results in the rest of this article do not rely on the PS being on a discrete scale; we resort to the heuristic arguments with discrete PSs only in the second and fourth arguments above to show the intuition behind MW estimation as an analog of pair matching.
2.3 Variance estimation
Following the approach in Lunceford and Davidian , we can estimate the PS, MW, and the treatment effect simultaneously in one step by solving the following estimating equations with respect to :
We express as to emphasize its dependence on . The first two equations correspond to eq. , with and . The third is the estimating equation for from the PS model (1). The MW estimator . Since these estimating equations are unbiased, the MW estimator is an M-estimator with an asymptotically normal distribution.
The estimating equation  is not needed to calculate the point estimator, which can be obtained easily via eq.  using the estimated PSs from a binary regression fit. Eq.  is needed for deriving a sandwich type variance estimator for , using , with , and . One issue remains to be resolved. Since the MW function (2) does not have a continuous derivative at , is not everywhere differentiable with respect to . The MW equals to when and when Z0. We solve this problem by replacing the middle piece in and around 0.5 with a cubic polynomial that connects smoothly with the two ends . The result is an approximate MW function with continuous first derivative everywhere, which satisfies the usual regularity conditions for sandwich variance estimation. Since the middle piece can be made arbitrarily small, the approximation is quite accurate.
We first approximate . Let if and if . In order for to have continuous first derivative everywhere and adequately approximate , must satisfy four conditions: (1) , (2) , (3) , and (4) . Here, notation denotes the first derivative of the function. Solving these four equations for , we have
Similarly, we can define if and if . Then, approximates with
In all the numerical studies in this article, we set .
3 Double robust matching weights estimator
In this section, we develop an augmented MW estimator that has the “double robust” property. The idea follows from the inverse probability weighted double robust estimator [11, 27]. In addition to the PS model, the augmented estimator involves two outcome models, one for the regression of on among the treated subjects and one among the controls. Let denote the parameters associated with the outcome model for the treatment group. We write as the conditional expectation of the outcome given the covariates and write as the unbiased estimating equation for derived from the likelihood or quasi-likelihood of this outcome model. For the control group, we define and similarly.
The augmented MW estimator can be obtained directly from
Theorem 1. The augmented MW estimator is consistent for , as long as at least one of the following two models is correctly specified: (1) the PS model ; (2) the outcome models and .
Theorem 2. Assume that the PS model (1) is known and let . The class of influence functions of regular asymptotically linear estimators for is given by (subscript i suppressed)
where is the space of functions of form for any function . Among all estimators with influence functions in this class, the augmented MW estimator is the most efficient in the sense that it has the smallest variance.
The proof is in the Appendix. The property described in Theorem 1 is called double robustness. Usually, if a statistical model is misspecified, the result would be biased. With double robustness, even if one part of the model fails, we may still obtain an unbiased estimator if the other part of the model is correctly specified. Therefore, it gives the data analyst two chances, instead of one, to get a correct result. Theorem 2 shows that adding the outcome models improves efficiency.
Double robustness has been established for inverse probability weighting method [11, 27], but not for PS matching methods. Ho et al.  mentioned that doing a regression of the outcome with pair-matched data leads to double robust estimation, but did not provide theoretical justification of that observation. Given the similarity between MW and pair matching, the results in this article support their claim. However, there is a subtle difference between the method used in their data example and the one here. In Ho et al. , matching was used as a preprocessing step, and the outcome model included both X and Z, with the coefficient of Z representing the causal effect. The estimator proposed here uses a pair of outcome regression models, for each of the two treatment groups separately, and the causal effect is expressed as a mean difference. In the next section, we compare the two approaches in simulation.
The expression (6) involves unknown parameters , and . In practice, they can be replaced by consistent estimators, and Theorem 1 still holds. Let and be the asymptotic limits of the three terms in eq. . The estimating equations for are
. Again, for point estimators, we do not need to solve these estimating equations jointly, because they can be obtained separately from the PS model, the outcome model, and the close-form formula (6). Eq.  are needed mainly for the sandwich variance estimation of , similar to that in Section 2.
4 Balance checking
Balance diagnosis, that is, checking whether holds, is an attractive feature of the PS analysis in comparison with direct regression on the outcome [20, 31]. In the PS matching framework, it is often recommended that balance be assessed by calculating the absolute value of the standardized difference of each covariate after matching [2, 17]:
where and are the sample mean and standard deviation of the matched treated subjects, respectively, and and are those of the controls. Let denote the covariate of interest, then , , , and , where the summation is over the matched subjects.
With MWs, we can develop an analog to the above by using weighted means and standard deviations, i.e. , , , and . If the are continuous, we can also compare the weighted cumulative distribution functions of between the treated and controls to assess balance in the entire distributions of the covariates.
The PS model should be formulated so that the absolute value of the standardized difference of all covariates be minimized “without limit” . As mentioned in Section 1.3, it is necessary to develop well-justified criteria on whether balance is achieved in the propensity matched or propensity weighted data set. A formal hypothesis test of balance is often viewed as being undesirable in the matching framework, because “balance” is a property of the matched sample but not the underlying population, and because matched data have reduced sample size, which could inflate p-value by itself [13, 18]. We agree with this view. However, from another perspective, we can formulate the null and alternative hypotheses as follows:
H0: the PS model is correctly specified
HA: the PS model is misspecified.
Let be a vector of covariates whose balance we want to examine. V may include some or all of X, or their transformations. Let
where is a vector of monotone smooth transformations applied to the corresponding element of its input vector. The identity transformation is often used. However, when V is a binary covariate with very low prevalence, defining as the logit transformation improves finite-sample performance . Similarly, when V is highly skewed, we may use the log transformation. If the null hypothesis is true, we can prove that the expectation of is zero, that is, the covariates are balanced after being weighted by the MWs; otherwise, may deviate from zero, indicating that the covariates are not balanced.
In order to properly adjust for the estimated PS, we view this statistic as a solution to the following estimating equations for :
We can think of and as parameters representing and , respectively, and , with , where the first two blocks are identity and negative identity matrices of a dimension equal to the length of . Under the null hypothesis, as , . An estimator of the variance matrix , , can be obtained from the estimating equations above using the sandwich method, similar to Section 2. The test statistic is proposed as
Under the null, this statistic has an asymptotically central distribution with degrees of freedom . The p-value of the test is .
In a randomized clinical trial, as long as the randomization protocol is properly designed and closely followed, the baseline covariates are expected to be balanced (only subject to chance imbalance) between randomized groups and a test of balance is unnecessary . From another perspective, checking for covariate balance in a randomized trial is still useful, because if many covariates are unbalanced, that raises questions on whether the randomization protocol has been followed appropriately during the conduct of the trial. The PS model to an observational study is analogous to the randomization protocol to a randomized clinical trial. As long as the PS model is correctly specified, all covariates should be balanced after weighting by the MWs (subject only to chance imbalances). Therefore, the test above should be interpreted as a test for the misspecification of the PS model, but not as a test for covariate balance. However, the test statistic and its associated p-value may serve as an index of covariate balance, which may complement existing indices such as the standardized difference.
The idea of casting balance diagnosis as a test of the misspecification of the PS has been advocated by some researchers [34–36]. The test statistic (9) was previously proposed by Hansen and Bowers  as a metric of balance for general balance checking problems, not necessarily for PS analysis. No numerical result on this test was presented in that article. We developed an adaptation here, by incorporating MWs and adjustment for the uncertainty in estimated PSs.
We conducted simulations to study the numerical performance of the proposed methodology. The PS model was a logistic regression with . The outcome model was a linear causal model similar to those in Lunceford and Davidian , Freedman and Berk , and Austin : with , and . Following a similar approach to Lunceford and Davidian , we created correlated covariates as follows. . . . . were bivariate normal with mean , and covariance matrix
The sample size was n1,000, and 1,000 simulations were run in each setting. We simulated an incorrectly specified PS model or outcome (Y) model, by omitting the confounder X2 in the above expressions.
Table 1 compares estimators from 17 different methods. Method 1.0: correctly specified multiple linear regression of Y on Z and X. Method 1.1: incorrectly specified multiple linear regression of Y on Z and X. Method 2.0: pair matching without replacement, implemented through R function matchit (http://gking.harvard.edu/matchit) with nearest neighbor matching within the caliper size recommended by Austin , which equals to 0.2 times the standard deviation of the PS on its logit scale. Paired analysis was used to estimate variance. Method 2.1: pair matching using the incorrect PS model. Method 3.0: MW estimator. Method 3.1: MW estimator with the incorrect PS model. Method 4.0: the method proposed by Ho et al. , that is, fit a correct linear regression of Y using pair-matched data. Method 4.1: the same as in Method 4.0 except the incorrect PS model was used. Method 4.2: the same as in Method 4.0 except the incorrect Y model was used. Method 4.3: the same as in Method 4.0 except both PS and Y models were misspecified. Method 5.0: double robust MW estimator. Method 5.1: the same as in Method 5.0 except that the PS model was incorrect. Method 5.2: the same as in Method 5.0 except that both Y models for the treatment and control groups were incorrect. Method 5.3: the same as in Method 5.0 except the Y models and the PS models were all incorrect. Method 7.1: inverse probability weighing (IPW) with stabilized weights. This is called IPW3 estimator in Lunceford and Davidian . Method 8.1: double robust IPW estimator as in Lunceford and Davidian , both Y models and PS model are correctly specified. Method 9.1: MW estimator without incorporating the score equation in the estimating eq. (5). By comparing Methods 3.0 and 9.1, we are able to study whether it is necessary to account for the sampling error in the estimated PS in the MW analysis.
From Table 1, we have the following observations. First, the correctly specified MW estimator is generally unbiased and more efficient than pair matching. A heuristic explanation of the efficiency gain is that when multiple controls (or treated subjects) are within the neighborhood of the caliper and available for matching, pair matching may choose some and leave out others. This is equivalent to assigning weight 1 to the matched subjects and assigning zero weight to the unmatched subjects. The MW method retains all subjects in that neighborhood by giving them roughly equal weights, so that they all contribute roughly equally to the averaging. Compared with dichotomized assignment of weights, this continuous assignment of weights increases numerical stability and efficiency and reduces bias.
Second, the confidence interval from pair matching appears to be quite conservative. This may be partly due to the use of the estimated instead of the true PS without adequate adjustment for its uncertainty and partly due to the complicated correlation in the matched data introduced by the matching algorithm. This phenomenon was also observed by others . The confidence interval from the MW analysis is shorter and has the correct coverage, suggesting that the sandwich variance estimation incorporating the uncertainty in estimated PS is accurate for MW methods in large samples. In contrast, if we do not adjust for estimated PS as in Method 9.1, the confidence interval from the MW method would also be very conservative.
Third, the simulation results support the double robust and efficiency properties of MW estimator as stated in Theorems 1 and 2. The estimator proposed in Ho et al.  appears to be approximately double robust in this simulation, with a small bias and reduced confidence interval coverage observed when the outcome model is misspecified. However, it appears to be less efficient than the double robust MW estimator.
Fourth, the two inverse probability weight estimators (Methods 7.1 and 8.1) are unbiased, but they are substantially less efficient than the corresponding MW estimators (Methods 3.0 and 5.0) and have some bias in the estimated variances. Similar phenomenon was also observed by Freedman and Berk . Note that for Methods 7.1 and 8.1, we already adjusted for the estimated PS by incorporating the score equation of the PS model into the estimating equations Lunceford and Davidian . A likely explanation of the relatively large variance and biased variance estimator here is that sometimes the inverse probability weights may get very large due to small estimated probabilities calculated from the PS model Freedman and Berk , Robins and Wang , Kang and Schafer . We want to emphasize that in general, methods based on the inverse probability weight and MW are not comparable because their estimands are different. The MW method generally gives more weights to subjects in the middle range of the PS, and in certain sense its estimand is easier to estimate, which partly explains the efficiency gain. The two weighting methods can be compared in this simulation because we generated data with .
To study the proposed test of PS model misspecification, we simulated data under three scenarios: (1) the null hypothesis where the true PS model was the same logistic model as that used for Table 1; (2) alternative hypothesis I where interactions were added to the true logistic model; (3) alternative hypothesis II where interactions were added to the true logistic model. Scenarios (2) and (3) represented situations where the PS was misspecified with increasing severity because the working PS model was always without any interactions. We considered three sample sizes: 500, 1,000, and 2,000. The covariate vector V included , and was the identity function. The rejection probabilities are reported in Table 2. The column under the “Null” is the Type I error, and the other two columns provide the statistical power. The proposed test maintained Type I error close to the target of 5% under the null, and the statistical power increases with the sample size. The higher statistical power under alternative hypothesis II is probably because the interaction terms under alternative hypothesis II have larger coefficients than those under the alternative hypothesis I. However, in general we cannot expect to observe more imbalance in covariates if the terms omitted from the working PS model have larger variances.
In this section, we apply the proposed method to a cardiac surgery study that includes 12,649 patients, 3,105 of whom received red blood cell transfusion in the operating room (), and others 9,544 did not (). The research question is whether transfusion is associated with longer length of stay in the intensive care unit (outcome Y). The covariates X include 24 demographic factors, preoperative laboratory values, co-morbidities, and intraoperative risk factors (Table 3).
First, we developed a PS model by fitting a logistic regression for intraoperative transfusion. A natural starting point of model building was to include all the covariates, with no interactions or higher order terms for continuous covariates for now. We call it PS Model I. In Figure 1(A), we plotted the standardized differences of all covariates using the pair-matched and MW methods (the standardized differences of the original data set were much larger in scale and were not plotted). In the matching literature, it is generally recommended that the absolute values of the standardized differences of all covariates should be less than 10% . We can see from Figure 1 that after pair matching, the absolute standardized differences of all covariates are within 10% except the preoperative hematocrit, which is a continuous covariate and is known to be a strong predictor of transfusion. The MW produced much better balance in terms of the standardized differences (Figure 1). There are two reasons for this. First, eq.  has expectation 0 for any vector V as a function of the covariates used in the PS model; second, as shown by the simulation result of Table 1, the MW estimator generally has less variation than pair matching, reducing the amount of imbalance due to random variation.
We performed the test of misspecification proposed in Section 4. Cryoprecipitate was a binary covariate with prevalence less than 1%. It was excluded from the vector V in the test statistic because binary covariates with extremely low prevalence would need extremely large sample size to reach asymptotic normality. Fresh frozen plasma, platelets, emergency surgery were binary covariates with low prevalence between 1% and 5%, and they were retained in V, but we let their corresponding function to be the logit function in order to improve the finite-sample performance. The functions of the rest of the covariates were identity functions, that is, no transformations were needed. NYHA (New York Heart Association functional class) was an ordinal variable with four levels; we created three binary dummy variables to represent it. The statistic was 22.3 on 23 degrees of freedom. The p-value was 0.5, indicating no evidence of model misspecification.
Based on the test result above, the search for a suitable PS model could have stopped here. However, while the test of PS model misspecification tells us when we should continue searching for better PS models (i.e. when the test is significant), it does not dictate that we must stop once the test result becomes insignificant. In fact, the significance threshold of the test is a benchmark; once we pass that benchmark, the observed covariate imbalance is indistinguishable from random variation. Any PS model passing the test is suitable, and it does no harm to choose one that results in smaller standardized differences. Preoperative hematocrit raised some concern in this example, because although the overall test was not significant and this covariate had an absolute standardized difference of merely 1.6%, that number was the highest among all the covariates (most of the other covariates had the absolute standardized differences within 0.5% after being weighted by the MWs). We performed a separate test on each covariate (i.e. the V in eq.  comprised a single covariate), and preoperative hematocrit was the only covariate that was statistically significant (). Although we might attribute this nominally significant result to Type I error because more than 20 tests were performed, we chose to tweak the PS model to see if we could build another model which could still pass the misspecification test but with even better balance, especially for preoperative hematocrit. We reformulated the model by including two simple spline terms of hematocrit (denoted henceforth by HCT): max(HCT–35,0) and max(HCT–41,0) and their cross-product interactions with all other covariates. Here, 35 and 41 were the 1/3 and 2/3 quantiles of HCT. The reformulated model is called PS Model II, and its standardized difference plot is in Figure 1(B). With PS Model II, the absolute standardized difference of HCT reduced to 0.7% (Table 3), and there was no noticeable increase in the standardized differences of other covariates (Figure 1); the p-value of the overall test of misspecification for PS Model II was 0.9. The rest of the analysis in this section was based on PS Model II.
Table 3 presents descriptive statistics of the three patient populations: the original data set, the pair-matched data set, and the original data set weighted by the MWs. The descriptive statistics from pair matching and MWs are very similar, which is an illustration that the MW is a weighting analog to pair matching. Overall, the MW resulted in smaller covariate mean differences than pair matching, which is consistent with the standardized difference result in Figure 1(B).
We propose the mirror histogram as a new graphical tool to visualize both pair matching and MWs (Figure 2). Above the horizontal line is the histogram of the PSs of the group, and below is that of the group. The vertical axis of the histogram must be on the frequency, instead of probability density scale. For pair matching, the shaded areas are the matched treated subjects (dark green) and matched control subjects (light green). Since in pair matching, the matched subjects are a subset of the original sample, the shaded areas are always within the histograms outside (histograms of the original sample). The height of the shaded bar is the number of matched (treated or control) subjects within the corresponding PS stratum defined in the histogram. The mirror histogram shows, within each stratum, how many treated and control subjects are matched and how many are unmatched. For MW, everything is the same except that the height of the shaded bar is the summation of the MWs of the (treated or control) subjects within the corresponding stratum. It may not be an integer. Since , the shaded areas are always within the histograms outside, analogous to the case of pair matching.
An important use of the mirror histogram is that any lack of overlap can be seen directly from the plot. Figure 2 shows that overall the range of the PSs from the treated subjects coincides with that of the control subjects, with no visible sparse regions near 0 or 1. In this sense, the overlap is good. However, imbalance is obvious between the two groups. The pair matching (Figure 2A) left some treated subjects near 1 unmatched and left many controls near 0 unmatched. Note that almost all treated subjects near 0 are matched, and almost all control subjects near 1 are matched. Since not all treated subjects are matched, the number of matched pairs (2,316) is less than the number of treated subjects (3,105). If we want to force the unmatched treated subjects (most have a PS > 0.5) to be matched to a control, we can only increase the caliper, which may produce inaccurate matches and lead to bias. In Figure 3, we present another mirror histogram that visualizes a simulated data set in Section 5.
Table 4 shows the estimators from four methods studied in the simulation and the unadjusted comparison. The outcome regression model, if used, was a linear model with all the covariates and no interactions. The MW (Method 3.0), double robust MW (Method 5.0), and double robust matching estimators (Method 4.0) gave similar point estimators. The estimator from pair matching (Method 2.0) was slightly larger. The estimators from these four methods were all quite different from the unadjusted estimator. All estimators were statistically significant at 0.05 level, suggesting that intraoperative transfusion is associated with longer stay in intensive care unit. Estimators based on the MWs had smaller confidence interval lengths than the corresponding ones based on pair matching and incorporating the outcome model (e.g. double robust estimation) reduced the confidence interval length further.
The MW is a PS weighting method. The most widely used weighting approach is the inverse probability of treatment weighting (IPW) [14, 29, 40], which provides estimates of the PATE. Since the inverse probability weights may become very large when the estimated PS approaches 0 or 1, this approach sometimes leads to highly variable and biased estimates of the treatment effect [38, 40]. Several modifications of the IPW have been proposed to alleviate this problem [11, 41–43], and some involve redefining the PATE estimand to exclude individuals with extreme PSs close to 0 or 1 (e.g. ). The estimand of the MW estimator also de-emphasizes PSs close to 0 or 1, which often limits the variance inflation and stabilizes the computation. In addition to controlling variance inflation, the MW estimator and other estimators whose estimands assign greater weight to individual causal effects when the PS is close to 0.5 can be interpreted as emphasizing patients for whom there is greater equipoise; that is, patients whose characteristics are such that physicians are roughly equally likely as not to assign treatment. As noted in Section 1.2, these patients may often be of most interest when contrasting treatment alternatives. Further, we conjecture that emphasis of patients with PSs closer to 0.5 is likely to mimic enrollment patterns in some randomized clinical trials. In particular, for treatment regimens already used in clinical practice, physicians may have greater comfort randomizing patients with propensities close to 0.5 than they have for patients whose characteristics are associated with propensities close to 0 or 1.
While the proposed MW estimator displays the above advantages, the precise form of the estimator was proposed from a different motivation from all other weighting methods, that is, as a weighting analog to pair matching. Tasks such as accurate variance estimation, test for the misspecification of PS model, and double robust estimation are easier to accomplish in the framework of MWs than in the framework of matching. Our numerical studies illustrate that the MW estimator is generally more efficient than the pair matching estimator and achieves better balance. In this article, we present both theoretical arguments (Section 2.2) and empirical results (e.g. Table 3) to demonstrate the similarity between pair matching and MW in terms of the estimands and covariate distribution. Thus, the MW method can be used where the pair matching is applicable.
In this article, we focus only on the pair matching method. There are other types of matching methods, such as one-to-many matching or matching with replacement, though they are used less frequently in the medical literature. Since they often have different estimands from pair matching, it is difficult to compare them with the MW method. It is of interest to study the relationship between weighting methods and other types of matching methods. This is currently being explored by the authors.
The MW estimator and its double robust version are both easy to calculate. One just needs to use a binary regression model to estimate the PS in a first step, and the MW estimator can be calculated in closed form. The estimating equations such as eqs  and  are needed only for variance estimation, which involve no iterations and should be fairly easy to program. Since the MW estimator is an M-estimator, its variance can also be calculated through bootstrap, if the data analyst does not want to go through the additional effort of programming, the sandwich variance estimation. The methodology developed in this article applies when the outcome Y is a binary variable. However, one should be cautious in interpreting the estimator, because mean difference may not always be a proper measure of treatment effect when the variance depends on the mean. Eqs  and  are not directly applicable when Y is a time to event outcome subject to censoring. Further research is needed to study double robust MW estimator accommodating censoring in Y.
Proposition 1 Suppose that the PS can only take finitely many values at , . and . If we do a one-to-one exact matching on the PS without replacement and choose randomly when multiple matched pairs are available, then the matching estimator has the same asymptotic limit as the MW estimator as . In addition, the effective sample size of the MW estimator is asymptotically equivalent to the expected number of matched pairs.
Proof Denote by , and let () be the set of matched subjects from the treatment (control) group with PS . The matching estimator is
Therefore, as ,
which is the same asymptotic limit as the MW estimator . The effective sample size of the MW estimator is for the treatment group and for the control group. The number of matched pairs is . These quantities are asymptotically equivalent from the derivation above and . □
Proof of Theorem 1: If and are correctly specified, then .
Since and ,
Similarly, . Summarizing the results above and the expression of , we have when and are correctly specified.
Next, we assume that the PS model is correctly specified, that is, s are correct. We can rearrange the terms in eq.  of the article and write as
The first term is , which converges to . The second term equals to
Since and , the second term converges to 0. Similarly, the third term in eq.  also converges to 0. Therefore, when the PS model is correctly specified. □
Proof of Theorem 2: When the PS model is known, . The MW estimator approximately equals to
where . We can define and similarly and view as the new data and as the new potential outcomes. It is obvious that if the SUTVA assumption and unconfoundedness assumption hold, then similar assumptions hold for the new data and new potential outcomes as well: and .
Expression (11) suggests that can be viewed as an inverse probability weighting estimator, if we think of as the outcome variable. Therefore, the semiparametric theory in §13.5 of Tsiatis , originally developed for inverse probability weighting method, can be applied to show that the class of influence functions of regular asymptotically linear estimators of is given by
where for any function . The efficient influence function in this class is uniquely given by
as in eq.  of the article is the estimator corresponding to this efficient influence function. □
We greatly appreciate the helpful comments from the editor, associate editor and referees. This work was carried out while Liang Li was a faculty biostatistician in the Department of Quantitative Health Sciences at Cleveland Clinic.
Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Pol Anal 2007;15:199–236. Google Scholar
Guo S, Fraser M. Propensity score analysis: statistical methods and applications. Thousand Oaks, CA, USA: Sage Publications, Inc. 2010. Google Scholar
Rubin DB. Which ifs have causal answers? J Am Stat Assoc 1986;81:961–2. Google Scholar
Rosenbaum P, Rubin D. Constructing a control-group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985;39:33–8. Google Scholar
Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization. Health Serv Outcomes Res Methodol 2001;2:259–78. Google Scholar
Austin PC. Some methods of propensity-score matching had superior performance to others: results of an empirical investigation and monte carlo simulations. Biometrical J 2009;51:171–84. Web of ScienceCrossrefGoogle Scholar
Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009;28:3083–107. CrossrefPubMedWeb of ScienceGoogle Scholar
Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat 2010;10:150–61. Web of ScienceGoogle Scholar
Hill J. Discussion of research using propensity-score matching: comments on “a critical appraisal of propensity-score matching in the medical literature between 1996 and 2003” by peter Austin. Stat Med 2008;27:2055–61. Web of ScienceCrossrefGoogle Scholar
Stuart E. Developing practical recommendations for the use of propensity scores: discussion of “a critical appraisal of propensity score matching in the medical literature between 1996 and 2003” by peter Austin. Stat Med 2008;27:2062–5. Web of ScienceCrossrefGoogle Scholar
Hansen BB. The essential role of balance tests in propensity-matched observational studies: Comments on “a critical appraisal of propensity-score matching in the medical literature between 1996 and 2003” by peter Austin. Stat Med 2008;27:2050–4. Web of ScienceCrossrefGoogle Scholar
Abadie A, Imbens GW. On the failure of the bootstrap for matching estimators. Mimeo, Kennedy School of Government, Harvard University, 2005. Google Scholar
Abadie A, Imbens GW. Matching on the estimated propensity score. NBER Working Paper Series, w15301, 2009. Available at SSRN: http://ssrn.com/abstract=1463894.
van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer, 2003. Google Scholar
Shaikh AM, Simonsen M, Vytlacil EJ, Yildiz N. A specification test for the propensity score using its distribution conditional on participation. J Econometrics 2009;1:33–46.CrossrefWeb of ScienceGoogle Scholar
Kang JDY, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussions and rejoinder). Stat Sci 2007;22:523–39. Web of ScienceCrossrefGoogle Scholar
Cao WH, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 2009;96:723–34. PubMedWeb of ScienceCrossrefGoogle Scholar
Tsiatis AA. Semiparametric theory and missing data. New York: Springer, 2006. Google Scholar