Energy Balancing of Covariate Distributions

Bias in causal comparisons has a direct correspondence with distributional imbalance of covariates between treatment groups. Weighting strategies such as inverse propensity score weighting attempt to mitigate bias by either modeling the treatment assignment mechanism or balancing specified covariate moments. This paper introduces a new weighting method, called energy balancing, which instead aims to balance weighted covariate distributions. By directly targeting distributional imbalance, the proposed weighting strategy can be flexibly utilized in a wide variety of causal analyses, including the estimation of average treatment effects and individualized treatment rules. Our energy balancing weights (EBW) approach has several advantages over existing weighting techniques. First, it offers a model-free and robust approach for obtaining covariate balance that does not require tuning parameters, obviating the need for modeling decisions of secondary nature to the scientific question at hand. Second, since this approach is based on a genuine measure of distributional balance, it provides a means for assessing the balance induced by a given set of weights for a given dataset. Finally, the proposed method is computationally efficient and has desirable theoretical guarantees under mild conditions. We demonstrate the effectiveness of this EBW approach in a suite of simulation experiments, and in studies on the safety of right heart catheterization and the effect of indwelling arterial catheters.


Introduction
Studying the causal effect of a treatment or intervention is a central goal in many scientific disciplines.In randomized controlled trials, estimation of causal effects is possible since randomization ensures that treated units and control units are comparable [1].However, for many pressing questions, it is impossible or impractical to randomize treatment assignments.Researchers are thus left with existing sources of observational data to answer these questions.In these observational studies, it is of prime interest to make unconfounded comparisons between treatment groups, a common example being estimation of the average treatment effect (ATE).Yet in observational settings, natural selection processes into the treatment, resulting in imbalance in the covariate distributions between treatment groups.This may then introduce substantial bias in naive comparisons of the outcome of interest between groups.
In observational studies based on complex electronic health records (EHRs) or administrative health data, the differences between treated and untreated units are typically substantial and difficult to characterize.One motivating application for our work is the study by Connors et al. [2], which explored the impact of right heart catheterization (RHC), a diagnostic procedure designed to guide therapy, on mortality among intensive care unit (ICU) patients.In this study, patients who received RHC are different from those who did not receive RHC in highly complex ways.As an example among many such differences, both younger patients and older patients are less likely to receive RHC; thus, a simple correction for the average age does not characterize the differences between those who received RHC and those who did not.Similar complex differences between treated and untreated patients can also be observed in three other motivating studies arising from clinical care and EHRs using the MIMIC-III critical care database [3]; we will demonstrate how the proposed method tackles these four challenging applications later.
There is a vast literature on adjustment methods for correcting for imbalances between treated and untreated units to reduce estimation bias in treatment effects.Weighting methods are a class of adjustment methods that control for confounding by re-weighting the treatment and control groups to look similar.Inverse-probability weighting (IPW) methods [4][5][6][7][8][9], which have origins in survey sampling, are by far the most commonly used weighting approaches.IPW methods model the treatment assignment mechanism (or propensity score, see Rosenbaum and Rubin [10]), and inverse weight each sample by the probability of receiving its assigned treatment.Inverse weighting by the true underlying propensity score controls for confounding, as it re-weights the covariate distributions of the treatment groups to that of the overall population.
In practice, IPW methods require positing and fitting a model for the propensity score.While model fitting is no unfamiliar task for a statistician, it has been noted in the literature that even mild model misspecifications for the propensity score can result in substantial bias in estimating the treatment effect [11].Hearkening to George Box's maxim, "all models are wrong, but some are useful," it is often quite difficult in practice to obtain a useful propensity score model, especially in the presence of many covariates.Recent work has focused on mitigating this issue, by either (i) including conditions on the estimation of the propensity model which encourage moment balance of the covariates [12] or by (ii) altogether avoiding direct modeling of the propensity score and instead estimating weights that explicitly balance moments of the covariates either exactly [13,14] or approximately [15].A large class of such estimators is explored Chan et al. [14] and further expanded on by Zhao [16] and has a long history [17].Recent works [18,19] have attempted to mitigate this issue by focusing on nonparametric approaches to moment balancing, yet they require a careful choice of kernel function and tuning of multiple hyperparameters.
To make progress in addressing these issues, we show in this article that estimation bias for the ATE has a link to imbalance in the covariate distributions.This shows that balancing the full covariate distributions (and not just lower order moments) of the treatment groups to the full population provides a robust way for mitigating confounding.Although this is understood in the literature [16,20], in this article, we establish and make use of this link in a general manner, by (i) introducing a metric that evaluates how well a set of weights mitigates imbalance for a given dataset and (ii) developing a new method for estimating weights by minimizing this metric, yielding good distributional balance for a given dataset.The distributional balancing property of these weights allows for robust empirical performance; they tend to perform well in practice without relying on modeling assumptions for the propensity score or which moments are imbalanced.Our approach is designed to yield good distributional balance between treatment groups without the need of carefully tuning hyperparameters.Hirshberg et al. [21] study the theoretical properties of use of integral probability metrics for construction of weights; however, our work focuses on a specific choice that is suitable to wide use in practice.Despite the lack of need for tuning, our proposal works well empirically in a wide variety of settings.
We introduce weights that are explicitly constructed to balance the covariate distributions of the treatment groups to a target distribution (usually the full population).We do so by leveraging the energy distance presented in Székely and Rizzo [22].The energy distance is a measure based on powers of the Euclidean distance and was originally introduced as a means to replace standard nonparametric goodness-of-fit tests in high dimensions.The energy distance has an exact duality with a norm on the characteristic functions, enabling its use to compare two (or more) distributions or the distributions of two samples.We show that a weighted energy distance still retains this duality, making it a rigorously justified and reliable metric to compare between multiple sets of weights for a given dataset.From this, we propose the so-called energy balancing weights (EBWs), which are defined as the weights which minimize the weighted energy distance between treatment groups and the full sample, subject to constraints that mitigate variability in the weights.We prove that EBWs asymptotically ensure full distributional balance and result in root-n consistent estimation of the ATE.Our emphasis is on the robust performance of our approach in practice, and our asymptotic analysis serves primarily to justify the use of the proposed weights.We analyze four challenging observational studies and use each of these datasets to conduct realistic and highly challenging simulations, demonstrating the effectiveness of EBWs in practice with minimal user input required.Two of the studies we present in this article and the remaining two are presented in the Supplementary Material.
Although we focus primarily on a simple estimand, namely, the average treatment effect (ATE) the proposed weights can be used for a wide variety of causal estimands since the distributional re-balancing property of EBWs enables flexible control of confounding.We show that they can be used for the estimation of a wide variety of causal quantities, such as the ATE and individualized treatment rules (ITRs) [23,24].With minor modifications, they can also be used for estimation of the average treatment effect on the treated (ATT) and for estimating treatment effects for multi-category treatments.Despite the fact that EBWs are not specifically designed to match low-order moments, in practice, they often result in better marginal mean balance of covariates than propensity score methods even in high dimensions (50-100 covariates), as seen in Section 6. EBWs are also quite stable in practice, rarely resulting in large weights, an issue that plagues standard propensity score methods [11], making it less critical to impose constraints that induce weight variability.EBWs are constructed without using outcome information; however, as for other weighting approaches, variance can be reduced via augmented estimators that make use of outcome regression models, such as the augmented estimators in the studies by Wong and Chan [18], Zhao [16], and Athey et al. [25].However, we do not explore such techniques here.
The remainder of this article is organized as follows.Section 2 motivates the need for distributional balance and introduces the weighted energy distance.Section 3 presents the proposed EBWs and discusses their computation and asymptotic properties.Section 5 compares the performance of EBWs with other weighting methods in simulation studies.Section 6 discusses an application of EBWs in a study of RHC and further uses these data to explore highly challenging simulations that use the natural data-generating processes of the study.Section 7 further demonstrates the utility of EBWs by analyzing data from a complex application involving EHR data.Section 8 concludes with a discussion and future work.
2 Distributional balance and weighted energy distance

Consider a sample
{( ) } of size n from a population, where Y i is the outcome of the ith unit, ∈ A 0, 1 i { } is a binary indicator of receiving a treatment, and ∈ = X i p is a p-dimensional vector of covariates.Further In this article, we are interested in estimating the average causal effect of the treatment on the outcome.A formal definition of a causal effect often involves the use of so-called potential outcomes [26][27][28][29].The potential outcome Y a ( ) is the outcome that would have been observed under level a of the treatment.As each individual only receives one level of the treatment at a given time, only one potential outcome, either Y 0 i ( ) or Y 1 i ( ), for each individual is observable.We assume the standard stable unit treatment value assumption (SUTVA), which posits that the potential outcomes for each unit are unaffected by the potential outcomes of other units and that only one version of the treatment exists.Under SUTVA, the observed outcome is consistent with the potential outcomes in that = Y Y A i i i ( ).We further assume the assignment mechanism is strongly unconfounded in the sense that ⊥ | , which requires that there are no unmeasured confounders.The ⊥ ⊥ notation of Dawid [30] denotes (conditional) independence.We further assume positivity (or probabilistic assignment) in that the propensity score , so that everyone has a chance of receiving the treatment.Positivity, together with no unmeasured confounders, constitute strong ignorability [10].
Let us denote ≡ μ Ya X X as the conditional variance function of the response for ∈ a 0, 1 { }.We consider scenarios where data have been collected from an observational study, and thus, the treatment groups are not comparable due to imbalances Energy balancing of covariate distributions  3 in the distributions of their baseline covariates.These differences can be characterized through π x ( ) ), the cumulative distribution function (CDF) of covariates X conditional on treatment level a.
Using the notation of potential outcomes, the (population) ATE is defined as . This can be rewritten as follows:

Weighted average estimates and distributional balance
We restrict our focus to weighted averages as estimates of the ATE.Given a vector of weights = w w w , …, n 1 ( ), we study estimators of the form: The most commonly used example of (2), inverse propensity score weighting, uses ) .As presented in Imai and Ratkovic [12] and Li et al. [20], the weights in (2) are often normalized by treatment group, i.e., ∑ = ) for ∈ a 0, 1 { }, to improve precision [11,16] at the cost of a small bias.When these weights are constructed as mentioned earlier and normalized by treatment group, the resulting estimator is called the Hájek estimator [31] and may yield reduced mean squared error over its nonnormalized version [32].
Given any weight vector w, we can express the error of τ w  as follows: ∑ ∑ where is the weighted ECDF for treatment level ∈ a 0, 1 { }.In observational studies, F n is not impacted by the weights, so the error term (4) is irreducible.However, this term goes to 0 as long as the sample is representative of the desired population.Since (5) always has mean 0, the bias of τ w  in essence depends on the properties of (3): the difference of integrals with respect to . Thus, the systematic source of error from the weighted estimator τ w  can be completely controlled by reducing the imbalance between the weighted ECDFs F n a w , , and the ECDF F n .Unlike the decomposition in Zhao [16], the decomposition of − τ τ w  above holds even if the treatment effect is not constant over x.Further, none of the methodology or results of this article require such a constant treatment effect to justify the validity of our methods.We note that the systematic source of bias in equation (3) can be zero without F n a w , , being balanced to F n , depending on μ x 0 ( ) and μ x 1 ( ).However, knowledge of such an event a priori is impossible without knowl- edge of the mean potential outcome functions; as such, a measure of distributional imbalance is critical to characterize the degree of bias for a given dataset.The terms in (5) drives the variability of τ ˆw and can be straightforwardly mitigated with a measure of dispersion of the weights [33]; we discuss this more at the end of Section 3; however, this is not the focus of this article.
The notion that the covariate distributions should be balanced to obtain a good estimate of τ is not new (see Imai and Ratkovic [12] and Li et al. [20], among many others).In fact, it is well understood that the weights resulting from correctly specified propensity score models asymptotically balance covariate distributions.However, slight model misspecifications of the propensity score may result in poor performance of (2) [11].Further, two units with the same propensity score do not necessarily have the same covariate values, so in finite samples, propensity score weights may not be optimal.Other approaches that aim to estimate weights by balancing prespecified moments of the covariates [12,14] tend to be more robust than directly modeling π x ( ).Yet, the term (3) makes it explicit that bias is directly related to distributional imbalance and that, holding μ a fixed, weights that yield less distributional imbalance will in general result in less bias.A natural conclusion is that w should be estimated to directly balance each F n a w , , to F n .In the following, we introduce a new distance metric between F n a w , , and F n , which enables one to characterize how well a set of weights re-balances the weighted distribution F n a w , , to F n .

Weighted energy distance
We introduce next a new measure of the distributional balance induced by a set of weights.This measure is based on the energy distance, which is a metric on distributions [34].Due to the link between estimation bias and distributional imbalance, our measure can be used to evaluate the degree of bias one expects from a given set of weights and a given dataset.We will later leverage this measure to construct distributional balance weights that minimize this metric for a given dataset.The energy distance (as surveyed in Székely and Rizzo [34]) is defined as follows.Let G and H be two finite- mean distribution functions on , and let . The energy distance between distributions G and H is defined as follows: where ⋅ 2 ‖ ‖ is the Euclidean norm.When both G and H are ECDFs, i.e., G n is the ECDF of ) can be expressed as follows: The energy distance has been used within a wide variety of statistical methods, e.g., for testing equivalence of distributions, for testing statistical independence [35], and for generating samples from a target distribution [36].The energy distance is simple to compute as it only involves sums of Euclidean distances, it is nonnegative, and it can be shown to be equivalent to a distance between the characteristic functions of G and H , indicating that a value of 0 occurs if and only if G and H are the same distribution and a positive value indicates how similar they are.The name "energy" comes from the fact that the population energy distance (or limit of the empirical energy distance) has an intricate connection with gravitational potential energy (which depends on the distance between two bodies of mass).Two key attractive properties of the energy distance are that it tends to perform well in measuring the distance between two distributions in moderately high-dimensional settings, and it does not require the choice of a kernel or any tuning parameters.
We propose a weighted modification of this energy distance, which measures the distance between a weighted distribution, i.e., the weighted covariate ECDFs F n w ,0, and F n w ,1, for the control and treated, and a target distribution, i.e., the combined covariate ECDF F n .The weighted energy distance between F n a w , , and F n is defined as follows: Energy balancing of covariate distributions  5 In other words, F F , )is the energy distance between the ECDF of a sample { } and a weighted ECDF of a subsample . We show this new weighted energy distance is indeed a distance between F n a w , , and F n .Let t s , ⟨ ⟩ be the inner product of vectors t and s and define the empirical characteristic function (ECHF) of { } and the weighted ECHFs of as follows: exp , and 1 exp , , respectively.The following proposition establishes the distance property of the weighted energy distance.
Proposition 2.1.Let w be a vector of weights such that ∑ is a constant, and ⋅ Γ( ) is the complete gamma function.Thus, ) Thus, the weighted energy distance is a distance between the weighted distribution of interest and the target distribution, and is thus a bona fide measure of distributional balance of covariates induced by a set of weights.This proportion extends the duality results of Proposition 1 from Székely and Rizzo [34] and Theorem 1 from Székely et al. [35] for the weighted energy distance at hand.A subtle point is that Proposition 2.1, combined with the decomposition presented in Section 2.2, makes clear that F F , )more closely aligns with evaluating how well a set of weights estimates the sample average treatment effect [32] than the population ATE τ that is our main focus.Yet, as long as F n is representative of F , the weighted energy distance still aligns well with the population ATE.
We now show that the weighted energy distance converges to the energy distance when the weights yield a well-defined limiting distribution.
. Further assume, for a sequence of weights Theorem 2.2 shows that the weighted energy distance converges to the limiting energy distance.If the limiting distribution implied by a set of weights is

and Theorem 2.2 together imply that weights with smaller values of the sum
) yield better distributional balance of covariates.Due to the link between imbalance and bias (see the following section), this also implies that better balance will yield estimates with smaller values for the term (3).

Bias and distributional imbalance
We now demonstrate this connection between bias and distributional imbalance (as measured by the weighted energy distance) using two illustrative examples.A more formal presentation is provided later in Section 3.2 when proving asymptotic properties of EBWs.In the first illustrative example, we generate a one-dimensional covariate of sample size 250, which impacts treatment assignment via a logistic model under each of three scenarios: , and (3) . In each scenario, the response is generated as ).For each scenario, we construct IPWs based on three logistic regression models, which consider only a linear term in X (denoted as "IPW (1)"), a linear plus quadratic term ("IPW (2)"), and up to the cubic term ("IPW (3)"), respectively.Thus, for Scenarios 2 and 3, at least one of the fitted models is misspecified.For each set of weights w, we compute F F , ), i.e., the sum of the energy distances between each treatment group and the combined sample, and compute the error − τ τ w  of (2) for τ. Figure 1(a) displays the energy distances and biases over 1,000 replications of the experiment.We see that the energy distances are the largest in all scenarios for the unweighted estimator ((2) with all weights equal to 1).For scenarios where the weights are estimated using a misspecified model (IPW (1) in Scenario 2 and IPW (1) and (2) in Scenario 3), the energy distances are much larger than for weights based on correctly (or overspecified) models.Correspondingly, the bias is pronounced for the misspecified models.Thus, the weighted energy distance can be a useful tool to compare between different models, as weights with smaller weighted energy distances tend to yield estimates with smaller error.
In the second example, we consider a two-dimensional example where the true assignment mechanism depends on first and second moments of the covariates.We consider several methods for estimate weights: logistic regression, the method of Imai and Ratkovic [12], and the method of Chan et al. [14], each with different moments included for balancing or estimation.We then compare their weighted energy distances and absolute errors of (2) over 1,000 replications.Figure 1(b) displays the distances and errors for each dataset and method.We see that, in general, weights with lower energy distance have a much smaller magnitude of bias in estimating the ATE.

Other measures of distributional imbalance
There are, of course, many ways of measuring distance between distributions in the literature, including the Kolmogorov-Smirnov statistic, f -divergences (e.g., the Kullback-Leibler divergence and the Hellinger dis- tance), the Wasserstein distance, and the maximum mean discrepancy (MMD).The energy distance has several advantages for characterizing distributional imbalance.First, while in general there is no uniformly most powerful nonparametric test for the difference between two distributions, the energy distance is often sensitive to differences in distributions, unlike the Kolmogorov-Smirnov statistic.Second, unlike the energy distance, f -divergences such as the Kullback-Leibler divergence and Hellinger distance do not metrize weak Energy balancing of covariate distributions  7 convergence; this property ensures that the distance remains stable under perturbations of the support of the distributions being measured [37].Third, the energy distance is easy and efficient to compute, unlike the Wasserstein distance.It also works reasonably well in moderately high dimensions [36], whereas the Wasserstein distance suffers from the curse of dimensionality, in the sense that empirical Wassterstein distances converge to the true Wasserstein distance at rate ) [37,38].Finally, while there is a direct link between MMD and the energy distance [39], the energy distance does not require careful tuning of hyperparameters and tends to work well across a wide variety of scenarios.
3 Energy balancing weights

Definition
We will now use the proposed weighted energy distance to estimate weights which (i) match the distribution of covariates of the treated group to the distribution of covariates of the full population, and (ii) match the distribution of covariates of the control group to the distribution of covariates of the full population.
To achieve this, we define the EBWs to be . 0 Thus, the EBWs w n e minimize the statistical energy between the treatment and control groups to the full population.Due to the duality result of Proposition 2.1, minimizing statistical energy is directly equivalent to balancing covariate distributions.The constraints ) serve several purposes: they (i) preserve the sample size of the weighted pseudo-population to that of the study population, (ii) stabilize the estimated weights, and (iii) ensure that F n w ,0, and F n w ,1, are valid distribution functions.Also note that, due to the bias decomposition in (3) and the duality result in Proposition (2.1), the weights w n e are explicitly designed to minimize the key component of the finite-sample bias of τ w n e  , in a manner agnostic to the func- tional forms of μ x 0 ( ) and μ x 1 ( ).If the functional forms of μ x 0 ( ) and μ x 1 ( ) are known, it is certainly possible to construct a different set of weights to better reduce bias, by emphasizing balance in regions of where either μ X 0 ( ) or μ X 1 ( ) are pronounced.However, this information is rarely (if ever) known, and hence this is difficult to achieve in practice except by good fortune.
To illustrate the effectiveness of EBWs for distributional balance, we consider data generated under Scenario 3 of the toy example in Figure 1(a).Figure 2 shows the difference between the weighted ECDF using EBWs of the covariate in the treatment group and the ECDF of the combined (i.e., treated and untreated) sample, for varying sample sizes n.This is compared with the weighted ECDF using weights from the true data- generating propensity score, the estimated propensity score under the correct model, and the unweighted ECDF of the treatment group.As sample size increases, the difference vanishes for EBWs, but much more slowly for both the true and estimated propensity score weights.This demonstrates the improved distributional balance provided by the proposed EBWs, which should then translate to a greater reduction of bias.
EBWs can be naturally extended to handle a wide variety of scenarios, such as for treatments with more than two levels, estimation of the ATT, and for the estimation of optimal ITRs.In the Supplementary Material, we show how these extensions are manifested and empirically demonstrate the benefit of using EBWs for ITR estimation.
A few key distinguishing features of our proposal from the works of Wong and Chan [18] and Kallus [19] are (1) its direct focus on distributional balance rather than moment balance.Despite the connection between balancing moments of an infinite dimensional class of functions and distributional balance, this feature helps alleviate modeling decisions about what moments to balance of what space of moments to balance and has advantages in terms of interpretability and ( 2) by focusing on a measure that does not require tuning parameters to characterize distributional differences, our approach can be applied broadly by practitioners of varying degrees of statistical sophistication.

Asymptotic properties
Next, we show two desirable properties of the proposed EBWs.We first show that the weighted ECDFs based on EBWs indeed converge to the population CDF F of X.
and that the assumptions presented in Section 2.1 hold.Let w n e be as defined in (10).Then, for ∈ a 0, 1 { },  Energy balancing of covariate distributions  9 Next, we show that the ATE estimator τ w n e  is root-n consistent.To do so, we make use of the following lemma, which provides a connection between bias and distributional balance: Lemma 3.3.Let be the native space induced by the radial kernel ⋅ = − ⋅ Φ 2 ( ) ‖ ‖ on , and suppose ⋅ ∈ μ a ( ) for ∈ a 0, 1 { }.Then, for any weights w satisfying ∑

(
) .Under the mild condition that the conditional mean function μ x a ( ) is found in (this is discussed further at the end of the section), this lemma shows that the weighted energy distance is a key component in an upper bound on the systematic bias in (3).We note that Lemma 3.3 applies to any set of auto-normalized weights, justifying the use of the weighted energy distance to compare between different sets of weights.The proposed EBWs w n e , which minimize F F , ) , may therefore yield lower estimation bias via this upper bound, which is in line with the empirical observations in Section 2.4.With this, we now state the result on root-n consistency: Theorem 3.4.Assume the same conditions in Theorem 3.1.Let be the native space induced by the radial kernel ⋅ = − ⋅ Φ 2 ( ) ‖ ‖ on .Suppose the following mild conditions hold: , where ′ ″ ′′′ F X X X X , , , ĩ Then the proposed EBW estimator τ w n e  is root-n consistent, i.e., We give a brief discussion of Assumptions (A1)-(A5).Assumption (A1) concerns the regularity of the conditional mean functions μ 0 and μ 1 .It can be shown that the Sobolev space ⌈ + ∕ ⌉ W p 1 2 ,2 ( ) ( ) the space of functions with square-integrable < ⌈ + ∕ ⌉ r p 1 2 ( ) -th differentials -is contained within the native space ( ) (Theorem 10.42 of [40]), so (A1) can be viewed as a smoothness assumption on the conditional ATE − μ μ 1 0 (a similar assumption is made in [18]).Assumptions (A2) and (A3) require the conditional mean functions to have finite variance and the conditional variance function to be bounded, respectively.Assumption (A4) require the kernels h 0 and h 1 to have finite second moments; these kernels can be seen as a "modified" Euclidean kernel weighted by the true propensity scores.Assumption (A5) assumes all EBWs to be bounded above by ∕ Cn 1 3 for some positive constant > C 0; this assumption has been used in several weighting-based covariate balancing methods [18,25].In practice, (A5) can always be checked after optimizing for the EBWs in (10).One can change the optimization procedure (10) to explicitly enforce (A5), but we have never encountered "exploding weights," which violate (A5) in practice; EBWs in our simulations and data analysis typically satisfy (A5) with = C 1.As we discuss in Section 3.4, if one is willing to add a penalty on the dispersion of the weights, the explicit assumption on the maximum of the weights is unnecessary and root-n consistency still holds, as shown in the Supplement.
While Theorem 3.4 proves the desired root-n consistency of the proposed EBW estimator τ w n e  , it unfortu- nately does not shed light on its asymptotic variance, which is useful for constructing confidence intervals on τ.We recommend the use of bootstrapped confidence intervals for the EBW estimator.We explore a connec- tion to importance sampling in the Supplementary Material.

Optimization
The optimization problem (10) for computing EBWs (and the later optimization problem ( 16) for obtaining three-way EBWs) can be viewed as quadratic programs with linear (in)equality constraints.There has been much work on efficient algorithms for solving such programs [41], including interior point methods [42], augmented Lagrangian techniques [43], and extensions of the simplex algorithm [44].A recent development is the operator splitting approach in Stellato et al. [45], which provides a reliable alternative for nonpositive definite quadratic programs.
In our implementation, we made use of well-maintained interior-point cone programming solvers in the R package cccp [46] for optimizing the EBW formulations (10) and (16).Such solvers follow a two-stage procedure: finding an initial feasible solution for w, then iteratively refining this solution by traversing the interior of the feasible region.Detailed algorithmic steps can be found in Wright et al. [47].Interior-point algorithms enjoy empirical and theoretical advantages for solving large-scale quadratic programs (see further discussion in Gondzio [48] on its scalability and complexity), and we have observed excellent performance of such algorithms for EBW optimization.In practice, we recommend that the covariates should be normalized so that each covariate has mean zero and unit variance, before being used as inputs for the optimization problem.An efficient implementation of these methods for EBW optimization is provided in our R package ebw, which will be released on comprehensive R archive network in the near future.

Controlling weight variability
In our experience, the weights resulting from the optimization criterion (10) rarely, if ever, result in large weights; however, in the literature there has been an emphasis on methods that afford explicit control on the variability of weights [33].To allow for such within our framework, one can simply add a penalty 2 to the criterion (10) for some fixed > λ 0. This penalty can be thought of as a penalty on the inverse of the effective sample size of the weights [33].In our experience, since the energy distance criterion is not nearly as variable as using a kernelized distance using a universal kernel such as the Gaussian kernel, careful tuning of λ is not at all critical and we advocate for a simple choice such as = λ 1 or not using any penalty at all.On the other hand, if one were to replace the energy distance with a kernelized distance using a universal kernel, the choice of λ becomes critical.In the Supplement, we show analogs of Theorems 3.1 and 3.4 hold when using a penalized energy distance criterion, although the conditions regarding the magnitude of the weights in the latter theorem can be relaxed.

Three-way EBWs
The EBWs in (10) are designed to balance the distributions of covariates between each treatment group to that of the combined sample { } .As such, the treatment group should be asymptotically balanced to the control group.Yet for finite samples, EBWs do not necessarily guarantee good distributional balance between the treatment and control arms, which can be important in practice.Consider the following re-expression of the two terms in (3): Energy balancing of covariate distributions  11 Terms ( 14) and ( 15) shed light on how the choice of w impacts estimation error for the ATE.In particular, this error term depends not only on (i) how close the weighted ECDFs F n w ,0, and F n w ,1, are to that of the combined sample F n but also (ii) how close the weighted ECDF for the control group F n w ,0, is to the weighted ECDF of the treated group F n w ,1, .The EBWs in Section 3 take care of the imbalance in (i), but not the imbalance in (ii) between treatment and control.With finite samples, imbalance in (ii) can result in ( 14) dominating (15), depending on the properties of μ μ x x , 0 1 ( ) ( ), and Indeed, the importance of this three-way balance is recognized in the literature, and is emphasized in Chan et al. [14] as an important component in constructing globally efficient estimators based on moment balancing.In our case, the target is the three-way distributional balance, i.e., balance in (i) and (ii).
We propose an extension of EBWs, the improved EBWs (iEBWs), to help improve covariate balance between the treatment and control groups.These improved weights are defined as follows: where is the energy distance between the weighted ECDFs for treated and control.Thus, the iEBWs w n ei aim to minimize imbalance not only between treatment arms and the full population but also between the treatment arm and the control arm.

Extensions and applications of EBWs 4.1 Estimation of the ATT
A common target of estimation is the (population) ATT, , which is the mean difference in potential outcomes among those who are actually treated.Due to the unconfoundedness assumption, we can write which suggests that a plug-in estimator can be obtained by replacing F x 1 ( ) with a suitable energy-weighted ECDF.To do so, we define the new weights = w w w , …, where ).Thus, w n e1 balances the covariate distribution for the control group to the covariate distribution of the treated group.Similar to Theorem 3.1, we have With the new weights w n e1 , the natural plug-in estimate for the ATT τ 1 ( ) is τ ˆwn e1 (i.e., the weighted estimator in (2) with = w w n e1 ).

Multi-category treatments
EBWs can also be constructed for multi-category treatments.When the treatment A i takes multiple values, i.e.
, one for each level.The unconfoundedness assumption in this setting becomes | .Following Lopez and Gutman [49], the positivity assumption required is now < , with EBWs.We define the EBWs for the multiple treatment case as follows: for all and 0 for 1, …, .
Improved EBWs, which encourage covariate balance between all pairs of treatment options, can be defined similarly as (16), where an additional weighted energy distance between each pair of treatment options is added to the objective.Given any two treatment options ′ ∈ a a , , we can then estimate the causal contrast

Estimation of ITRs
As many treatments exhibit heterogeneous effects for different patients, there is great interest in tailoring treatment decisions to patients.A main line of work in this area is the development of statistical methods aimed at finding an optimal ITR, which maps patient characteristics to treatment decisions.Thus, the immediate goal is to estimate a mapping ↦ d : 0,1 { } which optimizes the expected potential outcomes under the distribution induced by d.Following Qian and Murphy [23], and Zhao et al. [24] and assuming that larger values of the outcome are preferred, the optimal ITR is defined as where and the second equality holds due to the causal assumptions outlined in Section 2.1.The optimal ITR d* has the property that The last term in (19) appears as a weighted classification task due to the weighted 0-1 loss.With observed data, the objective becomes to minimize Due to the nonconvexity of (20), in practice is replaced with a surrogate, such as the hinge function − )) (see, e.g., the outcome weighted learning (OWL) method of Zhao et al. [24]).
Energy balancing of covariate distributions  13 Yet, this objective still requires estimation of the propensity score π X ( ) and involves plugging in this estimate to (20), which subjects it to the same issues of propensity score weighted estimates of τ.To see how EBWs can be used in place of π A X , ( ), we can express (19) as This suggests a plug-in estimator by replacing F x ( ) with energy-weighted ECDFs (either . Thus, we propose to estimate the optimal ITR as follows: where w i e could be replaced with improved weights w i ei and ϕ a d , ( ) is a convex surrogate for ≠ I a d ( ), such as the hinge function or logistic loss.As in Zhao et al. [24], to prevent overfitting, a penalty λ d n 2 ‖ ‖ can be added to (21) to control complexity of the estimated ITR.
To demonstrate the effectiveness of using EBWs in optimal ITR estimation, we provide an illustrative example under two data-generating scenarios.For both scenarios, we generate outcomes as = Y g X ( ) , where g X ( ) are the main effects of X, . . .

(
).Both scenarios are motivated by the simulation studies of Zhao et al. [24], but generate A from a logistic regression model with terms depending on up to third-order polynomials in a subset of the predictors, and g X ( ) contains nonlinear terms in the predictors.Full details of the experiment are given in the Supplementary Material.We utilize the OWL method to obtain estimates d ˆ, which uses inverse weighting via the propensity score and adds a penalty λ d n 2 ‖ ‖ to the objective.For OWL, the propensity score is misspecified to only include linear terms in the covariates.We also estimate d* by minimizing (21) plus λ d n 2 ‖ ‖ .We denote this as OWL (EBW) for weights given by (10) and OWL (iEBW) for weights given by (16).We simulate 1,000 independent datasets and compute the average value function Y d  [ ( )] evaluated on a large independent dataset in addition to the missclassification rate in esti- mating d X *( ) on the independent dataset.The results are given in Table 1, which indicates that EBWs can yield better performing ITRs.

Simulation studies
To evaluate the finite sample performance and operating characteristics of our proposed estimators, we conducted a large-scale simulation study across a wide variety of data-generating scenarios.Since existing techniques, such as empirical calibration balancing and the covariate balancing propensity score, work exceedingly well when the correct moments are specified to be balanced, we primarily consider simulation settings where the relationships between covariates and both the treatment status and outcome regression model are nonlinear.We consider a wide range of scenarios for the propensity score and outcome regression models, several of which are examples taken from the existing works.Outcome models B and E are taken from [18], outcome model D is taken from Wong and Chan [50], and outcome model A is a slight modification of an

(
).To align with the setup of Kang and Schafer [11], we utilize covariate setup 2 for any setting with propensity model III.To align with the setup of Wong and Chan [18], we utilize covariate setup 2 for any setting with outcome model B or E. All other settings use covariate setup 1.
We compare our proposed EBW and the iEBWs ( 16), both with no penalty on the squares of the weights, with several widely used alternatives.First, we compare with inverse propensity score weights (denoted "IPW").In addition, we compare with the covariate balancing propensity score weights (denoted "CBPS"), the empirical calibration balancing weights (denoted "Cal") with exponential tilting weights.For all methods that require specification of a model for the treatment assignment or moments to balance, only first-order terms in X were used, as in Wong and Chan [18].This reflects a setting where the analyst misspecifies that only first moments should be balanced.We also utilize the kernel-based functional covariate balancing approach of Wong and Chan [18], denoted as "KCB" for kernel covariate balancing with the second-order Sobolev kernel.As a baseline for comparison, we investigate a naïve unweighted estimator that simply compares the means between treated groups and denote this as "Unweighted."For all estimators, we use weights normalized by treatment group.Thus, the IPW approach is the Hájek estimator as opposed to the more unstable, nonnormalized Horvitz-Thompson estimator.For each setting, we generate 1,000 independent datasets and evaluate each method based on the square root of the mean-square error (RMSE) and bias in estimating the ATE.Simulations for IPW, Cal, and CBPS are conducted using the WeightItR package [51] and that of KCB using the ATE.ncb package.
For the sake of brevity of presentation, we present a summary of the results across all outcome models.More detailed results are presented in the Supplementary Material.Table 2 contains a summary of the results averaged across outcome models (A-E) and dimension settings ( ∈ p 10, 25 { }).Each entry in the table is the average rank of each method in terms of RMSE and bias for each combination of outcome model and dimension; i.e. the method with the smallest RMSE for a particular setting receives a "1" and the method with the largest RMSE receives a "7."The Supplementary Material contains a similar summary, but averaged There are substantial empirical differences in the distributions of many of these covariates between treatment groups (RHC vs. no RHC).In Section 6.2, we study the effect of RHC on 30-day survival.However, since there is no ground truth available, in Section 6.3, we use the RHC data to conduct a realistic simulation that demonstrates the effectiveness of EBWs.

Analysis of RHC data
We used 65 of the available covariates, as in the analysis of the same dataset in Rosenbaum [53], leaving out date-related covariates.Using the 65 covariates, we applied the weighting methods used in Section 5 (except the method of Wong and Chan [18] as the code returned constant weights of 1 regardless of the tuning parameters used) to estimate weights to balance the treated groups.To first investigate how well each method balances the marginal means of each covariate, we evaluate the absolute standardized mean differences for each covariate and p-values for the difference in weighted means between treated and control groups, which are displayed in Figure 3. Empirical calibration balancing is designed to balance all specified moments exactly, and since we balance the first moments, these have exact balance.Both EBWs still result in tight moment balance despite the fact that this was not an explicit goal.In Figure 4, we investigate how well each method balances the distributions of all 1-and two-dimensional projections of the covariates.To do this, we compare the weighted ECDFs for each projection by evaluating the square root of the integrated mean-square error (RIMSE) of the weighted ECDFs between the two treatment groups.Both EBWs yield the smallest RIMSEs on average.However, it is important to note that lower-order projections of the distribution are not the explicit focus of EBWs and thus, the EBWs also likely balance other aspects of the distributions of covariates between the treatment groups.We display further measures of discrepancy of the covariates between the treatment groups in Table 3.In addition to displaying the average RIMSE values and absolute SMDs, in this table, we display the weighted energy values for each of the weights, including the unweighted energy distances.By definition, EBWs have the minimum weighted energy distances; however, it can be seen that the empirical calibration balancing weights have a relatively small weighted energy distance, indicating its use is likely sensible for the RHC data.The estimates of the ATE of RHC and its standard errors are displayed in Table 3.Standard errors are computed using the nonparametric bootstrap with 1,000 replications.The EBW-based estimates have the smallest standard errors, and, perhaps more interestingly, yield an estimate of the effect of RHC that is slightly less deleterious than estimates in the literature, despite it still being a highly significant effect.

RHC data simulation
In this section, we fix the covariates and treatment assignments from the RHC data, and simulate responses with confounding.This produces a realistic and highly challenging simulation scenario with high dimension.We use the key functional form from outcome model D from Section 5. Since this outcome model involves only 7 covariates, we use the functional form from this outcome model, apply it over 8 separate groups of 7 covariates in the RHC data, and take the sum of all groups of 7 covariates as the mean of the outcome.We then simulate 1,000 independent datasets using this procedure and each time estimate the ATE and record each method's RMSE in estimating the true ATE.Since under this model the ordering of the covariates changes the nature of the confounding, we randomly permute the columns 100 times and repeat the simulation 1,000 times for each permutation and each time record the RMSE over the 1,000 replications.Further details are provided in the Supplementary Material.The RMSEs for each of the 100 column permutations are displayed in Figure 5.Both EBW and iEBW consistently yield among the smallest RMSE across the permutation settings.
Since the outcome model in this simulation involves a constant treatment effect, and thus is not impacted by potential issues of covariate overlap, and in the Supplementary Material, we additionally consider an outcome model with a treatment effect that depends on X; the pattern of the results is consistent with the findings using a constant treatment effect, albeit the advantage of EBWs is slightly more pronounced with the heterogeneous treatment effect scenario.

Analysis of the MIMIC-III critical care database
We analyze the effectiveness of three treatments based on separate subpopulations of the MIMIC-III v1.4 critical care database [3]: a study of the effect of indwelling arterial catheters (IACs) on mortality in patients with respiratory failure [54], a study of the effect of transthoracic echocardiography (echo) on mortality in sepsis patients [55], and a study of mechanical power of ventilation (MPV) on mortality in critically ill patients [56].Each study is based on the existing studies utilizing MIMIC-III.The degree of confounding in each study varies, with the IAC and MPV studies exhibiting a great degree of confounding and the echo study with minimal confounding.For all studies, missing values were imputed using missForest [57].We present the IAC study in this section and the remaining two studies in the Supplementary Material.For each study, we present treatment effect estimates and balance statistics using each method used in the main text.For all studies, we also use the covariates and treatment assignments of the observed data to conduct simulation studies, using the same approach used for the simulation based on the RHC data.In essence, with these simulation studies we preserve the treatment assignment mechanism of the observed data and simulate outcomes that involve a high likelihood of confounding under this real-world treatment assignment mechanism.The simulation studies investigate scenarios with a constant treatment effect over X and with a treatment effect that varies with X but results in a (population and sample) ATE of 0. The outcome models used are the same as described in the Supplementary Material.Each dataset has a differing number of confounders; however, the outcome model across all datasets involves 63 covariates impacting the response for any given scenario.As in the setup for the RHC data simulation, the columns are permuted 100 times, resulting in 100 separate outcome models with different covariates impacting the response in different ways.

IAC data
In this section, we replicate a study originally conducted Hsu et al. [54] based on the MIMIC-III critical care database to study the effect of indwelling arterial catherization on 28 day mortality.The data are based on the queries provided in https://github.com/MIT-LCP/mimic-codeandcontains information on 2,522 mechanicallyventilated patients, 1,298 of whom received the treatment, IAC.The outcome is an indicator of 28 day mortality from time of admission.Pre-treatment covariates likely to be confounders include demographics, lab values, calculated risk scores, missingness indicators, and more, totaling to a design matrix with = p 81.We leveraged the same approach used in the RHC study to control for the confounders via various weighting approaches to estimate the ATE of IAC on 28-day mortality.The KCB approach yielded constant weights of 1 regardless of the tuning parameter.Table 4 presents information on the estimated effect of IAC on mortality based on each weighting method in addition to various balance statistics for the 81 covariates, including marginal mean balance, univariate and bivariate distributional balance, and weighted energy statistics.EBW and iEBW yield the best univariate and bivariate distributional balance, while Cal and CBPS result in the best marginal mean balance.Note that Cal does not achieve exact marginal mean balance due to numerical issues.All approaches except inverse weighting by propensity score (IPW) yield 95% confidence intervals that contain zero, albeit with point estimates that suggest a potential benefit of IAC.IPW results in a significant estimated benefit of IAC on 28 day mortality, with a point estimate of the benefit that is substantially larger than for other methods.
In addition, we conduct a simulation study based on the IAC data with precisely the same responsegenerating mechanism as for the simulation on the RHC data in Section 6.3.Although the dimension of the IAC data is higher, the number of dimensions that impact the response for each data-generating setting is 63, the Table 4: Estimates of the ATE and standard errors for the IAC data.Standard errors were computed for all methods using the nonparametric bootstrap with 1,000 replications.Also displayed are various measures of discrepancy between the distributions of covariates for the IAC and control groups.We also display the mean and max RIMSE statistic for marginal univariate and bivariate CDF differences, as in Figure 4.In addition, we display summary statistics of SMDs for marginal means and SMDs for all polynomials up to order 5 and pairwise interactions (denoted SMD( 2

Figure 1 :
Figure 1: (a, left) Energy distances and biases for IPW estimates based on weights from the three fitted logistic regression models; (b, right) Boxplots of the biases for IPW estimates versus weighted energy distance based on weights estimated by several methods, each with different combinations of moments included for balancing or estimation.

1 : 3 . 2 .
result in the almost sure convergence of the weighted ECDFs of the treated (and untreated) group to the underlying covariate distribution F .The consistency of τ w n e  follows immediately from Theorem 3.Corollary Suppose the conditions of Theorem 3.1 hold, and that the treatment and control mean response functions μ x 0 ( ) and μ x 1 ( ) are bounded and continuous on .Then τ w n e  is a consistent estimator of τ.

Figure 2 :
Figure 2: Displayed are the absolute differences between the ECDF of the combined sample and the weighted ECDF of the covariate in the treatment group based on energy weights.Also displayed the same for the unweighted ECDF of the treated group and the weighted ECDF based on the true and estimated propensity scores.

3
the kernel h a is defined for = a 0, 1 as follows: for some constant > C 0 independent of n.

Figure 5 :
Figure 5: RMSEs for each method across 100 outcome models using the RHC data. ,

Table 1 :
Displayed are the value functions and misclassification rates for the optimal ITR estimation example averaged over 1,000 independent simulated datasets

Table 2 :
[7,20,52,53]e the ranks among all methods tested of each method in terms of RMSE and bias averaged over all response models (I-VI) for = n 250 and over the dimension settings = p 10 and = p 25.The bold values indicate the best performance across all methods for a given setting Energy balancing of covariate distributions  15over propensity models and dimensions.In general, iEBW tends to yield the best rank in terms of performance across the settings, with EBW and KCB yielding similarly low ranks.studybyConnorsetal.[2]was conducted to investigate the effectiveness of RHC, a diagnostic procedure for critically ill patients in ICUs.Since RHC is more relevant for certain forms of intensive care than others, there is substantial imbalance in patient characteristics in those treated with RHC and those who did not receive RHC.The original analysis was based on propensity score matching, and the data have been subsequently reanalyzed in many other works[7,20,52,53].The study consists of data on 5,735 individuals, 2,184 of whom received RHC, and the remaining 3,551 did not receive RHC.The outcome is an indicator of survival at 30 days after admission.A panel of experts convened to discuss which variables contribute to a decision to use RHC, resulting in a large set of covariates to study (72 in total, 21 of which are continuous, 25 binary, and 26 dummy variables originating from 6 categorical covariates).The dataset is publicly available at: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/rhc.html. A

Table 5 :
)).The bold values indicate the best performance across all methods for a given setting Displayed are the median, mean, standard deviation, and maximum RMSEs for each method across the 100 simulation settings using the IAC data.The bold values indicate the best performance across all methods for a given setting