Adaptive normalization for IPW estimation

Inverse probability weighting (IPW) is a general tool in survey sampling and causal inference, used both in Horvitz-Thompson estimators, which normalize by the sample size, and H\'ajek/self-normalized estimators, which normalize by the sum of the inverse probability weights. In this work we study a family of IPW estimators, first proposed by Trotter and Tukey in the context of Monte Carlo problems, that are normalized by an affine combination of these two terms. We show how selecting an estimator from this family in a data-dependent way to minimize asymptotic variance leads to an iterative procedure that converges to an estimator with connections to regression control methods. We refer to this estimator as an adaptively normalized estimator. For mean estimation in survey sampling, this estimator has asymptotic variance that is never worse than the Horvitz--Thompson or H\'ajek estimators, and is smaller except in edge cases. Going further, we show that adaptive normalization can be used to propose improvements of the augmented IPW (AIPW) estimator, average treatment effect (ATE) estimators, and policy learning objectives. Appealingly, these proposals preserve both the asymptotic efficiency of AIPW and the regret bounds for policy learning with IPW objectives, and deliver consistent finite sample improvements in simulations for all three of mean estimation, ATE estimation, and policy learning.


Introduction
Consider the problem of estimating a mean from samples that are observed with non-uniform probabilities.Formally, we want to estimate the mean µ of a set of responses Y 1 , • • • , Y n , but only observe Y 1 I 1 , • • • , Y n I n , where I k is an indicator of whether or not unit k was observed.This problem is a fundamental primitive of many problems in survey sampling, causal inference, and beyond.We focus on the case where the I k are independent and distributed as I k ∼ Ber(p k ), which can correspond to a randomized experiment under a Bernoulli design with known p k or to certain observational contexts.
The standard estimators for µ are the Horvitz-Thompson and Hájek estimators [16,2], both of which are based on the idea of inverse probability weighting (IPW).To introduce these estimators, we first define These estimators have several desirable properties: μHT is unbiased and admissible in the class of all unbiased estimators [10], while μHájek is approximately unbiased and often has lower variance than μHT [27].
The idea of inverse probability weighting also figures prominently in the Monte Carlo literature on importance sampling, where the Horvitz-Thompson and Hájek estimators are known as the importance sampling (IS) and self-normalized importance sampling (SNIS) estimators, respectively.In 1954, working in this context, Trotter and Tukey [31] briefly entertained the idea of a family of estimators that generalizes μHT and μHájek , namely for λ ∈ R.This family contains Horvitz-Thompson as the special case λ = 0 and Hájek as the special case λ = 1.Curiously, Trotter and Tukey emphasized that λ need not be constrained to [0, 1], and that values outside that range might sometimes be useful [31].
Although this proposal first appeared nearly 70 years ago, it has not received any significant attention in the literature.In this work, we consider this proposal in detail.When λ is selected as a function of the data to optimize some criterion, we refer to this idea as adaptive normalization.We show in Section 3.2 how choosing a member of the family (1) in a data-dependent way-to iteratively minimize asymptotic variance-leads to estimators that improve on both the Horvitz-Thompson and Hájek estimators in terms of asymptotic variance, despite the additional variance incurred by estimating λ from the data.We also show how the adaptively normalized estimator can be understood as a regression control/control variate method in Section 3. 4.
As applications, we deploy adaptive normalization in diverse settings where the Horvitz-Thompson estimator and Hájek estimator are traditionally used.The popular AIPW estimator of [24] essentially uses Horvitz-Thompson as a primitive, and we replace the sample size normalization with adaptive normalization to obtain an estimator that preserves the asymptotic efficiency of the AIPW estimator while improving performance in simulations.Similarly, the problem of estimating average treatment effects (ATEs) can be tackled by separately estimating the mean response in the treated group and the mean response in the control group using adaptive normalization.Finally, we also show that a new objective for policy learning based on adaptive normalization preserves the regret bounds of [18] and learns better policies in simulations.More broadly, our analysis strongly suggests that choosing normalizations other than n or n is worthy of attention in other contexts where inverse probability weighted estimators appear, such as off-policy evaluation or reinforcement learning.
Motivation for adaptive normalization.Why would we want to use values of λ learned from the data in (1) rather than λ = 0 or λ = 1, especially values of λ < 0 or λ > 1?As a starting point, consider λ = 0, the Horvitz-Thompson estimator, and λ = 1, the Hájek estimator.As mentioned above, the Hájek estimator often has lower variance than the Horvitz-Thompson estimator.One reason for this variance reduction is that the numerator and denominator of the Hájek estimator are positively correlated, while the numerator and denominator of the Horvitz-Thompson estimator are uncorrelated.
To see the role this correlation plays, consider a toy example with n = 10 units with responses and response probabilities Then, we claim the Horvitz-Thompson and Hájek estimators have very different behaviors on the event that Y 1 is observed.For our illustration, assume units 2-5 are observed but 6-10 are not.The MSE of μλ as a function of λ in an instance of the normal model described in (16) with n = 250, µ = 1, and ρ = 0.2.Note that neither the Horvitz-Thompson (λ = 0) nor Hájek (λ = 1) estimator minimizes the MSE.Furthermore, the optimal choice of λ is greater than 1.
For the Horvitz-Thompson estimator, if Y 1 is observed then the numerator of the estimator increases from 8 to 8 + 1/10 −5 ≈ 10 5 , while the denominator, n, remains fixed at 10, so our estimate increases from 8/10 to (8 + 10 5 )/10 ≈ 10 4 .For the Hájek estimator, if Y 1 is observed, the numerator similarly increases by 10 5 , but now the denominator also increases by 10 5 , and so our estimate is less affected.The positive correlation between the numerator and denominator of the Hájek estimator provides a kind of shrinkage that serves to reduce variance.
Since the correlation between numerator and denominator in the Hájek estimator is one way that it reduces variance, we would expect that we can further reduce variance by introducing more correlation, which motivates taking λ > 1 in (1).By increasing the weight on the random factor of n in the denominator, we increase the covariance between the numerator and denominator, and thus reduce variance.Put crudely, if going from λ = 0 to λ = 1 reduces variance, shouldn't going from λ = 1 to λ = 2 reduce it even more?
Of course, there is a bias-variance trade-off.Increasing λ decreases the variance of μλ , but increasing λ to be arbitrarily large also shrinks our estimates towards 0 and increases bias.Thus the problem becomes one of choosing a value of λ that provides the correct amount of shrinkage for a given problem.This trade-off is visualized in Figure 1, which shows the MSE of (1) for a simulated problem at a range of values of λ.Ideally, we would choose λ to minimize the curve shown in this figure; unfortunately, we cannot compute the exact MSE of μλ .In this work, we take the approach of showing that μλ is asymptotically normal and then choose λ to iteratively minimize estimates of the asymptotic variance.Other procedures for selecting λ based on different criteria would lead to other estimators, suggesting a general class of adaptively normalized estimators that may be worthy of further study.

Related work
The present work is most closely related to the literature on variance reduction in importance sampling.Since the proposal of importance sampling in [13], there have been a variety of proposals for variance reduction methods, many of which are surveyed in [22].The one most relevant to our work is the method of control variates, which is also known as difference estimation in the survey sampling community [27].We show in Section 3.4 that the estimator we derive in Section 3 is algebraically equivalent to a particular kind of control variate estimator, and a slightly modified form of this estimator has been previously discussed by [9] and studied in Monte Carlo simulations by [15].For proponents of this choice of control variate, adaptive normalization provides an independent motivation and derivation.
Our proposed modifications of the AIPW estimator stand alongside many other proposals: for example, [5,23] use an empirical likelihood approach, while [26] uses a weighting approach, among others.We do not, however, know of any attempt or alternative derivation that is equivalent to the estimator we derive in this context.Similarly, our policy learning proposals build directly on the work of [18], which is itself closely related to work on counterfactual risk minimization in the bandit setting [29], and was extended by [1] using ideas from AIPW estimation.

Problem formulation and notation
In this section we formally specify our model and introduce notation.We assume that pairs (Y 1 , p 1 ), . . ., (Y n , p n ) are drawn i.i.d.from a super-population distribution D on R × [0, 1] and that we observe both p 1 , . . ., p n and Y 1 I 1 , . . ., Y n I n where the I k are independent and This model differs from previous design-based work on asymptotics of the Horvitz-Thompson and Hájek estimators in assuming a super-population model for the Y k rather than modeling Y k as a sequence of fixed finite populations [8,25].However, we believe our results are still true in a finite population model, with slightly modified proofs, and we choose to work in a super-population model mainly to avoid making many cumbersome assumptions about the existence of limiting moments of the Y k .In particular, although we are working in a super-population model, we make no parametric assumptions on D, such as assuming a regression model.
Our assumption that the p k are random as well is best understood in the context of a more general model that we consider in Sections 4.1 and 4.3.There, we consider pairs (Y 1 , X 1 ), . . ., (Y n , X n ) drawn i.i.d.from a super-population distribution D on R × X , where the Y k are responses and the X k are known covariates.Then, we can think of the treatment probability p k as a function p k = p(X k ) of the observed covariates.
We typically assume that the p k (or equivalently the map p(•)) are known exactly, rather than estimated from the X k .This is mainly to facilitate asymptotic comparisons between our estimators that are independent of the model used to estimate p k , and we believe that it is reasonable to use estimated propensities with our proposed estimators.In Section 4.1, where we discuss AIPW estimation, under appropriate consistency and rate conditions, the choice of model for p k no longer plays any role in the asymptotics, and so we relax this assumption there and allow the p k to be estimated propensities.
We now specify our assumptions on D.

Assumption 1 (Boundedness and overlap). There exist constants
Under Assumption 1, all of the moments of D are all finite.Throughout the next section, our goal is to estimate µ = E[Y k ] in the model presented here.In Section 4, we consider more diverse models and goals.

Adaptive normalization
In this section we propose and analyze a novel procedure for selecting a data-dependent value of λ and describe properties of the adaptively normalized IPW estimator that follows.We establish that the resulting estimator has smaller asymptotic variance than either the Horvitz-Thompson or Hájek estimator.

The optimal choice of λ
Before we can understand how to choose λ from the data, we must first understand the behavior of μλ for a fixed λ, as a function of λ and the problem parameters.To this end, we contribute the following central limit theorem, which is a straightforward generalization of known results on the bias and variance of the Hájek estimator.We defer the proof this result, as well as all other results in this section, to Appendix A.
Theorem 1. Suppose Assumption 1 holds.Then, for any fixed λ ∈ R, we have the CLT With this result in hand, we now choose λ to minimize σ 2 λ .Moving forward, we assume that µ = 0, since if µ = 0, then σ λ does not actually depend on λ, and minimizing over λ is no longer meaningful.
Under the assumption that µ = 0 then, there is a unique value of λ that minimizes σ 2 λ in (3), given by where π and T are the moments defined in (2).Assumption 1 precludes the possibility that π = 0. We can interpret (4) to shed light on the role λ plays.If the Y k and p k are positively correlated, then Y k and 1−p k p k are negatively correlated, so T < µπ and λ * < 1.Similarly, if the Y k and p k are negatively correlated, we obtain λ * > 1.This interpretation of (4) extends the conventional wisdom that Hájek is preferable to Horvitz-Thompson when Y k and p k are negatively correlated [27].

Estimating λ * from the data
Based on the above asymptotic analysis, we would seem to prefer μλ * over μHT and μHájek .However, we cannot use μλ * directly as an estimator of µ because the prescribed choice of λ * depends on unknown moments of D, including the very mean µ we are trying to estimate.What happens if we estimate λ * from data?
Our expression of λ * depends on three moments of D, namely T, π, µ.The first two of these can readily be estimated from the data by the IPW-style estimators and we already have the estimator μHT of µ.
As a brief aside, we might wonder whether we can also use an adaptive normalization when estimating T and π, instead of a traditional IPW estimator.Unfortunately, just as adaptively normalizing an estimator of µ requires estimating the higher-order moments T and π, adaptively normalizing an estimator of T or π would require estimating even further higher-order moments.To avoid these complications, we restrict ourselves to using the Horvitz-Thompson normalization.Now, the estimators in (5) lead us to estimate λ * and µ by Unfortunately, we can check in simulation that this estimator is sometimes even worse than μHT .Essentially, the additional variance introduced by estimating λ * from data can outpace the variance reduction from using a tuned value of λ * .However, we can make useful observation: μλ * is often a better estimator of µ than μHT , so should we not use μλ * rather than μHT when estimating λ?In fact, every time we obtain a better estimate of µ, we can use this to obtain a better estimate of λ * .On the other hand, a better estimate of λ * will, because λ * is the optimal amount of normalization, lead to a better estimate of µ.Combining these ideas leads to the following alternating scheme.Formally, we construct a sequence of estimators ( λ(t) , μ(t) ) initialized at ( λ(0) , μ(0) ) = (0, μHT ) and defined for t > 0 by the recursions The first equation in (7) corresponds to estimating λ * using μ(t−1) as an estimate of µ, while the second corresponds to estimating µ using λ(t) as an estimate of λ * .There are two possible stable limiting behaviors for this sequence: the first is to trivially have μ(t) → 0 and λ(t) → ∞, while the second is to converge to a fixed point at a pair (μ AN , λAN ) satisfying This system of equations has the unique solution The following theorem formally establishes that the iterations (7) converge to the non-trivial solution, the fixed point at (8).
Thus, our attempts to develop an estimator of µ by estimating the optimal normalization parameter λ * culminate in the estimator μAN , which we refer to as the adaptively normalized IPW estimator.The role of Theorem 2 and the iterative scheme we have presented is to make explicit the connection between adaptive normalization and the final estimator μAN .The fixed point equations alone do not characterize this connection, since, based on those equations alone, we may be led to take μ = 0 and think of this as using an infinite value of λ.Our theorem shows that the iterations (7), which correspond to iteratively learning better and better normalizations, uniquely pick out μAN with high probability.
Before proceeding, we note a very attractive property of μAN in (8): its simplicity.It is, for practical purposes, the Horvitz-Thompson estimator with a correction term, and can replace the Horvitz-Thompson and Hájek estimators in essentially any application where they are used, with minimal additional computation.Beyond this simplicity, we will see in Section 3.3 that μAN typically has lower asymptotic variance than either the Horvitz-Thompson estimator or the Hájek estimator, and in Section 3.4 that it is closely related to several existing ideas in the literature.
An optimization perspective.It may feel unsettled to minimize the asymptotic variance and then commence an iteration scheme.We can also derive μAN more directly by framing the problem of selecting λ as a joint minimization, over µ and λ, of the asymptotic variance.Minimizing the asymptotic variance σ 2 λ in (3) is equivalent to minimizing As before, we use T and π defined in (5) to estimate T and π but now we estimate µ by μλ directly.This leads to the non-convex optimization problem Rather than directly solving this optimization problem, we consider the equivalent problem This problem is also non-convex, but is more amenable to analysis.In particular, we claim that, despite its non-convexity, (9) can be easily solved analytically to find a unique optimum.
To solve (9), note that the objective is a quadratic function of λμ and so checking first-order stationarity conditions shows that the unconstrained minimum is achieved for any pair (λ, μ) satisfying λμ = T /π.Then, direct algebra yields that there is a unique pair ( λopt , μopt ) satisfying both λopt μopt = T /π as well as the constraint in (9), and the resulting μopt is precisely μAN .

Properties of adaptive normalization
We now study the asymptotic behavior of μAN , and show that its asymptotic variance generically improves on the asymptotic variance of the Hájek and Horvitz-Thompson estimators.
Asymptotic variance.To understand the asymptotics of we use the fact that each of Ŝ/n, T , π, n/n is an average of i.i.d.terms.This structure implies that the vector ( Ŝ/n, T , π, n/n) satisfies a CLT, and we can then apply the delta method to obtain a CLT for μAN .This argument produces the following result.
Theorem 3.Under Assumption 1, the estimator μAN satisfies the CLT Furthermore, the asymptotic variance in (10) is always smaller than the asymptotic variances of μHT and μHájek , and is strictly smaller except if T = µπ or T = 0.
We defer the proof of the CLT to Appendix A, but give the proof of the variance comparison below.
Proof.We begin by expanding Then, we first compare with μHT .For μHT , plugging λ = 0 into (3) gives an asymptotic variance ≥ 0, this is always larger than the asymptotic variance in (10), and is strictly larger whenever E Y k For μHájek , plugging λ = 1 into (3) gives an asymptotic variance of so the asymptotic variance of μAN is always smaller than the asymptotic variance of μHájek unless T = πµ, in which case they are equal.
Theorem 3 shows that μAN can only ever improve on μHT and μHájek asymptotically.For a more careful understanding of when we expect to see improvement, note that the condition which is equivalent to Y k and 1−p k p k being uncorrelated.This condition will be satisfied, for example, when Y k and p k are independent.So in some sense, μλ is exploiting dependence between Y k and p k to improve performance, and cannot improve on Hájek in the absence of such dependence.
Another useful observation is that the condition T = µπ corresponds to λ * = 1, while T = 0 corresponds to λ * = 0. Thus we see that the only cases where wee fail to improve on μHájek and μHT are those in which μHájek and μHT were coincidentally using the optimal value of λ already.

Finite sample variance.
Examining the form of μAN directly, we can see that if it has lower variance than μHT in finite samples, it must be because the correction term T π 1 − n n is negatively correlated with the first term, Ŝ/n = μHT .However, using the iterative scheme introduced in (7), we can give an alternative explanation for why the finite sample variance of μAN is smaller than that of μHT , one that builds on the result of Theorem 2. This explanation is presented in Appendix A.2, following the proof of Theorem 2, and suggests that under Assumption 1 and assuming µ = 0, every two steps of the iterations in (7) reduce finite-sample variance.This further underscores the importance of adaptive normalization as a variance reduction technique; an experimental demonstration of this phenomenon is shown in Figure 2. The variance of μ(t) as a function of t.Although a single iteration may increase the variance, we observe that, in this simulation, every two iterations reduce variance.The data here are generated from the normal model ( 16) in Section 5 with n = 100, ρ = 0.1, µ = −1.

Connections to regression/control variate methods
In this section, we show how μAN can be understood as a control variate/regression control method, and also how it is connected to augmented IPW (AIPW) estimation.
A popular technique for variance reduction in survey sampling is the use of regression controls, and this technique is also known in the Monte Carlo literature, where regression controls are referred to as control variates.We will now show that μAN is equivalent to a particular choice of regression control/control variate that is known in the Monte Carlo community, but does not appear to have been widely adopted in the survey sampling and causal inference communities.
We follow the discussion in [22], where the author strongly recommends using the importance weights as control variates whenever they are known, based on the results of [15].In our setting, the random variables w k = I k /p k play the role of the importance weights, since they have mean 1 and re-weight the observed Y k I k to have mean µ.Then, following [22], we define the family of estimators Each estimator in this family is unbiased, so we choose β to minimize variance.The variance of μβ is connecting the importance sampling problem to the notation of our survey sampling problem given in (2).Thus, estimating the numerator and denominator of β * separately by T and π respectively gives the estimator β = T0 /π 0 of β * , and thus the estimator of µ, which is algebraically equivalent to the estimator μAN in (8) derived via adaptive normalization.We note that the version of this estimator in [22,15] estimates cov(Y k , w k ) and var(w k ) directly rather than first simplifying, a minor difference with our version.
Value of adaptive normalization as a perspective.While the adaptively normalized estimator we have derived is algebraically equivalent to a known estimator in the Monte Carlo literature and also a kind of AIPW estimator, we feel that our derivation, based on the simple idea of combining the denominators of μHT and μHájek , offers a valuable and unique motivation.Furthermore, our iterative analysis of μAN provides instructive intuition for finite-sample variance reduction.Finally, despite these other guises in which the adaptively normalized estimator has appeared, it has not received significant attention in the survey sampling or causal inference community.This inattention is especially surprising when we consider that, in the Monte Carlo setting, the importance weights are often not computable in closed form, and so they cannot be used as control variates.In contrast, in the causal inference setting, the treatment probabilities are often either known from the experimental design or estimated (as propensities), and so μAN is usually available as an immediate improvement over μHT or μHájek .Thus it seems that μAN is more widely known in the community where it is of less practical value, a gap we hope to remedy.Of course, in an applied setting, practitioners will often use other control variates or variance reduction techniques regardless, and whether or not μAN would be preferable to those varies from application to application.But there are still a variety of template settings in causal inference where μHT and μHájek are used by default, and replacing them with μAN can be advantageous, as discussed in the sequel.

Applications
In this section, we consider various causal inference settings in which the Horvitz-Thompson or Hájek estimators are used "by default," and show how the adaptively normalized estimator acts as a free upgrade.

AIPW estimation
Our first application of adaptive normalization focuses on AIPW estimation.A basic version of AIPW was first introduced for survey sampling in the 1980s [4,19], and then re-discovered and significantly expanded on for ATE estimation by the causal inference community [6,24], where it is also known as the doubly-robust estimator.Our approach follows the one developed in [6], although we concentrate on a single group for now and defer ATE estimation until further on.
We discuss AIPW estimation in the more general model where we have access to covariate information.Specifically, we assume the pairs Ber(p k ) and p k = p(X k ) is a function of the covariates.Then, as in Section 3.4, the AIPW estimate of µ is where μ( ) is an estimate of p(X k ).We assume that μ and p are trained on an external training dataset T n of size n.This is equivalent to two-fold cross-fitting with a combined sample of size 2n; our results continue to hold for an arbitrary number of folds, but we focus on this case for ease of notation.We will require that μ(•) and p(•) satisfy the following standard conditions [6]: Assumption 3 (Risk decay).We have that μ(•), p(•) satisfy Our focus on μAIPW is motivated by the fact that, given Assumptions 2 and 3, a cross-fitted version of the estimator μAIPW is semi-parametrically efficient [12].Of course, this efficiency result means that we cannot hope to improve μAIPW asymptotically through adaptive normalization, but we will see that we can use our ideas to to preserve its asymptotic efficiency and reduce its finite-sample MSE in simulations.
Consider the estimator in (12).The first term is an estimate of µ based on imputing all of the Y k by μ(X k ), while the second term is an inverse probability weighted estimate of the bias of the μ(X k ).In light of our work in Section 3, we naturally propose replacing the second term with a adaptively normalized estimator, yielding the new estimator where now are functions of the estimated propensities, rather than the true treatment probabilities as in Section 3.
The following theorem shows that this correction is asymptotically negligible; the proof appears in Appendix A.3.Crucially, Theorem 4 implies that μAIPW,AN has the same asymptotic variance as μAIPW under the given conditions.These are exactly the conditions required for the efficiency of μAIPW , so we conclude that μAIPW,AN is efficient whenever μAIPW is.On the other hand, in finite samples, the additional correction term in μAIPW,AN is negatively correlated with the other terms, and reduces variance, a fact we demonstrate empirically in Section 5. Taken together, these observations suggest that μAIPW,AN should typically be preferred to μAIPW .
Our proposal is by no means the only attempt to improve on μAIPW .However, our proposal differs from existing work in two important ways.First, many other proposals are meant to improve on AIPW in the case when μ is misspecified, which is not our motivation here, although we do explore this setting in simulation and see that μAIPW,AN handles misspecification better than μAIPW .Second, and more importantly, other proposed estimators are significantly more complex, and sometimes requires solving certain estimating equations numerically.In contrast, μAIPW,AN has an explicit, simple, closed form, and computing it requires nothing more than what is required to compute μAIPW .

ATE estimation
Another setting in which IPW estimators play a key role is in the problem of average treatment effect (ATE) estimation.Our model for ATE estimation is that there are triplets We assume that we observe Y 1 (I 1 ), . . ., Y n (I n ) where I k | p k ∼ Ber(p k ), and want to estimate the ] from these observations.There are a variety of general approaches to estimating τ [17]; the one relevant to our work is the approach of using IPW estimators to separately estimate µ 1 = E[Y k (1)] and µ 0 = E[Y k (0)], and then subtracting the two estimates.
Since the problems of estimating µ 1 and µ 0 are survey sampling problems, they are often estimated with Horvitz-Thompson or Hájek estimators.Continuing our general theme, why not use μAN instead?Thus, we propose τAN = μ1,AN − μ0,AN , where μ1,AN is the adaptively normalized estimator based on Y 1 (1)I 1 , . . ., Y n (1)I n and μ0,AN is the same based on This is not the only possible use of adaptive normalization in this setting.Note that μ1,AN and μ0,AN here are designed to minimize variance within each group, even though our target estimand is the difference between groups.It is natural to ask what happens if we instead attempt to directly minimize the variance of the estimated difference between the two groups; we discuss this approach in detail and compare it to τAN in Appendix C. We conclude that estimating the two groups separately and taking the difference is reasonable and generally advised.
Similarly, rather than using AIPW estimation in each group for ATE estimation, we can also use adaptively normalized AIPW estimation in each group instead.Indeed, we will show in Section 5 that these estimators improve on both of the usual estimators in simulation.

Policy evaluation
The final setting in which we propose replacing an IPW estimator with an adaptively normalized estimator is policy evaluation.In this context, we wish to learn a statistical rule for assigning treatments to a population that maximizes the total welfare.We follow the regret minimization framework introduced by [20] and build directly on the work of [18], although much of our notation is drawn from [1].
Formally, suppose that individual k has potential outcomes Y k (1) and Y k (0) depending on whether or not they receive a treatment, and we wish to learn a policy π that maps known covariates X k ∈ X to a treatment assignment in {0, 1}.The value of a policy π (note that we are no longer using π to represent moments of the unknown distribution as in Section 3) is the average outcome of an individual treated using the policy.Assuming we are restricted to a class of policies Π, the best possible policy is π * = arg max π ∈Π V (π ), and we evaluate a policy based on its regret V (π * ) − V (π).
Ideally we would learn a policy by maximizing V , but we cannot compute V .Instead we estimate ) is an indicator of whether or not Y k received the treatment, and we assume that Assumption 1 holds.Under the assumption that the propensity map p(•) is known, [18] proposed , as an unbiased estimate of V (π).
Of course, at this point we can sense that a better estimate of V would be the adaptively normalized The main result of [18] showed that maximizing VIPW (π) yields a policy whose regret decays at a rate of 1/ √ n.We now give an analogous result for VAN .
Theorem 5. Let πAN = argmin π∈Π VAN (π) and suppose that Π has finite VC-dimension.Then Thus, VAN preserves the theoretical guarantees associated with VIPW , and we verify in the next section that policies learned with VAN are closer to the optimal policy.One drawback of VAN however, is that VIPW can be interpreted as a weighted classification objective [1,18], facilitating optimization, while VAN unfortunately does not have such an interpretation.Finally, we note that [1] introduced the idea of policy learning based on AIPW estimation instead of IPW estimation; we consider the possibility of extending our ideas from Section 4.1 in this direction an exciting opportunity for future work.

Experiments
In this section, we use a series of experiments to evaluate the empirical performance of adaptive normalization.The goal of our experiments is to validate that applying adaptive normalization for mean estimation as well as in the various settings of Section 4 actually pays the dividends we expect.
The first portion of our experiments are focused on survey sampling as discussed in Section 3 and ATE estimation as discussed in Section 4.2.Each of these serves as an opportunity to compare μAN to μHT and μHájek , and also an opportunity to compare μAIPW,AN of Section 4.1 to μAIPW , since the AIPW estimator can be used for both survey sampling and ATE estimation.The second portion of our experiments focus on policy estimation, as introduced in Section 4.3, where we show that minimizing VAN learns better policies than minimizing VIPW .Additional simulations, including real-data experiments, appear in Appendix B.

Survey and ATE experiments
Data generating models.The goal of our simulations is to compare the various proposed estimators to each other and existing estimators in different settings.To this end, we consider two where Φ is the normal CDF.This model produces pairs (Y k , p k ) for which p k is marginally uniform on [0, 1] and the correlation between Y k and p k is approximately θ.The correlation is not exactly θ, owing to the non-linearity of Φ −1 , but the parameter θ still has the desired effect of making Y k and p k more or less correlated.Additionally, we truncate both p k and Y k so that they satisfy Assumption 1 with M = 50 and δ = 0.01.
The second model we consider is our power law model, where we generate This model captures settings where Y and p have a non-linear negative relationship.When α = 0, Y and p are independent, and as α increases, the strength of the negative relationship increases.The parameter σ 2 is an arbitrary noise parameter which we set to σ 2 = 9.As before, we truncate the p k and Y k to satisfy Assumption 1 with M = 10 6 and δ = 10 −3 .For problems with covariates, we incorporate covariate information X k into either model by taking X k = p k , so that the propensity mapping is the identity.For AIPW estimation, we assume the propensity map is known and estimated exactly, while E[Y k | X k ] is estimated with linear regression and two-fold cross-fitting.This is nearly well-specified in the normal model, where where Y k and p k are approximately multivariate normal, but is severely misspecified in the power law model.For ATE estimation problems, we take Y k (1) = Y k and Y k (0) = Y k − τ for a constant treatment effect of τ = 0.5.Survey sampling simulations.We simulate our data from both the normal model and the power law model, fixing n = 500 but allowing the parameters θ and α to vary so as to explore a range of different possible relationships between Y and p.The results are shown in Figure 3, omitting μHT in the power law model to preserve the scale of the plot.We split our discussion into two portions, focusing separately on the estimators that do and do not use covariate information.Among the estimators that do not use covariate information, namely μHT , μHájek , and μAN , it is clear that μAN is the best choice.In the normal model, the MSE of both μHT and μHájek is significantly affected by the strength of the correlation, as controlled by the parameter θ, while μAN is able to do well regardless of the correlation structure in the problem, and always has the lowest MSE in this group.In the power law model, which represents a more challenging problem, the difference is even more stark: the MSE of μAN is, for larger values of α, an order of magnitude lower than that of the other two estimators.
Among the estimators that do use covariate information, namely μAIPW and μAIPW,AN , there a small but noticeable difference in the normal model.However, in the power law model, where the estimates of E[Y k | X k ] are misspecified, there is a more substantial difference.Because of the misspecification, μAIPW offers only the slightest of improvements over μHájek .In contrast, μAIPW,AN is noticeably better than either of μAIPW or μHájek , but is essentially identical to μAN (again, the lack of improvement from μAN to μAIPW,AN is caused by model misspecification).This suggests that the additional care taken in μAIPW,AN to estimate the bias of μ(•) is especially important when μ(•) is misspecified and this bias is large.

ATE simulations.
As above, we simulate in both models, fixing n = 500 and τ = 0.5 while varying θ and α.Here, we omit τHT from both plots, since its MSE is so large as to distort the scale of the plot.Our main observations from the survey sampling setting largely carry over: τAN is consistently better than τHájek , and τAIPW,AN is slightly better than τAIPW when the model for Y k is well-specified and significantly better when the model for Y k is misspecified.

Policy evaluation experiments
Our second set of simulations compares the two policy learning objectives, VIPW and VAN of Section 4.3.
Data generating model.Due to the slightly different set-up of the problem, we deviate from the models introduced above and instead use a model similar to the one used in the simulation studies of [1].For each k, we take where X k,i is the i th entry of X k .
In general the problem of minimizing VIPW or VAN is non-convex.To obtain a tractable problem, we restrict the class Π to be the set of all policies of the form π(X k ) = 1{X k,2 > T } for T ∈ [−1, 1].Note that this is a misspecified setting, in the sense that the optimal policy π(X k ) = 1{X k,2 + X k,3 > 0} is not in the class we are optimizing over.
Policy learning simulations.For each of VIPW and VAN , we generate a sample of size n from (18) and learn a cut-off T that minimizes the corresponding policy objective by grid search on T ∈ [−1, 1].The average threshold learned, for a range of values of n, is shown in Table 1.
The optimal policy is to threshold at T = 0, and so we take this as a point of comparison.We see that the thresholds learned from minimizing VAN are consistently closer to the optimal threshold of zero than those learned by minimizing VIPW , and that the gap between the performance of the two objectives is consistent across the range of values of n we consider.We note that, although the improvements we obtain are small in absolute terms, this may be a consequence of the simplicity of the policy class we consider.In particular, [1] finds little improvement from using AIPW estimators for policy learning of depth-1 trees, but more substantial improvements for policy learning of depth-2 trees, suggesting that VAN may also be a more substantial improvement over VIPW when learning policies from more complex classes.

Discussion
In this paper we study adaptive normalization for IPW estimators: rather than normalizing by the sample size or by the sum of the weights, we propose normalizing by a data-dependent affine combination of the two.For mean estimation in survey sampling, our proposed estimator is algebraically equivalent to a control variate method from the Monte Carlo literature that has not been studied in the causal inference literature before.We further develop the adaptive normalization idea in causal inference settings by using it to improve on the AIPW estimators, to propose new estimators for the ATE in randomized experiments, and new objectives for policy learning.
There are several possible future directions for this work.One is to relax the assumption that the treatment indicators I k are independent, which is unrealistic in many observational datasets and directly violated in certain experimental designs.However, if the correlation structure of the I k is known or estimated, analogues of our limit theorems and estimators could be developed.In a different direction, there are many places within and beyond causal inference where inverse probability weighted estimators are used in context-specific ways, such as off-policy evaluation on networks [7], recommender system evaluation [28], in learning to rank problems [21], and inference from bandits [11,3].Developing and applying similar ideas in these contexts suggests many promising lines of future work.
Lemma 1.Under Assumption 1, we have where the entries of Σ are given by Proof.For any vector v ∈ R 4 , the quantity √ n(v T β − v T β) is a sum of i.i.d.random variables.These random variables have finite variance by Assumption 1, so the classical Lindeberg CLT applies with a limiting distribution that is N (0, v T Σv).The lemma then follows from the Cramer-Wold device.
The CLTs for μλ and μAN now follow from applications of the delta method to different functions of β.For μλ , we will need only the first and last coordinate of β, since μλ depends only on these, while μAN is a function of all four coordinates of β.

A.2 Proof of Theorem 2
This subsection contains the proof of Theorem 2 and details for the heuristic argument for variance reduction given in the main text.
As before, we combine the two equations in (7) to obtain and write this as μ(t) = f (μ (t−1) ) where In this notation, the fixed point of f is x = a + b = μAN .The proof now proceeds in two steps: first, we generalize the problem slightly and consider the discrete dynamical system x (t) initialized at x (0) = a and with iterative map x (t) = f (x (t−1) ) for arbitrary fixed a and b.For this dynamical system, we show that if |a| > |b|, then x (t) converges to a + b.Second, we show that, with high probability, the particular a and b defined in ( 19) satisfy these conditions .
Analyzing the dynamical system.The function f has two fixed points, at x * 1 = 0 and at x * 2 = a + b.To understand the stability of these fixed points, we compute the derivative is unstable and x * 2 is stable, and if |a| < |b|, then x * 1 is stable and x * 2 is unstable.The case |a| = |b| occurs with probability zero and so we do not consider it.
In light of these stabilities, we should expect x (t) to converge to a + b whenever |a| > |b|.The following lemma confirms this.Lemma 2. If |a| > |b|, then the dynamical system with initial point x 0 = a and iterative map x t+1 = f (x t ) converges to x * = a + b.
Proof.Our analysis relies on the two-step map In particular, we will show that the subsequences x 0 , x 2 , • • • , and x 1 , x 3 , • • • , both converge to x * , and this will prove the lemma.In both cases, the key observation is that We first consider the subsequence x 0 , x 2 , • • • , which for convenience we denote by y t = x 2t .We claim that for any t ≥ 0, we have The proof of the claim is by induction.For the base case, which is t = 0 and y 0 = a, we have that and substituting this into (21) gives (22).For the inductive step, suppose the result holds for y 1 , • • • , y t−1 .We will show that y t also satisfies and this together with (21) will prove the claim.The previous display is equivalent to This inequality can be established by casework on the signs of a and b.We discuss the case a > 0, b > 0 in detail; the other three cases are analogous.If a > 0 and b > 0, then a + b > a, and since by the inductive hypothesis, y t is closer to a + b than a is, we must have a ≤ y t ≤ a + 2b.Thus Since the function we are minimizing is piecewise linear, the minimum must be attained at an endpoint (t = a or t = a + 2b) or where t(a − b) + b 2 = 0.
At t = a, the objective is b−a < 0 < a.Thus we conclude that (23) holds, and this completes the induction for the case a > 0 and b > 0. The other cases are analogous, and combining them establishes (22).Now, using our claim in ( 22) repeatedly, we have that for any t > 0, Since |b| < |a|, we thus have |y t − x * | → 0 as t → ∞, proving the convergence.
Recalling that y t = x 2t , we have shown that the subsequence x 0 , x 2 , • • • , converges to x * .An analogous argument gives that x 1 , x 3 , • • • converges to x * as well, and these two together imply that High-probability guarantees Our next lemma carries out the proof of the second part of Theorem 2, which is showing that a and b as defined in (19) satisfy |a| > |b| with high probability.
Lemma 3. Suppose Assumption 1 holds and that µ = 0.Then, Proof.For any 0 ≤ ≤ |µ|, define the events We need to lower bound the probability of E 1 , which we do by upper bounding the probability of E c 1 .By the union bound, The second inequality follows from the bound which implies that | T /π| ≤ M .The third inequality follows from applying Hoeffding's inequality to each term with the bounds Finally, we choose to balance these two terms.The optimal choice is = M M +1 |µ|, and with this value of we conclude that finishing the proof.
Combining the lemmas With these two lemmas, the proof of Theorem 2 is straightforward.
Proof of Theorem 2. .Let a, b be as defined in (19).
For (i), if |a| > |b|, then Lemma 2 implies the result with μlim = μfp If |a| < |b|, then an argument similar to the one in the proof of Lemma 2 establishes that x (t) → 0 as t → ∞, and so the statement holds with μlim = 0.
For (ii), we have where the first inequality is from the contrapositive of Lemma 2 and the second is Lemma 3.This bound goes to zero, implying (ii).
Intuition for variance reduction.In this section, we present the intuitive argument for why every two applications of f reduce variance, as alluded to in Section 3.3.
For convenience, let g(x) = f (f (x)).If it were the case that |g (x)| ≤ 1 for all x, then applying g would certainly reduce variance, and we would conclude that μAN has smaller variance than μHT .But this is not the case: g(x) approaches ∞ as its denominator approaches 0, and so |g (x)| can be arbitrarily large .However, the actual sequence of iterates μ(2t) lies, with high probability, in an interval where |g (x)| is bounded by 1.In what follows, we assume again that µ = 0, and that |a| > 2|b|, which holds with high probability by the same arguments as those in the proof of Lemma 3. We also assume, without the loss of generality, that a and b are both positive, but this is only for clarity.Now, in the proof of Lemma 2, we showed that under these conditions, the entire sequence of iterates μ(0) , μ(2) , • • • lies in the interval [a, a + 2b].By standard concentration arguments, a is concentrated around µ and b is concentrated around 0, so the interval between [a, a + 2b] lies with high probability in a small interval I centered at µ. On the other hand, we can check by direct computation that |g (x)| ≤ 1 for x outside the interval (−3b, b).Again by concentration arguments, for large n, this interval will be concentrated around zero, and thus be disjoint from the interval I centered at µ.
To summarize, there exists an interval I centered around µ such that, with high probability, x (2t) ∈ I for all t > 0 and |g (x)| ≤ 1 for all x ∈ I.If the function g were fixed, this would be enough to conclude that each application of g reduces variance, and thus that μAN has smaller variance than μHT .Unfortunately, because the function g is random, this is not a rigorous argument, only an intuitive one.
We note that an argument similar to ours appears in Section 5 of [14].In that context, when studying a generalized method of moments (GMM) estimator, the authors iterate a random datadependent estimate of an underlying true function.The underlying true function is a contraction mapping, and thus reduces variance when applied to a random variable, so they argue heuristically that the iterations of the data-dependent estimated function should reduce variance as well.

A.3 Proof of Theorem 4
Proof.Let T n be an auxiliary data set of size n on which μ(•) and p(•) are trained.(This will be the case if, for example, we do cross-fitting as in [6]; we choose not to write out the explicit cross-fitting set-up to simplify the notation.)Then, it is sufficient to show that the correction term we have introduced is o P (n −1/2 ).That error term is where the second inequality follows from the fact that, by the consistency of p(•), we have δ/2 ≤ p(x) ≤ 1 − δ/2 for all x ∈ X for sufficiently large n, the fourth inequality applies the triangle inequality and the CLT for i.i.d.sums, and the final equality uses Assumption 2. (Note that since p(•) is bounded, the consistency of p implies the consistency of 1/p as well.)Examining (31), the first two terms are o P (n −1/2 ) as needed, so it remains to analyze the final term.We thus compute where the first inequality is Cauchy-Schwarz, the second equality expands the product and drops the k = terms, the fourth equality uses the fact that the terms of the sum are equal after conditioning on T n , and the fifth equality uses the fact that the two errors are independent after conditioning on T n .By Assumption 3, the product of conditional expectations in (36) is o P (n −1 ).Since µ(X k ) and p(X ) are bounded by Assumption 1, and μ(•) and p(•) are sup-norm consistent by Assumption 2, that product of conditional expectations is eventually dominated by a constant, and so the expectation in (36) is o(n −1 ) as well, completing the proof.

A.4 Proof of Theorem 5
Our proof closely follows ideas and tools developed in [18].
Proof.Our goal is to control V (π * ) − V (π AN ).We do this by first writing The second term of (40) is bounded in Theorem 2.1 of [18], and so we are interested in bounding the first term.By the definitions of VIPW and VAN , that first term of (40) is by a calculation analogous to the one in (24).Combining (40) and ( 42) gives Finally, the expectation in the first term of (43) can be bounded using standard empirical process tools; in particular, Lemmas A.1 and A.4 of [18] , finishing the proof.

B.1 Semi-synthetic experiments
We now move beyond pure simulations and consider a real data set.Specifically we work with a dataset of Swiss municipalities, provided by the R package sampling [30] under the GPL-2 license.This data set contains demographic and financial data for 2896 different municipalities in Switzerland, all aggregated at the municipality level, and does not contain data about any individual.We consider two responses: Y 1 , the wooded area of a municipality and Y 2 , the industrial area of a municipality, and we assume a probability-proportional-to-size sampling scheme in which the probability of observing the response is proportional to the total area of the municipality.We take the sum of the probabilities to be either 50 or 250, meaning that the average number of observations is either 50 or 250.We perform simulations by resampling the set of municipalities that are observed according to the given probabilities and comparing the estimators to the mean value across all municipalities, making this a fixed-population setting.The RMSE of the Horvitz-Thompson estimator, Hájek estimator, and adaptively normalized estimator for the 4 specifications are shown in Table 2.
To contextualize the results, we note that Y 1 is strongly positively correlated with the probabilities and Y 2 is weakly positively correlated with them.Thus in the first two columns of Table 2, the Horvitz-Thompson estimator is much better than the Hájek estimator, but not as good as the adaptively normalized estimator.In the latter two columns, because the correlation is weaker, the Horvitz-Thompson and Hájek estimators have nearly the same RMSE, but again the adaptively normalized estimator improves on both.For reference, plots of the MSE in each problem as a function of different values of λ are shown in Figure 5. Interestingly, the optimal choices of λ are all between 0 and 1, suggesting that the Hájek estimator is actually "over-normalizing" in this case.

B.2 Confidence intervals and coverage
In this section we discuss the problem of constructing confidence intervals for μAN .Since we have computed the asymptotic variance of μAN in Theorem 3, we can use a normal approximation to construct confidence intervals that are asymptotically valid.The coverage of these intervals in a simulation experiment is shown in Figure 6.
Unfortunately, we see that in the superpopulation model, which is the model we have considered throughout, the coverage is quite poor.To better understand the reason for this behavior, consider the asymptotic variance of the Hájek estimator, E 1−p k p k (Y k − µ) 2 .Crucially, p → 1−p p is a decreasing function of p ∈ [0, 1], and so this expectation up-weights small values of p k .But these are exactly the values we are least likely to observe, hence the difficulty in obtaining an accurate approximation of the asymptotic variance, and thus in obtaining confidence intervals with good coverage.In particular, we do not believe that these results are due to a failure of the normal approximation: the normal approximation should be very nearly exactly correct for values as large as n = 1000, but estimating the parameters of the target normal approximation remains challenging.This is further evidenced by the fact that μHT , for which there is no approximation and the asymptotic variance is exactly correct, actually has slightly lower coverage than μHájek .Rather, the challenge is in estimating k , which is slightly harder than estimating the asymptotic variance of μHájek because Y 2 k is typically larger than (Y k − µ) 2 .This point is further demonstrated by the significantly improved coverage of the same intervals in the fixed population simulation in the right panel of Figure 6.In this setting, the asymptotic variance of, for example, the Hájek estimator is no longer E 1−p k p k (Y k − µ) 2 but is rather lim 1 The improved coverage is driven by the fact that now, only the small values of p k that are actually in the finite population, rather than all possible values of p k , contribute to the variance, and so there is less unobserved variance negatively impacting the coverage.
To summarize, intervals constructed using our theory show sub-par coverage in simulation, but this is largely due to unobserved variance in the problem and the difficulty of estimating such variance, as evidenced by the comparison with the finite population setting, rather than a failure of our theory or of the normal approximation.

C Joint adaptive normalization in ATE estimation
This appendix explores some subtleties of how the adaptive normalizations should be chosen for ATE estimation.In particular, the estimator in ( 14) is equivalent to selecting λ in (1) to separately minimize the asymptotic variance of the two mean estimates.However, this "plug-in" use of adaptive normalization ignores the fact that the asymptotic variance of an adaptively normalized ATE estimator depends not just on the variances of the two mean estimators but also their covariance.
In what follows, we jointly choose the normalizations of each mean estimator to minimize the asymptotic variance of estimating the combined estimand τ .Specifically, we define Combining these estimators leads to the estimator τλ 1 ,λ 0 = μ1,λ 1 − μ0,λ 0 of τ .To follow the program of Section 3, we seek to choose λ 1 and λ 0 to minimize the following asymptotic variance.where and C denotes terms that do not depend on either λ 1 or λ 0 .
As with our other CLTs, this is a routine delta method calculation.
Proof.Define the vector with mean β = (µ 1 , µ 0 , 1, 1).By the same arguments as in 1, β satisfies the usual CLT for the mean of i.i.d.random variables, and so we can apply the delta method with the function f (x, y, z, w) =   In the normal model, τAN 2 is sometimes better than τAN , but in the power law model, τAN 2 is consistently worse.
as estimates of the population total and sample size.Then, the Horvitz-Thompson and Hájek estimators are μHT = Ŝ/n and μHájek = Ŝ/n.

Figure 2 :
Figure2: The variance of μ(t) as a function of t.Although a single iteration may increase the variance, we observe that, in this simulation, every two iterations reduce variance.The data here are generated from the normal model (16) in Section 5 with n = 100, ρ = 0.1, µ = −1.

Figure 4 :
Figure 4: Left: estimated MSE for estimators applied to data from the normal model (16) for ATE estimation with n = 500, µ = 1, and τ = 0.5 for different values of θ.Right: the estimated MSE of the same estimators applied to data from the power law model (17) for ATE estimation with n = 500 and τ = 0.5 for different value of α.All MSEs are averaged over 20,000 trials.In all cases, adaptive normalization improves MSE.
and a, b are both positive, this is equal to a 2 + ab − b 2 ≥ ab as well.Finally, t(a − b) + b 2 = 0 is not possible because this requires t = b 2 b−a and because a > b, b 2

2 Figure 5 :
Figure5: MSE of μλ as a function of λ for the four problem specifications of the Swiss municipality data.Interestingly, the optimal choices of λ are all between 0 and 1, suggesting that the Hájek estimator is actually "over-normalizing" in this case.

Figure 6 :
Figure 6: The coverage of a 95% confidence interval for μHT , μHájek , and μAN based on the asymptotic variances in Theorems 1 (λ = 0, 1) and 3 in a superpopulation model (left) and a fixed population model (right).In the superpopulation model, at each trial, the Y k , p k , I k are all drawn anew, while in the fixed population model, the Y k , p k are fixed and the I k are drawn anew at each trial.The target coverage is indicated with a dashed line.The coverage in the fixed population model is significantly better than the coverage in the superpopulation model, for reasons discussed in the main text.

Figure 7 :
Figure 7: Comparison of τAN 2 and τAN in the normal model with n = 500, µ = 1 (left) and power law model with n = 500 (right).In the normal model, τAN 2 is sometimes better than τAN , but in the power law model, τAN 2 is consistently worse.
(17): the estimated MSE of all discussed estimators on data generated from the normal model(16)with n = 500 and µ = 1 for different values of θ.Right: the estimated MSE of the same estimators on data generated from the power law model(17)with n = 500 and α varying (the value of µ depends on α).All MSEs are averaged across 20,000 trials.Across both models and the full range of parameter values, we see that the adaptively normalized estimators improve on the traditional estimators.
models, one of which represents a fairly benign setting in which Y k and p k are approximately jointly normal, and estimating µ is not too difficult.The second model represents a more pathological setting, where Y k and p k have an extremely strong negative relationship, and so estimating µ becomes significantly more challenging.We now describe these models in detail.The first model, which we refer to as our normal model, is a model in which we generate

Table 2 :
RMSE of estimators on Swiss municipality data; Y 1 is wood area and Y 2 is industrial area; probabilities are chosen proportional to total municipality area, which is strongly positively correlated with Y 1 and weakly positively correlated with Y 2 .RMSEs are averaged over 100,000 trials and standard errors are over 10 replications.