Designing Experiments Informed by Observational Studies

The increasing availability of passively observed data has yielded a growing methodological interest in"data fusion."These methods involve merging data from observational and experimental sources to draw causal conclusions -- and they typically require a precarious tradeoff between the unknown bias in the observational dataset and the often-large variance in the experimental dataset. We propose an alternative approach to leveraging observational data, which avoids this tradeoff: rather than using observational data for inference, we use it to design a more efficient experiment. We consider the case of a stratified experiment with a binary outcome, and suppose pilot estimates for the stratum potential outcome variances can be obtained from the observational study. We extend results from Zhao et al. (2019) in order to generate confidence sets for these variances, while accounting for the possibility of unmeasured confounding. Then, we pose the experimental design problem as one of regret minimization, subject to the constraints imposed by our confidence sets. We show that this problem can be converted into a convex minimization and solved using conventional methods. Lastly, we demonstrate the practical utility of our methods using data from the Women's Health Initiative.


Introduction
The past half-century of causal inference research has engendered a healthy skepticism toward observational data (Imbens and Rubin, 2015).In observational data sets, researchers do not control whether or not each individual receives a treatment of interest.Hence, they cannot be certain that treated individuals and untreated individuals are otherwise comparable.
This challenge can be overcome only if the covariates measured in the observational data are sufficiently rich to fully explain who receives the treatment and who does not.This is a fundamentally untestable assumption -and even if it holds, careful modeling is necessary to remove the selection effect.The applied literature includes myriad examples of treatments that showed promise in observational studies only to be overturned by later randomized trials (Hartman et al., 2015).One prominent case, the effect of hormone therapy on the health of postmenopausal women, will be discussed in this manuscript (Writing Group for the Women's Health Initiative Investigators, 2002).
The "virtuous" counterpart to observational data is the well-designed experiment.Data from a randomized trial yield unbiased estimates of a causal effect without the need for problematic statistical assumptions.Yet experiments are frequently expensive, and, as a consequence, generally involve fewer units.Especially if one is interested in subgroup causal effects, this means experimental estimates can be imprecise.
In this paper, we discuss an approach that allows us to leverage the availability of observational data, while retaining the attractive unbiasedness properties of randomized experiments: we use the observational data not for inference, but rather to influence the design of the experiment.Our discussion will be limited to settings with binary outcomes, in which computations are tractable.We suppose the experiment has a stratified design, and seek to determine allocations of units to strata and treatment assignments.
Suppose pilot estimates of the stratum potential outcome variances are obtained from the observational study.If the outcomes are binary, we show that recent advances in sensitivity analysis from Zhao et al. (2019) can be extended to generate confidence sets for these variances, while incorporating the possibility of unmeasured confounding.Next, we pose the experimental design problem as one of regret minimization subject to the potential outcome variances lying within their confidence sets.We use a trick from von Neumann to convert the problem into a convex (though non-DCP) minimization, which can be solved using projected gradient descent.This approach can yield modest efficiency gains in the experiment, especially if there is heterogeneity in treatment effects and baseline incidence rates across strata.
The remainder of the paper proceeds as follows.Section 2 defines our notation, assumptions, and loss function.Section 3 gives our main results.These include the derivation of bias-aware confidence sets for the pilot variance estimates; the formulation of the design problem as a regret minimization; and the strategy to convert that problem into a computationally tractable one.We demonstrate the practical utility of our methods on data from the Women's Health Initiative in Section 4. Section 5 discusses future work and concludes.
2 Problem Set-Up

Sources of Randomness
We suppose we have access to an observational study with units i in indexing set O such that |O| = no.We associate with each unit i ∈ O a pair of unseen potential outcomes (Yi(0), Yi(1)); an observed covariate vector Xi where Xi ∈ R p ; a propensity score pi ∈ (0, 1) denoting that probability of receiving treatment.We also associate with each i a treatment indicator Wi and an observed outcome defined by Yi = WiYi(1) + (1 − Wi)Yi(0).
There are multiple perspectives on randomness in causal inference.In the setting of Rubin (1974) -as in much of the early potential outcomes literature -all quantities are treated as fixed except, the treatment assignment Wi.More modern approaches sometimes treat the potential outcomes Yi(0) and Yi(1) and covariates Xi as random variables (see e.g.VanderWeele and Robins, 2012).Similarly, some authors treat all of the data elements (including the treatment assignment Wi) as random draws from a super-population (see e.g.Imbens and Rubin, 2015).Per the discussion in Chin (2019), these subtleties often have little effect on the choice of estimators, but they do affect the population to which results can be generalized.
In our setting, we assume that the RCT data has not yet been collected, so it does not make sense to talk about their fixed potential outcomes.More naturally, we treat the potential outcomes and covariates as random.Thus, we view Xi, Yi(0), Yi(1) as drawn from a joint distribution FO.The RCT data will be denoted (with a slight abuse of notation) as (Yi(0), Yi(1), Xi) for i ∈ R, sampled from a joint distribution FR.
Because we are treating the potential outcomes as random variables, we can reason about their means and variances under the distribution FR.

Stratification and Assumptions
We will make the following assumptions about allocation to treatment.
Assumption 1 (Allocations to Treatment).For i ∈ O, Wi ∼ Bern(pi) for pi.For i ∈ R, treatment is allocated via a simple random sample of size n rkt for k = 1, . . ., K.
We suppose we have a fixed stratification scheme based on the covariates Xi.This can be derived from substantive knowledge or from applying a modern machine learning algorithm on the observational study to uncover treatment heterogeneity (e.g.Wager and Athey, 2018;Hill, 2011).The stratification is such that there are k = 1, . . ., K strata and each has an associated population weight w1, . . ., wK .Using the stratification on the observational study, we define indexing subsets O k with cardinalities n ok to identify units in each stratum.For each stratum, define I k as the set of covariate values defining the stratum, such that Xi Suppose we can recruit only nr total units for the RCT.We need to decide both the number of units n rk recruited for each stratum, subject to the constraint k n rk = nr, and the count of units we will assign to treatment vs. control in each stratum, such that the associated counts n rkt and n rkc sum to n rk .Hence, our variables of interest will be {(n rkt , n rkc )} K 1 .
Define ER, VarR, EO, and VarO as expectations and variances under the distributions FR and FO, respectively.We will need two further assumptions.
Assumption 2 (Common Potential Outcome Means).Conditional on the stratum, the potential outcome averages for the two populations are equal.In other words, and for all k ∈ 1, . . ., K. We denote these shared quantities as µ k (0) and µ k (1).
Assumption 3 (Common Potential Outcome Variances).Conditional on the stratum, the potential outcome means for the two populations are equal.In other words, for all k ∈ 1, . . ., K. We denote these shared quantities as σ 2 k (0) and σ 2 k (1).

Loss and Problem Statement
Given Assumption 2, we can define a mean effect, for each k ∈ 1, . . ., K. We can collect these values into a vector τ .Denote the associated causal estimates derived from the RCT as τrk for k = 1, . . ., K. We can collect these estimates into a vector τr.We use a weighted L2 loss when estimating the causal effects across strata, Our goal will be to minimize the risk, defined as an expectation of the loss over both the treatment assignments and the potential outcomes.For simplicity, we suppress the subscript and write 3 Converting to an Optimization Problem

Naïve Approach
Were σ 2 k (1), σ 2 k (0) K k=1 known exactly, it would be straightforward to compute optimal allocations in the RCT.The optimal choice from minimizing this quantity is simply: which yields a risk of Assumption 3 guarantees shared variance across the observational and RCT datasets.So we might be tempted to obtain pilot estimates of σ 2 k (1) and σ 2 k (0) from the observational study and then to plug them in to determine the allocation of units in the RCT.However, any estimate of the variances derived from the observational study should be treated with caution.Our assumptions do not preclude the possibility of unmeasured confounding, which can introduce substantial bias into the pilot estimation step.Hence, a framework that exclusively optimizes expected loss is incongruent with what we know about sources of uncertainty.

Regret Minimization
Decision theory provides an attractive framework in the form of regret minimization, originally attributed to Bell (1982), as well as Loomes and Sugden (1982).In this framework, a decision-maker chooses between multiple prospects, and cares not only about the received payoff but also about the foregone choice.If the foregone choice would have yielded higher payoff than the chosen one, the decision-maker experiences regret (Diecidue and Somasundaram, 2017).Decisions are made to minimize the maximum possible regret.
In our case, the decision is in how to allocate units in our RCT.One choice is an allocation informed by the observational study.The other is a "default" allocation against which we seek to compare.Denote the default values as ñrkt and ñrkc , where a common choice would be equal allocation, ñrkt = ñrkc = nr/2K for all k; or weighted allocation ñrkt = ñrkc = w k nr for all k.
Regret is defined as the difference between the risk of our chosen allocation and the default allocation, Choosing this as our objective, we can now begin to formulate an optimization problem.Suppose we can capture our uncertainty about (σ 2 k (1), σ 2 k (0)) via a convex constraint, indexed by a user-defined parameter Γ, where We could then obtain the regret-minimizing unit allocations as the solution to min (2) Defining and solving Optimization Problem 2 will be the goal of the remainder of this paper.

Tractable Case: Binary Outcomes
To construct our confidence regions A k , k = 1, . . ., K, we will extend recent sensitivity analysis results from Zhao et al. (2019).
The authors consider the case of causal estimation via stabilized inverse probability of treatment weighting (SIPW).Zhao and co-authors focus on observational studies, and consider the case where unmeasured confounding is present.To quantify this confounding, they propose a marginal sensitivity model indexed by a quantity Γ, which bounds the odds ratio between the true treatment probability (a function of the covariates and the potential outcomes) and the treatment probability marginalized over the potential outcomes (a function of the covariates only).Their method extends the widely-used Rosenbaum sensitivity model (Rosenbaum, 1987).
The authors' focus is on developing valid confidence intervals for the average treatment effect even when Γ-level confounding may be present.They offer two key insights.First, they demonstrate that for any choice of Γ, one can efficiently compute upper and lower bounds on the true potential outcome means via linear fractional programming.These bounds, referred to as the "partially identified region," quantify the possible bias in the point estimate of the ATE.Second, the authors show that the bootstrap is valid in this setting.Hence, they propose drawing repeated bootstrap replicates; computing extrema within each replicate using their linear fractional programming approach; and then taking the relevant α-level quantiles of these extrema.This procedure yields a valid α-level confidence region for the ATE.
We adapt this approach to our setting in the case of binary outcomes.Note that if Yi ∈ {0, 1}, then potential outcome variances can be expressed directly as a function of potential outcome means, via As Zhao et al. (2019) provides the necessary machinery to bound mean estimates, we can exploit this relationship between the means and variances to bound variance estimates.In particular, we can show that the bootstrap is also valid if our estimand is µ k (e) • (1 − µ k (e)), rather than µ k (e), for e ∈ {0, 1} and k = 1, . . .K. Computing the extrema is also straightforward.Note that the function f (x) = x • (1 − x) is monotonically increasing in x if 0 < x < 0.5 and monotonically decreasing in x if 0.5 < x < 1.Hence, if we use the Zhao et al. (2019) method to solve for a partially identified region for µ k (1) and µ k (0), we can equivalently compute such intervals for σ 2 k (1) and σ 2 k (0).

Denote as μU
k (e) the upper bound and μL k (e) the lower bound computed for a mean for e ∈ {0, 1}.Denote (σ 2 k (e)) U and (σ 2 k (e)) L as the analogous quantities for variance.We apply the following logic: • If μL k (e) < 0.5 and μU k (e) > 0.5, set Hence, we propose the following procedure for deriving valid confidence regions for (σ 2 k (0), σ 2 k (1)) for each choice of k: 1. Draw B bootstrap replicates from the units i ∈ O k .
3. Each replicate can now be represented as a rectangle in [0, 1] × [0, 1], where one axis represents the value of (σ 2 k (1)), and the other the value of (σ 2 k (0)) and the vertices correspond to the extrema.Any set such that a 1 − α proportion of the rectangles have all four corners included in the set will asymptotically form a valid α-level confidence interval.
A full proof of the validity of this method can be found in Appendix B.
Note that the final step does not specify the shape of the confidence set (it need not even be convex).For simplicity, we compute the minimum volume ellipsoid containing all vertices, then shrink the ellipsoid toward its center until only B • (1 − α) of the rectangles have all four of their vertices included.For details on constructing the ellipsoids (sometimes known as Löwner-John ellipsoids), see Boyd et al. (2004).Observe that this is by no means the smallest valid confidence set, but it is convex and easy to work with numerically.
In Figure 1, we demonstrate this procedure on simulated data using Γ = 1.2.We suppose there are four strata, each containing 1,000 observational units.The strata differ in their treatment probabilities with 263, 421, 564, and 739 units in each stratum, respectively.The large black dot at the center of each cluster represents the point estimate (σ 2 k (0), σ2 k (1)).In purple, we plot the rectangles corresponding to the extrema computed in each of 200 bootstrap replicates drawn from the data.The dashed ellipsoids represent 90% confidence sets.In the cases of strata 2 and 4, the ellipsoids extend beyond the upper bound of 0.25 in at least one direction, so we intersect the ellipsoids with the hard boundary at 0.25.The resulting final confidence sets, A1, A2, A3, and A4, are all convex.The objective is convex in n rkt , n rkc and affine (and thus concave) in σ 2 k (1), σ 2 k (0).Now, having obtained convex constraints, we can invoke Von Neumann's minimax theorem (Von Neumann, 1928) to switch the order of the minimization and maximization.Hence, the solution to Problem 2 is equivalent to the solution of max But the inner problem has an explicit solution, given by Plugging this in yields the simplified problem max Problem 3 is concave.See Appendix C for a detailed proof.The solution is nontrivial, owing to the fact that the problem is not DCP-compliant.Nonetheless, a simple projected gradient descent algorithm is guaranteed to converge under very mild conditions given the curvature (Iusem, 2003).Hence, we can efficiently solve this problem.
4 Application to the Data from the Women's Health Initiative

Setup
To evaluate our methods in practice, we make use of data from the Women's Health Initiative, a 1991 study of the effects of hormone therapy on postmenopausal women.
The study included both a randomized controlled trial and an observational study.A total of 16,608 women were included in the trial, with half randomly selected to take 625 mg of estrogen and 2.5 mg of progestin, and the remainder receiving a placebo.A corresponding 53,054 women in the observational component of the WHI were deemed clinically comparable to women in the trial.About a third of these women were using estrogen plus progestin, while the remaining women in the observational study were not using hormone therapy (Prentice et al., 2005).We investigate the effect of the treatment on incidence of coronary heart disease.The data is split into two non-overlapping subsets, which we term the "gold" and "silver" datasets.We estimate the probability of treatment for observational units via fitted propensity scores.The data split is the same as the one used in Rosenman et al. (2018).Details on the construction of these data elements can be found in the Appendix, Section A.2, while further details about the WHI can be found in the Supplement, Section A.1.
To choose our subgroups for stratification, we utilize the clinical expertise of researchers in the study's writing group.The trial protocol highlights age as an important subgroup variable to consider (Writing Group for the Women's Health Initiative Investigators, 1998), while subsequent work considered a patient's history of cardiovascular disease (Roehm, 2015).We also consider Langley scatter, a measure of solar irradiance at each woman's enrollment center, which is not plausibly related to baseline incidence or treatment effect.Langley scatter exhibits no association with the outcome in the observational control population: a Pearson's Chi-squared test yields a p-value of 0.89.The analogous tests for age and history of cardiovascular disease have p-values below 10 −5 .
The age variable has three levels, corresponding to whether a woman was in her fifties, sixties, or seventies.The cardiovascular disease history variable is binary.The Langley scatter variable has five levels, corresponding to strata between 300 and 500 Langleys of irradiance.We provide brief summaries of these variables in Tables 9, 10, and 11 in Appendix Section A.3.
The RCT gold dataset is used to estimate "gold standard" stratum causal effects.We now suppose that the observational study is being used to design an experiment of size nr = 1, 000 units.We compare the estimates from the designed pseudo-experiments against the gold standard estimates under the unweighted L2 loss.
In the design setting, we face the additional challenge of choosing the appropriate value of Γ.The WHI provides a very rich set of covariates, and our propensity model incorporates more than 50 variables spanning the demographic and clinical domains (see details in Appendix Section A.2). Hence, we will run our algorithm at values of Γ = 1.0 (reflecting no residual confounding) as well as Γ = 1.1, 1.5, and 2.0 (reflecting a modest amount).

Detailed Example: Γ = 1.5, Fine Stratification
We show one example in detail, in which we choose Γ = 1.5 and stratify on all three subgroup variables: age, history of cardiovascular disease, and Langley scatter.The cross-product of these variables yields 30 subgroups, which we suppose are weighted equally.We number these groups from 1 through 30.
In the top panel of Figure 2, we show a naïve RCT allocation based purely on the pilot estimates of the stratum potential outcome variances from the observational study.In the bottom panel, we show the regret-minimizing allocations.Visually, it is clear that we have heavily shrunk the allocations toward an equally allocated RCT, but there remain some strata where we recommend over-or under-sampling.Note, too, that the shrinkage is not purely reflective of the magnitude of the pilot estimate, since the number of observational units from each stratum and treatment status also influences the width of our confidence regions for each of the pilot estimates.To investigate the utility of our regret-minimizing allocations, we sample pseudoexperiments of 1,000 units from the RCT silver dataset 1,000 times with replacement.We do so under three designs: equal allocation by strata; naïve allocation based on the pilot estimates; and regret-minimizing allocations.Below, we show the average L2 loss when compared against the gold standard estimates derived from the RCT gold estimate.Results are shown in Figure 3.Our method yields a modest reduction in average loss (3.6%) relative to the naïve design.It also outperforms the equal design, though by a slimmer margin (1.6%).This is encouraging -especially because the design was intended to guard against worst-case loss, rather than to optimize average loss.

Performance Over Multiple Conditions
We now simulate with all possible combinations of the stratification variables.For each choice of a stratification, we select 1, 000 units under equal allocation, naïve allocation, and regret-minimizing allocation with Γ = 1.0, 1.1, 1.5 and 2.0.We then compute the L2 loss versus the "gold standard" estimates derived from the RCT gold datasets.
In Table 1, we summarize the loss relative to equal allocation.We see immediately that the entries are all non-positive.This makes some intuitive sense: the objective in Problem 2 can always be set to 0 by choosing n rkt = ñrkt and n rkc = ñrkc for all k; hence, the algorithm is designed to guarantee that we cannot do worse than allocating equally.By the same token, many of the gains we see are modest, owing to the conservatism of the regret-minimizing approach.Notably, we seem to achieve the greatest gains when we are stratifying only on clinically relevant variables and using a relatively low value of Γ.We achieve a 5-6% risk reduction at low values of Γ in the fourth row of the table, in which we stratify on the clinically relevant age and cardiovascular disease variables.On the other hand, the algorithm quickly defaults to recommending equal allocation when variables are not clinically relevant.In the third row, in which we stratify only on the irrelevant Langley scatter variable, the starred entries correspond to cases in which the regret-minimizing allocation is equal allocation.
In Table 2, we summarize the loss relative to naïve allocation.In this case, our method can underperform a naïve allocation derived from the observational study pilot variance estimates.This can be seen most clearly in the first row of the table, in which we stratify only on the age variable.Such underperformance is a consequence of the fact that our algorithm is defensive toward underperformance when bias and variance are present in the pilot estimates.However, there are two clear trends in the results.First, when we stratify on a variable that turns out not to be clinically relevant, like Langley scatter, the naïve allocation is essentially just recommending an allocation based on noise from the data; as a result, our regret-minimizing allocations uniformly outperform While these simulation results show modest performance gains, they are encouraging.A wise analyst would be extremely cautious about designing an RCT exclusively using observational study pilot estimates of stratum variances.Because such pilot estimates can have both bias and variance, relying too heavily upon them might waste resources.Our framework allows data from the observational study to be incorporated into the RCT design while guarding against the possibility of underperforming a default allocation.

Future Work: General Case
We briefly discuss challenges in the more general case of Yi ∈ R. In keeping with the theme of IPW estimation, we consider estimators of the form where pi are the true treatment probabilities.Such estimators are asymptotically unbiased.
In the typical setting in which we use logistic regression to estimate the propensity scores, ĝ(Xi) = βT Xi.
We account for the possibility of Γ-level unmeasured confounding by allowing the true probability pi to satisfy Any affine transformation of our optimization variable will not change the curvature of the problem, so we redefine the problem in terms of the vi = p −1 i , an affine function of the zi.We define two vectors vt = (vi) i:W i =1 and vc = (vi) i:W i =0 , and analogously define vectors Yt = (Yi) W i =1 and Yc = (Yi) W i =0 .Now, we can express the equations in 4 as quadratic fractional program, e.g.
We have few guarantees on the curvature of the problem: the numerators will be neither convex nor concave in the ve terms, e ∈ {0, 1}, as long as the vectors 1, Yt, and Yt 2 are linearly independent.The denominators will be convex in the ve terms.This poses a major challenge.Quadratic fractional programming problems can be solved efficiently in some special cases, but are, in general, NP-hard (Phillips, 2001).
One promising avenue for future work is to apply Dinkelbach's method to transform the quadratic fractional problem to a series of quadratic programming problems (Dinkelbach, 1967).This will not immediately yield a solution because of the indefinite numerator, but it will allow us to make use of considerable recent work on new solution methods in quadratic programming (see e.g. Park and Boyd, 2017).

A.1 Further Details about the Women's Health Initiative
In this section we evaluate our estimators on data from the Women's Health Initiative to estimate the effect of hormone therapy on coronary heart disease.The Women's Health Initiative is a study of postmenopausal women in the United States, consisting of randomized controlled trial and observational study components with 161,808 total women enrolled (Prentice et al., 2005).Eligibility and recruitment data for the WHI can be found in the early results papers (Hays et al., 2003; Writing Group for the Women's Health Initiative Investigators, 2002).Participants were women between 50 and 79 years old at baseline, who had a predicted survival of at least three years and were unlikely to leave their current geographic area for three years.
Women with a uterus who met various safety, adherence, and retention criteria were eligible for a combined hormone therapy trial.A total of 16,608 women were included in the trial, with 8,506 women randomized to take 625 milligrams of estrogen and 2.5 milligrams of progestin, and the remainder receiving a placebo.A corresponding 53,054 women in the observational component of the Women's Health Initiative had an intact uterus and were not using unopposed estrogen at baseline, thus rendering them clinically comparable (Prentice et al., 2005).About a third of these women were using estrogen plus progestin, while the remaining women in the observational study were not using hormone therapy (Prentice et al., 2005).
Participants received semiannual contacts and annual in-clinic visits for the collection of information about outcomes.Disease events, including CHD, were first selfreported and later adjudicated by physicians.We focus on outcomes during the initial phase of the study, which extended for an average of 8.16 years of follow-up in the randomized controlled trial and 7.96 years in the observational study.
The overall rate of coronary heart disease in the trial was 3.7% in the treated group (314 cases among 8,472 women reporting) versus 3.3% (269 cases among 8,065 women reporting) for women not randomized to estrogen and progestin.In the observational study, the corresponding rates were 1.6% among treated women (706 out of 17,457 women reporting) and 3.1% among control women (1,108 out of 35,408 women reporting).Our methodology compares means and not survival curves.In the initial follow-up period, death rates were relatively low in both the observational study (6.4%) and the randomized trial (5.7%).Hence, we do not correct for the possibility of these deaths censoring coronary heart disease events.

A.2 Propensity Score Construction, Covariate Balance, and Gold Standard Effects
The Women's Health Initiative researchers collected a rich set of covariates about the participants in the study.For the purposes of computational speed, we narrow to a set of 684 variables, spanning demographics, medical history, diet, physical measurements, and psychosocial data collected at baseline.
The most meaningful measure of covariate imbalance can be found by looking at clinically relevant factors.Prentice et al. (2005) identified factors that are correlated with CHD.They found that hormone therapy users in the observational study were more likely to be Caucasian or Asian/Pacific Islander, less likely to be overweight, and more likely to have a college degree.These imbalances strongly suggest that applying a naïve differencing estimate to the observational data will yield an unfairly rosy view of the effect of hormone therapy on CHD.
To generate our estimators for this dataset, we need a propensity model e(x) to map the observed covariates to an estimated probability of receiving the treatment in the observational study.We used a logistic regression to generate an expressive model while limiting overfit.A forward stepping algorithm was first applied to the observational dataset to put an ordering on the variables.All 684 baseline covariates were provided as candidates to a logistic regression predicting the treatment indicator, and variables were automatically added, one at a time, based on which addition most reduced Akaike's Information Criterion (Akaike, 1974).
Using this ordering, models containing from one to 120 variables were generated.Model fit was assessed via the area under the Receiver Operator Characteristic curve.At each model size, the area under the curve was computed first for the nominal model and then computed again using a ten-fold cross-validation.This procedure generated the curves seen in Figure 4. Notably, we observe that the predictive power rises rapidly with the addition of the first twenty variables to the logistic regression model, but slows dramatically thereafter.There is also very little evidence of overfit, as the nominal area under the durve only very slightly outpaces the cross-validated area under the curve, even in models with 100 or more variables.This is likely a consequence of the large number of observations in the observational dataset.
We next applied a heuristic threshold, selecting the largest model such that the most recent variable addition increased the cross-validated area under the curve by at least one basis point (0.01%).This yields a model containing 53 variables, with area under the receiver operator characteristic curve of 82.49%, or about 1% lower than a model containing all 684 covariates.As our goal is to obtain an association between e(xi) and Wi, and additional variables beyond the 53rd do not materially improve this association, omission of the additional variables seems warranted.
Matching on the propensity score should reduce imbalances on clinically relevant covariates.To evaluate this effect, we use standardized differences (as advocated by Rosenbaum (2009)).Let xtj and xcj be the treated and control group averages for continuous covariate j in the ODB before matching and let σ2 tj and σ2 cj be the sample variances within those two groups.Let xtjk and xcjk be those averages taken over subjects i ∈ O k and define post-stratification averages as xtj = k n ok xtjk /no and xcj = k n ok xcjk /no.These are weighted averages of xij with greater weight put on observations from treatment conditions that are underrepresented in their own strata.Rosenbaum's standardized differences for the original and reweighted data are , respectively.These quantities measure the practical significance of the imbalance between groups unlike t-statistics which have a standard error in the denominator.Note that Rosenbaum uses the same denominator in both weighted and unweighted standardized differences.We considered ten equal-width propensity score strata to evaluate the standardized differences between treated and control on risk factors listed in Prentice et al. (2005), before and after adjusting for the propensity score.With the exception of the physical functioning score, all of these covariates were included in the propensity model.Imbalance measures for the continuous covariates can be found in Table 3.As we can see, the stratification procedure reduces all standardized differences to less than 0.05 in absolute value, representing very good matches between the populations.
For categorical variables, the stratification procedure similarly reweights individual women, such that the effective proportion of women in each category changes after stratifying on the propensity score.Standardized differences can also be computed for categorical variables, using the procedure described in Graziano et al.Graziano and Raulin (1993) We achieve similar balance on two significant categorical variablesethnicity and smoking status -in Tables 4 and 5. Lastly, consider estimation of the "gold standard" causal effect.We randomly partition the randomized trial data into two subsets of equal size, such that each contains the same number of treated and control women.We select one of these subsets and refer to it as our "gold" dataset, to be used for estimating the true causal effect.The remaining subset is referred to as the "silver" dataset, and is used for evaluating our estimators.
Because of the randomization, we find that treated and control are already well balanced on the coronary heart disease risk factors in the gold dataset, as summarized in Tables 6, 7, and 8.

B Proof of Validity of Confidence Regions
We hew closely to the proofs provided in Zhao et al. (2019).Their primary proofs consider the missing data problem, which is equivalent to estimating the mean of either of the potential outcomes.We begin by providing details of their proof and then show how it can be extended to our case.

B.1 Review of Proof in Zhao et al. (2019)
The authors define e(x, y) as the probability of treatment given covariates X = x ∈ X and outcome Y = y ∈ R and compare it against the marginal treatment probability e(x).They use A rather than W to denote a treatment indicator, so in keeping with their notation: Then, for any choice of Λ > 1, they define a collection of sensitivity models is the odds ratio.This model was originally introduced by Tan (2006).Per proposition 7.1, it is related to the widely used Rosenbaum sensitivity model.In keeping with that model, we use Γ rather than Λ to denote our sensitivity parameter in the text, but retain the notation Λ throughout this proof.Via remark 3.2, Zhao and co-authors reparameterize the problem such that each model in E(Λ) corresponds to a choice of h(x, y), the logit-scale difference of the observed probability e(x) and the complete data selection probability e(x, y).So we can alternatively write: where λ = log(Λ) and H(λ) = {h : X × R | ||h||∞ ≤ λ}.In words: every choice of h ∈ H(λ) defines, at each possible value of X and Y , a discrepancy between e(x) and e(x, y).The choice of H(λ) bounds the maximum of those discrepancies.So, as Λ grows, we are allowing for greater and greater discrepancies in these probabilities.
For each choice of h, they define a "shifted estimand," where A is the treatment indicator and the expectation is over the joint distribution of X, Y, A. The corresponding "shifted estimator" is given by μ The sum is over a sample of points (Xi, Yi, Ai) drawn i.i.d.from their joint distribution.
The quantity in the denominators, ê(h) (Xi, Yi), is obtained by estimating P (A = 1 | X = x) and then shifting the estimate by h(xi, yi) for all units i such that Xi = xi and Yi = yi.Now, the proof of the validity of their approach proceeds in several stages.
2. For each choice of h ∈ H(λ), they establish that the bootstrap is valid (Theorem 4.2).
• First, they use the general theory of Z-estimators to show that μ(h) and its bootstrap analogue, μ(h) , are asymptotically normal with the same mean and variance.(Theorem C.1 and Corollary C.2) • Then, they conclude that defining L 3. They argue that the quantile and infimum/supremum functions can be interchanged, such that and h)   via Lemma 4.3.

B.2 Extension to Design Case
Our challenge is to extend this argument to the case where our estimand of interest is not a single µ but rather the pair (σ 2 k (0), σ 2 k (1)) = (µ k (0)(1 − µ k (0)), µ k (1)(1 − µ k (1)).Crucially, we will now have two h functions h0 and h1, corresponding to each of the potential outcomes, but they both lie within H(λ).The definition of the shifted estimand under h given above generalizes to the case of two shifted estimands in a straightforward way.We extend Proposition 1 in the following argument.
Proposition 1. Suppose there exists a data-dependent region β Since this holds for every h0, h1 ∈ H(λ), we can take the union on the LHS to observe Observe that the RHS is simply A k , since any ellipse containing the vertices of a rectangle will contain the convex hull of those vertices as well.
On the LHS, we can make use of our bootstrap validity result to observe lim inf n→∞ P σ It follows from Proposition 1 that the LHS is a valid 1 − α level confidence region.Hence, the right-hand side must be as well.
To conclude, we observe that our ellipsoid method must necessarily comprise a superset of a convex hull for some choice of Bα.Hence, our method will indeed generate valid confidence regions for the potential outcome variances.

C Proof of Concavity of Minimax Problem
We begin with the unweighted case, and demonstrate concavity by direct computation of the Hessian.Define The Hessian is given by where We want to consider the eigenvalues of H + vv T .First, observe that at most one eigenvalue can be nonnegative.This follows from the famed Weyl Inequalities Weyl (1912).H has all strictly negative eigenvalues, while vv T , being an outer product, has one positive eigenvalue, v T v, with all other eigenvalues 0. Denoting as λi(G) the i th largest eigenvalue of matrix G, the Weyl Ineqalities tell us that λ2(H + vv T ) ≤ λ1(H) + λ2(vv T ) = λ1(H) < 0 .
Hence, only one non-negative eigenvalue is possible.
Next, we can use the matrix determinant lemma to observe that det(H + vv T ) = (1 + v T H −1 v) det(H) and direct computation tells us that Hence, the determinant is 0, meaning at least one of our eigenvalues must be zero.
Combined with our prior result, this means our maximum eigenvalue must be zero and we conclude the Hessian is negative semidefinite.Thus, f is indeed concave.Finally, note that the extension to the weighted case is straightforward.We can simply define new variables σk (e) = √ w k σ k (e) for e ∈ {0, 1}, and then repeat the proof above using the σk (e) variables.Since σ k (e) is simply an affine transformation of σk (e), concavity in the former follows from concavity in the latter.

Figure 1 :
Figure 1: Simulated example of confidence regions in four strata under Γ = 1.2.

Figure 2 :
Figure 2: Allocation of units to strata under naïve scheme and regret-minimizing scheme.

Table 1 :
L 2 loss comparisons for regret-minimizing allocations relative to equal allocation.For starred entries, the regret-minimizing allocation defaults to equal allocation.naïveallocations.Second, the regret-minimizing allocations tend to outperform the naïve allocations as the number of strata grow.We significantly outperform naïve allocation in the final row, which corresponds to stratification on all three variables and a total of 30 strata.

Table 2 :
L 2 loss comparisons for regret-minimizing allocations relative to naïve allocation.

Table 3 :
Standardized differences (SD) between treated and control populations in the observational dataset, before and after stratification on the propensity score, for clinical risk factors for coronary heart disease.

Table 4 :
Standardized differences (SD) between treated and control populations in the observational database, before and after stratification on the propensity score, for ethnicity category.

Table 6 :
Standardized differences (SD) between treated and control populations in RCT gold dataset, for clinical risk factors for coronary heart disease.

Table 7 :
Standardized differences (SD) between treated and control populations in RCT gold dataset, for ethnicity category.

Table 9 :
Distribution of age variable values in the observational study, RCT, and RCT "silver" datasets.

Table 10 :
Distribution of history of cardiovascular disease in the observational study, RCT, and RCT "silver" datasets.

Table 11 :
Distribution of Langley scatter categories in the observational study, RCT, and RCT "silver" datasets.