The increasing availability of passively observed data has yielded a growing interest in “data fusion” methods, which involve merging data from observational and experimental sources to draw causal conclusions. Such methods often require a precarious tradeoff between the unknown bias in the observational dataset and the often-large variance in the experimental dataset. We propose an alternative approach, which avoids this tradeoff: rather than using observational data for inference, we use it to design a more efficient experiment. We consider the case of a stratified experiment with a binary outcome and suppose pilot estimates for the stratum potential outcome variances can be obtained from the observational study. We extend existing results to generate confidence sets for these variances, while accounting for the possibility of unmeasured confounding. Then, we pose the experimental design problem as a regret minimization problem subject to the constraints imposed by our confidence sets. We show that this problem can be converted into a concave maximization and solved using conventional methods. Finally, we demonstrate the practical utility of our methods using data from the Women’s Health Initiative.
The past half-century of causal inference research has engendered a healthy skepticism toward observational data . In observational datasets, researchers do not control whether each individual receives a treatment of interest. Hence, they cannot be certain that treated individuals and untreated individuals are otherwise comparable.
This challenge can be overcome only if the covariates measured in the observational data are sufficiently rich to fully explain who receives the treatment and who does not. This is a fundamentally untestable assumption – and even if it holds, careful modeling is necessary to remove the selection effect. The applied literature includes myriad examples of treatments that showed promise in observational studies only to be overturned by later randomized trials . One prominent case, the effect of hormone therapy on the health of postmenopausal women, will be discussed in this manuscript .
The “virtuous” counterpart to observational data is the well-designed experiment. Data from a randomized trial can yield unbiased estimates of a causal effect without the need for problematic statistical assumptions. However, experiments are not without their own significant drawbacks. Experiments are frequently expensive, and, as a consequence, often involve fewer units than observational studies. Particularly if one is interested in subgroup causal effects, this means experimental estimates can be imprecise. Moreover, experiments sometimes involve inclusion criteria that can make them dissimilar from target populations of interest. In this way, experiments are often said to have poor “external validity” .
In this article, we use the observational data not for inference, but rather to influence the design of an experiment. Our method seeks to retain the possibility of unbiased estimation from the experiment, while also leveraging the ready availability of observational databases to improve the experiment’s efficiency. Because the observational data are not used to estimate causal effects, we need not make onerous assumptions about the treatment assignment mechanism. However, we do need to make some assumptions to establish comparability between the observational and experimental data – assumptions that will be less likely to hold if the experiment incorporates inclusion criteria. Furthermore, our discussion will be limited to settings with binary outcomes, in which computations are tractable. We suppose the experiment has a stratified design, and seek to determine allocations of units to strata and treatment assignments.
Suppose pilot estimates of the stratum potential outcome variances are obtained from the observational study. If the outcomes are binary, we show that recent advances in sensitivity analysis from Zhao, Small, and Bhattacharya  can be extended to generate confidence sets for these variances, while incorporating the possibility of unmeasured confounding. Next, we pose the experimental design problem as a regret minimization problem subject to the potential outcome variances lying within their confidence sets. We use a trick from von Neumann to convert the problem into a concave maximization. The problem is not compliant with disciplined convex programming (DCP) , but it can be solved using projected gradient descent. This approach can yield modest efficiency gains in the experiment, especially if there is heterogeneity in treatment effects and baseline incidence rates across strata.
The remainder of the article proceeds as follows. Section 2 briefly reviews related literature, while Section 3 defines our notation, assumptions, and loss function. Section 4 gives our main results. These include the derivation of bias-aware confidence sets for the pilot variance estimates; the formulation of the design problem as a regret minimization; and the strategy to convert that problem into a computationally tractable one. We demonstrate the practical utility of our methods on data from the Women’s Health Initiative in Section 5. Section 6 discusses future work and concludes.
2 Related work
Our focus is on using observational data for experimental design, rather than for inference. We briefly review challenges in using so-called “data fusion” methods  that seek to merge observational and experimental data directly.
A key question is whether researchers can assume unconfoundedness – roughly, that all variables simultaneously affecting treatment probabilities and outcomes are measured – in the observational study. Under unconfoundedness, bias can be finely controlled using statistical adjustments (see e.g. ref. ). Hence, observational and experimental data can be merged without the risk of inducing large biases. This is the approach used in our previous work ; similar assumptions are made in ref. . Yet unconfoundedness is a strong and fundamentally untestable assumption, and it is unrealistic to assume in many practical settings.
Some previous studies have attempted to weaken the unconfoundedness assumption, but they frequently introduce alternative assumptions in order to proceed with merged estimation. In ref. , the authors assume that the hidden confounding has a parametric structure, and they suggest fitting a model to correct for the hidden confounding. In ref. , it is assumed the bias preserves unit-level relative rank ordering (as the authors say, “bigger causal effects imply bigger bias”). The authors consider time series data with multiple observations per unit, and they argue that their assumptions are reasonable in this setting. Yet this approach does not easily extend to the case where each unit’s outcome is observed only once.
Observational studies are also frequently included in meta-analyses, which seek to synthesize evidence across multiple studies . In a recent summary of methods, Mueller et al.  found that recommendations for the inclusion of observational studies in systematic reviews were largely unchanged from those used for experiments. They also found little consensus on best practices for combining data. Mueller and coauthors highlight a few exceptions. Thompson et al.  propose estimating bias reduction based on the subjective judgment of a panel of assessors, and adjusting the observational study results accordingly. Their method requires a high degree of subject matter expertise. Prevost et al.  suggest a hierarchical Bayes approach in which the difference between observational and experimental results is modeled explicitly. Their results are sensitive to the choice of prior.
A number of other approaches have been suggested, such as methods that make use of Bayesian networks  or structural causal models . Broadly, this remains an area of active research, and there is no consensus best practice for merging observational and experimental causal estimates, especially when unconfoundedness is not a tenable assumption.
We instead focus on the question of experimental design, influenced by the observational data. Many recent papers have considered a closely related problem: adaptive randomization in multi-stage trials (see e.g. [18,19]). In multi-stage trials, the pilot data (or “first-stage data”) emerges not from an observational study, but instead from a randomized controlled trial (RCT). The comparative trustworthiness of these data allows for considerable flexibility in using the data to improve the design of a subsequent experiment.
In ref. , Tabord-Meehan considers the problem of a two-stage RCT. Unlike the setting of this study, Tabord-Meehan does not suppose that the strata are defined ahead of time. He seeks to minimize variance in estimation of the average treatment effect (ATE), rather than an loss across strata. Leveraging the reliability of the first-stage data, he proposes estimating a stratification tree using these data. Then, the choice of stratification variables, stratum delimiters for those variables, and assignment probabilities for each individual stratum in the second stage are all determined using the first-stage data. This procedure achieves a notion of asymptotic optimality among estimators utilizing stratification trees.
Bai  also considers randomization procedures that are informed by pilot data. He proposes a procedure in which units are first ranked according to the sum of the expectations of their treated and untreated potential outcomes (conditional on covariates), then matched into pairs with their adjacent units, with treatment randomized to exactly one member of each matched pair. Because the ranking depends on an unknown quantity, a large pilot study is required to implement this method. Bai also discusses the case in which pilot data are unavailable, in which case he proposes using the minimax framework to choose the matched-pair design that is optimal under the most adversarial data-generating process, subject to mild shape constraints on the conditional expectations of potential outcomes given covariates.
These papers share many similar goals and analytic techniques to this manuscript. Crucially, we consider the case of an loss over a fixed stratification, rather than an estimation of the ATE. Moreover, our pilot data are assumed to come from an observational study, rather than an experiment. The data are potentially informative, but significantly less reliable than a pilot RCT.
3 Problem set-up
3.1 Sources of randomness
We assume we have access to an observational study with units in indexing set such that . We associate with each unit a pair of unseen potential outcomes ; an observed covariate vector , where ; and a propensity score denoting that probability of receiving treatment. We also associate with each a treatment indicator and an observed outcome defined by .
There are multiple perspectives on randomness in causal inference. In the setting of ref.  – as in much of the early potential outcome literature – all quantities are treated as fixed except the treatment assignment . More modern approaches sometimes treat the potential outcomes and and covariates as random variables (see e.g. ref. ). Similarly, some authors treat all of the data elements (including the treatment assignment ) as random draws from a super-population (see e.g. ref. ). Per the discussion in ref. , these subtleties often have little effect on the choice of estimators, but they do affect the population to which results can be generalized.
In our setting, we assume that the experimental data have not yet been collected, so it does not make sense to talk about fixed potential outcomes. More naturally, we treat the potential outcomes and covariates as random for both the observational and experimental datasets. Thus, for units , we view as drawn from a joint distribution . Similarly, the experimental data will be denoted for , sampled from a joint distribution . Because we are treating the potential outcomes as random variables, we can reason about their means and variances under the distribution .
3.2 Stratification and assumptions
We suppose we have a fixed stratification scheme based on the covariates . This can be derived from substantive knowledge or from applying a modern machine learning algorithm on the observational study to uncover treatment effect heterogeneity (see e.g. ref. [25,26]). The stratification is such that there are strata and each has an associated weight , where for all and . The define the relative importance of the strata and thus ordinarily reflect their prevalence in a population of interest.
Using the stratification on the observational study, we define indexing subsets (with cardinalities ) to identify units in each stratum. For each stratum, define as the set of covariate values defining the stratum, such that .
Suppose we have a budget constraint such that we can recruit only total units for the experiment, which we will also refer to as an RCT. One goal of our procedure is to decide the number of units recruited for each stratum, subject to the constraint . Once the experimental units are recruited, we will identically define indexing subsets such that . Within each stratum , a second goal of our procedure will be to decide the count of units we will assign to the treatment vs. control conditions, such that the associated counts and sum to . Hence, our variables of interest will be .
We will make the following assumption about allocation to treatment.
(Allocations to treatment) For each observational unit , treatment is allocated via an independent Bernoulli trial with success probability . For the experimental units, treatment is allocated stratum-wise by drawing a simple random sample of size treated units from the total units within stratum .
Define and as expectations and variances under the distributions and , respectively. We will need two further assumptions for our derivations.
(Common potential outcome means) Conditional on the stratum, the potential outcome averages for the two populations are equal. In other words,
for all . We denote these shared quantities as and , respectively.
(Common potential outcome variances) Conditional on the stratum, the potential outcome variances for the two populations are equal. In other words,
for all . We denote these shared quantities as and , respectively.
Assumptions 2 and 3 establish commonality between the observational and experimental datasets. Assumption 3 is needed explicitly to relate the optimal experimental design to quantities estimated from the observational study. These assumptions are not testable, though they need not hold exactly for the proposed methods to generate improved experimental designs. Researchers must apply subject matter knowledge to assess their approximate viability. For example, in cases in which the RCT units are sampled from the same underlying population as the observational units, these assumptions are likelier to hold. However, if the experiment incorporates onerous inclusion criteria such that the covariate distributions within stratum differ significantly between experimental and observational datasets, Assumptions 2 and 3 may be less plausible.
3.3 Loss and problem statement
Given Assumption 2, we can define a mean effect,
for each . We can collect these values into a vector .
Denote the associated causal estimates derived from the RCT as for . We can collect these estimates into a vector . We use a weighted loss when estimating the causal effects across strata,
Our goal will be to minimize the risk, defined as an expectation of the loss over both the treatment assignments and the potential outcomes. For simplicity, we suppress the subscript and write
4 Converting to an optimization problem
4.1 Decision framework
Were known exactly, it would be straightforward to compute optimal allocations in the RCT. The optimal choice from minimizing (1) is simply:
which yields a risk of
Note that the expressions in (2) are closely related to the well-known Neyman allocation formulas for stratified sampling . In our setting, we are allowing for arbitrary stratum weights, but we are imposing a sample size constraint rather than a cost constraint, as is frequently used in the Neyman allocations. We will continue using a sample size constraint for the remainder of the article. It is straightforward to extend this work to the setting in which the treated and control arms have different costs, and the constraint is imposed in terms of cost rather than sample size. These formulas are computed explicitly in Appendix D.
Assumption 3 guarantees shared potential outcome variances across the observational and RCT datasets. One approach would be to obtain pilot estimates of and from the observational study and then plug them into the expressions in (2) to determine the allocation of units in the RCT. We refer to this approach as the “naïve allocation.” However, any estimate of the variances derived from the observational study should be treated with caution. Our assumptions do not preclude the possibility of unmeasured confounding, which can introduce substantial bias into the pilot estimation step. Hence, we would be better served by a framework that explicitly accounts for uncertainty in the pilot estimates.
A number of heuristic approaches are appealing. The experimenter might, for example, take a weighted average between the naïve allocation and a design that allocates units equally across strata and treatment arms. Such an approach would rely on a subjective weighting to account for the possibility of unmeasured confounding, but would be difficult to calibrate in practice. Alternatively, the experimenter might seek to develop confidence regions for the pilot estimates of and and solve for the best possible allocation consistent with these regions. But such an approach would be fundamentally optimistic and would ignore the possibility that and could take on more adversarial values.
We argue that the problem is somewhat asymmetric. Were the experimenter to ignore the observational data and use a sensible default allocation – e.g., equal allocation – they might lose some efficiency, but they would likely obtain a fairly good estimate of . Hence, we argue that one should incorporate the observational data somewhat cautiously and seek a strong guarantee that doing so will not make the estimate worse. Decision theory provides an attractive framework in the form of regret minimization [28,29]. In this framework, a decision-maker chooses between multiple prospects and cares about not only the received payoff but also the foregone choice. If the foregone choice would have yielded higher payoff than the chosen one, the decision-maker experiences regret . Decisions are made to minimize the maximum possible regret.
In our case, the decision is on how to allocate units in our RCT. One choice is an allocation informed by the observational study. The other is a “default” allocation against which we seek to compare. Denote the default values as and , where a common choice would be equal allocation, for all ; or weighted allocation for all . Regret is defined as the difference between the risk of our chosen allocation and the default allocation,
Choosing this as our objective, we can now begin to formulate an optimization problem.
Suppose we can capture our uncertainty about via a convex constraint, indexed by a user-defined parameter ,
where . We could then obtain the regret-minimizing unit allocations as the solution to
Crucially, observe that the objective in Problem (3) can be set to zero by choosing and for , and this allocation must satisfy the sample size constraint by definition. Hence, the problem will only return an allocation other than the default in the case that such an allocation outperforms the default under all constraint-satisfying possible values of the variances , , . This captures our intuition about the asymmetry of the problem.
Defining and solving Optimization Problem (3) will be the goal of the remainder of this article.
4.2 Tractable case: binary outcomes
To construct our confidence regions , we will extend recent sensitivity analysis results from Zhao et al. .
The authors consider the case of causal estimation via inverse probability of treatment weighting (IPW). They focus on observational studies and consider the case where unmeasured confounding is present. To quantify this confounding, they rely on the marginal sensitivity model of Tan . In this model, the degree of confounding is summarized by a single researcher-chosen value, , which bounds the odds ratio of the treatment probability conditional on the potential outcomes and covariates and the treatment probability conditional only on covariates. The Tan model extends the widely used Rosenbaum sensitivity model  to the setting of IPW.
Zhao and co-authors focus on developing valid confidence intervals for the ATE even when -level confounding may be present. They offer two key insights. First, they demonstrate that for any choice of , one can efficiently compute upper and lower bounds on the true potential outcome means via linear fractional programming. These bounds, referred to as the “partially identified region,” quantify the possible bias in the point estimate of the ATE. Second, the authors show that the bootstrap is valid in this setting. Hence, they propose drawing repeated bootstrap replicates; computing extrema within each replicate using their linear fractional programming approach; and then taking the relevant -level quantiles of these extrema. This procedure yields a valid -level confidence region for the ATE.
Practically speaking, the choice of is crucial in establishing the appropriate width of the confidence intervals. A common approach is to calibrate the choice of against the disparities in treatment probability caused by omitting any of the observed variables [33,34]. The central logic to this approach is that unobserved covariates are unlikely to have affected the treatment probability more than any of the relevant measured covariates that are available in the dataset. A broader treatment on how to choose sensitivity parameters can be found in the study by Hsu and Small .
We adapt this approach to our setting in the case of binary outcomes. Note that if , then potential outcome variances can be expressed directly as a function of potential outcome means, via
As the work of Zhao et al. provides the necessary machinery to bound mean estimates, we can exploit this relationship between the means and variances to bound variance estimates. In particular, we can show that the bootstrap is also valid if our estimand is , rather than , for and . Computing the extrema is also straightforward. Note that the function is monotonically increasing in if and monotonically decreasing in if . Hence, if we use the method used in ref.  to solve for a partially identified region for and , we can equivalently compute such intervals for and .
Denote as the upper bound and the lower bound computed for a mean for . Denote and as the analogous quantities for variance. We apply the following logic:
If , set(4)
If , set(5)
If , set(6)
Hence, we propose the following procedure for deriving valid confidence regions for for each choice of :
Draw bootstrap replicates from the units .
Each replicate can now be represented as a rectangle in , where one axis represents the value of , and the other the value of , and the vertices correspond to the extrema. Any set such that a proportion of the rectangles have all four corners included in the set will asymptotically form a valid -level confidence interval.
Note that the final step does not specify the shape of the confidence set (it need not even be convex). For simplicity, we compute the minimum volume ellipsoid containing all vertices, then shrink the ellipsoid toward its center until only of the rectangles have all four of their vertices included. For details on constructing the ellipsoids (sometimes known as Löwner–John ellipsoids), see ref. . Observe that this is by no means the smallest valid confidence set, but it is convex and easy to work with numerically. In Appendix E, we briefly discuss the use of rectangular confidence regions, finding that results are substantively similar.
In Figure 1, we demonstrate this procedure on simulated data using . We suppose there are four strata, each containing 1,000 observational units. The strata differ in their treatment probabilities with 263, 421, 564, and 739 treated units in each stratum, respectively. The large black dot at the center of each cluster represents the point estimate . In purple, we plot the rectangles corresponding to the extrema computed in each of 200 bootstrap replicates drawn from the data. The dashed ellipsoids represent 90% confidence sets. In the cases of strata 2 and 3, the ellipsoids extend beyond the upper bound of 0.25 in at least one direction, so we intersect the ellipsoids with the hard boundary at 0.25. The resulting final confidence sets, and , are all convex.
The objective is convex in and affine (and thus concave) in . Now, having obtained convex constraints, we can invoke von Neumann’s minimax theorem  to switch the order of the minimization and maximization. Hence, the solution to Problem (3) is equivalent to the solution of
But the inner problem has an explicit solution, given by the expressions in (2). Plugging in these expressions, we arrive at the simplified problem:
Problem (7) is concave. See Appendix C for a detailed proof. The solution is non-trivial, owing to the fact that the problem is not DCP-compliant. Nonetheless, a simple projected gradient descent algorithm is guaranteed to converge under very mild conditions given the curvature . Similarly, under mild conditions, the convergence rate can be shown to be linear (see e.g. ref. ), meaning that distance to the optimum declines at a rate of , where is the number of steps taken by the algorithm. Hence, we can efficiently solve this problem.
5 Application to the data from the Women’s Health Initiative
To evaluate our methods in practice, we make use of data from the Women’s Health Initiative (WHI), a 1991 study of the effects of hormone therapy on postmenopausal women. The study included both an RCT and an observational study. A total of 16,608 women were included in the trial, with half randomly selected to take 625 mg of estrogen and 2.5 mg of progestin, and the remainder receiving a placebo. A corresponding 53,054 women in the observational component of the WHI were deemed clinically comparable to women in the trial. About a third of these women were using estrogen plus progestin, while the remaining women in the observational study were not using hormone therapy .
We investigate the effect of the treatment on incidence of coronary heart disease. We split the data into two non-overlapping subsets, which we term the “gold” and “silver” datasets. We estimate the probability of treatment for observational units via fitted propensity scores. The data split is the same as the one used in ref. . Details on the construction of these data elements can be found in Section A.2, while further details about the WHI can be found in Section A.1.
To choose our subgroups for stratification, we utilize the clinical expertise of researchers in the study’s writing group. The trial protocol highlights age as an important subgroup variable to consider , while subsequent work considered a patient’s history of cardiovascular disease . To evaluate the impact of a clinically irrelevant variable, we also consider langley scatter, a measure of solar irradiance at each woman’s enrollment center, which is not plausibly related to baseline incidence or treatment effect. Langley scatter exhibits no association with the outcome in the observational control population: a Pearson’s Chi-squared test yields a p-value of 0.89. The analogous tests for age and history of cardiovascular disease have p-values below .
The age variable has three levels, corresponding to whether a woman was in her 50s, 60s, or 70s. The cardiovascular disease history variable is binary. The langley scatter variable has five levels, corresponding to strata between 300 and 500 langleys of irradiance. We provide brief summaries of these variables in Tables A.7, A.8, and A.9 in Section A.3.
The RCT gold dataset is used to estimate “gold standard” stratum causal effects. We suppose that the observational study is being used to assist the design of an experiment of size units. In all cases, the default allocation is an equal allocation across strata and treatment statuses.
We face the additional challenge of choosing the appropriate value of . The WHI provides a very rich set of covariates, and our propensity model incorporates more than 50 variables spanning the demographic and clinical domains (see details in Section A.2). Hence, we will run our algorithm at values of (reflecting no residual confounding) as well as , and 2.0 (reflecting a modest amount).
5.2 Detailed example: , fine stratification
We show one example in detail, in which we choose and stratify on all three subgroup variables: age, history of cardiovascular disease, and langley scatter. The cross-product of these variables yields 30 strata, which we suppose are weighted equally. We number these groups from 1 through 30.
In the top panel of Figure 2, we show a naïve RCT allocation based purely on the pilot estimates of the stratum potential outcome variances from the observational study. In the bottom panel, we show the regret minimizing allocations. Visually, it is clear that we have heavily shrunk the allocations toward an equally allocated RCT, but there remain some strata where we recommend over- or under-sampling. Note, too, that the shrinkage is not purely reflective of the magnitude of the pilot estimate, since the number of observational units from each stratum and treatment status also influences the width of our confidence region for each of the pilot estimates.
To investigate the utility of our regret-minimizing allocations, we sample pseudo-experiments of 1,000 units from the RCT silver dataset 1,000 times with replacement. We do so under three designs: equal allocation by strata; naïve allocation based on the pilot estimates; and the regret-minimizing allocations under Γ = 1.5. We compute the average loss when compared against the gold standard estimates derived from the RCT gold dataset. Results are shown in Figure 3. Our method yields a modest reduction in average loss (3.6%) relative to the naïve design. It also outperforms the equal design, though by a slimmer margin (1.6%).
5.3 Performance over multiple conditions
We now simulate with all possible combinations of the stratification variables. For each choice of a stratification, we sample 1,000 units from the RCT silver dataset with replacement, under equal allocation, naïve allocation, and regret-minimizing allocation with , and 2.0. We then compute the loss versus the “gold standard” estimates derived from the RCT gold dataset.
In Table 1, we summarize the loss of the regret-minimizing allocations relative to equal allocation. We see immediately that the entries are all non-positive. This makes some intuitive sense: the objective in Problem (3) can always be set to 0 by choosing and for all ; hence, the algorithm is designed to guarantee that we cannot do worse than allocating equally. By the same token, many of the gains we see are modest, owing to the conservatism of the regret-minimizing approach. Notably, we seem to achieve the greatest gains when we are stratifying only on clinically relevant variables and using a relatively low value of . We achieve a 5–6% risk reduction at low values of in the fourth row of the table, in which we stratify on the clinically relevant age and cardiovascular disease variables. On the other hand, the algorithm quickly defaults to recommending equal allocation when variables are not clinically relevant. In the third row, in which we stratify only on the irrelevant langley scatter variable, the starred entries correspond to cases in which the regret-minimizing allocation is equal allocation.
|Subgroup Var(s)||Equal alloc loss||Loss relative to equal allocation|
|Age, CVD, langley||0.008395|
For starred entries, the regret-minimizing allocation defaults to equal allocation.
In Table 2, we summarize the loss relative to naïve allocation. our method can underperform a naïve allocation derived from the observational study pilot variance estimates. This can be seen most clearly in the first row of the table, in which we stratify only on the age variable. However, there are two clear trends in the results. First, when we stratify on a variable that turns out not to be clinically relevant, like langley scatter, the naïve allocation is essentially recommending an allocation based on noise from the data; as a result, our regret-minimizing allocations uniformly outperform naïve allocations. Second, the regret-minimizing allocations tend to outperform the naïve allocations as the number of strata grows. We significantly outperform naïve allocation in the final row, which corresponds to stratification on all three variables and a total of 30 strata.
|Subgroup Var(s)||Naïve alloc loss||Loss relative to naïve allocation|
|Age, CVD, langley||0.008574|
Recall that as rises, the feasible set of Optimization Problem (3) grows larger. Hence, we expect the allocation to be closer to the naïve allocation for smaller values of , but to be regularized more toward the default allocation for larger values of . For large , we would thus expect the loss to converge to the equal allocation loss. This is precisely what we see in Table 1: for each possible stratification, the performance is closest to that of the default allocation when . However, in Table 2, we do not see the inverse pattern – that is, performance is not uniformly closest to that of the naïve allocation when . This is because the confidence set does not collapse to a single point at ; rather, it incorporates the possibility of variance but not bias in the pilot estimation. More broadly, we do not expect a monotone relationship between and the average loss. In many cases, the pilot estimates will be somewhat informative, but incorporate some bias. Hence, we may see the lowest average loss at intermediate values of , which encourage the algorithm to extract some relevant information from the pilot data without relying too heavily on these estimates.
While these simulation results show modest performance gains, they are encouraging. A wise analyst would be cautious about designing an RCT exclusively using observational study pilot estimates of potential outcome variances. Because such pilot estimates can have both bias and variance, relying too heavily upon them might waste resources. Our framework allows data from the observational study to be incorporated into the RCT design while guarding against the possibility of underperforming a default allocation.
We briefly discuss potential extensions of this work.
One natural consideration is the case of multiple treatment levels, rather than the binary setting of treatment versus control. The machinery discussed in this manuscript naturally extends to the multilevel case. If we suppose there are treatment levels, then we instead optimize over sample sizes and stratum potential outcome variances for . The optimization problem becomes:
The curvature of Problem (8) is unchanged from that of Problem (3), so we can use the same von Neumann trick to obtain a readily solvable concave maximization problem. The only remaining complexity is the construction of the confidence sets . The procedure described in Section 4.2 can be easily generalized to the multilevel case, with the bounds derived from each bootstrap replicate now represented as an -dimensional box rather than a rectangle. The proof in Appendix B does not depend on the problem’s dimensionality, so we can again obtain asymptotic -level validity for any confidence set drawn to include a proportion of the boxes. The method of drawing Löwner–John ellipsoids also generalizes to dimensions greater than two, so we can use this exact procedure to obtain our confidence sets.
Another obvious extension is to the more general case of . In keeping with the theme of IPW estimation, we consider estimators of the form
where are the true treatment probabilities. Such estimators are asymptotically unbiased.
We suppose we estimate with fitted propensity scores, , defined as
We typically use logistic regression to estimate the propensity scores, such that .
We account for the possibility of -level unmeasured confounding by allowing the true probability to satisfy
We redefine the problem in terms of the , an affine function of the . We define two vectors and , and analogously define vectors and . Now, we can express the equations in (9) as quadratic fractional programs, e.g.,
We have few guarantees on the curvature of the problem: the numerators will be neither convex nor concave in the terms, , as long as the vectors , and are linearly independent. The denominators will be convex in the terms. This poses a major challenge. Quadratic fractional programming problems can be solved efficiently in some special cases, but are, in general, NP-hard .
One avenue is to apply Dinkelbach’s method to transform the quadratic fractional problem to a series of quadratic programming problems . This would not immediately yield a solution because of the indefinite numerator, but it would potentially allow one to make use of considerable recent work on solution methods in quadratic programming (see e.g. ref. ). This path represents a possible future extension of this work.
We thank Mike Baiocchi, Guillaume Basse, and Luke Miratrix for their useful comments and discussion.
Funding information: Evan Rosenman was supported by Google, and by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program. This work was also supported by the NSF under grants DMS-1521145, DMS-1407397, and IIS-1837931.
Conflict of interest: The authors have no conflicts of interest to declare.
A.1 Further details about the Women’s Health Initiative
We evaluate our estimators on data from the Women’s Health Initiative to estimate the effect of hormone therapy on coronary heart disease (CHD). The Women’s Health Initiative is a study of postmenopausal women in the United States, consisting of RCT and observational study components with 161,808 total women enrolled . Eligibility and recruitment data for the WHI can be found in the results of previous studies [3,46]. Participants were women between 50 and 79 years old at baseline, who had a predicted survival of at least 3 years and were unlikely to leave their current geographic area for 3 years.
Women with a uterus who met various safety, adherence, and retention criteria were eligible for a combined hormone therapy trial. A total of 16,608 women were included in the trial, with 8,506 women randomized to take 625 mg of estrogen and 2.5 mg of progestin, and the remainder receiving a placebo. A corresponding 53,054 women in the observational component of the Women’s Health Initiative had an intact uterus and were not using unopposed estrogen at baseline, thus rendering them clinically comparable . About a third of these women were using estrogen plus progestin, while the remaining women in the observational study were not using hormone therapy .
Participants received semiannual contacts and annual in-clinic visits for the collection of information about outcomes. Disease events, including CHD, were first self-reported and later adjudicated by physicians. We focus on outcomes during the initial phase of the study, which extended for an average of 8.16 years of follow-up in the RCT and 7.96 years in the observational study.
The overall rate of coronary heart disease in the trial was 3.7% in the treated group (314 cases among 8,472 women reporting) versus 3.3% (269 cases among 8,065 women reporting) for women not randomized to estrogen and progestin. In the observational study, the corresponding rates were 1.6% among treated women (706 out of 17,457 women reporting) and 3.1% among control women (1,108 out of 35,408 women reporting). Our methodology compares means and not survival curves. In the initial follow-up period, death rates were relatively low in both the observational study (6.4%) and the randomized trial (5.7%). Hence, we do not correct for the possibility of these deaths censoring coronary heart disease events.
A.2 Propensity score construction, covariate balance, and gold standard effects
The Women’s Health Initiative researchers collected a rich set of covariates about the participants in the study. For the purposes of computational speed, we narrow to a set of 684 variables, spanning demographics, medical history, diet, physical measurements, and psychosocial data collected at baseline.
The most meaningful measure of covariate imbalance can be found by looking at clinically relevant factors. Prentice et al.  found that hormone therapy users in the observational study were more likely to be Caucasian or Asian/Pacific Islander, less likely to be overweight, and more likely to have a college degree. These imbalances strongly suggest that applying a naïve differencing estimate to the observational data will yield an unfairly rosy view of the effect of hormone therapy on CHD.
To generate our estimators for this dataset, we need a propensity model to map the observed covariates to an estimated probability of receiving the treatment in the observational study. We used a logistic regression to generate an expressive model while limiting overfit. A heuristic procedure was used for careful construction of the propensity scores. Full details can be found in ref. .
Matching on the propensity score should reduce imbalances on clinically relevant covariates. We provide a summary of these diagnostics, with additional details again available in ref. . We use standardized differences to measure imbalance, as advocated by Rosenbaum .
We compute the standardized differences between treated and control on risk factors listed in ref. , before and after adjusting for the propensity score by grouping the units into ten equal-width propensity score strata. With the exception of the physical functioning score, all evaluated covariates were included in the propensity model. Imbalance measures for the continuous covariates can be found in Table A1. The stratification procedure reduces all standardized differences to less than 0.05 in absolute value, representing very good matches between the populations.
|Before stratifying||After stratifying|
|Age at menopause||50.49||50.19||0.06||50.35||50.33||0.02|
For categorical variables, the stratification procedure reweights individual women, such that the effective proportion of women in each category changes after stratifying on the propensity score. Standardized differences can also be computed for categorical variables, using the procedure described in Graziano et al. . We achieve similar balance on two significant categorical variables – ethnicity and smoking status – in Tables A2 and A3.
|White (%)||Black (%)||Latino (%)||AAPI (%)||Native American (%)||Missing/others (%)||SD|
|Never smoked (%)||Past smoker (%)||Current smoker (%)||SD|
Finally, we consider estimation of the “gold standard” causal effect. We randomly partition the randomized trial data into two subsets of equal size, such that each contains the same number of treated and control women. We select one of these subsets and refer to it as our “gold” dataset, to be used for estimating the true causal effect. The remaining subset is referred to as the “silver” dataset and is used for evaluating our estimators.
|Age at menopause||44.97||46.33|
|White (%)||Black (%)||Latino (%)||AAPI (%)||Native American (%)||Missing/others (%)||SD|
|Never smoked (%)||Past smoker (%)||Current smoker (%)||SD|
A.3 Stratification variable distributions
|Age||Observational study||RCT||RCT silver dataset|
|50–59||17,447 (33.0%)||5,491 (33.2%)||2,806 (33.9%)|
|60–69||23,030 (43.6%)||7,473 (45.2%)||3,689 (44.6%)|
|70–79||12,388 (23.4%)||3,573 (21.2%)||1,774 (21.5%)|
|History of CVD||Observational study||RCT||RCT silver dataset|
|Yes||8,709 (16.5%)||1,828 (11.1%)||900 (10.9%)|
|No||44,156 (83.5%)||14,709 (88.9%)||7,369 (89.1%)|
|Langley scatter||Observational study||RCT||RCT silver dataset|
|300–325||15,599 (29.5%)||4,854 (29.4%)||2,411 (29.2%)|
|350||12,521 (23.7%)||3,917 (23.7%)||1,935 (23.4%)|
|375–380||5,841 (11.0%)||1,858 (11.2%)||934 (11.3%)|
|400–430||8,216 (15.5%)||2,585 (15.6%)||1,310 (15.8%)|
|475–500||10,688 (20.2%)||3,323 (20.1%)||1,679 (20.3%)|
B Proof of validity of confidence regions
We hew closely to the proofs provided in ref. . The primary proofs consider the missing data problem, which is equivalent to estimating the mean of either of the potential outcomes in our setting. We begin by providing a summary of key proofs. All references to Remarks, Lemmas, etc. in Section B.1 refer to the text of ref. .
B.1 Review of proofs in Zhao et al. 
The authors define as the probability of treatment given covariates and outcome and compare it against the marginal treatment probability . They use rather than to denote a treatment indicator, so in keeping with their notation:
Then, for any choice of , they define a collection of sensitivity models
where is the odds ratio. This model was originally introduced by Tan . Per Proposition 7.1, it is related to the widely used Rosenbaum sensitivity model. In keeping with that model, we use rather than to denote our sensitivity parameter in the text, but retain the notation throughout this proof.
Via Remark 3.2, Zhao and co-authors reparameterize the problem such that each model in corresponds to a choice of , the logit-scale difference of the observed probability , and the complete data selection probability . So, we can alternatively write:
where and . In words: every choice of defines, at each possible value of and , a discrepancy between and . The choice of bounds the maximum of those discrepancies. So, as grows, we are allowing for greater and greater discrepancies in these probabilities.
For each choice of , they define a “shifted estimand,”
where is the treatment indicator and the expectation is over the joint distribution of . The corresponding “shifted estimator” is given by
The sum is over a sample of points drawn i.i.d. from their joint distribution. The quantity in the denominators, , is obtained by estimating and then shifting the estimate by for all units such that and .
Now, the proof of the validity of their approach proceeds in several stages.
First, they consider the case where data-dependent intervals are asymptotically guaranteed to contain with probability. They argue that taking and yields an interval with asymptotic coverage for every value of for which (Proposition 4.1).
For each choice of , they establish that the bootstrap is valid (Theorem 4.2).
First, they use the general theory of -estimators to show that and its bootstrap analogue, , are asymptotically normal with the same mean and variance (Theorem C.1 and Corollary C.2).
Then, they conclude that by defining as the bootstrap quantile, they have
where the expectation is taken under the joint distribution of , and . Analogous results holds for , the bootstrap quantile (Section C.3).
They argue that the quantile and infimum/supremum functions can be interchanged, such that
via Lemma 4.3.
B.2 Extension to design case
Our challenge is to extend this argument to the case where our estimand of interest is not a single but rather the pair , . Crucially, we will now have two functions and , corresponding to each of the potential outcomes, but they both lie within . The definition of the shifted estimand under given above generalizes to the case of two shifted estimands in a straightforward way. We extend Proposition 1 from ref.  in the following argument.
Suppose there exists a data-dependent region such that
holds for every , where for , and is the sample size. Under these conditions, the set
is an asymptotic confidence set of with at least coverage if .
This follows from the fact that, by assumption, the true data-generating distribution satisfies .□
Next, we must show that the bootstrap is valid in our setting. We adopt the same model and regularity conditions of Theorem 4.2 in ref. . In their proof of Corollary 5.1, the authors show that the pairs and are both jointly asymptotically normal, with the same limiting distribution. We define the function
We can see that applying to the tuple of potential outcome means will yield the potential outcome variances, and the same logic holds for applying to any estimator of the potential outcome means. Moreover, because is continuously differentiable, we can use the Delta method to observe immediately that and have the same asymptotic distribution, and thus the bootstrap is valid .
Finally, we generalize Lemma 4.3 to our setting. For each possible bootstrap replicate where , define the quartet of points
In words, contains the vertices of a rectangle in which defines the extrema of the potential outcome variances consistent with .
Denote as the standard convex hull operator. Define a related operator,
which takes in a set of cardinality as well as a set . The function returns the convex hull of the points contained in the entries in indexed by .
We choose a set such that , and we define the set
The set is an asymptotically valid confidence set.
For , where is the total number of possible bootstrap samples, we have that for every ,
Since this holds entrywise, it follows that any set containing a fixed proportion of the sets on the RHS must contain at least that proportion of points on the LHS, and hence
Since this holds for every , we can take the union on the LHS to observe
Observe that the RHS is simply , since any convex set containing the vertices of a rectangle will contain the convex hull of those vertices as well.
On the LHS, we can make use of our bootstrap validity result to observe
It follows from Proposition 1 that the LHS is a valid level confidence region. Hence, the right-hand side must be as well.
To conclude, we observe that our ellipsoid method must necessarily comprise a superset of a convex hull for some choice of . Hence, our method will indeed generate valid confidence regions for the potential outcome variances.□
C Proof of concavity of minimax problem
We begin with the unweighted case and demonstrate concavity by direct computation of the Hessian. Define
The Hessian is given by
We want to consider the eigenvalues of . First, observe that at most one eigenvalue can be nonnegative. This follows from the famed Weyl Inequalities . has all strictly negative eigenvalues, while , being an outer product, has one positive eigenvalue, , with all other eigenvalues 0. Denoting as the ith largest eigenvalue of matrix , the Weyl Inequalities tell us that
Hence, only one non-negative eigenvalue is possible.
Next, we can use the matrix determinant lemma to observe that
and direct computation tells us that
Hence, the determinant is 0, meaning at least one of our eigenvalues must be zero. Combined with our prior result, this means our maximum eigenvalue must be zero and we conclude the Hessian is negative semidefinite. Thus, is indeed concave.
Finally, note that the extension to the weighted case is straightforward. We can simply define new variables for , and then repeat the proof above using the variables. Since is simply an affine transformation of , concavity in the former follows from concavity in the latter.
D Extension to cost-constrained case
The existing results can easily be extended to the case in which costs vary by treatment status, and we have a constraint in terms of cost rather than sample size. In this setting, we associate with each stratum and treatment status a constant , which represents differential cost per unit. With a budget constraint , the regret minimization problem now becomes
Crucially, our constraints are still affine, so we can again invoke von Neumann’s minimax theorem to switch the order of the minimization and maximization, yielding the problem
But the inner problem has an explicit solution, given by
Plugging this in yields the simplified problem
Using the same logic as the final paragraph in Appendix C, we immediately see that Problem (11) is concave. Hence, we can use the same projected gradient descent approach to efficiently solve this problem.
E Results with rectangular confidence sets
We consider the case of using rectangular confidence sets, rather than ellipsoids. The implementation is straightforward: per the discussion in Section 4.2, we simply draw bootstrap replicates and compute the associated extremal rectangles; then draw the minimum rectangular set that contains all of them; and then shrink the rectangle proportionally toward its center until a proportion of the rectangles have all four corners included in the set. Per the results in Appendix B, this procedure yields asymptotically valid -level confidence sets.
We use this procedure on the WHI data and report the results in Tables A10 and A11. We see that results are substantively similar to those in Tables 1 and 2. In particular, all the entries in Table A10 are non-positive. The rectangular confidence sets actually seem to yield slightly larger performance improvements in the case of stratification on a single variable (yielding fewer total strata). However, the performance is slightly poorer when stratifying on multiple variables.
|Subgroup Var(s)||Equal alloc loss||Loss relative to equal allocation|