Asymptotic Inference for Optimal Rerandomization Designs

Abstract Recently a computational-based experimental design strategy called rerandomization has been proposed as an alternative or complement to traditional blocked designs. The idea of rerandomization is to remove, from consideration, those allocations with large imbalances in observed covariates according to a balance criterion, and then randomize within the set of acceptable allocations. Based on the Mahalanobis distance criterion for balancing the covariates, we show that asymptotic inference to the population, from which the units in the sample are randomly drawn, is possible using only the set of best, or ‘optimal’, allocations. Finally, we show that for the optimal and near optimal designs, the quite complex asymptotic sampling distribution derived by Li et al. (2018), is well approximated by a normal distribution.


Introduction
In randomized experiments, the treatment assignment is unconfounded (Rubin 1978;Imbens and Rubin 2015) or equivalently strongly ignorable (Rosenbaum and Rubin 1983), which permits valid inference to a large collection of well-defined estimands. For this reason, among others, randomized experiments are seen as the gold standard for causal inference. However, in a single realized randomized allocation, the imbalance in both observed and unobserved covariates can be substantial. This can lead to poor precision and low efficiency in the inference. To reduce these potential imbalances, blocking (also called stratification) on observed covariates has been used in experimental design, especially with a few discrete covariates. An alternative, or complement, to blocking that has received attention lately is to utilize modern computational capabilities in finding the experimental design (see e.g., Morgan and Rubin (2012); Bertsimas et al. (2015); Kallus (2018); Lauretto et al. (2017); Krieger et al. (2019); Kapelner et al. (2020); ). The idea in all the above strategies is to use the computer to discard allocations with imbalance in the observed covariates or, alternatively, to find allocations with balance in observed covariates. The paper by Kallus (2018) proposed algorithms for finding 'optimal' designs for the estimation of the population average treatment effect (PATE). Bertsimas et al. (2015); Lauretto et al. (2017) are not discussing inference in their designs. Morgan and Rubin (2012); Krieger et al. (2019); Kapelner et al. (2020);  all used Fisher randomization tests to test for an average effect among the units in the experiment while Kallus (2018) suggest using bootstrap inference to the PATE but without any formal proof of its validity. This aim of this paper is to clarify the concept of an 'optimal' design for inferences to the PATE based on the mean-difference estimator. In addition we suggest a strategy for asymptotic inference to the PATE in an optimal design based on the Mahalanobis-based rerandomization strategy suggested in Morgan and Rubin (2012). Here, the operating characteristics, i.e., the expectation, variance, and asymptotic distribution of the mean-difference estimator are known (Li et al. 2018); the 'optimal' designs suggested by Kallus (2018) includes rerandomization based on the Mahalanobis distance as a special case when the relation between the covariates and the potential outcomes is linear (Kallus 2018, Section 2.3.3).
Under complete randomization the mean-difference estimator is an unbiased estimator of the sample average treatment effect (SATE), that is an average effect among the units in the experiment. Under random sampling of units from the population to the experiment the estimator is also an unbiased estimator of the PATE. Standard normal asymptotic inference are possible to either SATE and PATE.
With the computer based designs the asymptotic distribution of the mean-difference estimator will not in general be normal. In order to provide an intuition to the problem with (asymptotic) inference in these designs we need to discuss the underlying assumptions used to derive the asymptotic distributions for inferences to SATE and PATE, respectively. The normal asymptotic inference to SATE is derived using finite sample central limit theorems (see e.g. Li and Ding (2017)). The underlying assumption is that of replicated randomization and that the sample size n goes to infinity. The only thing stochastic is thus the allocation of treatments. In an optimal design the number of allocation is reduced to a minimum. The implication, is that we cannot derive the asymptotic distribution for the SATE estimand. For instance, in a deterministic design (i.e. one possible allocation) the resulting distribution has zero variance. Note that with a limited set of allocations a Fisher randomization test cannot either be conducted. The assumption deriving the asymptotic distribution of the mean-difference estimator as an estimator of PATE is that of random sampling to the experiment from a population of size N (N > n), or from a superpopulation. When interest is on conducting inference to PATE there is no lower limit on the number of possible allocations as the asymptotic distribution is derived under the assumption of random sampling to the experiment only. The consequence is that we, in theory, can have a deterministic design and then conduct inference to PATE, however no inference is possible regarding an effect in the experiment. This is an anomaly, but it is a consequence of the idea behind Neyman-Pearson inference.
We show that when the cardinality of the set of allocations fulfilling the Mahalanobis-distance covariancebalance criterion is close to its minimum, the asymptotic distribution of the mean-difference estimator for inference to the PATE is, as is the case with complete randomization, normally distributed with known variance. Furthermore, the difference in efficiency compared to using the 'optimal' set is typically very small which means that, using a slightly larger 'near optimal' set, admits non-degenerate inference to both SATE and PATE without substantially decreasing efficiency of estimation to PATE. Lastly, the large sample asymptotic distribution of the mean-difference estimator is well approximated by a normal distribution also when a larger 'near optimal' set is used. That is, as long as the criterion is selected such that it is small enough according to evaluation strategies suggested in this paper, standard z-tests can be be applied for inference. The implication of this results is important as the asymptotic inference after Mahalanobis-based rerandomization is simplified in contrast to what is suggested in Li et al. (2018).
The next section discusses rerandomization using the Mahalanobis metric. Section 3 provides the main results concerning asymptotic inference and Section 4 concludes.

Rerandomization based on the Mahalanobis criterion
In line with Morgan and Rubin (2012), consider a trial with n units, with n 1 assigned to treatment and n 0 assigned to control. Let W i = 1 or W i = 0 if unit i is assigned treatment or control, respectively, and define W = (W 1 , ..., Wn) ′ . Furthermore let x i , i = 1, ..., n, be K × 1 vectors of fixed covariates in the sample and let In a balanced experiment, i.e., n 1 = n 0 = n/2, there are (︀ n n1 )︀ = n A possible treatment allocation (assignment) vectors, thus W j , j = 1, ..., n A and W = (W 1 , ..., W n A ) the complete set of allocations. Note that this set by construction has the 'mirror property', i.e., it can be enumerated as a set of 'mirror allocations' W j and 1 − W j .

The Mahalanobis distance for allocation j is
Following Li et al. (2018), denotê︀ τ X the estimator of the covariate mean-difference vector over complete randomization. Morgan and Rubin (2012) suggested accepting the treatment assignment vector W j only when where a is a positive constant. Due to the asymptotic normality of difference in means implied by the Central Limit Theorem (CLT), M(W j , X) follows a χ 2 K distribution asymptotically. This means that a quantile of this distribution can be used as an allocation inclusion/exclusion criterion. Let P(χ 2 K ≤ a) = pa, then to randomize within the set of the 0.01% best balanced allocations implies setting a so that pa = 0.0001.
The minimum number of allocations in the set of allocations with the smallest Mahalanobis distance (i.e. M(W j , X) ≃ 0 or pa ≃ 0) in an experiment with n 1 = n 0 is two.¹ Because the Mahalanobis distance of an allocation (W j ) is always exactly the same as for its mirror (1 − W j ). This minimal set, containing only the allocations with the smallest Mahalanobis distance across all possible allocations, contains the optimal set in terms of covariate balance. Thus, by the mirror property, assuming at least one continuous covariate, there are always allocations. This means that the large sample rerandomization criterion that gives the minimum set can be written as We refer to this inclusion criterion as the 'best allocation inclusion criteria' (BAIC) and the set of allocations fulfilling BAIC is denoted M BAIC . However with ties in the minimum of the Mahalanobis distance, as could happen with discrete data, M BAIC would have more than two elements. If M BAIC is large enough, non-degenerate inference to the SATE is possible . However, because the cardinality of M BAIC is not restricted by design, BAIC does not generally allow for non-degenerate inference to SATE. If there are too few allowed allocations, probabilistic inference to SATE is not helpful . Note that in a real experiment it will not be possible to find M BAIC within a reasonable time limit. The reason is that the number of allocations is exponential increasing in n. Thus with n > 30 this is a NP-hard problem.

Asymptotic Theory
Let Y i (w) denote the potential outcome when unit i is exposed to w. The sample average treatment effect is defined as , w = 0, 1, then the sample variance and the variance of the treatment effect are equal to In an experiment with n 1 ≠ n 0 the minimum set could consist of just one allocation. See (Morgan and Rubin 2012, p. 9) for an example with n = 3 and n 1 ≡ 2. and Note that these three variances are fixed in the sample. Define the sample means of treatments and controls Under SUTVA (no interference between individuals and the same treatment, Rubin (1980) is an unbiased estimator of τ (Neyman 1923). Using CLTs where the n units in the sample is embedded into an infinite sequence of finite populations with increasing sizes one can show (see e.g Li and Ding (2017)) that under complete randomization The population average treatment effect (PATE) is defined as where µw =E(Y(w)), w = 0, 1. Given random sampling from the population thê︀ τ is an unbiased and the third term in Equation (1)  where V = n n1 S 2 Y(1) + n n0 S 2 Y(0) . For later use define Y(w) = (Y 1 (w), ..., Yn(w)) ′ and τ = (τ 1 , ...τn) ′ and let S 2 Y(w)|X and S 2 τ|X be the variances of the linear projection of Y(w), w = 0, 1 and τ on X, respectively. Using the same conditions as in Li and Ding (2017), Li et al. (2018) showed that the asymptotic distribution of̂︀ τ after randomly choosing an allocation from the set where Here, ε 0 is a standard normal variable (for Y in the space orthogonal to the covariates), L K,a is the projection of Y into the space of covariates and is thus affected by the rerandomization, and Li et al. (2018) show that R 2 can be consistently estimated. Under homogeneous treatment effects, i.e., S 2 τ = 0 and S 2 τ|X = 0, it follows that R 2 = s 2 Y(0)|X /S 2 Y(0) . This implies that R 2 can be estimated by a linear projection of the outcomes of the treated and control units values on X. The second part of Q has the following distribution where χ K,a = χ 2 K |χ 2 K ≤ a, S a random variable taking values ±1 with probability 1/2, and β K ∼ β(1/2, (K − 1)/2) degenerating to a point mass at 1 when K = 1.

Asymptotic theory for the optimal design
It is only the second term of Equation (3), is a constant that can be estimated consistently. Thus, from the assumption that the n units are randomly sampled from the population, the asymptotic results when n → ∞ in Li et al. (2018) can be used to derive the asymptotic distribution in the situation when a approaches 0.
For fixed K, Var(L K,a ) → 0 as a → 0. That is, the L K,a distribution will converge to point mass at zero when a, or equivalently pa , goes to zero. This implies that for large n using the minimum criterion a min , The intuition of the result is that randomizing the treatment assignment within the set of allocations containing only the very best allocations will, in large samples, result in a realized treatment allocation with a Mahalanobis distance close to zero, which means that essentially all variation in̂︀ τ that is explained by group differences in X is removed, that is, all the variance in the linear projection of Y(0) and Y(1) on X is eliminated. Thus, non-degenerate inference to PATE is in general possible using BAIC given random sampling to the experiment. Under Theorem 1 in (Li et al. 2018, p. 8) and using Equation (5), the test statistic is thus where Z ∼ N(0, 1) and V is the variance under random sampling. All the well established asymptotic results for the standard normal distribution apply. The sampling distribution is well defined for R 2 < 1. In the situation where R 2 = 1 which is impossible in practice, i.e., that the variation in X explains all variation in both Y(0) and Y(1) and all variation in X is removed by rerandomization, the sampling distribution is degenerated to a point mass at zero.
For the inference to PATE the calculation of the R 2 is simplified as we can can neglect the heterogeneity in the calculation of the variance. We thus separately regress Y i on x i for the W i = 1 and W i = 0 samples, then s 2 Y(1)|X and s 2 Y(0)|X are estimated as wherê︀ βw , w = 0, 1 is the OLS estimates. The Neyman estimator is used to estimate V, thuŝ︂ where This means that̂︀ R 2 = n n1 s 2 Y1|X + n n0 s 2

Y0|X︂ V(̂︀ τ)
, and that the estimator of By Slutsky's theorem, plugging in these consistent estimators into Equation 6 yields a statistic that fully computable from the sample. To illustrate that the sampling distribution can be approximated by a normal distribution when pa is sufficiently small, a small Monte Carlo simulation is conducted. We test for the mean difference between treated and control with an equal number of treated and controls with a sample size n = 12, using the corresponding BAIC. This implies pa = 2/ (︀ 12 6 )︀ ≈ 0.002. The sampling distibution of the test-statistics given in Equation (6) is compared to standard normal. We also calculate sampling distributions of the test statistic in Li et al. (2018) where νa = Pr(χ 2 (K+2) ≤ a)/ Pr(χ 2 K ≤ a). This distribution is compared to the Q-distribution. Data are generated ² as where x ki ∼ N(2, 2), k = 1, ...5 and ε i ∼ N(0, 6). This implies R 2 = 0.5. 10,000 independent realizations of size n = 12 are generated from each distribution, including the theoretical ones. The true R 2 and Vτ are used to exclude the small sample sampling variation of the corresonding estimators, both shown to be consistent estimators in Li et al. (2018). Figure (1 with z a κ × 1 vector of covariates. with a subspace of observed covariates (i. e. x ⊂ z). Let x c be the complement such that x ∪ x c = z then ε i = x c i β c and β x = (1, 1, 1, 1, 1) ′ in y i = x i β z + ε i . When using Mahalanobis-based rerandomization with BAIC, for large n, the variation in y that is common with x is removed. described by their corresponding theoretical distributions. A one-sample Kolmogorov-Smirnov test for the empirical distribution of the statistic given in Equation 6 gives the test statistic D = 0.0096, with p-value = 0.3079 for a two-sided test against a standard normal distribution. It should be noted that in this case the covariates and error term was generated normal such that the asymptotic results are valid for small sample sizes in order to be able to do the simulation study, for non-normal covariates and error term these results require large n to be valid.
To summarize, Table (1) displays the sampling distribution of thê︀ τ under Mahalanobis-based rerandomization for different regions of pa. Under complete randomization (pa = 1), the distribution is normal, for pa in the interval (0, 1), the sampling distribution is Q, and for BAIC (pa → 0) the sampling distribution is again normal but with a scaled variance. These results have the important implication that the large sample Complete rand. Rerendomization Optimal design Reran. crit sampling distribution under Mahalanobis-based rerandomization with sufficiently small pa is simply normal with mean zero but with variance V(1 − R 2 ) instead of V as under complete randomization. As the inference is simplified when conducting inference to PATE given small enough a, a relevant question is whether we in the design phase can choose a pa such that the (scaled) standard normal asymptotic inferences can be used in the analysis. The issue is discussed in the next section.

A Mahalanobis-based rerandomization design for simplified asymptotic inference
The variance of Q is equal to where the equality comes from ϵ 0 ∼ N(0, 1). Because the asymptotic variance of L K,a is known (cf. Equation 4), this allows us to calculate the relative importance of the second term which goes to zero (by Equation 4) for a specific R 2 . The variance ratio (VR) of the second term to the overall variance of the estimator under Mahalanobis based rerandomization given R 2 and a, equals To illustrate, consider a setting with R 2 = 0.2 which is often realistic in practice, four covariates, and a small inclusion criterion, e.g. a = 10 −4 which corresponds to pa ≈ 1.25 × 10 −9 . It follows that which means that around 4.16/1000 of a percent of the variance of Q arises from the variance of the second term. Clearly, including or excluding this term has no practical importance for inferences in this case. For a large sample, e.g, n = 100, there would still be around 1.26 × 10 20 (= 1.25 × 10 −9 × (︀ 100 50 )︀ ) allocations fulfilling this rerandomization criterion. Thus choosing a to be 10 −4 would also enable inference to the units of the sample while simultaneously allowing standard methods to make inference to the units of the population.
In real experiments, R 2 is not known in the design phase, which complicates this type of analysis. However, the VR can be calculated for various hypothetical values of R 2 that are larger than the expected empirical R 2 to create an upper bound for the VR. Figure 2 displays the VR as a function of a for n = 100, K = 5, and R 2 between 0.05 and 0.95. It is clear that, in this case, even if R 2 is large and the sample size is only moderately large, there are still enough allocations fulfilling the criterion. For example, if a is set to 10 −4 , there are still 7.88 × 10 20 allocations fulfilling the criterion. That this small sample size allows for the simplification even for very large R 2 illustrates that these results often can be applied in typical experimental settings.  Based on a limited set of simulations, there seem to be no problem with inference, using the normal approximation, if the ratio of the variance (VR) in the sampling distribution due to the second term in Equation 3 is smaller than 0.01%. In the example above, using VR < 0.01% would allow for the (scaled) standard asymptotic inferences for all R 2 < 0.95. These simple calculations can be performed for any experiment to quickly evaluate the possibilities of using the simplified results presented in this paper, or, to select a such that, for the largest plausible R 2 , the VR is smaller than the rule of thumb.

Discussion
The asymptotic sampling distribution for the mean-difference estimator,̂︀ τ, for inferences to the PATE under Mahalanobis-based rerandomization design is investigated, specifically, how the sampling distribution is affected by letting the Mahalanobis-based rerandomization inclusion criterion approach the 'optimal' design.
Thê︀ τ under complete randomization is asymptotically normal. However, removing allocations associated with large covariates-differences between treated and controls, as with 'rerandomization designs', affects the properties of this estimator in ways that depend on the balance criterion used to discard unbalanced allocations. Morgan and Rubin (2012) and Li et al. (2018) used the well known properties of the affinely invariant Mahalanobis distance to derive properties of thê︀ τ after randomizing in the set of allocations with a Mahalanobis distance smaller than a specified inclusion criterion.
Thê︀ τ has a non-degenerate sampling distribution for repeated sampling inference to the population. However, deterministically choosing an 'optimal' design can lead to a degenerate sampling distribution for inference to the units in the experiment. Based on the results in Li et al. (2018), we show that the asymptotic sampling distribution under Mahalanobis-based rerandomization simplifies to a normal distribution when the inclusion criterion is small. When the sample size is moderately large, there will be large number of allocations that fulfills even very restrictive inclusion criteria, which enables inference to the PATE or the sample estimand (SATE) .
For these reasons, when using Mahalanobis-based rerandomization, it can be advisable to set the inclusion criterion for admissible allocations slightly smaller than suggested in Li et al. (2018) so that standard asymptotic inference can be used for PATE, and slightly larger than the minimum so that inference is possible to both SATE and PATE. To this end we suggest a simple-to-use rule of thumb for when the simplified asymptotics can be used that makes it possible to choose the Mahalanobis criterion accordingly.