Optimal allocation of sample size for randomization-based inference from 2 K factorial designs

: Optimizing the allocation of units into treatment groups can help researchers improve the precision of causal estimators and decrease costs when running factorial experiments. However, existing optimal allocation results typically assume a super-population model and that the outcome data come from a known family of distributions. Instead, we focus on randomization-based causal inference for the ﬁ nite-population setting, which does not require model speci ﬁ cations for the data or sampling assumptions. We propose exact theoretical solutions for optimal allocation in 2 K factorial experiments under complete randomization with A, D, and E-optimality criteria. We then extend this work to factorial designs with block randomization. We also derive results for optimal allocations when using cost-based constraints. To connect our theory to practice, we provide convenient integer-constrained programming solutions using a greedy optimization approach to ﬁ nd integer optimal allocation solutions for both complete and block randomizations. The proposed methods are demonstrated using two real-life factorial experiments conducted by social scientists.


Introduction
Randomized 2 K factorial experiments are conducted to assess the marginal causal effects of K factors, each with two levels, along with their interactions on a response of interest.The two levels are often denoted as the "high level" and "low level" of the factor (Fisher 1935;Yates 1937).With K factors, there are 2 K unique treatment combinations to which units can be assigned.In the twentieth century, factorial designs have mostly been discussed in an industrial setting, whereas in recent times, there has been a lot of interest in their application in the social, behavioral and biomedical sciences and randomization-based inference from such designs (e.g., Branson et al. 2016;Egami and Imai 2019).
Randomization-based inference is a useful methodology for drawing inference on causal effects of treatments in a finite population setting (e.g., Freedman 2006Freedman , 2008)).A major advantage of randomization-based inference is that it applies even if the experimental units are not randomly sampled from a larger population, which is the case in most social science experiments (Abadie et al. 2020;Olsen et al. 2013).The theory, methods, and applications of randomizationbased inference for two-level factorial experiments with a completely randomized treatment assignment mechanism have been developed and discussed (e.g., Dasgupta et al. 2015;Lu 2016).Further, randomization-based inference from experiments with more general factorial structures and complex assignment mechanisms have been discussed in Mukerjee et al. (2018).Connections between regression-based and randomization-based causal inference from factorial experiments have been studied by Zhao and Ding (2021).
Despite the growing literature in this area, most of the recent research on randomization-based inference of factorial experiments has been confined to the analysis side.On the design side, the main focus has been on rerandomization (Branson et al. 2016;Li et al. 2020;Morgan and Rubin 2012), which generalizes the idea of blocking by pre-defining an acceptable criterion for randomization based on covariate balance between treatment groups.There have also been extensions to fractional and incomplete factorial designs (Pashley and Bind 2022) and to the use of screening steps (Shi et al. 2023).However, the distribution of the total number of experimental units into the 2 K treatment groups has not received much attention.Balanced designs that assign equal number of units to the treatment groups are often the default choice, but it is unclear whether they are the "best" design under different conditions.One work that does discuss how to allocate units to optimize precision of factorial estimators from the randomization-based perspective is Blackwell et al. (2022).That work explores the advantages of Neyman-Allocation (Neyman 1934;Cochran 1977b) by extending the two-stage adaptive design in Hahn et al. (2011) to multiple treatment designs.Dai et al. (2023) similarly explores Neyman-Allocation but within sequential designs.
To motivate this problem, we consider an education experiment from Angrist et al. (2009), conducted to assess the causal effects of two different interventions, a student support program (SSP) and a student fellowship program (SFP), on the academic performance of freshmen.This is a 2 2 factorial experiment in which each unit (freshman) can receive only one of four treatment combinations: control (neither of the two), SSP only, SFP only, and SFSP (both).The units were divided into two blocks based on their sex.Table 1 shows the allocation of units within each block to the four treatment combinations.Clearly, the design is unbalanced, with the highest number of units assigned to control and the fewest to the treatment SFSP.Such an allocation, among other reasons, could be motivated by the budget for experimentation.The question we investigate in this paper is the following: Under what assumptions, conditions, and requirements will such an allocation be the best possible one (in terms of being able to precisely answer scientific questions of interest)?
The problem of finding optimal designs in the context of model-based inference has been extensively studied in the twentieth century (see Atkinson et al. 2007, for example).In such settings, optimal designs depend on a postulated outcome model, that may be linear or non-linear.
For example, for binary responses, optimal designs based on logistic models are likely to differ from those based on probit models, and depend on unknown model parameters (Yang et al. 2012;Yang and Mandal 2015;Yang et al. 2016).We aim to develop optimal designs that are tied to model-free, randomization-based analysis for finite and super populations.In addition to being robust to model assumptions, our approach works for continuous as well as binary outcomes as long the finite-population estimand is well-defined.
This paper is organized as follows: The next section introduces basic notation and estimands for factorial experiments using the potential outcomes framework.In Section 3 we derive optimal allocations of the N units in a population to different treatment combinations under three commonly used optimality criteria for a completely randomized design (CRD).In addition to theoretical results for exact optimal allocations, we also provide numerical algorithms for obtaining integer solutions.
In Section 4, we extend our results for CRDs to the setting of randomized block designs (RBDs).
In Section 5, we derive optimality results under cost constraints.Two different applications of the proposed methodology, the motivating education experiment and an audit experiment conducted to assess discrimination, are described in Section 6.We conclude with a discussion, including opportunities for future work, in Section 7.

2 K factorial experiments under the potential outcomes framework
Here, we introduce some key definitions and notation from Dasgupta et al. (2015).Consider a 2 K experiment with N units, in which the levels of each of the K factors are denoted by 0 and 1.Each treatment combination is of the form z j = (z 1 . . ., z K ), where z k ∈ {0, 1} for k ∈ 1, . . ., K.There are J = 2 K treatment combinations arranged in lexicographic order 1, . . ., J, where treatment binary representation of integer j − 1.Thus, for example, in a 2 2 experiment, the four treatment combinations 00, 01, 10 and 11 are numbered as j = 1, 2, 3 and 4 respectively, and the 8 th treatment combination in a 2 4 experiment is 0111.We will just refer to the treatment combination by its number (j) in notation below.
For i = 1, . . ., N , under the Stable Unit Treatment Value Assumption or SUTVA (Rubin 1980), the i th unit has J = 2 K potential outcomes Y i (1), . . ., Y i (J) corresponding to the J treatment combinations z 1 , . . ., z J .Let Y i denote the J × 1 vector of potential outcomes for unit i.For unit i, the unit-level main effect of factor k = 1, . . ., K is defined as the difference between the averages of potential outcomes for unit i for which the levels of factor k are at levels 0 versus 1. Mathematically, it is a contrast of the form 2 , where x T denotes the transpose of vector x, λ k is a J × 1 column-vector with coefficient λ jk such that λ jk = −1 if the level of factor k in j th treatment combination is 0, and λ jk = 1 otherwise.For all k = 1, . . ., K, λ k is a contrast vector, i.e., J j=1 λ jk = 0. Proceeding along the lines of Dasgupta et al. (2015), for unit i, we can define K 2 two-factor interactions, K 3 three-factor interactions, and finally one K-factor interaction as contrasts of the form 2 −(K−1) λ T Y i , where the contrast vector λ for any interaction can be derived by element-wise multiplication of the contrast vectors of the corresponding main effects λ k , for factors involved in the interaction.Denoting the J − 1 = 2 K − 1 contrast vectors for the J − 1 unit-level factorial effects τ 1i , . . ., τ J−1,i by λ 1 , . . ., λ J−1 respectively, we define the J × J matrix as where λ 0 is the J × 1 vector with all elements equal to one.We note that L is an orthogonal matrix with LL T = L T L = 2 K−1 I J , where I J denotes the identity matrix of order J.For i = 1, . . ., N , let , where τ 0i denotes the average of all potential outcomes for unit i.The linear transform between the vector of unit-level potential outcomes Y i and the vector of unit-level factorial effects τ i can be expressed as Having defined unit-level factorial effects, we now move to their population-level counterparts.
Let Y = N −1 i=1 Y i and τ = N −1 i=1 τ i respectively denote the J × 1 vectors of average potential outcomes and the average factorial effects.Then, averaging (2) over i = 1, . . ., N , the vector of population-level factorial effects is given by (3) Note that the first element of τ is twice the average of all potential outcomes.
In a CRD, a pre-assigned numbers of units, N j , are randomly assigned to treatment j.The experiment generates an N × 1 vector of observed outcomes data from which the vector of factorial effects τ can be unbiasedly estimated.We examine the properties of τ , the unbiased estimator of τ , with respect to its randomization distribution, and formulate the problem of optimally allocating the N units to the J treatment combinations in Section 3.
Remark 1 (A super population perspective).While the finite-population perspective does not depend on any hypothetical data generating process for the outcomes, alternative approaches assume that the potential outcomes are drawn from a, possibly hypothetical, super population.Assuming that Y 1 , . . ., Y N are independent and identically distributed random vectors with E[Y i ] = µ, factorial effects at a super population level are defined as Ding et al. (2017) discussed the conceptual and mathematical connections between finite-and super-population inference, showing that while the same estimator commonly used to estimate τ unbiasedly is also an unbiased estimator of τ SP , its sampling variances under the two perspectives are different.

Optimal designs for completely randomized experiments
In a randomized experiment with N j units assigned to treatment combination j ∈ {1, . . ., J}, only one of the J potential outcomes is observed for unit i.This observed outcome is , where T i is the random treatment assignment variable for unit i taking value j if unit i receives treatment j.In a CRD, the joint probability distribution of (T 1 , . . ., T N ) is where ½ {A} denotes the indicator random variable for set A. Let y(j denote the average response for treatment j.Let y denote the vector (y(1), y(2), . . ., y(J)) T of observed averages.Substituting y in place of Y in (3), we can unbiasedly estimate the vector of factorial effects as Lu ( 2016) derived sampling properties of the estimator τ with respect to its randomization distribution for the general case of unequal N 1 , . . ., N J .Lu showed that τ is an unbiased estimator of τ , and has the following finite-population covariance matrix: where λ j represents the transpose of row j of the model matrix L defined in (1), τ i denotes the vector of unit-level factorial effects given by (2), τ the vector of population-level factorial effects given by (3), and the variance of all N potential outcomes for treatment j with divisor N − 1, where Y (j) = In the spirit of classical optimal designs (Atkinson et al. 2007), we can define a design optimality criterion as a functional of the matrix V τ defined in (5).For example, the D-optimality criterion, which aims to minimize the determinant of the covariance matrix, or the A-optimality criterion, which aims to minimize the trace of the covariance matrix, or the E-optimality criterion which aims to minimize the maximum eigenvalue of the covariance matrix, can be considered.However, the , which is a measure of heterogeneity of treatment effects, cannot be estimated from observed data, because none of the unit-level treatment effects τ i are estimable without additional assumptions due to the missing potential outcomes.Because 1/(N (N −1)) N i=1 (τ i − τ ) (τ i − τ ) T is positive semi-definite, the first term of (5) can be considered an upper bound of V τ , which is attained under specific restrictions on the potential outcomes (e.g., treatment effect homogeneity).Thus, we propose optimizing a functional of the first term of (5), which in turn is equivalent to optimizing a functional of the positive definite matrix instead, where A = diag(S 2 1 /N 1 , . . ., S 2 J /N J ).Another justification for using a functional of the matrix V as an optimality criterion comes from the super-population perspective mentioned in Section 1. Ding et al. (2017) showed that the estimator τ defined earlier is also an unbiased estimator of the super-population estimand τ SP .Further, extending their argument for a single factor with two levels to the case of 2 K factorial designs, if V 2 j = Var[Y i (j)], j = 1, . . ., J, then a variance decomposition yields the following sampling variance of τ where Var SP denotes variance over the random sampling from the super population and random assignment, and V 2 j = E(S 2 j ) represents expectation of S 2 j with respect to the distribution of the potential outcomes in the super population.This connection provides further motivation for the form of our optimization, but we focus on the finite-population setting going forward.

Exact optimal designs
The problem of finding an optimal design can be formulated as minimization of an appropriate functional ψ( V) subject to the constraint J j=1 N j = N or equivalently as J j=1 p j = 1 in terms of the proportions of units p j = N j /N to be assigned to treatment combination j.As discussed earlier, we consider three widely used functionals in optimal design literature: the D-optimality criterion where ψ( V) = V and |.| refers to the determinant, the A-optimality criterion where ψ( V) = tr V , and the E-optimality criterion where ψ( V) = max{ν 1 , . . ., ν J } and ν 1 , . . ., ν J are the eigenvalues of V.The following theorem, proved in Supplementary Material A, summarizes these optimality results.
Theorem 1.Let N units be allocated to J treatment groups such that p j = N j /N proportion of units receive treatment j.Then, the optimal allocation of N units to J treatment groups on the basis of covariance matrix V under (a) A-optimality is proportional to the finite-population standard deviations of potential outcomes in the corresponding treatment groups, i.e., p j = S j /( j S j ).
(b) D-optimality is balanced assignment to all J treatment groups, i.e., p j = 1/J.
(c) E-optimality is proportional to the finite-population variances of potential outcomes in the corresponding treatment groups, i.e., p j = S 2 j /( j S 2 j ).
Remark 2. The optimality results in Theorem 1 are similar in spirit to the determination of optimal sample sizes in stratified survey sampling (Cochran 1977a, Ch. 5).The A-optimality result is the same as the Neyman-Allocation discussed in Blackwell et al. (2022), who motivate its use as reducing the identifiable portion of the finite-population variance of classical factorial estimators.
Derivations of the A-and D-optimal designs are straightforward and use the Lagrangian multiplierbased optimization technique.The proof of the E-optimality result uses the idea of perturbing the eigenvalues of a scaled identity matrix to show that the E-optimal design is indeed characterized by equal eigenvalues of the matrix V.
Remark 3. In order to implement the results of Theorem 1, researchers need to "guess" the variances S 2 j , j = 1, . . ., J and substitute them into the expressions for p j .This is similar to the application of optimal designs in non-linear models, where optimal designs are actually "locally optimal" (Chernoff 1953).It is often a common practice to conduct pilot studies to obtain some preliminary estimates of the S 2 j 's, as done in finite-population survey sampling.
We now introduce two conditions associated with the matrix of potential outcomes under which Theorem 1 can be further simplified.
Condition 1 (Homoscedasticity).We call an N × J matrix of potential outcomes homoscedastic if each column has the same variance i.e., S 2 j = S 2 for j = 1, . . ., J.
The following corollary of Theorem 1 is straightforward but useful: Corollary 1.If the matrix of potential outcomes satisfies Condition 1, then the A-, D-and Eoptimal designs are all balanced designs with p j = 1/J for j = 1, . . ., J. Further, under Condition 2, optimizations based on V and V τ are equivalent.
While Theorem 1 provides results on exact optimal designs in terms of proportions p j , experimenters need integer solutions in terms of N j 's satisfying j N j = N .For example, while the exact D-optimal design is balanced, N is not necessarily a multiple of 2 K , and the result does not provide a D-optimal allocation of, say, 69 units to the 8 treatment combinations in a 2 3 factorial experiment.Thus we need approximate integer solutions to the optimization problem in which additional constraints on the sample sizes assigned to specific treatment groups can also be introduced.Next, we discuss an integer programming approach to obtain such solutions.
3.2.Computation of exact optimal designs using an integer programming approach Many sources in the integer programming literature address constrained optimization under integer space constraints (e.g.see Nemhauser and Wolsey 1988;Schrijver 1998;Khan 1995;Sofi et al. 2020).We adopt the methods proposed in Friedrich et al. (2015) that are designed for settings very similar to the ones we consider.Friedrich et al. (2015) consider the following integer programming problem: min where, Z + is the set of positive integers, (l j , u j ) are the lower and upper bound constraints on , can be expressed as J j=1 f j (N j )) then the greedy algorithm in Figure 1 finds the globally optimal integer solution of the minimization problem given in (7) under some regularity conditions that we show in Supplementary Material B.
From the proof of Theorem 1 in Supplementary Material A, it follows that the A-optimality and D-optimality criteria can respectively be expressed as J j=1 (S 2 j /N j ) and J j=1 (log S 2 j /N j ).In Supplementary Material B, we show that the conditions for global convergence of the greedy algorithm are met and thus, convergence to the true integer optimal solution in the cases of A-and D-optimality are guaranteed if this algorithm is used.Hence, substitution of S 2 j /N j and log S 2 j /N j for f j (N j ) in the algorithm described in Figure 1 leads to optimal integer solutions for the Aoptimality criterion and the D-optimality criterion, respectively.In our implementation of this algorithm, we take l j = 2 for j = 1, . . ., J to guarantee at least two units are assigned to each treatment combination, allowing variance estimation within each treatment group.
We provide a modified greedy algorithm described in Figure 2 for solving the E-optimality problem since the optimization problem cannot be written in a separable form as in A-or D-Figure 1: Greedy algorithm for separable functions 1. Set I ← {1, ..., J}.

While (
optimality criterion above.In this algorithm we take f j (N j ) = S 2 j /N j and l j = 2.In Section 4.2, the ability of this greedy algorithm to find E-optimal solutions is demonstrated empirically for blocked designs, discussed next.2. Let t = 0. Initialize N (t) j ← l j for j = 1, . . ., J.

Optimal allocation for factorial experiments with blocks
Consider a block-randomized 2 K factorial design with H blocks.That is, units are pre-assigned membership to one of h blocks based on some similarity metric (we do not consider how to form blocks here).Let M h denote the size of block h, h = 1, . . ., H. Also, let M h,j be the number of units in block h assigned to the treatment j (j = 1, . . ., J).Finally, let b i (h), i = 1, . . ., N be an indicator variable taking value 1 if unit i belongs to block h and 0 otherwise.Treatment assignment under a factorial RBD is equivalent to performing an independent CRD, as described in Section 3, within each block.
The population average treatment effect τ can be expressed as h M h τ h /N , where τ h is the block-level vector of factorial effects and its estimator τ is a weighted average of τ h , an unbiased estimator of τ h defined in the same way as in (4) for block h.Extending (5) by noting the independence of the assignment to treatment across blocks, the covariance matrix of τ h in a factorial RBD can thus be obtained as where λ j represents the transpose of row j of the model matrix L defined in (1) and S 2 h,j denotes the variance of all M h potential outcomes for units in block h under treatment j with divisor M h − 1.
The covariance matrix of τ can then be expressed as Cov ( h M h τ h /N ).Because the block-level treatment estimators τ h are independent across blocks, we have ) and proceeding along similar lines as in Section 3, we formulate a surrogate optimization problem to only optimize the first term, since the second term in the equation above is not identifiable.Then, choosing M h,j 's to optimize some functional of Cov ( τ ) is equivalent to optimizing a functional of the matrix where A blk = diag S 2 blk,1 , . . ., S 2 blk,J .

Exact optimal designs
The problem of finding an optimal design for RBDs can be formulated as minimization of an appropriate functional ψ( V blk ) subject to the constraint J j=1 M h,j = M h for each h or equivalently as J j=1 p h,j = 1 in terms of the proportions of units to be assigned to treatment combination j in block h, p h,j = M h,j /M h .While the A-optimality result is straightforward, finding exact D-optimal and E-optimal solutions in the setting with blocks is difficult without imposing restrictions on the potential outcomes.Before stating the optimality results, we first introduce two such restrictions that generalize Condition 1 to a block setting.
Condition 3 (Within-block homoscedasticity, WBH).We call an N × J matrix of potential outcomes in H blocks to be within-block homoscedastic (WBH) if within each block, all treatment columns have the same variance, i.e., within block h, S 2 h,j = S 2 h• for j = 1, . . ., J.
Condition 4 (Between-block homoscedasticity, BBH).We call an N × J matrix of potential outcomes in H blocks to be between-block homoscedastic (BBH) if for each treatment column j, the variance of potential outcomes in each block is the same, i.e., S 2 h,j = S 2 •,j for h = 1, . . ., H and j = 1, . . ., J.
The following theorem now summarizes the optimality results for blocked designs.
Theorem 2. Let N units be distributed across H blocks such that there are M h units in block h and N = H h=1 M h .Let each set of M h units be allocated to J treatment groups such that p h,j proportion of the units are allocated to treatment j in block h, and let S 2 h,j as defined in (8).Then, optimal allocation of M h units to the J treatment groups on the basis of covariance matrix V blk under different optimality criteria can be summarized as follows.
(a) The A-optimal allocation is the same as the A-optimal CRD allocation within each block, i.e., p h,j = S h,j /( J j=1 S h,j ) for each h.
(b) If either (or both) of Conditions 3 (WBH) and 4 (BBH) hold, the D-optimal allocation is the balanced assignment within each block, i.e., p h,j = 1/J for each h.
(c) If Condition 3 (WBH) holds, the E-optimal allocation is the balanced design within each block, i.e., p h,j = 1/J for each h.
Remark 4. Condition 2 for all N units implies both WBH and BBH.Consequently, by Theorem 2, the D-and E-optimal allocation for strictly additive potential outcomes in randomized block designs is a balanced assignment within each block.
Remark 5.The A-optimal allocation under WBH is balanced allocation within each block (same as D-and E-optimal allocations).However, under BBH, the A-optimal allocation is different from the D-and E-optimal allocations, and is proportional to the standard deviations of the treatments S •,j that is constant across blocks.

Computation of exact optimal designs for factorial RBDS using an integer programming approach
As in the case of completely randomized factorial designs, whereas Theorem 2 provides results on exact optimal designs in terms of proportions p h,j , experimenters need integer solutions in terms of M h,j 's in which additional constraints on the sample sizes assigned to specific treatment groups can also be introduced.Further, Theorem 2 provides D-and E-optimal solutions only under specific conditions like WBH and BBH.Thus, we discuss an integer programming approach to obtain integer solutions for settings covered and not covered by Theorem 2.
We can use the same algorithm in Figure 1 within each block to obtain the optimal integer solutions for A-optimality under the RBD by replacing the function f j (.) by f h,j (M h,j ) = (M 2 h /N 2 )(S 2 h,j /M h,j ).For D-and E-optimality, we extend the greedy idea from the algorithm in Figure 2 with minor changes to the function f j (.).We take ) for E-optimality.The exact algorithms taking the structure of the blocks into account are given in Supplementary Material D. The main difference between the algorithm used in Section 3 and the one proposed here lies in the fact that now we have to allocate the next best unit at iteration t over an H × J matrix ((M (t) h,j )) with upper bounds on row sums, rather than a vector (N J ) with an upper bound on the sum of the elements, as in the case of the CRD.Note that, by Friedrich et al. (2015), the greedy algorithm finds the correct solution in the case of A-optimality for the block design.It, however, does not extend to the D-and E-optimality in the block case, due to nature of the objective functions.
We now conduct an empirical exploration of the performances of the greedy algorithm in terms of its ability to find E-optimal solutions.Five different settings of 2 2 factorial designs in two blocks, each corresponding to a specific type of potential outcome matrix, are considered.Each setting is defined by the block sizes M 1 and M 2 , and the 4 × 2 matrix of variances S 2 h,j as shown in columns 3 and 4 of Table 2, respectively.
The first setting considers blocks of equal sizes with potential outcomes satisfying Condition 2 (strict additivity), leading to an E-optimal design that is balanced within each block as per Remark 4. The second setting considers equal block sizes with potential outcomes satisfying Condition 3 (WBH), leading to a balanced design by Theorem 2. In this setting and the previous one, the exact optimal designs provide optimal integer solutions.This is not the case in the third setting, which considers unequal block sizes with potential outcomes satisfying Condition 4 (BBH) but not Condition 3 (WBH).Theorem 2 does not apply directly for E-optimality, but the greedy algorithm identifies the unique true E-optimal allocation determined by the exhaustive search.The fourth setting is similar the third case above, but is one where the exhaustive search provides multiple solutions, identifying four different allocations, each of which is optimal.In this case, Theorem 2 does not apply directly and the greedy algorithm identifies one of these solutions.The fifth setting neither satisfies Condition 3 (WBH) nor Condition 4 (BBH), and consequently Theorem 2 cannot provide an exact E-optimal solution.However, the greedy algorithm identifies one of the two (identified through exhaustive search) true optimal integer allocations.A quick note on our greedy algorithm is that, due to the nature of the algorithm in Figures 3 and 4 of Supplementary Material D, ties are broken deterministically using minimum index when the greedy step returns more than one solution.Thus, our greedy solutions will always achieve the same solution for a given set of inputs without regard for the plurality of solutions (such as the ones in the fourth and fifth setting above).
A similar exploration performed for the D-optimal allocation (shown in Supplementary Material C) provides evidence that the greedy algorithm can identify the true optimal integer solution when it is unique, and one of the true optimal solutions when multiple optimal integer solutions exist.

Optimal allocation driven by cost constraints
So far, we have considered optimality criteria that are based on the covariance matrix of the estimated factorial effects, implicitly assuming that all treatment combinations are equally expensive (with respect to cost and/or time).However, such assumptions may not be true in many practical situations and cost constraints can play an important role in determining optimal allocation.Thus it is worthwhile to explore solutions to optimal allocation under cost constraints.
We consider the optimal allocation for 2 K factorial CRDs.Let the cost of assigning treatment combination j to one unit be C j > 0, and the total available budget be C.In the new optimization problem, we replace the constraint j N j = N in the original problem described in Section 3.1 by the cost constraint j C j N j ≤ C. The new optimization problem is therefore: where V = J j=1 S 2 j N j λ j λ j T and ψ( V) is a functional of V. A straightforward approach to incorporate this new constraint into our previous setting is to re-write the constraint as j N j ≤ C where N j = N j C j is the total cost for the suggested allocation to treatment arm j.Under this one-to-one transformation Ñj = C j N j , the optimization problem in ( 10) is equivalent to minimizing the objective function over N j (Boyd and Vandenberghe 2004), and can be written as: Because the optimal solution of the above optimization problem is attained at j N j = C, the inequality constraint can be replaced by the equality constraint.Then, proceeding along the lines of Theorem 1, one can obtain the cost for the optimal allocation to treatment arm j as N j ∝ S j (C j ), Ñj = C/J and N j ∝ S 2 j C j as the A-, D-and E-optimal solutions.These results are formalized in terms of the optimal proportion of the budget allocated to each treatment arm, which can be used to determine the number of units to assign to each treatment arm, in the theorem below.
Theorem 3. Let C be total budget for the whole experiment and the cost of allocating one experimental unit to treatment j be C j > 0. Let π j = C j N j /C denote the proportion of the total budget assigned to treatment j with j π j ≤ 1.Then, the (a) A-optimal cost-based allocation to the J treatment groups on the basis of covariance matrix (b) D-optimal cost-based allocation to the J treatment groups on the basis of covariance matrix (c) E-optimal cost-based allocation to the J treatment groups on the basis of covariance matrix Remark 6. Theorem 3 can be extended to the case of block designs along the lines of Theorem 2.
Remark 7. If for j = 1, . . ., J, the costs C j in Theorem 3 are the same and equal to C 0 , then the constraint j C j N j ≤ C reduces to j N j ≤ N where N = C/C 0 .Also, π j = C 0 N j /C ≡ p j .Thus the optimization problem becomes the same as the one in Theorem 1, making it a special case of Theorem 3.
We use an example to demonstrate the applicability of Theorem 3. Consider a 2 2 factorial design, with C = 100, per unit cost vector (C 1 , . . ., C 4 ) = ( 0 .1, 4, 4, 9).This set up represents many common scenarios where treatment arm 1 represents the control group 00 and involves a per-unit cost that is negligible compared to the ones with at least one active treatment.On the other hand, treatment arm 4 has both treatments at active level and involves the highest cost.Table 3 shows the A-, D-and E-optimal proportions of total cost π j 's for two different vectors of variances (S 2 1 , . . ., S 2 4 ).In one setting, we take the vector as (1, 1, 1, 1) and in another, set it to (1, 2, 3, 4).
For the sake of completeness, we also add a column of equal cost (1, 1, 1, 1), under which the p j 's of Theorem 1 and π j 's of Theorem 3 become identical, as explained in Remark 7. Thus, the optimal allocations in the first column of Table 3 can also be derived from Theorem 1 with N = C = 100.One can obtain the number of units N j 's the A-, D-and E-optimal allocations of N j 's by substituting the optimal π j s from Theorem 3 into N j = (Cπ j )/C j .However, rounding these optimal N j 's into nearest integers may lead to violation of the constraint j C j N j ≤ C. To avoid such possibilities, one can consider the optimal values of ⌊(Cπ j )/C j ⌋ as approximate integer solutions, where ⌊x⌋ denotes the largest integer contained in x.
Researchers may decide to impose an additional constraint on the optimization problem (11) that forces the sum of N j 's to be exactly equal to a predetermined N .Such a problem would give optimal allocation under fixed N , unlike Theorem 3.However, imposing this additional constraint may force the set of feasible solutions to the optimization problem to be empty.For example, suppose for all j, C j > C/N .Then, j C j N j > j (C/N )N j , which exceeds the allowable cost C if the restriction j N j = N is imposed.Thus, additional conditions are necessary to guarantee that the feasible set is non-empty.Obtaining closed-form solutions under such conditions may not be straightforward and one may need to rely on numerical methods to obtain such solutions.

Applications in real experiments
In this section, we demonstrate applications of the results and algorithms developed to two reallife experiments.First, we re-visit the education example from Angrist et al. (2009) described in Section 1.Second, we discuss a pilot audit experiment reported in Libgober (2020) conducted to identify how perceptions of race, gender and affluence affect access to lawyers, and demonstrate how the proposed methodology can be used to design follow-up experiments in similar populations.

Education experiment
In the experiment described in Angrist et al. (2009), the authors use a CRD to allocate the N = 1656 units to the 2 2 treatments.Theorem 1 can directly inform us of the optimal allocation without costs, but there is more structure that we can exploit.For instance, there are potentially two blocks of experimental units or subjects representing female (block 1) and male (block 2) students, with block sizes M 1 = 948 and M 2 = 708, which can be used to improve their design.Theorem 2 will give us the optimal designs in this case.
Assuming that there is no prior information about the variances of potential outcomes (GPAs after year 1), we assume that the variances are equal within and across blocks (Conditions 3 and 4).
Then, optimal allocations under both CRD and RBD, from Theorem 1 and Theorem 2 respectively, are shown in Tables 4 and 5.  Now let us consider a hypothetical situation where the number of units N is not prespecified, but there is a budget constraint that depends on the costs associated with the four treatment combinations in this experiment.The treatment combinations 01 (SFP but not SSP) and 10 (SSP but not SFP), each involve cost associated with one of two programs.Angrist et al. (2009) report about $5,000 for individual students that were allocated to treatment 10 (SSP).Per unit cost for treatment combination 01(SFP) is not mentioned but if we assume a similar cost as with SSP, then, we can infer that the cost to allocate a student to the treatment combination 11 (SFSP) would be the sum total of the individual costs ($10,000).The control, representing the treatment combination 00, is possibly the cheapest to allocate units to, because it would involve only administrative cost, which we assume to be $500.Then under the original allocation (1106,250,250,150) in the actual experiment as shown in Table 1, the cost of the experiment would be approximately $4.5 million.
Assuming this amount to be our budget constraint C, the A-, D-and E-optimal allocations for two different variance vectors obtained from Theorem 3 are shown in Table 6.The first row shows the allocation of the proportions π j 's of the total budget to the four treatment arms, and the second shows the corresponding approximate integer solution for N j as ⌊Cπ j /C j ⌋.

Audit experiment
Libgober (2020) reported an audit study in which the experimental units were 96 lawyers randomly selected from lawyers in California with a certification in criminal law.Each lawyer in the experiment received an email about a routine 'driving under influence' (DUI) case (a very common criminal matter).The email template suggested that the person sending the email was (i) either white or black (with a racially distinctive name being used to influence perceived race), (ii) either female or male (again cued via the email sender's name), and (iii) either relatively affluent or relatively lower-income description of client's earnings.Thus, this experiment had a 2 3 factorial structure.The response was recorded as a binary outcome taking value 1 if there was a response to the email and 0 otherwise.The experiment was replicated with 96 additional lawyers after a certain period of time.The estimated variances s 2 j for j = 1, . . ., 8 treatment groups for the individual replicates and their pooled values are shown in Table 7.If another completely randomized experiment is planned with lawyers selected from a similar pool with a sample size of 192, then based on the pooled estimated variances shown in Table 7, we can apply Theorem 1 to obtain the optimal designs given in Table 8.Now assume for illustration that (contrary to fact) instead of two replicates, the original experiment was conducted in two blocks, each block representing one type of lawyer (e.g., criminal and divorce), and suppose we want to obtain optimal allocations within each block for a future experiment.Further suppose that the variance estimates in row j of Table 7 represent estimates of the variances S 2 h,j in block j = 1, 2 and block 2 respectively.Then, we can directly use part (a) of Theorem 2 to derive the A-optimal design.However, neither WBH or BBH appear to hold, and we cannot apply parts (b) and (c) of Theorem 2. Conveniently, we can obtain the D-and E-optimal designs using the greedy search algorithm proposed in Section 4.2.

Discussion
In this paper, we consider optimal allocations of a finite population of experimental units to different treatment combinations of a 2 K factorial experiment under the potential outcomes model.Rather than invoking the standard assumption in the mainstream optimal design literature that outcome data comes from a known family of distributions, our work revolves around randomization-based causal inference for the finite-population setting.We find that for 2 K factorial designs with a completely randomized treatment assignment mechanism, D-optimal solutions are always balanced designs, while A-and E-optimal solutions are proportional to finite-population standard deviations and finite-population variances of the treatment groups, respectively.For blocked designs, our solution does not admit a closed form for D-or E-optimality without imposing specific restrictions on the potential outcomes, but the A-optimal allocation is equivalent to finding the A-optimal solution within each block.Convenient integer-constrained programming solutions using a greedy optimization approach to find integer optimal allocation solutions for both complete and block randomization are proposed.Optimal allocations are also derived under cost constraints.
While there is a large literature on model-based optimal designs, to the best of our knowledge, such designs have had very limited development for randomization-based inference for finite populations.The ideas explored and results developed in this paper exploit the connection between finite-population sampling and experimental design.This recondite connection has recently been emphasized, explored, and utilized in various contexts by several researchers in causal inference, as discussed in Mukerjee et al. (2018).This article attempts to further strengthen the bridge between finite-population survey sampling and experimental design by utilizing ideas from proportional and optimal allocation for stratified sampling in the context of optimal designs.While the optimal solutions are derived for a finite-population setting, they are readily applicable to a super-population setting without making any assumptions about the probability distribution of the outcome variable.
A question that practitioners may ask is, which optimal design should be chosen for a given experiment?The answer would depend on the research goal of the experimenter.As our results have shown, strong assumptions like strict additivity lead to equivalence of A-, D-and E-optimal designs.
However, under treatment effect heterogeneity, different criteria will lead to different allocations.
Both A-and D-optimality criteria are associated with quality of estimated causal effects -whereas A-optimality minimizes the average variance of estimators, the D-optimality criterion minimizes the volume of the confidence ellipsoid around the parameters.Some researchers (e.g., Jones et al. 2021) have argued that in model-based settings, A-optimal designs exhibit better performance than D-optimal designs when the objective is screening of active effects from inactive ones.On the other hand, when the goal is to draw the most precise inference on the vector of estimated causal effects, D-optimal design may be a better choice.The goal of the E-optimal design is to minimize the maximum variance of all possible normalized linear combinations of estimated treatment effects.
Thus the E-optimal design is useful when a large number of linear combinations of factorial effects are of interest.The E-optimal allocation, being a minimax strategy, is likely to provide a more conservative solution to the inference problem, but as shown by some researchers (e.g., Wong 1994) in other contexts, the E-optimal solution may be less sensitive to incorrect prior information or assumptions about potential outcomes in comparison to A-and D-optimal designs.However, more investigation is required along these lines in the randomization-based setting.
The work presented in this paper can be extended in several directions.One limitation of the proposed approach lies is the fact that the correlation among the potential outcomes under different treatment combinations is unidentifiable from the data, forcing us to ignore one term in the covariance matrix of estimated factorial effects while formulating the optimization problem.Basse and Airoldi (2018) proposed a model-based approach to overcome this problem in two-armed experiments, in which information on the correlation among the outcomes is available pre-intervention.Such an idea may be extended to the setting of factorial experiments.
Also, a natural extension of the randomization-based framework of causal inference is the Bayesian framework, in which the potential outcomes are assumed to follow a hierarchical probabilistic model containing hyperparameters with assumed prior distributions.The Bayesian framework proposed in Dasgupta et al. (2015) for drawing both super-population and finite-population causal inference from 2 K factorial designs can be utilized to obtain Bayesian optimal deigns according to different criteria proposed in literature (e.g., Chaloner and Verdinelli 1995).
Another setting that has gained a lot of attention in recent times is when SUTVA is violated, for example, in the presence of interference between units.Extending the proposed results to such settings is a challenging, yet rewarding problem.
Finally, in certain situations, it is possible that instead of the traditional factorial effects defined by ( 3 for all treatments j 1 ∈ M and one specific j 2 / ∈ M as follows: Using Lemma 1, after a little algebra, it follows that ν , and contradicts (11).

Proof of Theorem 2:
We need the following lemma regarding the eigenvalues of the covariance matrix V blk defined in (9) under a blocked design: Lemma 2. The matrix V blk has J non-zero eigenvalues JS 2 blk,1 , . . ., JS 2 blk,J .
Proof: Along the same lines as Lemma 1, note that V blk can written as Because the rows of (J) −1 L forms an orthonormal basis of vectors in R J , the above expression is a spectral decomposition of V blk .Thus, the eigenvalues of V blk are its diagonal elements JS 2 blk,1 , . . ., JS 2 blk,J .Now we prove the three parts of the main theorem.
Part (a): We can proceed very similarly to the proof of Theorem 1(a).Because the trace of a matrix is the sum of its eigenvalues, using Lemma 2, tr( V blk ) = J J j=1 S 2 blk,j .The problem of minimizing tr( V blk ) can be expressed as minimize J j=1 S 2 blk,j subject to We can again solve this using the method of Lagrange multipliers, min where λ h are Lagrangian multipliers.Taking partial derivative of the objective function with respect to M h,j and setting it to zero, we get − Solving for the constraint J j=1 M h,j = M h , we get, Part (b): Because the determinant of a matrix is the product of the eigenvalues, from Lemma 2, we have The problem of minimizing the determinant can thus be equivalently expressed as minimize J j=1 log(S 2 blk,j ) subject to To prove the special cases, we use the Lagrangian multiplier based optimization approach as in (a).So, we need to solve min where λ h are Lagrangian multipliers.
Taking the derivative with respect to M h,j and setting equal to 0, we have Under special case (i), WBH: variances of potential outcomes are the same within each block (such that S 2 h,j = S 2 h,• for all j = 1, . . ., J).

Substitution of S
After a little algebra, we get Now, summing over j in ( 13) and applying the second constraint J j=1 M h,j = M h we have: Thus, substituting ( 14) in (13), Substituting this back into definition of c j in (13) gives us, Thus, δ j = c j / j c j = 1/J and M h,j = M h /J and p h,j = 1/J.
Thus, it holds that, Lemma 4. Given integers M h for h = 1, . . ., H, let M h,j denote any allocation of M h into J > 1 groups such that j M h,j = M h .Let a h > 0, for h ∈ {1, . . ., H}, be fixed.Then, .
That is, the allocation that minimizes max j H h=1 a h /M h,j is M h,j = M h /J.

Proof.
For each h = 1, . . ., H, using an argument similar to the one in the proof of Theorem 1, we can write for h = 1, . . ., H, Suppose there exists an allocation Mh,j = M h /J + δ h j that maximizes max j H h=1 a h /M h,j , where for each h, ∃j such that δ h j = 0.Then, J j=1 Mh,j = M h =⇒ J j=1 δ h j = 0 ∀h = 1, . . ., H.
which is a contradiction by taking r = M h /J and δ j = δ h j for each h in Lemma 3.
Thus, we have that, Under special case, WBH: variances of potential outcomes are the same within each block (such that S 2 h,j = S 2 h,• for all j = 1, . . ., J).We have, S 2 blk,j = H h=1 (M h /N ) 2 (S 2 h,./M h,j ).We can rewrite this equation as S 2 blk,j = H h=1 a h /M h,j where a h = (M h /N ) 2 S 2 h,.does not depend on j.By Lemma 4, we get max j H h=1 a h /M h,j ≥ J h a h /M h .We immediately see that for M * h,j = M h /J, the left hand side of the inequality attains the lower bound, which is minimax.Hence, p * h,j = 1/J for each treatment j within block h, a balanced design within each block.
Supplementary Material B Conditions for Greedy Algorithm from Friedrich et al. (2015) Theorem 3.2 of Friedrich et al. (2015) have the following conditions to be satisfied for global convergence of the greedy algorithm to the true optimum.We restate the theorem here for reference.Unequal blocks with equal variances but exact solution is non-integer 40 20 1 2 3 5 1 2 3 5

Supplementary Material D Greedy Algorithm for RBD
Using the methods in Section 3.2, we can use the same algorithm in Figure 1 to obtain the optimal integer solutions for the A-optimality case under block design by taking f h,j (M h,j ) = (M 2 h /N 2 )(S 2 h,j /M h,j ).The exact algorithm taking the structure of the blocks into account is given in Figure 3.

Figure
Figure 3: Greedy algorithm for A-& D-optimality

Table 1 :
Allocation of units to treatment combinations

Table 2 :
Summary of Greedy algorithm solutions for E-optimality for H = 2, K = 2

Table 3 :
Optimal π j 's under cost constraints obtained from Theorem 3

Table 7 :
Estimated variances for different treatment groups

Table 8 :
Optimal allocations for future CRD

Table 9 :
Optimal allocations for future RBD ), the experimenter is interested in other contrasts of the treatment means or more general factorial effects.One such natural choice of contrast is one that compares the outcome of the control group with the average of all other groups that have at least one treatment.Optimal allocations under such a reformulated optimization problem would be an interesting problem to study. https://www.mattblackwell.org/files/papers/batch_adaptive.pdf.
Further, by our assumption on Mh,j ,

Table 10 :
Summary of Greedy algorithm solutions for D-optimality for H = 2, K = 2