Partitions for stratified sampling

Classical jittered sampling partitions $[0,1]^d$ into $m^d$ cubes for a positive integer $m$ and randomly places a point inside each of them, providing a point set of size $N=m^d$ with small discrepancy. The aim of this note is to provide a construction of partitions that works for arbitrary $N$ and improves straight-forward constructions. We show how to construct equivolume partitions of the $d$-dimensional unit cube with hyperplanes that are orthogonal to the main diagonal of the cube. We investigate the discrepancy of such point sets and optimise the expected discrepancy numerically by relaxing the equivolume constraint using different black-box optimisation techniques.


Introduction
Classical jittered sampling provides d-dimensional point sets with small discrepancy in [0, 1] d consisting of N = m d points for a positive integer m ≥ 2. The main idea is to partition the unit cube into N axis-parallel cubes of volume 1/m d and to pick a uniform random point from each cube.This construction avoids local clusters one usually obtains when sampling N i.i.d uniform random points from [0, 1] d .The expected (star) discrepancy of a set P N of N i.i.d.uniform random points in [0, 1] d is of order Θ( 1/N ); see [9] for the first upper bound, [1] for the first upper bound with explicit constant and [5] for the first lower bound as well as [8] for the current state of the art results in this context.Recently Doerr [6] determined the precise asymptotic order of the expected star-discrepancy of a point set obtained from jittered sampling: This result is our motivation to look for other constructions of partitions.In particular, we aim to find a construction that works for arbitrary N while improving the expected (star) discrepancy of a set of N i.i.d.uniform random points.In fact, for 1 < p < ∞, a stratified set derived from a partition into N > 2 equivolume sets always has a smaller expected L p -discrepancy than a set consisting of N i.i.d.random points.This strong partition principle was proven in [10, Theorem 1] (see also [12] for a weaker form) and raises the question which partition yields the stratified sample with the smallest mean L p -discrepancy -if such a partition exists.Or, as a more modest goal, which construction works well in general?
1.1.A construction.There are several straightforward ways to partition the d-dimensional unit cube [0, 1] d into N sets of equal volume.We could for example partition the interval [0, 1] into N subintervals of equal length and use this to define N slices of equal volume.In fact, we can partition an arbitrary number of the d generating unit intervals into subintervals of equal length to generate grid-type partitions of the d-dimensional unit cube.In every such example it is straightforward to characterise the sets in the partition (since they are axis-parallel rectangles) and to prove that these sets have indeed all the same volume.However, any such construction only works for a restricted number of points and not for arbitrary N .The aim of this note is to characterise another family of equivolume partitions that was recently studied in the context of jittered sampling [10].Given the unit square [0, 1] 2 and N − 1 parallel lines H i with i = 1, . . ., N − 1, which are orthogonal to the main diagonal, D, of the square, we would like to arrange the lines such that we obtain an equivolume partition of the unit square; see Figure 1.We denote the intersection H i ∩ D of a line with the diagonal with p i .It is straightforward to calculate all points p i for arbitrary N .In fact, note that H i splits the unit square into two sets of volume i/N and 1 − i/N .If i ≤ N/2, we just need to look at the isosceles right triangle that H i forms with (0, 0).We know that this triangle has volume i/N .Denote the length of the two equal sides with a i > 0, then a 2 i /2 = i/N and p i = a i /2.Therefore, we get that (1) for all i ≤ N/2.By symmetry, we also get the points p i with i > N/2.The first question of this note is how to generalise this simple characterisation of equivolume partitions of the unit square to dimensions d > 2?
To state this problem formally, we introduce further notation.Let H + r be the positive half space defined as the set of all x ∈ R d satisfying (2) x 1 + x 2 + . . .+ x d ≥ r; accordingly let H − r be the corresponding negative half space and let H r be the hyperplane of all points x ∈ R d satisying x 1 + x 2 + . . .+ x d = r.For a given N , we would like to find the points 0 < r 1 < . . .< r N −1 < d such that the corresponding hyperplanes H r 1 , . . ., H r N −1 define a partition of [0, 1] d into N equivolume sets.We call the set S(N, d) := {r i : i = 1, . . ., N − 1} the generating set of the partition.Using (1) we see that Note that for even N the point {1} is in the set, while it is not for odd N .1.3.Results.We fully answer Problem 1 providing two different characterisations for the set S(N, d).First, we use a result from [4] to derive an algebraic characterisation.
Theorem 1.For arbitrary r ∈ [0, d] and corresponding positive halfspace The parameter r of the positive half space H + r such that V + d (r) = V is a solution of the polynomial equation: Solving for r i with V + d (r i ) = 1 − i/N and 1 ≤ i ≤ N − 1 gives an algebraic characterisation of the set S(N, d).However, this description is not very practical and does not yield any insight into the structure of the set S(N, d) as explained further in the subsequent sections.Therefore, we derive a second result which gives a probabilistic and more practical albeit approximate characterisation of S(N, d).
as d → ∞ for all r > 0, i.e., V − d converges pointwise to Φ.To get the desired characterisation of the set S(N, d) containing the values r i such that we apply the well known symmetry of the quantile function, Therefore with the above observations, Theorem 2 gives the following probabilistic characterisation: (5) As a third result in order to tackle Problem 2, we provide extensive numerical results on the set S * (N, d) in Section 6 for d = 2, 3. Using black-box optimisers, we are able to approximate the sets S * (N, 2) and S * (N, 3) for low values of N .We observe that, in two dimensions the equivolume stratification S(N, 2) improves the uniformly distributed random point sets by a factor of 2 for the L 2 -discrepancy, in line with results in [10], and approximations of S * (N, 2) improve S(N, 2) by up to 8% for the smallest values of N .In three dimensions, we see that approximations of S * (N, 3) provide a slightly larger improvement of up to 10% over the equivolume stratification.
1.4.Outline.Section 2 recalls important results and introduces further notation.In Section 3 we illustrate the algebraic method in the special case of d = 3, outlining the difficulties that occur already in very small dimensions.In Section 4, we prove Theorem 1 while Theorem 2 is proved in Section 5.The final section contains numerical results on the minimal sets S * (N, d).

Preliminaries
It is a computationally very hard problem to calculate the volume of a convex polytope [7] which inspired much research on approximation methods as well as exact calculations.In [4] the authors study the special case in which a polytope is obtained from a hypercube by clipping hyperplanes.The resulting formulas do not require heavy machinery but are technically complicated as much as they are intriguing.
For our note, the simple case of a hypercube clipped by only one hyperplane is of interest.Our starting point is an interesting formula whose main idea can be traced back to a paper of Polya [13] and which seems to have appeared for the first time in [2].We refer to the introduction of [4] for more information.
with d t=1 a t = 0. Then we have in which F 0 corresponds to the set of vertices of the cube, and |v 0 | counts the number of zeros in the vector v.
Our problem is in a way the inverse of the main problem studied in [4].We know the volume of the intersection of a particular half space H + r with the unit cube (and we know the normal vector of the hyperplane) and we are interested in the offset r.Thus, we can use Theorem 3 to obtain an expression for the parameter r given the volume of the intersection of the cube with the hyperplane.To illustrate the theorem we look again at the two-dimensional case.For i ≤ N/2 we have that vol([0, 1] 2 ∩ H + r i ) = 1 − i/N .Moreover, we know that the three vertices (1, 0), (0, 1) and (1, 1) of the unit square are contained in [0, 1] 2 ∩ H + r i .Hence, we obtain which can be simplified to This coincides, up to the choice of the correct sign, with the result we already derived in the introduction.Note that even in the two-dimensional case we see that solving the problem algebraically requires us to pick our desired solution from a range of possibilities.

The special case d = 3
To further illustrate the algebraic method, we explicitly solve the special case d = 3.This case is still easy to visualise and already shows the difficulties we are facing beyond the twodimensional case.Let [0, 1] 3 be the three-dimensional unit cube with vertices labelled v 1 = (0, 0, 0), v 2 = (1, 0, 0), v 3 = (0, 1, 0), v 4 = (0, 0, 1), v 5 = (1, 1, 0), v 6 = (1, 0, 1), v 7 = (0, 1, 1) and v 8 = (1, 1, 1).We ask for a characterisation of the set S(N, 3).To be more precise, every point on a hyperplane H r i satisfies and we are interested in the points The following lemma characterises the points r i as roots of polynomials of degree 3. Lemma 4. For arbitrary r ∈ [0, 3] and corresponding positive halfspace , the parameter value r such that V + 3 (r) = V can be obtained as a solution of the corresponding polynomial equation: Proof.We use Theorem 3 and consider the three cases separately.We must first remark several important facts that will be useful in this derivation.It can be seen with some small calculations (via Theorem 3) that when r = 1 and r = 2, the volume of intersection of [0, 1] 3 ∩ H + r is 5/6 and 1/6 respectively.Hence, for example, the range of parameter 0 < r ≤ 1 corresponds to a range of volume of intersection 5  6 ≤ V < 1.In the first instance, let 0 ≤ V < 1  6 and we see that For the second case, if Since the vectors v 5 , v 6 and v 7 all contain one zero, we have that g(v i ) = 2 − r for i = 5, 6, 7. Hence we get, Finally, for 5 6 ≤ V < 1 we have all the vertices in the intersection F 0 ∩ H + r except v 1 .Note that the vectors v 2 , v 3 and v 4 all contain two zero components hence g(v i ) = 1 − r for i = 2, 3, 4. Therefore, which can be re written as, Using the last Lemma, we obtain the following full characterisation of r for a given V : The resulting function is illustrated in Figure 3.

The general case
We now proceed to the general d−dimensional case while emphasising that one of the key ingredients of the formula in Theorem 3 is to determine which vertices lie inside the half space and subsequently counting how many of these vertices contain k zeros, with 1 ≤ k ≤ d.In the context of our problem, i.e. when considering hyperplanes orthogonal to the main diagonal, this can be done in a simple manner that is summarised below in Lemma 5.
Lemma 5.For d ≥ 2, let H + r be the positive half space for r with 0 Finally, there are d k−1 vertices containing exactly k − 1 ones and hence by definition these are precisely the elements in the difference Using the previous result, we now give a closed form expression for the volume of the intersection of [0, 1] d and a positive halfspace H + r for an integer r with 0 ≤ r ≤ d which will be used to calculate the critical values of the volume of intersection, V + d (r), determining the form of our algebraic equation in variable r.Lemma 6.For d ≥ 2, let r = d−k be an integer with 1 ≤ k ≤ d and let H + r be the corresponding positive half space.Define Proof.By definition, every vertex v of [0, 1] d has d − |v 0 | components equal to one with the rest equalling zero.Hence from the formula in Theorem 3, with Therefore we obtain a closed formula for the volume of intersection for each integer 0 ≤ r ≤ d as follows With this, we can finally prove Theorem 1.
Proof of Theorem 1.For a given fixed volume Let r be such that V + d (r) = V , then we know from Lemma 5 and 6 that V k ⊂ H + r .Applying a similar argument as in Lemma 6 we have for every i−1 recalling that |v 0 | counts the number of zeros in vector v. Hence using the formula from Theorem 3, This can be rearranged to yield Remark 7. Commenting of the work completed so far, we can now provide a full characterisation of the generating set S(N, d) for arbitrary N and d via the solutions to order d polynomial equations of the form 8. It is worth noting that the Abel-Ruffini Theorem informs us that given a general polynomial of order (or in our case, dimension) greater than 4, we are not guaranteed to obtain an exact closed form solution.However, we can of course solve for variable r numerically to arbitrary degree of accuracy using for example Newton's method.As a second remark, the algebraic characterisation unfortunately does not give insight into any underlying structure that the parameter values might hold.Due to these drawbacks, in the next section we derive a second result from which a probabilistic approximation of S(N, d) can be stated as in Theorem 2.

Approximation by normal distribution
5.1.Probabilistic characterisation.The formula in Theorem 1 is a precise characterisation of r and lets us plot the graph of V + d (r) for increasing r; i.e. knowing that r ∈ [0, d] we can explicitly calculate the volume As pointed out in Remark 7, the inverse calculation is, however, limited by fundamental algebraic obstacles.
This motivates the search for a different route to a neat characterisation of the sets S(N, d) for arbitrary parameters d and N .We start from the explicit solutions that we have for d = 3; see Figure 3.This plot looks very similar to the quantile function of a normal distribution.In fact, if we interpret the set S(N, 3) as a point sample in [0, 3], some numerical exploration leads us to conjecture Theorem 2 which we prove in the following.
Let's consider the length growing with √ d we re-scale and consider The value R is required in order to allow the implementation of the central limit theorem later, however note that we are actually interested in the value The interpretation of the cumulative distribution function F R of R is as follows: , where 0 ≤ t ≤ 1, and The value R is the average of the variables X 1 , . . ., X d which are i.i.d.uniform in [0, 1] with mean µ = EX 1 = 1/2 and variance σ 2 = Var(X 1 ) = 1/12.The central limit theorem states that the normalized variables converge in distribution to a standard normal as d → ∞.One way to express this convergence is in terms of cumulative distribution functions (CDFs).If Φ is the CDF of a standard normal, we have as d → ∞, for all t ∈ R (i.e.F Z d converges to Φ pointwise).Combining ( 9) and ( 10) we obtain pointwise for all r > 0 as d → ∞.
As outlined in the introduction, we can now derive our probabilistic characterisation via the symmetry of the quantile function.The rate of the convergence described by the Central Limit Theorem is customarily measured via the Berry-Esseen Theorem [3]; see Appendix A 5.2.Numerical experiments.As a first numerical experiment, we investigate the convergence in Theorem 2. For d = 3, 5 and 10 and N = 10000, we numerically calculate parameter values r i (1 ≤ i ≤ 9999) from Theorem 1 and subsequently use these values to calculate probabilities from the cumulative distribution function in (3).Comparing these probability values at r i with the desired cumulative volume of intersection, i.e. i/10000 for 1 ≤ i ≤ 10000, the error is plotted in Figure 4 for increasing dimension.As expected, for increasing dimension the error becomes on average smaller, showing the desired convergence.Note that the values we obtain in our experiments are much smaller than what is guaranteed by the general bound in (11).Next, we would like to know how close the partitions generated by the probabilistic characterisation of S(N, d) in ( 5) are to being equivolume.We fix d = 5 and set N = 100, 1000 and 10000.Using two consecutive parameter values from the normal approximation (5), the volume of a single partitioning set, generated by the probabilistic values for the parameter r, can be calculated using (8).The resulting approximate volume for each partitioning set is then compared with the desired volume of 1/N ; see Figure 5.We observe that the error is bounded inside the main body of the cube but much larger in the extreme ends which indicates that our deterministic distribution is not well approximated by the tails of the normal distribution.However, the number of sets in the partition that are affected from this large deviation is relatively small.Our numerical results indicate that the relative error is at most about 7% for most of the sets; i.e. if a set is expected to have volume 10 −k , the volume of the approximated set is in the interval 10 −k ± 7 • 10 k−2 .To be more precise, note for example that in the case d = 5 and N = 10000 we leave the initial segment of the cube, i.e. for parameter values −1 ≤ r ≤ 0, at partition set number 84 = 10000

Conclusions.
For a given N , if one is required to partition the d−dimensional unit cube into N equivolume slices with partitioning lines orthogonal to the main diagonal, we suggest to proceed as follows: In this case, all partitioning planes H r belong in the range 1 ≤ r ≤ d − 1.As our numerical experiments above confirm, in the main body of the hypercube the normal approximation ( 5) generates a quasi-equivolume partition.Therefore the recommendation would be to generate the quasi-equivolume partition using Theorem 2. We note that the small errors we make in this region should play a limited role in the overall picture, particularly with respect to applications in uniform distribution theory.
The partitioning hyperplanes now sit in every segment of the cube and we are required to consider the full range of the parameter 0 ≤ r ≤ d.As in the first case, the normal approximation yields an accurate equivolume partition in the main body of the cube.Our numerical experiments above inform that in the extreme segments, i.e. when 0 ≤ r ≤ 1 and d − 1 ≤ r ≤ d the normal approximation deviates significantly from the desired volumes of the sets.To deal with this inaccuracy, we recall the algebraic formula in Theorem 1.In the case k = d (i.e. in the initial segment 0 ≤ r ≤ 1), the formula reduces to . We suggest to solve this simplified equation for r to generate the hyperplanes contained in the initial segment.By symmetry, we can also easily obtain the values for r in the upper most segment.

Looking for minimal sets
In this section, we use several different black-box optimisers to look for approximations of the minimal sets S * (N, 2) and S * (N, 3), which we then compare to our equivolume stratification.Rather than working with the hyperplane parameters r i ∈ [0, d] for i = 1, . . ., N − 1 as in the previous sections, we will now consider values 0 These values correspond to the Euclidean distance between the origin and the intersection points of the chosen hyperplanes with the diagonal and are more convenient to use with the optimisers.Black-box optimisers generate a number of samples, evaluate the target function for these samples (in this case, the expected L 2 -discrepancy) and then update some information on the problem.This depends heavily on the type of optimiser, it can be only the best solution found so far or an underlying distribution which can be used to find the best solution.
Black-box optimisers present two advantages.The first is that it is a very difficult task to directly compute the expected L 2 −discrepancy of a general stratification.Recent approaches 0.006955 0.003532 0.003541 0.003558 0.003478 have always studied square boxes aligned with the axes [11] or used numerical experiments [10]; note that finding the best positions for 3 points was already an involved process, see [10].Black-box optimisers, in contrast, do not require us to obtain new results on the structure of the function but directly try to find good intersection points for the given setting through a systematic trial and error process.The second advantage is that for each intersection point, we only need to keep a single parameter since we know the hyperplanes are orthogonal to the diagonal, regardless of the dimension d.We can model the problem as a N − 1 dimensional optimisation problem (p 1 , . . ., p N −1 ), where for each element p i we have the bounds p i ∈ [0, √ d], as well as the ordering constraints on the different parameters, i.e. p 1 ≤ p 2 ≤ . . .≤ p N −1 .The optimiser is given a budget of 1000 evaluations where each evaluation consists in approximating the expected L 2 -discrepancy of the corresponding stratification.To approximate this function, we first sort the different parameters to obtain a valid stratification.We sample randomly a point in each stratum then compute the L 2 -discrepancy of the resulting point set.This is repeated 1500 times and averaged to obtain an approximation of the expected L 2discrepancy.

N R(N, 2) S(N, 2) CMA-ES
All experiments were run in dimension 2 and 3 in a low-fidelity setting, with a high-fidelity correction at the end.Indeed, since the optimisers are tracking only the best value found so far, it is possible that the returned point sets have slightly higher discrepancy and the calculation during the optimiser run was one with a high variance.To correct for this phenomenon, we recompute the expected L 2 -discrepancies of the corresponding point sets, this time with 10000 repetitions to guarantee a better precision.No initialisation is provided to the optimisers and 10 runs are made for each instance.
Note that we could extend our experiments beyond dimension 3. The main difficulty is sampling uniformly from given hypercube slices.In our experiments, we define a slice based on the distance between two successive points p i and p i+1 , rotate and translate this slice into the right position and then sample a point in it.Note that re-sampling may be necessary as usually not all of the slice is indeed contained in the unit cube.However, this is in general not a problem.For d = 3 we use quaternions to define the rotations.In higher dimensions this gets more involved.
Both the ioh [24] and Nevergrad [20] Python packages were used for the optimisers, the implementations are taken from the Nevergrad package.Experiments were run in Python, using the random module to generate the points.The density estimations were obtained using the sklearn package and in particular Kernel Density.

Experimental results
. Table 1 gives a percentage comparison of the expected L 2 -discrepancy of the equivolume partition S(N, 2), of uniformly distributed random points R(N, 2), and of the best distributions found by the different optimisers, using S(N, 2) as a baseline.The discrepancy estimation is obtained with 10000 repetitions.
As already empirically observed in [10], the expected L 2 -discrepancy of the equivolume partition is approximately smaller by a factor of 2 than the discrepancy of uniformly distributed random points for all different set sizes tested.All three black-box optimisers return very similar values, which are about 7% better for the lowest values of N after re-evaluation of the expected L 2 -discrepancy.Since all three optimisers return similar values, this suggests that the point ordering in the problem structure was not an issue, at least for these low values of N .The only outliers are for very low values of N , where the (1+1)-ES is struggling: this seems to be largely due to the low-fidelity optimisation, the best discrepancy values found were around 0.025 for N = 3 before correction.We also note that for N = 4, both the discrepancy values and the point sets returned are quite similar to the improved construction given in [10] (see Table 2 for the point sets).The equivolume partition performs better than the optimisers for N = 20 and its relative performance improves when N increases in general.
As mentioned previously, discrepancy values for the point sets returned by the optimisers were recomputed with a higher precision.This implies that possibly better point sets were overseen during the optimisation because of the imprecision in the expected L 2 -discrepancy calculations.The gradually decreasing gap, both for the returned discrepancy value and the corrected one, also suggests that our optimisers may be struggling with the higher dimensional problem.
Table 2 gives examples of the obtained point sets (see also Appendix C).Interestingly, all the sets returned by the three optimisers have a similar structure once N is big enough (N ≥ 5).This structure can be visualised using a kernel density estimate with a Gaussian kernel as illustrated in Figure 6.The point sets seem to have two clusters of points on the diagonal.The spread of the points seems bigger meaning that the first point has a smaller value as the first of the equivolume set, while the last point has a larger value as the corresponding point of the equivolume set.
Turning to dimension 3, Table 3 gives the percentage comparison of the expected L 2 -discrepancy of the different methods with S(N, 3) as the baseline.The results are quite similar to the d = 2 case, where all three black-box optimisers returning similar values, all outperforming the equivolume stratification and with the (1+1)-ES giving the worst results.Once again, we notice that the relative performance of the equivolume strategy improves as N increases.The plots in Figure 7 are similar to corresponding plots in dimension 2: there are clusters of strata and the spread of the points seems larger.However, the last diagonal point p N −1 in the optimisers' sets is not always larger than the last diagonal point for S * (N, 3).We visualise the point sets again with a kernel density estimate in Figure 8. Overall, while the equivolume stratification provides much better expected L 2 -discrepancy than random uniformly distributed points, our results suggest that it can still be improved.Finally, it is important to note that our black-box optimisers seem to return sets with similar structure as illustrated in Figure 6 and Figure 8 while the individual points in the respective sets are quite different; see tables in Appendix C.This suggests a plateau-like structure of the underlying space which is a blessing and a curse at the same time.One the one hand, we can safely assume that we found almost optimal solutions, on the other hand it leaves little hope to actually determine the minimal set.Open questions.This paper presents results on optimal stratification of [0, 1] 2 for the class of partitions whose partitioning hyperplanes are orthogonal to the main diagonal of the unit cube.Hence,

N R(N, 3) S(N, 3) CMA-ES
(1) how much can one improve upon the expected discrepancy values of the stratified point sets obtained in this text by relaxing some, or all of the constraints on the partitioning lines?
(2) what is the optimal partition (with respect to the expected discrepancy of the resulting stratified point set) of [0, 1] d for d ≥ 2?
Theorem 8 (Berry-Esseen Theorem).Suppose that X 1 , X 2 , . . ., X n are i.i.d. and let µ n where Φ is the CDF of the standard normal distribution and C is an absolute constant.
The currently best bound for the constant is C < 0.4748; see [14].Hence, noting that for our i.i.d random variables X 1 , . . ., X d we have µ = 1/2, σ = 1/(2 √ 3) and ρ = 1/32 we can implement the Berry-Esseen Theorem to see that the rate of convergence noted in Theorem 2 is (11) sup B. Different Optimisers.CMA-ES.The Covariance Matrix Adaptation Evolution Strategy, also known as CMA-ES is a family of heuristic optimisation techniques for numerical optimisation.It was initially introduced in [18] (with an extension in [19]), but modifications have been suggested over the years, leading to many different variants (see for example [23] for some popular modifications).We give here a brief description of the general principle, more details can be found in a tutorial paper by Hansen [16] as well as in a more recent presentation by Akimoto and Hansen [15].In our experiments, we use the Diagonal-CMA version, described in [17].
The general principle behind CMA-ES is to introduce a covariance matrix associated to a multivariate Gaussian distribution and to learn second order information on this distribution.The mean of this distribution represents the current best guess while the covariance matrix determines the shape of the distribution ellipsoid.At each step, this covariance matrix is used to generate a number of solution candidates.These candidates are evaluated for the function we are trying to optimize and then used to update the mean and covariance matrix of the multivariate Gaussian distribution.The step size for each update is also changed during the algorithm, by comparing the current path length -how much we change the mean of the distribution -with the expected one if steps were independent.If the path length is shorter, the steps are anti-correlated and the step size should be decreased, if the path length is longer it should be increased.To update the mean, a weighted sum of part (or all) of the new samples is done, where weights are higher for better samples and the chosen step size is taken into account.For the covariance, the new covariance matrix will depend both on a weighted sum of the new samples and on the evolution path, the sum of past steps for the mean.NGOpt: The second black-box optimiser we used is NGOpt.An older version was presented in [21] (in which NGOpt8 is called ABBO), more recent modifications haven't been published but were incorporated in the NGOpt optimiser available in the Nevergrad Python module.NGOpt is an algorithm selection technique, which will select the best algorithm to run on a problem.It selects the algorithm(s) to run based on features known in advance -dimension of the problem, type of variables, bounds for the variables or the budget for example -as well as based on information obtained during the execution of the algorithms.It can run multiple algorithms in parallel before only continuing with the best one or run multiple algorithms sequentially.CMA-ES is one of the algorithms that NGOpt can select, but given our problem setting (dimension below 10, budget 1000), it does not call the same version of CMA as our CMA experiment.Given the lack of information on the function we are minimizing as well as our lack of budget for hyperparameter tuning and algorithm configuration, using NGOpt's intrinsic algorithm choice is a valuable help in guaranteeing that our chosen optimisers should be good.
In our problem formulation, potential solutions are of the shape (p i ) i∈{1,...,k} where for all i ∈ {1, . . ., N − 1} with i < j implies p i ≤ p j .CMA-ES and a large part of the algorithms in NGOpt try to learn some relation between the variables.For this, they use several evaluated points and update the sampling distribution.Most of these optimisers do not work well with constraints (they typically sample points until they fit the constraints, potentially spending a lot of time) therefore our approach was to reorder the solution parameters to always have a sorted point set.After evaluating the fitness of a sorted candidate point, the optimiser would consider this to be the fitness of the non-sorted original point.The distribution update could then be misled by this sorting.(1+1)-Evolutionary Strategy: To limit the impact of the above problem, we also run our experiments with a more traditional (1+1)-Evolutionary Strategy (ES).At each step, the current solution is modified by adding a random variable (in our case a Gaussian random variable) to each element of the solution, scaled by a varying step size.The best solution between the new one and the old solution is then kept for the next step.The step size is modified depending on the mutation success, according to the 1/5-th rule [22]: if too many steps are successful, the step size is increased, and it is decreased otherwise.This iterative process continues until we reach the given budget.There are many different versions of (1+1)-Evolutionary Strategies generally changing the mutation rates.We will be using here the Parametrized(1+1) implementation from the Nevergrad package with a Gaussian mutation.
C. Additional Tables.In this last appendix, we place several additional tables for the reader's perusal which give further and exact numerical results relating to the work in Section 6. Optimal Point Sets: Table 2 in the main text gives the approximations to the optimal point set S * (N, 2) for 3 ≤ N ≤ 10 returned by CMA-ES.The tables below show analogous information returned by the two other optimisers as well as the results in dimension 3. Exact Discrepancy Values: Figure 1 in the main text gives a percentage comparison of the expected L 2 -discrepancies of the various point sets.For completeness sake and as a final attachment, Table 9 gives the exact expected discrepancy of the point same point sets.

Figure 1 .
Figure 1.Partition of the unit cube into N = 6 equivolume slices that are orthogonal to the diagonal.

Problem 1 . 1 . 2 .
For given dimension d and number of sets N , characterise the set S(N, d).Optimisation.Our construction is motivated by[10, Example 2 and 3].It was shown that among all partitions of the unit square into two sets of equal-volume, the dividing hyperplane orthogonal to the diagonal and going through the center of the square gives the smallest discrepancy.Furthermore, it was observed that moving the dividing hyperplane along the diagonal and thus relaxing the equivolume constraint can further improve the expected discrepancy; see Figure2.This motivates our second question.For a given N , we would like to find the points 0 < r 1 < . . .< r N −1 < d such that the corresponding hyperplanes H r 1 , . . ., H r N −1 define a partition of [0, 1] d into N sets minimizing the expected discrepancy of the corresponding stratified sample.We call the set S * (N, d) := {r i : i = 1, . . ., N − 1} the minimal set of the parameter pair (N, d).

Problem 2 .Figure 2 .
Figure 2. Left: Best partition into two sets of equal volume and illustration of one-parameter family of partitions obtained from moving hyperplane along diagonal.Right: The partition of this family with the smallest expected discrepancy.

Figure 3 .
Figure 3.The x-axis contains the values of V , while the y-axis shows the corresponding r for a given volume V .The three different colors illustrate the three different cases.

Figure 4 . 2 √ 3d r i d − 1 2 .
Figure 4. Illustration of convergence to the cumulative distribution function as proved in Theorem 2. The x axis represents the index, i, of the set in the partition with 1 ≤ i ≤ 10000, while the y-axis contains the absolute differenceV − d (r i ) − Φ 2 √ 3d r i d − 1 2

5 !Figure 5 .
Figure 5. Deviation from 1/N of the volume of each partition set generated by the probabilistic values for r in dimension 5. From left to right, N = 100, 1000, 10000.The x-axis represents the index, i, of the set in the partition, while the y-axis shows the deviation of the volume of the i-th set in the approximate partition from the desired volume of 1/N .

6. 1 .
Experiment setup.Regardless of the chosen optimiser -see Appendix B for a description of the different algorithms: CMA-ES, NGOpt and (1+1)-ES -the general model is identical.In dimension d, we have an N − 1 dimensional problem, where each parameter is in the interval [0, √ d].

Figure 6 .
Figure 6.Kernel density estimate plots with Gaussian kernels for the best partitioning points obtained by the optimisers in dimension 2 as well as the equivolume sets for N = 10, 15, 20.

Figure 8 .
Figure 8. Kernel density estimate plots with Gaussian kernels for the best partitioning points obtained by the optimisers in dimension 3 as well as the equivolume sets for N = 5, 7, 9, 10, 15, 20.

Table 1 .
A percentage change comparison of the expected L 2 -discrepancies of different point sets with respect to the equivolume stratification: uniformly distributed random points R(N, 2), the equivolume partition corresponding to S(N, 2) and the best sets obtained by the three black-box optimisers.Expected L 2 -discrepancy values are done with 10000 repetitions to guarantee a better precision.We refer to Table9in Appendix C for the exact expected discrepancy values.

Table 2 .
Optimal partitioning points returned by CMA-ES.Note that we need N − 1 hyperplanes for a set of N points.

Table 3 .
A percentage change comparison of the expected L 2 −discrepancy values of different point sets using S(N, 3) as a baseline.Discrepancy values are done with 10000 repetitions; we refer to Table10in Appendix C for the exact expected discrepancy values.