Sampling and interpolation of cumulative distribution functions of Cantor sets in [0, 1]


 Cantor sets are constructed from iteratively removing sections of intervals. This process yields a cumulative distribution function (CDF), constructed from the invariant Borel probability measure associated with their iterated function systems. Under appropriate assumptions, we identify sampling schemes of such CDFs, meaning that the underlying Cantor set can be reconstructed from sufficiently many samples of its CDF. To this end, we prove that two Cantor sets have almost-nowhere intersection with respect to their corresponding invariant measures.


Introduction
A Cantor set is the result of an infinite process of removing sections of an interval -[ ] 0, 1 in this paperin an iterative fashion. The set itself consists of the points remaining after the removal of intervals specified by two parameters: the scale factor N and digit set D. The positive integer N determines how many equal intervals each extant segment is divided into per iteration, while ⊂ { … − } D N 0, , 1 enumerates which of the N intervals of the segments will be preserved in each iteration. Equivalently, a Cantor set is the subset of [ ] 0, 1 consisting of numbers whose base-N expansion uses only digits from D. Yet another description of Cantor sets is given by the invariant set for an iterated function system (IFS), which will be our view in this paper.
Each Cantor set yields a cumulative distribution function (CDF), which we define formally in Definition 1.2. We denote the class of all such CDFs by F. We consider the problems of sampling and interpolation of functions in F. By sampling, we mean the reconstruction of an unknown function F ∈ F from its samples { ( )} ∈ F x i i I at known points { } ∈ x i i I in its domain (for an introduction to sampling theory, see [1,2]). By interpolation, we mean the construction of a function F ∈ F that satisfies the constraints ( ) = F x y i i for a priori given data {( )} ∈ x y , i i i I . Note that the premise of the sampling problem is that there is a unique F ∈ F that satisfies the available data, whereas the interpolation problem may not have the uniqueness property. Depending on the context, I can be either finite or infinite.
To be more precise regarding sampling CDFs, we formulate the problem as follows: Fix G F ⊂ . For which sets of sampling points { } ∈ x i i I does the following implication hold: In the case where 1 holds, we call { } x i a set of uniqueness for G. Our main results in the paper concerning sampling include the following. In Theorem 2.5, we prove that if G consists of all CDFs for Cantor sets with unknown scale factor N , but the scale factor is known to be bounded by K , then there exists a set of uniqueness of size ( ) O K 3 . We show that when the scale factor N is known, there exists a set of uniqueness of size − N 1 that satisfies the implication in (1). We conjecture that there is a minimal set of uniqueness of size     N 2 and prove that the minimal set of uniqueness cannot be smaller in Proposition 2.5. We also provide evidence of our conjecture by considering a conditional sampling procedure (meaning that the sampling points are data dependent) that can uniquely identify the CDF from     N 2 samples in Theorem 2.2. Additionally, in Section 2.2, we include an interpolation procedure as an imperfect reconstruction of a CDF from samples, and provide an upper bound on the error that the reconstruction via interpolation could give.

Cantor sets and their CDFs
There are many ways to construct Cantor sets, and consequently many ways to denote a Cantor set. The Cantor sets we consider in this paper are those that correspond to restricted digit sets. Thus, our set is → B is referred to as the binary digit vector, and we denote the Cantor set determined by → B as → C B . In this sense, both C N D , and → C B can be used to describe a Cantor set, and we naturally associate N D , with its corresponding → B . Note that in this work, all indexing will start with zero, such that b 0 the first entry of the vector → B . In addition, special cases exist in which a Cantor set will be considered degenerate. In particular, → C B is not considered when the set is empty, a one-point set, or [ ] 0, 1 . Under this definition, there does not exist a Cantor set with < N 3 or ∥ → ∥ B equal to 0, 1, or N . For an example of a legitimate Cantor set, ( ) C 1,0,1 is the well-known ternary Cantor set ( Figure 1). We also provide an illustration of the iterative construction of the Cantor set corresponding to → = ( ) B 1, 1, 0, 1 ( Figure 2). Another description of the Cantor sets we consider is as the invariant set for an (affine) IFS.
Note that the CDF of any of our Cantor sets is continuous. When convenient, we will extend → F B to all of by , where m is Lebesgue measure.
The Cantor ternary set is the invariant set for the IFS ( ) = ϕ x . The corresponding CDF is often referred to as the "Devil's staircase," and the invariant measure on the Cantor ternary set is the pullback of Lebesgue measure onto the Cantor set under the CDF.
The Cantor sets we consider in this paper are sometimes referred to as "thin" Cantor sets [4]. The Cantor sets we consider have Lebesgue measure 0; indeed, the Hausdorff dimension of Next, we describe an algorithm for approximating the CDF of a Cantor set. To be precise, we recursively define a sequence of piecewise linear functions { } f n which converges uniformly to the desired CDF. For this, we need the following definition.
We can define a sequence of piecewise linear functions that approximate a CDF in the following manner. For the Cantor set → C B with cumulative digit function → g B , we define → ( ) F B 1 as the linear interpolation of the where the limit converges uniformly on [ ] 0, 1 .

Operations on Cantor sets and IFSs
For convenience, we define several operations on Cantor sets and their associated CDFs and IFSs. We recall the Kronecker product of two vectors: let , is the CDF whose binary digit vector is We can define a Kronecker product on digit sets to retain the association of 2 .
Definition 1.5. (Kronecker product of digit sets) The Kronecker product of two digit sets D 1 and D 2 , denoted ⊗ D D 1 2 , is defined to be the Kronecker product of their associated binary digit vectors. That is, the scale factor of Some assorted definitions and notations. We let and we will write

Related results
The results we obtain in this paper are the first of their kind, as far as we are aware. However, sampling of functions that are associated with fractals has been considered previously in various ways. Sampling of functions with fractal spectrum was first investigated in [6,7]. In those papers, the authors consider the class of functions F which are the Fourier transform of functions ∈ ( ) f L μ 2 . Here, the measure μ is a fractal measure that is spectral, meaning that the Hilbert space ( ) L μ 2 possesses an orthonormal basis of exponential functions. Similar sampling theorems are obtained in [8] without the assumption that the measure is spectral. In higher dimensions, graph approximations of fractals (such as the Sierpinski gasket) are often considered; sampling of functions on such graphs has been considered in [9,10].
Sampling of CDFs appears in [11,12] in the context of the cumulative distribution transform (CDT). The CDT is nonlinear and can provide better separation for classification problems. Sampling of CDFs occurs in the discretization of the CDT. Related results on interpolation of data using fractal functions and IFSs can be found in [13,14]. Approximating the moments of the Cantor function is investigated in [15].
A much more general construction of CDFs, and approximations thereof, can be found in [16]. Sampling of probability distributions on Cantor-like sets is considered in [17,18].

Preliminary theorems
The first Lemma of this section is a very useful invariance identity of the CDF.
Proof. This follows nearly immediately from Theorem 1.1, however, we present the proof anyway. Observe, Hence under a change-of-variables Then by the invariance equation (2), 0, , is a cumulative digit function for some valid CDF if and only if the following criteria are met: . By induction, it follows g is the cumulative digit function of → B by definition. □

Kronecker product results
We define Proposition 2.2. Consider Cantor sets → C B1 and → C B2 such that the scale factor and binary digit vector for

. This occurs if and only if
. This is the IFS for scale factor N N 1 2 and binary digit vector Proof. Since → F B is uniquely determined by → C B , and → C B is uniquely determined by the property that Since ⊗ D n was defined to retain its association with be a binary digit vector with cumulative digit function → g C , and → ⊗ → g B C be the cumulative digit function for Proof. The proof follows by induction on j. It follows the identity holds for all k when = j 0. This serves as the base case for induction on j. Now assume the identity for j. Then, Proof. Fix the sequence { } ⊂ n i N . We have, by Corollary 2.2 and Lemma 2.2, for all ∈ j For an inductive base case, by Proposition 2.2, Thus, by induction, for all j and ∈ n i . Further, by considering equivalent fractions, we may assume for all n, that = x n a N n and = y n . We construct the binary digit vector → B of length N as follows: Then, we observe the recurrence relation, , n n be a finite sampling set of rational pairs in the unit cube satisfying the hypotheses of Proposition 2.4. We note from the proof of the proposition that interpolation by a CDF is not unique. Proof. We may assume without loss of generality that < < <⋯< <

Sampling
We first show that if we know the scaling factor N , then − N 1 well chosen sample points is enough to is a set of uniqueness for G N .
We will now consider the case when we do not know the scale factor.

Motivating a bound on scale factor
Remark 2.1 and Corollary 2.4 together establish that finite samples will never suffice without some sort of constraint. We contrast this with Proposition 2.6 as this shows a lower bound of ( ) O N points is necessary, where N is the scale factor. The following proposition shows that to be able to uniquely determine a CDF with a finite number of points, there must be a bound on the scale factor.
Proof. First, we note that there exists an integer i such that Case 2. < j i The argument is analogous to the one given for case 1, and we omit the details.  Sampling of CDFs of Cantor sets  95 Proof. Since is continuous, it suffices to show the equality on a dense subset of the unit interval. Specifically, we show the identity on the set of N -adic numbers, that is where N is the length of → B . We first observe that the simplest case, when = k 1, holds. We proceed by induction on the power of the N -adic number, assuming the identity is true for k. Then, by Lemma 2.1, Thus, the identity holds on the N -adic numbers, and the proof is done. □ We say that a sampling algorithm is conditional if previously attained samples inform the selection of the next sample. For the remainder of this section, we describe a conditional sampling algorithm that completely determines a binary digit vector → B given its scale factor N . The algorithm as stated below requires at most     N 2 samples to execute successfully which we note is the minimum number of samples that is required under non-conditional sampling to discern binary digit vectors of equal scale factor. We first state the result.
Then there is a conditional sampling algorithm with at most     N 2 points that completely determines → F B .
The conditional sampling algorithm that answers Theorem 2.2 is located in the Appendix and split into two parts. Each part considers pairs of digits from → B at a time, e.g.
3 , etc. The role of Algorithm 1 is to find the first nonzero digit of → B . As a consequence of the method, we can also find ∥ → ∥ B from the sampling in Algorithm 1. Then the algorithm terminates if the first nonzero digit occurred in the last pair, i.e., ( ) The case when ℓ = 0 immediately follows from Lemma 2.2 since To prove identity (3) in general, we proceed by induction, so assume that the identity holds for ℓ. Then, by Lemma 2.1, we find ; else Using some basic algebra, we note that for ℓ ≥ 1, Thus, it suffices to find an integer L such that for ℓ ≥ L, The simplest way to find such an L is to take the smallest positive integer L such that > − It follows that we can then determine the parameters ( . In the validation of Algorithm 2, it is equivalent to consider the three situations: As for situation 1, we just showed that we may solve for Rearranging and combining terms, we find We conclude that cases 2 and 3 are distinct from dividing through by      As the base case, when = L 1, This proves Now we will induct on M. For the base case, when = M 1, can be represented as N LM long row vectors. This gives an equivalent definition of the Kronecker product on matrices. Since , which is a contradiction. Therefore, = c 1, and  when ∈ x S. Suppose that L M , and g LM be the cumulative digit function. Therefore, . By Theorem 2.4, = b 1 k , so by Lemma 2.3, for < j N M , and by Proposition 2.2, Further, by switching L and M above,

Almost nowhere intersection of Cantor sets
We will use the fact that different Cantor sets have almost no intersection, i.e., the intersection has measure 0 under either of the invariant measures, to design sampling schemes. Intersections of Cantor sets have been extensively studied, e.g. [20,21]. We prove here the property of the intersection of Cantor sets that we need.    Proof. Adapted from Lemma 3 of [24]. Note, ( ) e λx is defined and continuous on the interval ≤ ≤ x 0 1. Let Z t be the set of non-negative integers less than N t containing only digits in D in their base N expansion. Therefore, by the invariance equation as applied to the push-forward measure ( ) where m is Lebesgue measure, we can calculate The details of the above calculation are given in [24].
x C n Our proof of the following theorem is adapted from [24], and a much stronger result has already been proven in [25]. See also [26] for a related result regarding the entropy of multiplication by an integer on / . Note, normality is a stronger condition than necessary to show an element is not in any Cantor set with scale factor N . In fact, it only must have every digit appear at least once.   Remark 2.3. CDFs equivalent to → F B will not be eliminated by the algorithm described in Theorem 2.5, which only eliminates CDFs which do not pass through all the points. Then, the algorithm will produce all equivalent CDFs with scale factor less than K , which includes the CDF with the smallest possible scale factor, and the smallest possible scale factor can be determined.