A simple proof of Pitman-Yor's Chinese restaurant process from its stick-breaking representation

For a long time, the Dirichlet process has been the gold standard discrete random measure in Bayesian nonparametrics. The Pitman--Yor process provides a simple and mathematically tractable generalization, allowing for a very flexible control of the clustering behaviour. Two commonly used representations of the Pitman--Yor process are the stick-breaking process and the Chinese restaurant process. The former is a constructive representation of the process which turns out very handy for practical implementation, while the latter describes the partition distribution induced. However, the usual proof of the connection between them is indirect and involves measure theory. We provide here an elementary proof of Pitman--Yor's Chinese Restaurant process from its stick-breaking representation.

Its most prominent role is perhaps in Bayesian nonparametric statistics where it is used as a prior distribution, following the work of Ishwaran and James (2001). Applications in this setting embrace a variety of inferential problems, including species sampling (Favaro et al., 2009, Navarrete et al., 2008, Arbel et al., 2017, survival analysis and graphical models in genetics (Jara et al., 2010, Ni et al., 2018, image segmentation (Sudderth and Jordan, 2009), curve estimation (Canale et al., 2017), exchangeable feature allocations (Battiston et al., 2018) and time-series and econometrics (Caron et al., 2017, Bassetti et al., 2014. Last but not least, the Pitman-Yor process is also employed in the context of nonparametric mixture modeling, thus generalizing the celebrated Dirichlet process mixture model of Lo (1984). Nonparametric mixture models based on the Pitman-Yor process are characterized by a more flexible parameterization than the Dirichlet process mixture model, thus allowing for a better control of the clustering behaviour (De Blasi et al., 2015). In addition, see Ishwaran and James (2001), Favaro and Walker (2013), Arbel et al. (2018) for posterior sampling algorithms, Scricciolo et al. (2014), Miller and Harrison (2014) for asymptotic properties, and Scarpa andDunson (2009), Canale et al. (2017) for spike-and-slab extensions.
The Pitman-Yor process has the following stick-breaking representation: is distributed according to the Pitman-Yor process, PY(α, d, H), with concentration parameter α, discount parameter d, and base distribution H. The Pitman-Yor process induces the following partition distribution: if P ∼ PY(α, d, H), for some nonatomic probability distribution H, we observe data x 1 , . . . , x n |P iid ∼ P , and C is the partition of the first n integers {1, . . . , n} induced by data, then where the multiplicative factor before the product in (2) is also commonly (and equivalently) written as ( |C|−1 i=1 α + id)/(α) (n−1) in the literature. When the discount parameter d is set to zero, the Pitman-Yor process reduces to the Dirichlet process and the partition distribution (2) boils down to the celebrated Chinese Restaurant process (CRP, see Antoniak, 1974). By abuse of language, we call the partition distribution (2) the Pitman-Yor's CRP. Under the latter partition distribution, the number of parts in a partition C of n elements, k n = |C|, grows to infinity as a power-law of the sample size, n d (see Pitman, 2003, for details). This Pitman-Yor power-law growth is more in tune with most of empirical data (Clauset et al., 2009) than the logarithmic growth induced by the Dirichlet process CRP, α log n.
The purpose of this note is to provide a simple proof of Pitman-Yor's CRP (2) from its stick-breaking representation (1) (Theorem 2.1). This generalizes the derivation by Miller (2018) who obtained the Dirichlet process CRP (Antoniak, 1974) from Sethuraman's stickbreaking representation (Sethuraman, 1994). In doing so, we also provide the marginal distribution of the allocation variables vector (3) in Proposition 2.2.

Partition distribution from stick-breaking
Suppose we make n observations, z 1 , . . . , z n . We denote the set {1, . . . , n} by [n]. Our observations induce a partition of [n], denoted C = {c 1 , . . . , c kn } where c 1 , . . . , c kn are disjoint sets and kn i=1 c i = [n], in such a way that z i and z j belong to the same partition if and only if z i = z j . We denote the number of parts in the partition C by k n = |C| and we denote the number of elements in partition j by |c j |. We use bold font to represent random variables.

Let allocation variables be defined by
and C denote the random partition of [n] induced by z 1 , . . . , z n . Then The proof of Theorem 2.1 follows the lines of Miller (2018)'s derivation. We need the next two technical results, which we will prove in Section 3. Let C z denote the partition [n] induced by z for any z ∈ N n . Let k n be the number of parts in the partition. We define m(z) = max {z 1 , . . . , z n }, and g j (z) = #{i : z i ≥ j}.
Proposition 2.2. For any z ∈ N n , the marginal distribution of the allocation variables vector z = (z 1 , . . . , z n ) is given by Proof of Theorem 2.1.
3 Proofs of the technical results

Additional lemmas
We require the following additional lemmas. where B denotes the beta function. Proof.
Let S kn denote the set of k n ! permutations of [k n ]. The following lemma is key for proving Lemma 2.3.
Proof. Consider the process of sampling without replacement k n times from an urn containing k n balls. The balls have sizes n 1 − d, . . . , n kn − d, and the probability of drawing ball i is proportional to its size n i − d. Thus for any permutation σ ∈ S kn we have that Therefore, This way, we construct a distribution on S kn . We know that σ∈S kn p(σ) = 1. Applying this to Equation (4) and dividing both sides by (n σ 1 − d) · · · (n σ kn − d) = (n 1 − d) · · · (n kn − d) gives the result. .

Proof. Let
. We show by induction decreasing from j = k n to j = 0 that When j = k n we have due to Lemma 3.1, which proves the initialization for (5).
We now consider the case of an arbitrary j, greater than 0 and less than k n . By the induction hypothesis, we have that Equation (5) holds for j + 1, that is . Therefore, Rearranging the rising factorials in the numerator, we can write and thus factorize the terms independent of b j in order to obtain .
The sum above can be rewritten, using X ∼ Beta( α d +b j−1 + (k n − j), .
Putting this all together, which proves the desired result for j. By induction, this result is true for all j ∈ {1, . . . , k n }.

Proof of Proposition 2.2 and Lemma 2.3
Proof of Proposition 2.2. For simplicity, we fix the allocation variable vector to a value z and denote m(z) by m and g j (z) by g j . We have where e j = #{i : z i = j}. Thus, α + (j − 1)d g j + α + (j − 1)d where step (a) follows from Lemma 3.1, step (b) since f j = g j+1 and g j = e j + f j , step (c) since Γ(x + 1) = xΓ(x), and step (d) since g 1 = n and g m+1 = 0.
Proof of Lemma 2.3. As before, we denote the parts of C by c 1 , . . . , c kn , and we let k n = |C|. We denote the distinct values taken on by z 1 , . . . , z n by j 1 < · · · < j kn . We define . . , k n }. We use the notation a i (σ) = n σ i + · · · + n σ kn , where σ is the permutation of [k n ] such that c σ i = {ℓ : z ℓ = j i }. Then for any z ∈ N n such that C z = C, because g j (z) = a i (σ) forb i−1 < j ≤b i . It follows from the definition of b = (b 1 , ..., b kn ) and σ that there is a one-to-one correspondence between {z ∈ N n : C z = C} and {(σ, b) : σ ∈ S kn , b ∈ N kn }. Therefore, where step (a) follows from Lemma 3.3 and step (b) follows from Lemma 3.2.