A classification method for binary predictors combining similarity measures and mixture models

: In this paper, a new supervised classification method dedicated to binary predictors is proposed. Its originality is to combine a model-based classification rule with similarity measures thanks to the introduction of new family of exponential kernels. Some links are established between existing similarity measures when applied to binary predictors. A new family of measures is also introduced to unify some of the existing literature. The performance of the new classification method is illustrated on two real datasets (verbal autopsy data and handwritten digit data) using 76 similarity measures.


Introduction
Supervised classi cation aims to build a decision rule able to assign an observation x in an arbitrary space E with unknown class membership to one of L known classes C , . . . , C L . For building this classi er, a learning dataset {(x , y ), . . . , (xn , yn)} is used, where an observation is denoted by x i ∈ E and y i ∈ { , . . . , L} indicates the class belonging of x i , i = , . . . , n.
Model-based classi cation assumes that the predictors {x , . . . , xn} are independent realizations of a random vector X on E and that the class conditional distribution of X is parametric. When E = R p , among the possible parametric distributions, the Gaussian is often preferred and, in this case, the marginal distribution of X is therefore a mixture of Gaussians. Estimation of model parameters can be achieved with maximum likelihood, see [29]. Some extensions dedicated to high-dimensional data include [6,8,9,30,31,33,34]. Although model-based classi cation is usually enjoyed for its multiple advantages, it is often limited to quantitative data. Numerous recent works focused on non Gaussian distributions such as the skew normal [43], asymmetric Laplace [16], t-distributions [1,15] or skew t-distributions [27,28,45].
Only few works exist to handle categorical data using multinomial [12] or Dirichlet [5] distributions for instance. Recently, a new classi cation method, referred to as 'parsimonious Gaussian process Discriminant Analysis' (pgpDA), has been proposed [7] to tackle the case of data of arbitrary nature. See for instance [14] for an application to the classi cation of hyperspectral data. The basic idea is to introduce a kernel function in the Gaussian classi cation rule.
In this paper, we focus on the application of the pgpDA method to binary predictors. To this end, we show how new kernels can be built basing on similarity or dissimilarity measures. In particular, 76 such measures are considered. Some links are established between these measures when they are applied to binary predictors. A new family of measures is also introduced to unify the existing literature. As a result, we end up with a new supervised classi cation method dedicated to binary predictors combining similarity measures and mixture models. Its performance is illustrated on two real datasets (verbal autopsy data and handwritten digit data). It is shown that the proposed kernels can lead to good classi cation results even in challenging problems.
The paper is organized as follows. The principle of pgpDA applied to binary predictors is explained in Section 2. A brief review on similarity and dissimilarity measures is proposed in Section 3 together with some uni cation e orts. The construction of new kernels starting from similarity measures is presented in Section 4. The method is illustrated on real data in Section 5 and some concluding remarks are provided in Section 6. Proofs are postponed to the Appendix.

Classi cation with binary predictors using a kernel function
Conventional classi cation algorithms can be turned into kernel ones as far as the original method depends on the data only in terms of dot products. The dot product is simply changed to a kernel evaluation, leading to a transformation of linear algorithms to non-linear ones. Additionally, a nice property of kernel learning algorithms is the possibility to deal with any kind of data. The only condition is to be able to de ne a positive de nite function over pairs of elements to be classi ed [23]. Here, we focus on binary predictors. Let us consider a learning set {(x , y ), . . . , (xn , yn)} where {x , . . . , xn} are assumed to be independent realizations of a random binary vector X ∈ { , } p . The class labels {y , . . . , yn} are supposed to be realizations of a discrete random variable Y ∈ { , . . . , L}. They indicate the memberships of the learning data to the L classes denoted by C , . . . , C L , i.e. y i = k means that x i belongs to the kth cluster C k for all i ∈ { , . . . , n} and k ∈ { , . . . , L}.
The principle of pgpDA is as follows. Let K be a symmetric non-negative bivariate function K : { , } p × { , } p → R + . In the following, K is referred to as a kernel function and additional conditions will be assumed on K. The basic idea is to measure the proximity between individuals with K, and that close individuals are likely to belong to the same class. To this end, the kernel K computes inner products between pairs of data in some non-linear space (often referred to as a feature space). For all k = , . . . , L, let us denote by n k the cardinality of the class C k , i.e. n k = n i= I{y i = k} where I{.} is the indicator function. We also introduce r k the dimension of class C k once mapped in a non-linear space with the kernel K. In practice, one has r k = min(n k , p) for a linear kernel and r k = n k for the non-linear kernels considered in Section 4. See [7], Table 2 for further examples.
For all k = , . . . , L, the function ρ k : { , } p × { , } p → R + is obtained by centering the kernel K with respect to the class C k : Besides, for all k = , . . . , L, let M k be the n k × n k symmetric matrix de ned by (M k ) i,j = ρ k (x i , x j )/n k for all (i, j) ∈ { , . . . , n k } . The sorted eigenvalues of M k are denoted by λ k ≥ · · · ≥ λ kn k while the associated (normed) eigenvectors are denoted by β k , . . . , β kn k . In the following, β kji represents the ith coordinate of β kj , for (i, j) ∈ { , . . . , n k } . The main assumption of the method is that the data of each class C k live in a speci c subspace of dimension d k of the feature space (of dimension r k ). The variance of the signal in the kth group is modeled by λ k , . . . , λ kd k and the variance of the noise is modeled by λ. This amounts to supposing that the noise is homoscedastic and its variance is common to all the classes.
The classi cation rule introduced in [7], Proposition 2 a ects x ∈ { , } p to the class C if and only if = arg min k= ,...,L D k (x) with where dmax = max{d , . . . , d L } and Let us highlight that only the eigenvectors associated with the d k largest eigenvalues of M k have to be estimated. This property is a consequence of the above assumption, it allows to circumvent the unstable inversion of the matrices M k , k = , . . . , L which is usually necessary in kernelized versions of Gaussian mixture models, see for instance [13,32,35,44,46]. In practice, d k is estimated thanks to the scree-test of Cattell [11] which looks for a break in the eigenvalues scree. The selected dimension is the one for which the subsequent eigenvalues di erences are smaller than a threshold t. The threshold t can be provided by the user or selected by cross-validation, see Section 5 for implementation details. The implementation of this method requires the selection of a kernel function K which measures the similarity between two binary vectors. The following invariance remark can be made: Then, for all η > and µ ∈ R, the classi cation rules associated with K andK := ηK + µ through (2) are the same.
As a consequence, to de ne a proper kernel method [23], it su ces to nd a shifted version of K which is a positive de nite function i.e.
The construction of kernel functions adapted to binary vectors and satisfying (3) is addressed in Section 4. Let us highlight that pgpDA is not the only kernel-based classi cation method. In Section 5, pgpDA is compared to Support Vector Machine (SVM) classi cation [20,21,36] and k-nearest neighbours (kNN) [22], Chapter 13, on two real datasets. From the theoretical point of view, pgpDA o ers a number of advantages compared to SVM: It is naturally a multi-class method; as a model-based classi er, it provides classi cation probabilities, and nally its computation cost is lower than SVM [7].

Similarity and dissimilarity measures
Binary similarity and dissimilarity measures play a critical role in pattern analysis problems, classi cation or clustering. Since the performance of these methods relies on the choice of an appropriate measure, many e orts have been made to nd the most meaningful similarity measures over a hundred years, see [2,37] for examples. The review article [37] lists 76 examples of such measures. Here, we focus on their application to binary predictors. One of the earliest measures is Jaccard's coe cient [26]. It was proposed in 1901 and it is still widely used in various elds such as ecology and biology. Let , . > is usual scalar product on R p and = ( , . . . , ) T in R p . The integer a is often referred to as the intersection of x and x , (b + c) is the di erence and d is the complement intersection. Note that one always has a + b + c + d = p.
The inclusion of negative matches d in similarity measures is discussed for instance in [17,18,40]. It may reveal useful for instance when the classi cation rule depends on the coding of the data, see also Lemma 2 below. The new measure SSylla & Girard can also be seen as an extension of Sokal & Michener [37] eq. (7) and Innerproduct [37] eq. (13) measures which both correspond to the special case α = / . Thus, the parameter α in SSylla & Girard permits to balance the relative weights of positive and negative matches. More generally, Table 1 displays 28 similarity measures from [37] which can be rewritten using our formalism (4). It appears that, on binary predictors, many similarity measures are equivalent. For instance, Hamming similarity [37] eq. (15) is equivalent to measures [37] eq. (17)- (23). Finally, some measures of [37] do not enter in our framework (4)

Kernels for binary predictors
The goal of this section is to build kernels adapted to binary predictors starting from the similarity and dissimilarity measures presented in Section 3. The kernels can then be plugged in the classi cation rule (2) to build new classi cation methods designed for binary predictors. In a rst time, we consider the case of linear and Radial Basis Function (RBF) kernels. We then show in a second time how the RBF kernel can be extended to a wider class of exponential kernels.
x >= a is the simplest kernel function. In the considered binary framework, Klinear counts the number of positive matches between x and x . It is shown (see [7], Proposition 3) that the associated classi cation rule (2) is quadratic and can thus be interpreted as a particular case of the HDDA (High Dimensional Discriminant Analysis) method [4]. Let us recall that the basic principle of HDDA is to assume that the original data of each class live in a linear subspace of low dimension. The next lemma shows that the classi cation rule associated with a linear kernel is independent from the coding of the data.  [37]. Exponential kernels.
The best-known exponential kernel is RBF kernel: where σ is a positive parameter. In the binary framework, the RBF kernel can be built from the Hamming similarity measure (see Table 1 or [37] eq. (15)): We thus propose to extend this construction principle to any similarity measure S by introducing: In practice, S may be chosen to be (4), (5), or more generally in the set of 76 measures S described in [37]. The next result is the analogous of Lemma 1 for similarity measures.

Lemma 4. Let S be a similarity measure S
Then, for all η > and µ ∈ R, the classi cation rules associated with S andS := ηS + µ through (2) and (6) are the same.
The next result shows that any kernel de ned from (4) and (6) veri es condition (3).

Experiments
The performance of the proposed method is illustrated on two real datasets described in paragraph 5.1. Some implementation details are provided in paragraph 5.2. Finally, the results are presented on paragraphs 5.3, 5.4 and 5.5.

Verbal autopsy Data
The goal of verbal autopsy is to get some information from family about the circumstances of a death when medical certi cation is incomplete or absent [24]. In such a situation, verbal autopsy can be used as a routine death registration. A list of p possible symptoms is established and the collected data X = (X , . . . , Xp) consist of the absence or presence (encoded as 0 or 1) of each symptom on the deceased person. The probable cause of death is assigned by a physician and is encoded as a qualitative random variable Y. We refer to [39] for a review of automatic methods for assigning causes of death Y from verbal autopsy data X. In particular, classi cation methods based on Bayes' rule have been proposed, see [10] for instance.
Here, we focus on data measured on the deceased persons during the period from 1985 to 2010 in the three IRD (Research Institute for Development) sites (Niakhar, Bandafassi and Mlomp) in Senegal. The dataset includes n = .
individuals (deceased persons) distributed in L = classes (causes of death) and characterized by p = variables (symptoms).

Binary handwritten digit data
Handwritten digit and character recognition are popular real-world tasks for testing and benchmarking classiers, with obvious application e.g. in postal services. Here, we focus on the US Postal Service (USPS) database of handwritten digits which consists of n = segmented × greyscale images [25]. The dataset is available online at http://yann.lecun.com/exdb/mnist. The random vector X is the binarized image and is represented as a p-dimensional vector with p = . The class to predict Y is the digit so that L = . A sample extracted from the dataset is depicted on Figure 1.

. Experimental design
The implementation of the classi cation method requires the selection of the hyper-parameter ω = (t, σ) where t is the threshold (see Section 2) and σ is the kernel parameter see (6)

. Results obtained with Sylla & Girard kernel
We rst investigate the use of Sylla & Girard similarity measure (5) when plugged into (6). The CCR are computed for α ∈ { , . , . . . , } and for several proportions τ thanks to the double cross-validation procedure described in the previous paragraph. It rst appears on Figure 2 that the graphs are not symmetric with respect to α = . . This means that the coding of the observations does a ect the classi cation. This is di erent from the linear case, see Lemma 2. It is also apparent that the optimal value of α does depend on the dataset. However, in both considered cases, α = . permits to outperform the RBF kernel associated with α = . . Thus, the selection of an optimal value of α is of interest. It can be easily done by introducing α as an additional hyper-parameter in ω and thus selecting it by double cross-validation, see Paragraph 5.5 below. Finally, let us highlight that a large panel of values of α give rise to high CCR on the test set. In particular, a CCR of % can be reached on the challenging example of verbal autopsy data when τ = % of the dataset is used to train the classi er. As a comparison, a classi cation based on a multinomial mixture model under conditional independence assumption yields a CCR of about 50% only [41].

. Results obtained with the 76 kernels from [37]
The goal of this paragraph is to compare the performance of the classi cation methods obtained by combining the 76 similarity and dissimilarity measures presented in [37] with the exponential kernel (6). For the sake of completeness, the results obtained with Sylla & Girard kernel presented above are also included. The classi cation results are summarized in Table 2 when τ = % of the dataset is used to train the classi er. Only the results associated with the 18 best kernels (in terms of CCR computed on the test set) are reported. It appears that these kernels achieve good classi cation results on both datasets with CCR ∈ [ . %, . %]. It is also interesting to note that 8 kernels out of the 76 of [37] appear in the top 18 on both test datasets, namely: Euclid, Hellinger, Dice, 3w-Jaccard, Orchia1, Gower & Legendre, Roger & Tanimoto and RBF. Let us also highlight that Sylla & Girard kernel should also be included, leading to a list of 9 kernels with good results on both datasets.

. Comparison with other classi cation methods
The proposed classi cation method is compared to the Random Forest method (RandomForest package, version 4.6-10 from R software), the kNN method (fitcknn function from the statistics and machine learning toolbox of Matlab) and the SVM method (library libsvm, version 3.2 from Matlab). The "one-against-all" implementation of the SVM classi cation method is used. SVM and Random Forest methods were used with their default parameters. In particular, in case of Random Forest method, the number of trees to grow is set to ntree=500 and the minimum size of terminal nodes is set to nodesize=1. Some additional experiments reported in Table 5 and Table 6 showed that the obtained classi cations were not very sensitive to these parameters: The CCR computed on the test set remains approximately constant when nodesize ∈ { , . . . , } and ntree ∈ { , , , }. The number k of neighbours in kNN method is selected using the dou-ble cross-validation procedure. Sylla & Girard kernel is plugged into pgpDA, kNN and SVM methods with α ∈ { . , . , . . . , . }. The selection of α by double cross-validation has also been implemented, the resulting value is denoted by α * in the following.    . . Table 3 and Table 4 that, on the verbal autopsy dataset, pgpDA method yields better results than SVM, kNN and Random Forest methods on the test set. Since, on the learning set, the CCR obtained with Random Forest is larger than the CCR associated with pgpDA, kNN and SVM methods for all values of α, one can suspect that Random Forest over ts this dataset. One can also observe that the CCR associated with pgpDA slightly depends on α (CCR ∈ [ . %, . %]) whereas CCR associated with SVM and kNN are very sensitive to α (CCR ∈ [ . %, . %] and CCR ∈ [ . %, . %] respectively). At the opposite, SVM, kNN and Random Forest yield better results than pgpDA on the handwritten digit dataset. The CCR associated with pgpDA is however satisfying, it is larger than . % whatever the value of α is. This may due to the small number of classes (L = here, L = in the previous situation) which makes the classi cation problem not so di cult.

It appears in
The selection by double cross-validation of the parameter α in Sylla & Girard achieves good results for all the considered classi cation methods. The selected value remains stable accross the experiments: α * ∈ { . , . } with pgpDA, α * ∈ { . , . } with SVM and α * = . for kNN. It is a rst step towards an automatic choice of the similarity measure in the classi cation framework. Finally, let us precise that the experiments were conducted on a two processor computer (8 cores cadenced a 2.6 GHz). The computations on one learning set Lm from the handwritten digit dataset took respectively 35 minutes (pgpDA), 40 minutes (SVM), 48 minutes (kNN) and 11 minutes (Random Forest).

Conclusion
This work was motivated by two facts: First, numerous binary similarity measures have been used in various scienti c elds. Second, model-based mixtures o er a coherent response to the problem of classi cation by providing classi cation probabilities and natural multi-class support. Basing on these remarks, our main contribution is the proposal of a new classi cation method combining mixture models and binary similarity measures. The method provides good classi cation performances on challenging data sets (high number of variables and classes). We believe that this method can reveal useful in a wide variety of classi cation problems with binary predictors. As a by-product of this work, some new similarity measures are proposed to unify the existing literature.
This work could be extended to the classi cation of mixed quantitative and binary predictors. As suggested in [7], to deal with such data, one can build a combined kernel by mixing a kernel based on a similarity measure (as proposed here) for the binary predictors and a RBF kernel for the quantitative ones. The combined kernel could be for instance the weighted sum or the product of the two kernels, see [19] for further details on multiple kernel learning.

Proof of Lemma 2.
To simplify the notations, let K(x, x ) :=< x, x > and For all k = , . . . , L, replacing iñ yieldsρ k (x, x ) = ρ k (x, x ) in view of (1) and thus the two classi cation rules are equivalent.
-The second step consists in showing that /S de nes a proper kernel classi cation method. Let us focus on the case where ≤ α , β < θ , the other cases being similar. Introduce u := ( − α /θ )/p > and v := ( − β /θ )/p > such that with u ∈ [ , ) and v ∈ [ , ). Since ≤ u a + v d < , the following expansion holds: Since S ,N is obtained from sums and products of Klinear andKlinear, it follows from [38], Proposition 3.22 (i) and (iii) that S ,N de nes a proper kernel classi cation method for all N > . As a consequence, S ,N veri es condition (3) for all N > . Letting N → ∞, one can conclude that /S de nes a proper kernel classi cation method.
-Finally, in view of [38], Proposition 3.22 (ii), (iii) and Proposition 3.24 (ii), it follows that K de nes a proper kernel classi cation method.