Data Anonymization through Collaborative Multi-view Microaggregation

: The interest in data anonymization is exponentially growing, motivated by the will of the governments to open their data. The main challenge of data anonymization is to find a balance between data utility and the amount of disclosure risk. One of the most known frameworks of data anonymization is k -anonymity, this method assumes that a dataset is anonymous if and only if for each element of the dataset, there exist at least k − 1 elements identical to it. In this paper, we propose two techniques to achieve k -anonymity through microaggregation: k -CMVM and Constrained-CMVM. Both, use topological collaborative clustering to obtain k -anonymous data. The first one determines the k levels automatically and the second defines it by exploration. We also improved the results of these two approaches by using pLVQ2 as a weighted vector quantization method. The four methods proposed were proven to be efficient using two data utility measures, the separability utility and the structural utility. The experimental results have shown a very promising performance.


Introduction
Nowadays, data is used in every aspect of the human life. Data is collected by sensors, social networks, mobile applications and connected objects to treat it, explore it, transform it and learn from it. To mine collected data without security breaching, some rules related especially to the privacy of the people on the dataset have to be respected. The process of preserving data privacy is called data anonymization and was used for quite a while to statistical purposes.
Conscious of the costly analysis provided by good quality data, researchers, studied data anonymization methods with the purpose of proposing a good trade-off between identity disclosure and information loss. Data anonymization is the process of de-identifying sensitive data while preserving its format and data type [40] [33], generally this procedure is achieved by masking one or multiple values in order to hide some aspects of the data. The growing interest in data anonymization was mainly motivated by the desire of governments and institutions to open their data as a proof of democracy and good practices. Open data is a very promising study field and it is very challenging because the data released must be anonymized forever with very low re-identification rate and should ensure sufficient quality for the analytics [7,31]. Aware of the importance of the balance between privacy and utility, many approaches were introduced to tackle this problem, the first approaches were mainly based on the randomization method which consists of adding noise to data [1]. This method was proven to be inefficient since data reconstruction was feasible [20].
The risk of data privacy breach using randomization was overtaken by the emergence of the k-anonymization method [38]. This group based anonymization method outputs a dataset containing at least k identical records and the anonymization is achieved by firstly removing the key-identifiers like the name and the address and secondly by generalizing and/or suppressing the pseudo-identifiers which are for example: the date of birth, the ZIP code, the gender and the age. The k value should be chosen in a way to preserve the information provided by the database. The method itself is interesting and was widely studied [3,26,27,30], what gave a strong basis to further works on anonymization. Since the k-anonymity is a group based method, clustering was considered as one of its strongest assets [21,32]. Microaggregating k elements and replacing the data by the group representatives gives a good trade-off between the information loss and the potential data identification risk [5]. However, the clustering methods presented are based on the k-means algorithm which is prone to local optima and may give biased results.
In this paper, we propose two techniques to achieve k-anonymity through microaggregation: k-CMVM and Constrained-CMVM. Both, use topological collaborative clustering to obtain k-anonymous data. The first one determines the k levels automatically and the second defines it by exploration. To do that we take advantage from the topological structure of the Self Organizing Maps (SOM) [22] and its ability to prone less to local optimas [2]. We will use SOM as a clustering model since it was proven to give good results on practical applications when the aim is to visualize and perform dimensionality reduction. The results of the clustering are enhanced using the collaborative learning process [17]. At the end of the topological learning, the "similar" data will be collected in clusters, corresponding to the sets of similar patterns. These clusters can be represented by a more concise information, such as their gravity center or different statistical moments since we believe that this information is easier to manipulate than the original one.
In the second part of the paper, we are going to introduce the discriminative information to tackle the cases where the labels are given and how they may affect the anonymization process. The exploration of the supervision is given by the Learning Vector Quantization Method (LVQ) [22]. We will use a particular version that gives weights to each of the features what results in better preservation of the utility of the anonymized dataset; the approach is called pLVQ2 and was detailed in [4].
Ultimately, in this paper, we will tackle the following points: -Multi-view collaborative Self Organizing Maps to achieve data anonymization.
-Constrained collaborative Self Organizing Maps to attain a predetermined k anonymity level.
-The introduction of the discriminative information and the use of the pLVQ2 to achive highest anonymity levels with a good utility trade-off.
The remainder of this paper is organized as follows: Section 2 discusses the Theoritical Background, Section 3 presents the different algorithms proposed for anonymization, in Section 4 we illustrate the different experimental results and the conclusions and future directions are given in Section 5.

Fundamental background of the proposed approaches
In this section we will dress the theoretical foundations of the methods proposed in the remainder of the paper. In the subsection 2.1, we present foundations of k-anonymity through Microaggregation, in the subsection 2.2, we dress an overview of the multi-view collaborative learning and last in subsection 3 we list the notations and the definitions needed in the rest of the paper.

k-anonymity through Microaggregation
The privacy-preserving method that was widely studied is k-anonymity [38]. The model assumes that personspecific data is stored in a table of attributes and records. To anonymize a dataset, Sweeney [38] proposed a method that consists of suppressing or/and generalizing the quasi-identifiers in a way that any record is indistinguishable from at least k-1 records. Quasi-identifiers are the variables that ,alone, don't disclose much of information about the individuals but if combined, quasi-identifiers might leak the identity of their holder. This approach promoted the idea of grouping similar elements to anonymize them.
The objective of classical k-anonymity is to reduce information loss since data can be hidden in multiple ways depending on the used method. Minimal generalizations and fewer suppressions are preferred. In fact, heuristics to tackle k anonymity are motivated by some preference criteria or user policies [8]. In data mining, the k anonymous data should hold enough information about the respondents to be useful for subsequent operations related to pattern detection.
Venkatasubramanian [41] classifies the methods of anonymization into three classes. First, are the statistical methods that proposed measures of privacy in term of variance, the larger the variance, the greater is the privacy of the perturbed data. Second, are the probabilistic methods that attempt to quantify the idea of background information that a third party might possess. Researchers deployed tools from information theory and Bayesian analysis and more precisely notions of information transfer. The third class of methods is secure muti-party computations, these methods were inspired of the cryptography field and the amount of information leakage is measured by in terms of the amount of information accessible by the adversary. One of the most illustrative example of these methods is Yao's the Millionaire Problem [43] where two millionaires wish to know who is richer without revealing any information about each others wealth.
Grouping, as in the probabilistic approaches recalls classification in case of supervised learning, and clustering in the case of unsupervised learning. Li et al. [28] introduced the first algorithm that combines clustering and anonymization. The algorithm forms equivalence classes from the database by finding an equivalence class with records' number less than k. It measures the distance between the found equivalence class and the other equivalence classes and merges it with the nearest equivalence class in order to form a cluster of at least k elements with minimum information distortion. This method gives good computational results but it is very time consuming.
The k-member clustering algorithm was detailed in [5] and it forms clusters of at least k records in a way that the clusters are intersimilar. This approach fixes the value of k, looks for the record and the cluster with the minimal information loss, adds the record to the cluster and iterates until getting clusters with at least k members. Another approach is the Clustering based greedy algorithm. First, introduced by Loukides et al. [29], it focuses on capturing the usefulness of the data and protecting its privacy by presenting quality measures, taking into account the attribute, the tuples' diversity and a clustering algorithm. This algorithm is similar to the previous k-member clustering algorithms [5] but with the constraint of maximizing the dissimilarity of sensitive data values (privacy) and minimizing the similarity of the quasi-identifiers (usefulness). Those algorithms gave an opening to further studies on anonymization using clustering [21].
k-anonymity is a global framework to evaluate the amount of privacy in some dataset, as the elimination of key identifiers was proven to be inefficient, microdata was disclosed using the microaggregation technique [14]. Microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. It is used as an alternative to generalization and suppression to generate k-anonymous data sets, where the identity of each subject is hidden within a group of k elements. Unlike generalization, microaggregation perturbs the data in a way to improve data utility in several ways, such as increasing data granularity, reducing the impact of outliers and avoiding discretization of numerical data.
In microaggregation, records are clustered into small aggregates or groups of size at least k. Rather than publishing an original variable V i for a given record, the average of the values of the group over which the record belongs is published. In order to minimize information loss, the groups should be as homogeneous as possible.
The approach we are presenting in the following, consists of anonymizing microdata using multi-view topological collaborative microaggregation [16]. To anonymize the dataset, we start by determining the number of views to explore and then we randomly split the data vertically and we build a SOM for each view to get the corresponding prototypes.

Multi-view Collaborative Learning
Learning and detecting patterns in data is the ultimate aim of machine learning. Suppose we had a collection of datasets explained by different ensembles of attributes, extract information about these elements comes to extracting information about each family of descriptors alone. This is what we call multi-view decomposition [16], each view of the dataset allows to extract specific patterns of the studied data. The collaborative learning, on the other hand, aims to develop methods grounded on statistics to recover the topological invariants from the observed data points [9]. The models that interest us in this paper are those that both, reduce dimension and achieve clustering. Since SOM models [22] allow projection in small spaces that are generally two dimensional and they are often used for visualization and unsupervised topological clustering. In order to improve the SOM's clustering quality, the collaboration approach is used and the outputs of several selforganizing maps are compared. Each dataset is clustered through the SOM approach. The main idea of the used collaboration between different SOM maps is that if an observation from the ii-th dataset is projected on the j-th neuron in the ii -SOM map, then that same observation in the jj-th dataset will be projected on the same j neuron of the jj-th map or one of its neighboring neurons. In other words, neurons that correspond to different maps should capture the similar observations.
Compute DB index for SOM [ii] where DB [ii] is the Davies Bouldin index computed using w [ii] 5: Beforecollab ← DB [ii] 6: end for Step 2 : Collaborative learning: 7: for ii = 1 to P do 8: for jj = 1, jj ≠ ii to P do 9: )︁ 2 10: BeforeCollab then 13: 14: end if 15: end for 16: end for Therefore, the classical SOM objective function was modified by adding a term of collaboration. Based on the works of [17,18], we add a new collaboration step to estimate the importance of the collaboration, during the collaborative learning process. Formally, the objective function is composed of two terms: where P represents the number of views, N -the number of observations, |w| is the number of prototype vectors from the ii SOM (the number of neurons). χ (x i ) is the assignment function which allows to find the Best Matching Unit (BMU), it selects the neuron with the closest prototype from the data x i using the Euclidean distance.
The value of the collaboration link λ is determined. This parameter determines the importance of the collaboration between each two SOM, i.e. to learn the collaboration link between all datasets and maps. Its value is in the interval [1-10], 1 -for the neutral link, when no importance to collaboration is given, and 10 for the maximal collaboration within a map. Its value changes for each iteration during the collaboration step. In the case of the collaborative learning, as it is shown in the Algorithm 1, this value depends on topological similarity between both collaboration maps.
This function depends on the distance between two neurons and is defined as follows: σ(i, j) represents the distance between two neurons i and j from the map, and it is defined as the length of the shortest path linking cells i and j on the SOM. K [cc] σ(i,j) is the neighborhood function on the SOM[cc] between two cells i and j. T is the temperature which allows to control the size of the neighborhood influence of a cell on the map, it decreases with the T parameter. The value of T can be decreased between two values Tmax and T min .
The nature of the neighborhood function K [cc] σ(i,j) is identical for all the maps, but its value changes from one map to another: it depends on the closest prototype to the observation that is not necessarily the same for all the SOM maps. Indeed, during the collaboration with a SOM map, the algorithm takes into account the prototypes of the map and its topology (the neighborhood function).

Proposed Anonymization Approaches Notations
We use the k-anonymity notation: data is organized as a table of rows (Records) and columns (Attributes) where each row is defined as a tuple, the tuples are not unique but attributes are. Each row is an ordered mtuple of values < a 1 , a 2 , .., a j , .., am >.   [11] is based on a similarity measure of clusters R ij that is a fraction of the dispersion measure s i and the cluster dissimilarity d ij [25]. R ij should satisfy the following:

Definition 3.2. The Davies Bouldin Index The DB index
Where w i are the prototypes of the neuron, nc is the number of cells, c i is the i th cell. Davies-Bouldin is a cluster validity index used to measure the "goodness" of a clustering result [11]. It takes into account the compactness and the separability of clusters and works best and foremost with hard clustering (when the clusters have no overlapping partitions).
Since the objective is to obtain clusters with minimum intra-cluster distances, small values for DB are interesting, the usage of this validity index is justified by our wiliness to evaluate how the elements of the same cluster are similar. Therefore, this index is minimized when looking for the best number of clusters [37].

k-CMVM
In this work, we propose to use a pre-anonymization step in the approach which can give the choice to have two different levels of anonymization. The first using the prototypes of the BMUs(k-CMVM) and the second uses the linear mixture of models(Constrained CMVM).
The Self Organizing Maps, when introduced by Kohonen [39], seemed like a simple yet powerful algorithm to produce "order out of disorder" [19] by building a one or two dimensional lattice of neurons for capturing the important features contained in an input space. SOM are based on competitive learning, in other words the output neurons of the map compete among themselves to be activated or fired. In the course of this competition, the neurons are selectively tuned to the various input patterns. Their locations become ordered with respect to each other in a meaningful way. The best tuned neuron is called the winner neuron or the Best Matching Unit, in our case, we chose to encode the input vector by its corresponding prototypes i.e. Best Matching Unit. The idea joins the Group Anonymization methods since the SOM creates a map of neurons i.e. clusters and each cluster is defined by its prototype so the closest representative of an element is the prototype of the cluster it belongs to.

Constrained CMVM
In [23], Kohonen extended the use of the SOM by proving that instead of representing inputs by the "Best Matching Unit" i.e.. the "Winning neuron", they are described using the linear mixture of the reference vectors [24]. This novel method analyzes input data and approximates it by a set of models that defines the item more accurately. Compared to the classical SOM learning process where only the BMUs are used. The linear mixture of models preserves better the information.
Let us consider each input as a Euclidean vector x of dimensionality n. The SOM matrix of prototypes is denoted as M of size (pxn) where p is the number of nodes in the SOM. To get the coefficients of the models we minimize the following equation: where α is a vector of non negative scalars α i . The constraint of non negativeness is important when dealing with inputs consisting of statistical indicators because their negatives have no meaning. For the solution of the above objective function, there exist several ways. The most used and straightforward is the gradient-descent optimization. It's an iterative algorithm that can take into account the nonnegativity constraint.
The present fitting problem belongs to the quadratic optimization, for which numerous methods have been developed over the years. A one-pass solution is based on the Kuhn Tucker theorem [15].

Fine tuning
One of the challenges in data mining is to mine multi-view data distributed on different sites. Here, we propose to use collaborative clustering in an attempt to answer this problem. This method can deal with multi-source data i.e. several sets that are presented with the same individuals in different attributes' spaces where even the data type can be different. In other words, each database is a view of a global dataset about the same individuals. This way, the curse of dimensionality is implicitly dealt with, as the algorithm treats each part alone and the results are proved to be more accurate.
The k-CMVM & Constrained-CMVM (algorithm 2 & 3) build classical SOM for each view of the dataset and uses the collaborative paradigm to exchange topological informations between collaborators as described in the algorithm 1. It takes the Davies Bouldin index [11] which is a clustering evaluation indicator that reflects the quality of the clustering, as a stopping criterion. If DB decreases, the collaboration is positive and if it increases, we stop the collaboration and use the initial map. We mean by a positive collaboration the fact that the collaborators improve the clustering quality of one another; a negative collaboration, on the contrary, is used to describe when a collaborator affects another negatively by deteriorating the quality of its clustering. By using the DB index as a stopping criterion, we control how the views collaborate and we only collaborate if the collaboration improves the clustering. Therefore, the collaboration allows us to obtain more homogeneous clusters by using the topological information from all the views.
After the clustering and collaboration step, the pre-anonymization step where the elements of each of the collaborating maps are coded using the BMUs for the k-CMVM, and using the linear mixture of the map's prototypes in Constrained-CMVM. We found that the use of the linear mixture of models gave better results than anonymizing the data with BMUs because it preserves most of the information contained in each element. The pre-anonymized parts are then reorganized in the same way as the the original dataset.
The second part of each of the algorithms makes a huge difference between the two algorithms. On the one hand, k-CMVM, outputs a pre-anonymized dataset that will be fine-tuned using a SOM model where the map size is determined by the Kohonen heuristic [22]. The resulting dataset is recoded using the prototypes of the closest object to the BMU and we examine the anonymity level of the dataset. The k levels is not a predefined value but it is given automatically by the model.
On the other hand, for the Constrained-CMVM algorithm, the fine tuning step works as follow: we use a constrained SOM on the pre-anonymized dataset. To have a constrained map, we initially create a SOM that is learned on the outputs of the pre-anonymization step as stated before. A k levels of anonymity is predefined and the elements from the neurons that don't respect the constraint of k cardinality are redistributed on the closest neurons. This process modifies the topology of the map, but helps designing groups of at least k elements in each neuron. We code the objects of each neuron using the best matching unit, to get a kanonymized dataset. We then explore the different k values to determine the one that satisfies our requests.
-Compute w [ii] using the collaboration algorithm 1 with all V [ii] Pre-Anonymization: For each V [ii] , ii = 1 to P : jc where c is the matching neuron: -Code each element j of OT with its corresponding vector: X ′ j ← [w [1] jc (1) , w [2] jc (2) , ..., w [P] jc (q) ], where c (q) is the index of the cell associated with element j.

Fine-tuning and anonymization:
-Build a global SOM using the pre-anonymized dataset OT ′ -For each c in cells 1 to nc -Output level of anonymity: To sum up, the proposed anonymization methods use the multi-view approach with the purpose of treating complex data and multisources data. This method is also used to preserve the quality of the dataset to recode and prevent the dimensionality curse. The number of subsets to be used for collaboration is fixed by the user and it depends on the size of the data. The algorithm 1 uses classical SOM and collaborative paradigm to form the maps by exchanging the topological information between the collaborated maps. In the pre-anonymization step shown in the k-CMVM, & the Constrained CMVM, the dataset is coded using the prototypes of the best matching units for each data point or by the linear mixture of the SOM models.The preanonymized data is then fine tuned using BMUs or clustered under the constraint of k elements by neuron.

Incorporating Discriminative Power
After evaluating the different results of data anonymization using the k-CMVM (algorithm 2) and the Constrained-CMVM (algorithm 3), we wanted to explore the case where the data is labelled and to what extend the supervision might influence on the quality of the anonymized results? To tackle this topic we experimented with the Learning Vector Quantization approach (LVQ). This choice was motivated by its ability to improve the clustering results by taking into account the class of each object. The algorithm learns from a subset of patterns that best represent the training set.
-Compute w [ ii] using the collaboration algorithm 1 with all V [ii] .

Pre-Anonymization :
For each V [ii] , ii = 1 to P : -Find the linear mixture of SOM models for each object j in V [ii] .
where c(j) [ii] is the coding of the j th element of the [ii] th view. δ l are the coefficients of linear mixture of models. -Code each element j of OT its corresponding vector. X ′ j ← [c [1] (j), c [2] (j), .., c [P] (j)]. LVQ method is best known for the simplicity and the rapidity of its convergence, since it is based on the hebbian learning. This is a prototype-based method that prepares a set of codebook vectors in the domain of the observed input data samples and uses them to classify unseen examples.

Constrained Clustering and Anonymization
LVQ was designed for classification problems that have existing data sets that can be used to supervise the learning by the system. LVQ is non-parametric, meaning that it does not rely on assumptions about that structure of the function that it is approximating. Euclidean distance is commonly used to measure the distance between real-valued vectors, although other distance measures may be used (such as Mahalanobis distance), and data specific distance measures may be required for non-scalar attributes. There should be sufficient training iterations to expose all the training data to the model multiple times. The learning rate is typically linearly decayed over the training period from an initial value until it is close to zero. The more complex the class distribution, the more codebook vectors that will be required, some problems may need thousands. Multiple passes of the LVQ training algorithm are suggested for more robust usage, where the first pass has a large learning rate to prepare the codebook vectors and the second pass has a low learning rate and runs for a long time (perhaps 10-times more iterations).
In the LVQ model, each class contains a set of fixed prototypes with the same dimension of the data to be classified. LVQ adaptively modifies the prototypes. In the learning algorithm, data is first clustered using a clustering method and the clusters' prototypes are moved using LVQ to perform classification. We cho se to supervise the results of the clustering by moving the center clusters' using the pLVQ2 proposed in algorithm 4 for each of the approaches. We use the pLVQ2 [4] since this upgraded version of the LVQ respects the characteristics of each features and adapts the weighting of each feature according to its participation to the

Algorithm 4: Adaptive Weighting of Pattern Features During Learning
Initialization : Initialize the matrix of weights P according to : The codewords m are chosen for each class using the k-means algorithm. Learning Phase: 1. Present a learning example x.
2. Let w i ∈ C i be the nearest codeword vector to x.
if x ∈ C i , then go to 1 else then -let w j ∈ C j be the second nearest codeword vector if x ∈ C j then * a symmetrical window win is set around the mid-point of w i and w j . * if x falls within win, then Codewords Adaptation: * w i is moved away from x according to the formula * w j is moved closer x according to the formula * for the rest of the codewords w k (t + 1) = w k (t) (13) Weighting Patterns features: * adapt p k k according to the formula: * go to 1.
Where α(t) and β(t) are the learning rates discrimination. The system learns using two layers: the first layer calculates the weights of the features and then it is presented to the LVQ2 algorithm. The cost function of this approach can be written as follows: Where C k is the class k, x ∈ C k is a training example,and P is the weighting coefficient matrix,; w i is the nearest codeword vector to Px and w j is the second nearest codeword vector to Px. The pLVQ2 with the Collaborative Paradigm enhances the utility of the anonymized data by the k-CMVM and the Constrained-CMVM models. The use of pLVQ2 is done after the collaboration between cluster centers' to improve the results of the Collaboration at the pre-anonymization and the anonymization steps.

Datasets
The four methods presented earlier, k-CMVM, k-CMVM++, Constrained-CMVM and Constrained-CMVM++, were tested on several datasets provided by the UCI Machine Learning Repository [13]: -The DrivFace database contains images sequences of subjects while driving in real scenarios. It is composed of 606 samples of 6400 × 480 pixels each, acquired over different days from 4 drivers (2 women and 2 men) with several facial features like glasses and beard. -Ecoli & Yeast datasets contain protein localization sites. Each of the attributes used to classify the localization site of a protein is a score (between 0 and 1) corresponding to a certain feature of the protein sequence. The higher the score is, the more possible the protein sequence has such feature. -Glass dataset represents oxide content of the glass to determine its type. The study of classification of types of glass was motivated by criminological investigation. Since the glass left at the scene of the crime can be used as evidence...if it is correctly identified! -The Spam base dataset consists of 57 attributes giving information about the frequency of usage of some words, the frequency of capital letters and other insights to detect if the e-mail is a spam or not. -Waveform describes 3 types of waves with an added noise. Each class is generated from a combination of 2 of 3 "base" waves and each instance is generated of added noise (mean 0, variance 1) in each attribute. -Wine data is the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Utility Measures and Statistical Analysis
The impact of microaggregation on the utility of anonymized data is quantified as the resulting accuracy of a machine learning model [34]. To measure the utility of the provided anonymized datasets we designed a decision tree model and used it to see how the anonymized data was classified by this model. We then compared the separability utility of the results of both approaches before and after introducing the discriminant information to get more insights on how much data quality have we traded for the sake of anonymization. The pre-anonymization step was crucial to create anonymized elements by views i.e. we didn't code the whole example by one model, instead, we coded each part of the example, depending on the view it belongs to, by the BMU in the case of k-CMVM, and by the linear mixture of the neighboring models in case of the Constrained-CMVM, we then used fine tuning to add another layer of anonymization. In table 2 we illustrate the results of the four algorithms after the anonymization. We would like to call the accuracy of a dataset, the separability utility since a good utility refers to good separability between the clusters. In table 2, the titles refer to the following: -Original: The initial separability utility of the raw data using the decision tree model with 10 folds cross-validation. k-CMVM: The separability utility of the dataset using the multi-view clustering with collaboration between the views and using the Kohonen Heuristic to determine the size of the maps to use. The examples during pre-anonymization were coded using the BMUs. -Constrained-CMVM: The separability utility of the dataset using the multi-view clustering with collaboration between the views and using the Kohonen Heuristic to determine the size of the maps to use. The examples during pre-anonymization were coded using the Linear Mixture of Models. -The ++ in the name of the methods refers to discriminant version. In table 2, the separability utility of the four methods of data anonymization is compared to the original separability utility of the datasets. It is shown that the separability utility of the anonymized datasets is better than the initial separability utility using raw data. This can be explained by the process by which we anonymized the initial dataset, the process relies on clustering what implies that the different pattern of the datasets were discovered and all the noise was omitted. In other words, this can be explained by the tendency of microaggregation to remove non decisive attributes from the dataset in order to gather together elements that are similar.
For each dataset we proceed by splitting the original data to 3 views, clustering each view using the SOM clustering model, the size of each map is determined automatically by the Kohonen heuristic. The collaboration between the different views is done two by two using the Davies Bouldin index as a stopping criterion, if the index increases, the collaboration goes further and if it decreases the collaboration stops. The views are anonymized by representing each element of the cluster by its representative we then add a fine tuning microaggregation step to get a higher level of data anonymity.
To incorporate the discriminant information we use multi-view clustering with SOM by using the SOM toolbox [42]. For each class we use 10 prototypes to achieve the wLVQ2.
Let's take the Waveform data as an illustrative example. The used Waveform dataset is noisy, what explains that, at the start of the experiments, the separability utility was equal to 76.88%, after using the k-CMVM increased by 6.1% after applying the Constrained-CMVM it increased by 4.6%, for the discriminant versions we obtained an increase of 11.5% with the CMVM++ and an increase of 11.5% (table 2). Same goes for the other datasets (DrivFace, Glass, Spam base, Waveform, Wine, Yeast) where the separability utility obtained after incrporating the discriminant information increased significantly compared to the separability utility at the start of the experiments.
A well known method of the data anonymization using microaggregation literature is the Maximum Distance to Average algorithm (MDAV) introduced by [14]. MDAV represents the key attributes in a data set as points in the Euclidean space where k-anonymous microaggregation is the partitioning of points in cells of size k. The perturbed attributes are then characterized with a representative point at maximum distance of the average. In table 2, we illustrate the results of the MDAV compared to the k-CMVM and Constrained CMVM algorithms. Both algorithms that we proposed outperform the MDAV method as shown on the figure 2 In table 2, the graphics show a comparison between the different separability utility levels of the methods. In all the cases, k-CMVM and Constrained CMVM outperform the MDAV case. This can be explained by the fact that the MDAV microaggregates the whole data and then it represents each cluster with the farthest element to the cluster center; unlike the methods that we propose, where the multiview clustering and the two levels microaggregation helps preserving the characteristics inherent to each element and the coding occurs on a local dimension.
To evaluate the performance of our proposed approaches, we use the Friedman test and Nemenyi test recommended in [12]. The Friedman test is conducted to verify the null-hypothesis that all approaches are equivalent in the respect of accuracies. If the null hypothesis is rejected, then the Nemenyi test will proceed. In addition, if the average ranks of two approaches differ by at least the critical difference(CD), then it can be concluded that their performances are significantly different. In the Friedman test, we set the significant level α = 0.05. The figure 1 shows a critical diagram represents a projection of average ranks classifiers on enumerated axis. The classifiers are ordered from left (the best) to right (the worst) and a thick line which connects the classifiers were the average ranks not significantly different (for the level of 5% significance). As shown in figure 1, Contrained-CMVM++ achieves significant improvement over the other proposed techniques since it incorporates discriminating information from labels to better position the prototypes in the data space. As a result, the coding of the data is of better quality because it takes into account intra-and inter-class variability. The figure 2 shows the projection representations using the Principal Component Analysis (PCA) on the Ecoli, Waveform and Yeast datasets. In the figures we illustrate how the data behaves after anonymization, we can see that the shape of the data does not change after the anonymization but the number of data points represented is fewer since they are over each other. The elements of each cluster are represented in the same manner what implies that the number of points is reduced but the methods respect the initial data structure.

Cluster's validity indices 4.3.1 Davies Bouldin Index
The score is defined as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther apart and less dispersed will result in a better score. DB-Davis Bouldin index [10]: where K is the number of clusters,∆(c k , c k ′ ) is the similarity between clusters centres c k and c k ′ and ∆ n is the average similarity of all elements from the cluster C k to their cluster centre c k . This index evaluates the quality of unsupervised clustering based on the compactness of clusters and a separation measure between clusters. It is based on the ratio of the sum of within-clusters scatter to between-clusters separation. The lower the value of DB index, the better the quality of the cluster. In table 3, the index decreased after adding the discriminant information for almost 70% of the tests.

Silhouette Index
The silhouette score is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample [35]. This index is based on the measurement of the difference between the average of the distance between the instance x i and the instances belonging to the same cluster a i and the average distance between the instance x i and the instances belonging to other clusters b i , the closer the silhouette value is to 1 means that the instances are assigned to the right cluster.
It is generally used to find the number of clusters that produce a subdivision of the dataset into dense blocks that are well separated from each other. The score is closer to one when clusters are dense and well separated, which relates to a standard concept of a cluster. In the table 4, the only dataset that shows a different behavior after incorporating the discriminant information is the DrivFace data, this is explained by the nature of the dataset which is unbalanced.

Calinski Harabasz Index
Also known as the Variance Ratio Criterion, Calinski-Harabasz score [6] is the ratio of the sum of betweenclusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared). Calinski-Harabasz index is defined as: where S B is the between-clusters dispersion matrix, S W is the within-cluster dispersion matrix, N is the number of examples, K is the clusters number. The Calinski-Harabasz index ranges from 0 (worst classification) to +∞ (best classification). It's highly dependent on N. All other things being equal, it grows linearly with N. Therefore, its order of magnitude can vary considerably from one dataset to another. As we look for a low intracluster dispersion (dense agglomerates) and a high intercluster dispersion (well-separated agglomerates), the grater is the index the better is the clustering. In table 5, we can deduce that the only dataset where the index didn't increase is the DrivFace dataset, what is due to its nature of unbalanced data.

Structural Utility using the Earth Mover's Distance
We believe that measuring the distance between two distributions is the way to evaluate the difference between the datasets. The amount of utility lost in the process of anonymization can be see as the distance between the anonymized dataset and the original one. The Earth Mover's distance (EMD) also known as the Wasserstein distance [36], extends the notion of distance between two single elements to that of a distance between sets or distributions of elements. It compares the probability distributions P and Q on a measurable space (Ω, Ψ) and is defined as follows (We are using the distance of order 1): Where µ : prob.measureon (Ω × Ω, Ψ ⊗ Ψ)with marginals : P, Q, Ω × Ω is the product probability space. Notice that we may extend the definition so that P is a measure on a space (Ω, Ψ) and Q is a measure on a space (Ω ′ , Ψ ′ ).
Let us examine how the above is applied in the case of discrete sample spaces. For generality, we assume that P is a measure on (Ω, Ψ) where Ω = {x i } n i=1 and Q is a measure on (Ω ′ , Ψ ′ ) where Ω ′ = {y i } n ′ j=1 -the two spaces are not required to have the same cardinality.
Then, the distance between P and Q becomes: EMD is the minimum amount of work needed to transform a distribution to another. In our case we measure the EMD between the anonymized and the original datasets, attribute by attribute, to get an idea about the distortion of the anonymized datasets. We then normalize all distances between 0 and 1, then we define the utility by 1 − W 1 (P, Q). The smaller the distance W 1 is, the more the data utility is preserved.

Preserving combined utility
To choose the anonymization method which best addresses the separability-Structural utility Trade-off, we propose to combine the two types of utility structural and separability in a combined form while α = 1 2 : Table 7 summarize the clustering results of the proposed approaches in terms of combined utility (Comb_Utility). As it can be seen, our approach Attribute-oriented generally performs best on all the datasets. To further evaluate the performance, we compute a measurement score by following [? ]: where Comb_Utility(A i , D j ) refers to the combined Utility value of A i method on the D j dataset. This score gives an overall evaluation on all the datasets, which shows our approach Attribute-oriented outperforms the other methods substantially in most cases. As shown in the table 7, the introduction of the discriminant information improves the utility of the anonymized datasets for all of the methods proposed.

Conclusion
In this paper we covered in details four data anonymization using microaggregation approaches, the k-CMVM & Constrained CMVM that use Collaborative Multi-View Paradigm, and the k-CMVM++ & Constrained CMVM++ that we proposed to improve the quality of the anonymized dataset using the ground truth labels. The results shown above prove the efficiency of the methods and illustrate their importance.
The process we used started first by experimenting with the Multi-view clustering since we believe is an efficient way to deal with multisources data and high dimensional elements. Second, we have shown that the collaborative topological clustering improves the quality of the clustering what makes the model more accurate. Third, the pre-anonymization using the Linear Mixture of SOM gives better results, in terms of the separability utility than using BMUs. Fourth, we found a good trade-off between the separability utility and anonymity levels. Finally, we evaluated the limits and possibilities of incorporating the discriminant information if the ground truth labels were known and compared its performance to the literature and the k-CMVM & Constrained CMVM .
We are looking for other ways to anonymize data and we are experiencing 1D clustering as a way to anonymize data without loosing the information it is containing and we want to explore new methods to anonymize unbalanced datasets.