Unsupervised collaborative learning based on Optimal Transport theory

: Collaborative learning has recently achieved very significant results. It still suffers, however, from several issues, including the type of information that needs to be exchanged, the criteria for stopping and how to choose the right collaborators. We aim in this paper to improve the quality of the collaboration and to resolve these issues via a novel approach inspired by Optimal Transport theory. More specifically, the objective function for the exchange of information is based on the Wasserstein distance, with a bidirectional transport of information between collaborators. This formulation allows to learns a stopping criterion and provide a criterion to choose the best collaborators. Extensive experiments are conducted on multiple data-sets to evaluate the proposed approach.


Introduction
Data clustering is one of the main interests in unsupervised Machine Learning research [1]. A large number of clustering algorithms have been proposed in the literature [2], divided into different families based on the cost function to optimize [1,3].
Clustering task is known to be difficult and suffer from several issues. Most of the problems come from the fact that unsupervised algorithms work with very little information about the expected result [4]. Therefore, the choice of the cost function to optimize, the algorithm to use and the values of the parameters require a lot of expertise to obtain the desired output [5]. In addition, modern data-sets are often very large (both in size and dimension) and distributed into several sites [6], which limit the efficiency of most classical clustering algorithms [7].
In an attempt to solve these issues, the scientific community has suggested several ways of combining the results of different algorithms [8]. Several approaches have been proposed in that direction, based on the idea of several algorithms working on the data, either with each algorithm optimizing a different cost function or working with different values of the parameters on the same data-set, or with each algorithm working on a subset of the data, usually trying to optimize the same cost functions. These approaches can be classified into two main categories: In Ensemble Learning approaches, several algorithms are trained on the data and the set of results are merged into a global consensus [9]. In Collaborative Clustering several models are trained simultaneously on the data-set, usually each algorithm working on a sub-set of the data, and exchange information during the learning process [10]. In this paper we focus on the later approach.
Generally speaking, the problem of Collaborative Clustering can be defined as follows: Given a finite number of disjoint data sites, collaborative clustering is a scheme of collective development and reconciliation of fundamental cluster structures across these sites [11]. The general framework for collaborative clustering is based on two principal steps: Local step: Each algorithm will train on the data it has access to and produce a clustering result, e.g. a model of the local data subset.
Collaborative step: The algorithms share their output in order to confirm or improve their models, with the goal of finding better clustering solutions.
In this paper, we propose to study the unsupervised collaboration framework through the Optimal Transport theory, thus benefiting from this mathematical formalism to analyze and describe the process of collaboration between the different algorithms. In this case, the collaboration, that consists of exchanges of information between algorithms, will be modeled in the form of bi-directional or even multi-directional transports.
The rest of the paper is organized as follow. Section 2 includes the prototype based methods proposed in collaborative learning task, section 3 develops the background of Optimal Transport theory. In Section 4 we introduce the novel framework of collaborative clustering using Optimal Transport theory. In Section 5 we provide an experimental validation and discuss the quality of the proposed approach. Finally, in Section 6 a conclusion and some perspective work are given.

Related work
The first Collaborative clustering was introduced by [11] under the name "Collaborative Fuzzy Clustering" (CoFC). This approach was based on extended version of Fuzzy C-means adapted to distributed data. The algorithm was based on two steps, the first step aims to find c clusters for each collaborator where each object is assigned to some cluster with a certain degree membership stored in matrix S. The second step consists to exchange the information stored in the matrix S or the prototypes of each cluster. The algorithm of Fuzzy C-Means is trained again for each collaborator taking into account the shared information.
Several studies had been done to develop several algorithms and approaches on this framework, such as CoEM in [12], CoFKM [13] and collaborative EM-like algorithm (EM for Expectation-Maximization) based on Markov Random Fields [14]. All these approaches follow the same principle as Collaborative Fuzzy C-Means.
However, these algorithms display similar limitations: they require the same number of clusters in each site, the same same model trained in each site, and the algorithm can only happen between instances of the same algorithm.
The collaborative clustering was also developed based on Self-Organization Maps (SOM) [15] by adapting the original objective function to distributed data. The main idea was to add a term inspired by the classical SOM neighborhood function to the original SOM objective function, where this term aims to compare neighborhoods of each prototype in each sites. This neighborhood term is adapTable to either horizontal or vertical collaboration. The same principle can also be adapted to the Generative Topographic Maps (GTM) [16] with a modification in the M-step of the EM algorithm. The modification consists to add collaborative term inspired from the penalized likelihood estimation [17].
Another approach was proposed in this framework is the SAMARAH algorithm [10,18], with the advantage of not requiring a smoothness function or the same number of clusters or prototypes. However, it is restricted to horizontal collaboration only and the principle of solving the conflict based on pairwise criterion can make the process volatile.
Recent works have been done to develop the collaborative clustering and make it more flexible [19,20], ensuring a collaboration between different algorithms without fixing a unique number of clusters for all of the collaborators. The advantage of this approach is that different families of clustering algorithms can exchange information in a collaborative framework. Nevertheless, one of the most important issue in collaborative clustering is the control on the quality on the exchanged information from several collaborators and the right time to stop the collaboration. In [21,22], the authors develop a new criterion to select the optimal collaborator. They showed that the diversity between collaborators could be an important impact on collaboration. Furthermore, recently a study of the influence of diversity on the collaboration was done based on the entropy [23] and showed trade-off between the gain quality and diversity between the collaborators.

Fundamental background of the proposed approach
In this section we will represents the mathematical formalization of Optimal Transport problem and how it could be resolved using the Sinkhorn algorithm [24].

Optimal Transport
Optimal Transport is a well defended theory introduced by Monge in [25] to resolve the problem of resources allocation. The basis of this theory was to compute the optimal path of a massive particle from one point to another, by minimizing the cost of this move or this transportation. Lately the Monge problem was relaxed by Kantorovich in [26], where the problem is transposed to a distribution problem using linear programming connecting a pair of distributions.
More formally, Let Ω ⊆ R n be the measurable space of dimension n, and P(Ω) denotes the set of probability measures on Ω.
Given two families of data sets in Ω, , let µs and µ t be their respective distributions over Ωs and Ω t respectively.
The transport map from µs to µ t is defined as the pushforward #µs = µ t : : where transforms the probability measure µs in its image measure noted #µs , which is another probability measure defined over Ω t and satisfying: The Monge-Kantorovich formulation of this problem is a convex relaxation which aims to find a coupling defined as a joint probability measure over Ωs × Ω t with marginals µs and µ t that minimizes the cost of transport w.r.t c : Ωs × Ω t → R+ : Where Π is the set of all probabilistic couplings in P(Ωs × Ω t ) with marginals µs and µ t , and * designate the Optimal transportation plan. This problem admits a unique solution * which allows to define the Wasserstein distance of order p ∈ [1, +∞[ between µs and µ t : where d is a distance corresponding to the cost function c(x s , x t ) = d p (x s , x t ) s.t c : Ωs × Ω t → R+ of transporting the unit mass x s to x t .
In this work, we focus on the discrete case of the Optimal Transport problem. However, we refer to [27] for more details on the continuous case and the mathematics involved.
We consider the discrete setting of Optimal Transport problem. This case arises where µs and µ t are only accessible through discrete samples. The empirical measures can be defined as: With δx i the Dirac function at x i ∈ R n p s i and p t i the probability masses associated to the ith sample, The Monge-Kantorovich problem consists on finding an optimal coupling (or Transportation plan) * as a joint probability between µs and µ t over Ωs × Ω t by minimizing the cost of the transport w.r.t Xs ∈ R ns×n and X t ∈ R nt×n by solving: with: < ., . > F the Frobenius dot product C ∈ R ns×nt + the transport cost with C ij given by: the transportation polytope where 1n is a n-dimensional vectors of ones.
This problem admits a unique solution * and defines a metric called the Wasserstein distance on the space of the discrete probability measures as follow: The Wasserstein distance has been very useful recently especially in machine learning such as domain adaptation [28] metric learning [29], clustering [30] and multi-level clustering [31,32]. The particularity about this distance is that it takes into account the geometry of the data using the distance between the samples, which explains its efficiency. On the other hand, in term of computation, the success of this distance also comes from the work of Cuturi [24], who introduced an algorithm based on entropy regularization, as presented in the next section.

Regularized Optimal Transport
Even though the Wasserstein distance has known very significant successes, in term of computation the objective function has always suffered from a very slow convergence, especially in high dimension, which lead to the idea of proposing a smoothed objective function by adding a term of entropic regularization, introduced in [33] and applied to the Optimal Transport problem in [24] in order to speed up the convergence and improve the stability [34]. This is represented formally by the following minimization problem: where E( ) = − ∑︀ ns ,nt i,j ij log( ij ) and λ > 0 the entropy regularization parameter and C is the cost matrix. With the strong convexity of entropy, the objective function became a strictly convex function. Consequently, the minimization problem (8) admits a unique solution and can be solved by the Sinkhorn's fixed point algorithm, based on the following theorem.

Sinkhorn theorem (1967):
For any positive matrix A ∈ M(R n×m) + ), a ∈ Σn et b ∈ Σm, there is one and unique pair of vectors (u, v) ∈ R n + × R m + such that diag(u)Adiag(v) ∈ U(a, b) and constitutes a fixed point of the application: Thanks to the regularized version of optimal transport, we obtained a less sparse, smoother and more sTable solution than the original problem. Another important advantage is that this formulation allows the scaling matrix approach of Sinkhorn-Knopp [35].
The regularized Optimal Transport plan is then found by iteratively computing two scaling vectors u and v such that 4 Proposed approach

Motivation and potential applications
With the development of hardware technology, a huge amount of data represented in different views and different structures have been generated in real word applications. This kind of data is considered as a new challenge to develop the existing clustering algorithms, designed for single view data, to be more adapTable to multi-view data.
To clarify the motivation behind the proposed approach, we present some potential industrial applications, where we have several organizations or companies using a collection of data sets that could either concern the same or different customers. This could be data describing customers of banking institutions, state organizations and hospital with medical information records, etc. Imagine that all these organizations are dealing with the same individuals but every organizations may have different characteristics and descriptors for these individuals linked to the activities of the organization. All these organizations may want to explore data mining algorithms on there one data set. On the other hand, they also recognize that as they are other data sets containing information about the same individuals, it would advantageous to learn about the dependencies that they have so that they could reveal a macro-picture. However, due to ethical consideration and privacy issues, these organization are forbidden to share their data sets. which prevents the experts to combine all these data sets into a single view and carrying out different algorithms of classical clustering. For example, the confidentiality requirements in medical records of patients could deny the access into their personal information, and security issues in banking organization forbid to share the customers information. In addition, it may some hesitation from experts about losing the real structure of the data by adding more information and characteristics. In this situation, the exchange of the information through the proposed approach will guarantee the privacy of the information of each organism, and the control of the collaboration to avoid affecting the real structure of the data.
One of the most difficult challenges in collaborative learning is how to choose the right collaborator to collaborate with, which construct the order of the collaboration, not only to increase local quality of each model, but also to ensure the convergence and to avoid over-fitting ( Figure 1).
Classical collaborative algorithms are based on two steps. The first one consists to cluster the data locally, the second consists to send and receive information between the local models. Despite the quantity of work on this framework, it still requires many restrictions to ensure the convergence: usually each algorithm must work on the same representation space and must compute the same number of clusters. These restrictions limit the flexibility of collaborative clustering approaches for the analysis of real data.
On the other hand, Optimal Transport theory has shown very significant results, especially in transfer learning [28] and for comparison of distributions. Based on this idea, our intuition is to model collaborative learning as a bi-directional knowledge transfer and improve the optimization of the cost function based on the comparison of the distributions of local subset, in order to weight the mutual confidence of the collaborators and use a transport plan to transfer the information between them. In the next section we detail the proposed approach based on Optimal Transport theory, either in vertical or horizontal collaboration.

Collaborative Learning algorithms
The main goal of the proposed approach is to improve the quality and the stability collaboration and guarantee the convergence without over-fitting. In collaborative clustering, we distinct two principal approaches: vertical and horizontal collaboration. In the vertical collaboration, the collaborators learn from different instances represented in the same space, while in horizontal collaboration the collaborators work on the same instances in different representation space.
In general, different frameworks must be used for vertical and horizontal collaboration. Here we propose a unified framework adapted to both approaches.

Local step
Let consider r collaborators, where the data of each collaborator v, We seek in the local step to find the centroids corresponding to a distribution ν v , that represents the local clusters of each collaborator v, such as to minimizes the Optimal Transport plan between the the local data X v and the centroids M v .
To achieve this, we will solve the following minimization problem (11), where the first minimization of L v consists to find the Optimal Transport plan between the data and the centroids, and the second minimization M v aims to update the distribution of the centroids so that the transport plan is optimal between the data and the centroids. argmin i and the centroids m v i . It should be noticed that resolving (11) is equivalent to a Lloyd's problem which is the Expectation Minimization algorithm when d = 1 and p = 2 without any constraints on the weights. This is why to resolve this problem we alternate between computing the Sinkhorn matrix L v to assign instances the closest cluster and updating the centroids to decrease the transportation cost.

Algorithm 1: Sinkhorn-Means local algorithm
Algorithm 1 details the computation of the local objective function (11), proceeding similarly to k-means but with the advantage of using the Wasserstein distance. This allows to get soft assignment of the data, in contrary to k-means, which means that the components of the assignment matrix l ij ∈ [0, 1 n ]. Besides, the penalty term based on the entropy regularization guarantees a solution with higher entropy which increases the stability of the algorithm and ensures a uniform assignation of the instances.

Global step
The global step aims to compute the collaboration between the models where each collaborator can update its local clustering based on information exchanged with the other collaborators, until stabilization of clusters with improved quality. In the proposed approach, the collaboration step could be seen as two simultaneous phases.
The first phase aims to create an interaction plan based on Sinkhorn matrix distance which compares the local distribution of each collaborator to the others. The idea behind this phase is to allow each model to select the best collaborator to exchange information with, in other words the algorithm will also learns the best order of the collaborations in each iteration. The heuristic work in [21] proved that a collaboration with a model proposing a very different data distribution decreases the local quality, while a collaboration between very similar models is ineffective. Thus, the most beneficial collaboration is the one with models of median diversity. Hence, after the construction of the transport plan using the Sinkhorn algorithm, which compares the local structures, the proposed algorithm learns to choose for each model the collaborator with the median distribution similarity.
The second phase consists to exchange information between collaborators to improve local quality of each model. More precisely, we are looking to transport the prototypes to influence the location of the local prototypes; in order to get a higher local quality of each collaborator.
Considering the same notation above, we seek to minimize the following objective function: Where the first term deals with the local clustering and the remaining is the collaboration term and represents the influence on the local centroids' distribution by the distant centroid's distributions. α v ′ ,v are non-negative coefficients proportional to the diversity between the collaborators and the difference of local quality, and L v,v ′ is the Optimal Transport plan between the centroids of the v th and v ′th collaborators. Algorithm 2 explains the computation steps of the proposed approach and shows how it learns to select the best collaborator to learn from at each iteration, based on Sinkhorn comparisons between the distributions, and how it alternates between influencing the local centroids based on the confidence coefficient relative to the chosen collaborator and its local centroids distribution and update of the centroids relative to local instances in order to improve the clusters' quality.
It should be pointed out that in each iteration, each collaborator chooses successively the collaborators to exchange information with, based on the Sinkhorn matrix distance. More accurately, in each iteration, each model exchanges information with the collaborator having the median similarity between the two modelled distributions, computed with the Wasserstein metric. If this exchange increases the quality of the model (here we use the Davies-Bouldin index [36]), the centroids of the model are updated. Otherwise, the selected collaborator is removed from the list of possible collaborators and the process is repeated with the remaining collaborators, until the quality of the clusters stops increasing.
It must be highlighted that the proposed algorithm can be adapted to both horizontal a vertical collaboration, since the inputs of the algorithm requires the distributions that represent the local structure of each collaborator, where it can be either sharing the same space but different samples (vertical collaboration), which formally means that Another important advantage of the proposed algorithm is its adaptability with all the prototype-based algorithms, more precisely instead of using the Sinkhorn-Means as a local algorithm to get the centroids, we can use other prototype models like k-means, SOM, EM, etc. The proposed algorithm has therefore the capability to work with hybrid models. This will be detailed in Section 5.2.3.

Input :
{︀ : the numbers of clusters λ : the entropic constant for v = 1, ..., r do 5 Update the centroids M v and the partition matrix (L v ) * using a local algorithm (e.g. Sinkhorn-Means, SOM, GTM, k-Means...) 6 Update the centroids distribution ν v = 1 between the centroids of collaborators v and v ′ : Chose the median collaborator: Update the local centroids based on the collaborator's information, if the internal quality is increased (see below):

Data-sets
We consider the following data-sets provided by the UCI Machine Learning Repository [37], described in Table 1.
Each data-set is split between several collaborators.
-Glass dataset represents oxide content of the glass to determine its type. The study of classification of types of glass was motivated by criminological investigation. Since the glass left at the scene of the crime can be used as evidence...if it is correctly identified! -The Spam base dataset consists of 57 attributes giving information about the frequency of usage of some words, the frequency of capital letters and other insights to detect if the e-mail is a spam or not.  Glass  214  10  7  Spambase  4601  57  6  Waveform-noise  5000  40  3  Wdbc  569  33  2  Wine  178  13  3 -Waveform describes 3 types of waves with an added noise. Each class is generated from a combination of 2 of 3 "base" waves and each instance is generated of added noise (mean 0, variance 1) in each attribute.  (2)). -Wine data is the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Data-set splitting
In order to test experimentally the proposed algorithm, we first proceeded with a data pre-processing in order to create the local subsets.
For vertical collaboration, we aim create samples from the original data, which means different instances represented with same characteristics. We split the data horizontally into 10 random subsets X v , each subset is represented by the distribution µ v we train the algorithm 1 to get the local centroids partitions ν v , and then applied the collaborative algorithm 2 between the subsets in order to increase their local quality. To do so, we split the data as showed in Figure 2 where the data base is rated into v samples that share the same features. For horizontal collaboration, the main idea is to split each chosen data set to 10 subsets, (see Figure 3) that share the same instances but represented with different features in each subset, selected randomly with replacement. Considering the notation above, each subset X v will be represented by the distribution µ v that will be considered as the input of algorithm 1 to get the distribution of the local centroids ν v . Algorithm 2 is then applied to influence the location of the local centroids by the centroids of the distant learners without having access to their local data.

Quality measures
The proposed approach was evaluated with two internal quality indexes: Davies-Bouldin (DB) and Silhouette indexes, as well as an external criterion: Adjusted Rand Index (ARI). DB-Davies Bouldin index [36] is defined as follow: Where K is the number of clusters, ∆(c k , c k ′ ) is the similarity between clusters centers c k and c k ′ and ∆n is the average similarity of all elements from the cluster C k to their cluster center c k . This index evaluates the quality of unsupervised clustering basing on the compactness of clusters and separation measure between clusters. It's based on the ratio of the sum of within-clusters scatter to between-clusters separation. The lower the value of DB index, the better the quality of the cluster. The silhouette index [38], is based on the measurement of the difference between the average of the distance between the instance x i and the instances belonging to the same cluster a i and the average distance between the instance x i and the instances belonging to other clusters b i , the closer the silhouette value is to 1 means that the instances are assigned to the right cluster.
Moreover, since the data-sets we proposed in the experiments provide available labels, we choose to add an external quality index the Adjusted Rand Index (ARI) [39].
The Adjusted Rand Index [2] defined as follow 15: Where n ij =| C i ∩ Y j | and C i is the ith cluster and Y j is jth real class provides from the real label of the data-sets, and a i is the number of instances belonging to the same cluster with the same class while b j is the number of instances belonging to different cluster with different class. The ARI index measures the agreement between two partitions, one provided from the proposed algorithm and the second one provided from the labeled data-sets. The values of ARI are between 0 and 1 and the quality is better when the value of ARI is close to 1.
We therefore applied algorithm 1 on local data, then the coefficient matrix α is computed based on a diversity index between the collaborators [21]. This coefficient is used to control the importance of the terms of the collaboration. Algorithm 2 in trained 20 times in order to estimate the mean quality of the collaboration and a 95% confidence interval for the 20 experiments. The experimental results of horizontal collaboration were compared with SOM-collaborative [16]. Both approaches were trained on the same subsets and on the same local model, a 3 × 5 map, with the parameters suggested by the authors of the algorithm [16]. The last part of the experiments results consists to compare the proposed algorithm with the collaborative algorithms proposed in the state of the art, where the algorithms is trained on only two collaborators, we followed the same split as mentioned in [16] to compare the gain quality brought from the collaboration, based on DB Davis-Bouldin index.

Computation tools
A nice feature of the wasserstein distance is that their computation is vectored, which means the computation of a n distances, whether from one histogram to many, or many to many, can be carried out simultaneously using elementary linear algebra operations. To do so we use the PyTorch version of Sinkhorn-means, on GPGPU's. Moreover, the data collaborators were parallelized in order to compute local algorithm at the same time. For the experiment results, we used Alienware area-51m with GeForce RTX 2080/PCIe/SSE2 / NVIDIA Corporation graphic card.

Results and discussion
In this section we evaluate the approach on several data-sets for both vertical and horizontal clustering, based on different quality indexes, either internal or external one. We also compare the proposed algorithm with state-of-the-art approaches of collaborative clustering based on prototypes exchanges: Self-Organizing Maps collaboration (Co-SOM) and Generative-Topographic Maps collaboration (Co-GTM).

Vertical Collaboration case
To evaluate the proposed approach in a vertical collaboration case, we computed the algorithm 2 on several sub-sets that share the same features but have different size and complexity.
As one can see, the proposed approach shows in general an accepTable capacity at improving the DB index of the clustering before and after a vertical collaboration Table 3. This is not surprising, considering that the proposed algorithm evaluates the gain of quality based on this index. The DB index is computed at each iteration in order to learn whether or not the collaborator can benefit from this collaboration. To make sure of the validity of the algorithm, we used the silhouette internal index. As shown in Table 3, the value of Silhouette increases after collaboration, which is a confirmation that the proposed approach increases the quality of each collaborator. However, the quality gain resulting from the collaboration is not always very high for some data-sets. This is due to the structure of the database and its horizontal splitting. If the data is very sparse (notably Spambase), we can observe that the collaboration increases more the quality than non-sparse data (for example Waveform data-set). Table 3 shows the results achieved on this index and highlighted the performance of our algorithm and confirm that the quality of each collaborator increases after the collaboration.
As one can see, the results are generally positive but the difference between the values either in internal indexes (Silhouette and DB) or in external index ARI before and after collaboration is not very impressive, this is explained by the horizontal splitting which gives small subsets that practically have the same structure, which means that the collaboration can be seen as a bidirectional exchange of information between subsets of the same given database.
As we will see later on, this is not the case in horizontal collaboration, in which the impact of the collaboration is more important since the data are represented with different features for each collaborator. In addition, we chose one data set (due to page limitation) to detail the effect of the proposed algorithm on each collaborator. Table 2 shows the values of different quality indexes of each collaborator built from Spambase data set, and confirm that the quality does increase the quality of most collaborators in the process.
Sensitivity Box-Whiskers plots (Figure 4) are drawn for the 20 experiments scores for each dataset before and after collaboration process. They enable us to study the distributional characteristics of scores as well as the level of the scores. To begin with, scores are sorted over the 20 tests. Then four equal sized groups are made from the ordered scores. That is, 25% of all scores are placed in each group. The lines dividing the groups are called quartiles, and the groups are referred to as quartile groups. Usually, we label these groups 1 to 4    starting at the bottom. The median (middle quartile) marks the mid-point of the scores and is shown by the line that divides the box into two parts. Half the scores are greater than or equal to this value and half are less. The middle "box" represents the middle 50% of scores for the group. The range of scores from lower to upper quartile is referred to as the inter-quartile range. The middle 50% of scores fall within the inter-quartile range. As can be seen from these graphs, the overall performance behavior shows a clear improvement as a result of the collaboration process. For example, for the DB Index, we can see a decrease in index values for all databases due to the contribution of collaboration. For the other two quality indices, we rather observe an increase in the values of the indices showing an improvement in the qualities of the solutions found.

Horizontal collaboration case
In this section we validate the effectiveness of the proposed approach on different date-sets for horizontal collaboration, where each collaborator represents the instances with different features (in a different representation space) see Figure 3.
We show how the exchange of information between the collaborators can improve the local results of each collaborator. Moreover, we show that the gain of quality is much important comparing to classical collaboration (SOM and GTM collaboration) Besides Davis-Bouldin index 13, which is trained in the algorithm, we validated the proposed approach with silhouette 14, the Adjusted Rand Index 15. Table 5 shows that the collaboration step in the proposed approach increases the local quality of the models in regard to internal indexes DB and Silhouette, in a horizontal collaboration framework, for different datasets. Similarly, the ARI index values show that the clusters computed by the models are closer to the expected output after the collaborations (Table 5). One can notice that horizontal collaboration, between models that do not share the same representation space, is much more beneficial compared to vertical collaboration, where the models are computed in different spaces. This is due to the fact that in the vertical framework, the random splitting of the data-sets produce sub-sets of different instances represented in the same space (i.e., same features) with similar distributions due to the random process of the split. Therefore, each local model should be quite similar to the others and few exploiTable information is exchanged in the collaborative step. This could be confirmed by the comparison between the index values of Spambase data set in vertical collaboration (Table 2) and the horizontal collaboration (Table 4) where the difference between the score of the indexes is much more important for each collaborator comparing to vertical collaboration.
Sensitivity Box-Whiskers plots ( Figure 5) represents a synthesis of the scores into five crucial pieces of information identifiable at a glance: position measurement, dispersion, asymmetry and length of Whiskers. The position measurement is characterized by the dividing line on the median (as well as the middle of the box). Dispersion is defined by the length of the Box-Whiskers (as well as the distance between the ends of the Whiskers and the gap). Asymmetry is defined as the deviation of the median line from the center of the    Box-Whiskers from the length of the box (as well as by the length of the upper Whiskers from the length of the lower Whiskers, and by the number of scores on each side). The length of the Whiskers is the distance between the ends of the Whiskers in relation to the length of the Box-Whiskers (and the number of scores specifically marked). These graphs show the same overall performance behavior observed in the case of vertical collaboration. They show a clear improvement as a result of the collaboration process. This improvement is observed for all quality indices used.

Comparison with other collaborative approaches
In this section, the proposed collaborative algorithm 2 is based on Sinkhorn-Means (Sin-Mean) local algorithms as described in algorithm 1 (this framework is thereafter called Co-Sin-OT) and we illustrate the adaptability of the proposed collaborative approach by alternatively using Self-Organizing Maps (SOM) as local algorithms (Co-SOM-OT). Both are compared to popular state-of-the-art collaborative algorithms based on Self-Organized-Maps (Co-SOM) [15] and Generative-Topographic-Maps (Co-GTM) [16]. We focus here on the horizontal collaboration case as in [15] and [16]. Indeed, horizontal collaboration is usually more useful and applicable to real problems comparing to vertical collaboration, it is also more difficult. In the first part of the experiments, we test the quality of the collaboration for 10 collaborators. As the Co-GTM algorithm is designed for only two collaborators, it is not included in the comparisons. In the second part, only two collaborators are trained and the Co-GTM algorithm is included in the protocol.
The first set of experiments are thus restricted to Co-SOM, Co-SOM-OT and Co-Sin-OT in order to be able to work with several collaborators. All collaborative approaches are applied on the same subsets. In SOMbased approaches, each local collaborator starts with the same 5 × 3 SOM. The approaches are compared using the Silhouette index. As shown in Tables 6 to 10, the results obtained with the proposed approach are globally better for this index. One can note that, for some collaborators, the quality of the collaboration leads to very similar results in both cases, despite very different quality before collaboration. The OT-based approach (Co-SOM-OT) provides a much more sTable quality improvement over the set of collaborators. In addition, the use of Sinkhorn-Means as the local algorithm (Co-Sin-OT) provide the best results comparing a SOM-Based local clustering (Co-SOM and Co-SOM-OT). This can be explained by the fact that the mechanism of the SOM-based collaborative algorithms is constrained by the neighborhood's functions. Moreover, it was built for a collaboration between two collaborators, then extended to allows multiple collaborations, unlike the proposed approach where each learner exchange information with all of the others at each step of the collaboration.    In the second set of experiments, we compare the proposed approach to classical collaborative algorithms based on Self-Organized-Maps (Co-SOM) [15] and Generative-Topographic-Maps (Co-GTM) [16]. The three approaches are compared using DB index, as in [15,16]. As shown in Table 11, the results obtained with the proposed approach (Co-Sin-OT), in comparison to the state-of-the-art, is generally better than the classical approaches. The lowest qualities are expressed by the older approach, SOM-based collaborative clustering (Co-SOM), followed by the GTM-based approach (Co-GTM). Unlike Co-SOM and Co-GTM, the proposed approach aims to find a local optimum for each collaborator. More precisely, at the end of the local training, each collaborator exchange information based on a stopping criterion that ends the collaboration with each collaborator as soon as the quality of the collaboration starts decreasing, which is not the case in the other approaches. Furthermore, Table 11 compares the quality gain brought by the collaboration from each approach. The proposed approach increases the quality of each collaborator on all of the data-sets, which implies a positive gain quality. On the contrary, in SOM-based collaboration the gain can be negative for some datasets. Finally, in order to evaluate the general performance of the approaches, we define the following score Where G indicates the gain quality of each approach M i of each data-sets D j . This score gives an overall vision of the best approach on all the data-sets. As shown in Table 11, the best score belongs to the proposed collaboration based on Optimal Transport theory, followed by the GTM collaborative approach (Co-GTM) and the SOM-based collaboration (Co-SOM). These results highlight the the performance of the proposed algorithm, due to the strong theoretical back-ground of Optimal Transport theory.
In order to assess the performance of our approaches, we use the Friedman test and Nemenyi test recommended in [40]. The Friedman test is conducted to test the null-hypothesis that all approaches are equivalent in respect of accuracy. If the null hypothesis is rejected, then the Nemenyi test will be performed. In addition, if the average ranks of two approaches differ by at least the critical difference (CD), then it can be concluded that their performances are significantly different. In the Friedman test, we set the significant level α = 0.05. The Figure 6 shows a critical diagram representing a projection of average ranks approaches on enumerated axis. The approaches are ordered from left (the best) to right (the worst) and a thick line which connects the approaches were the average ranks not significantly different (for the level of 5% significance). As shown in Figure 6, Co-Sin-OT achieves significant improvement over the other proposed techniques (Co-GTM and Co-SOM) since during collaboration phase it is Table and the process stops the collaboration for some learners when their local quality stars to decrease, which prevents common issue of collaborative approaches.
Compared to the most cited approaches of the state of the art, the positive impact of using the collaborative learning based on this theory is: -The proposed algorithm is based on a strong and well defended theory that becomes increasingly popular in the field of machine learning. -Its strength is highlighted by experimental validation on both for artificial and real data-sets.
-The stopping criteria that we proposed based on the measure of the gain quality brought after each collaboration, because it guarantee the convergence once the gain quality tends towards zero. -The choice of the distant collaborator which is very important and allows to give an optimal order of the collaboration. In the proposed algorithm we solved this paradigm based on the Optimal Transport Matrix plan, that aims to compare all distribution of the centroids in each site. In this way each collaborator will be enable to choose the best one. -The proposed algorithm stops the negative collaboration, based on the measure of the gain quality between each collaboration and updates the centroids if the gain quality is positive. Otherwise, it moves to the other distant collaborator.
Finally, the proposed approach ensures the adaptability of working with different local models, this is lead us to introduce some managerial applications of our work, like the management system learning where the collaborative learning could offer in interaction between learners to make them work cooperatively rather than competitively and helps to create sub-networks of collaboration where the diversity is decreased, and manage the conflict learning by using a one-to-one collaboration. Besides the exchange information using the proposed algorithm preserve the privacy of each collaborator, and ensures the control of the shared information with each collaborator, and filters the received information to avoid affecting the real structure of local data. Thus, all the collaborators can explore the distributed data that could containing some mutual information while keeping the control on received and transmitted information.
However, the proposed algorithm still suffers from some limitations, in particular considering the same dimension in every site, and also the curse of height dimensionality that Optimal Transport still suffers from, which leads us to increase the penalty coefficient of the regularization in order to avoid the over-fitting.

Conclusion
In this paper, we proposed a new framework of collaborative learning inspired by Optimal Transport theory, where the collaborators aim to increase their local quality based on the information exchanged from other learners. We explained the motivation and the intuition behind our approach and we proposed a new algorithm of collaborative clustering based on the Wasserstein distance. The proposed approach allows to exchange information between collaborators either in vertical or horizontal collaboration. The results are sTable and the process stops the collaboration for some learners when their local quality stars to decrease, which prevents common issue of collaborative approaches.
The approach proposed in this paper is the first step into a new family of algorithms for the collaborative leaning task. We plan to develop further collaborative clustering algorithms based on Gromov-Wasserstein distance that ensure the comparison between the distribution coming from heterogeneous spaces, in order to make the collaborative algorithms more flexible, and to improve the quality and the stability of the collaborations.
There are several perspectives to this work. On the short term we are working to improve the approach in order to learn the confidence coefficient at each iteration, according to the diversity and the quality of the collaborators. This could be based on comparisons between sub-sets' distributions using the Wasserstein distance. This would lead us to another extension where the interaction between collaborators could be modeled as graph in a Wasserstein space, which would allow the construction of a theoretical proof of convergence.

Conflict of interest:
Authors state no conflict of interest.