An Improved Robust Fuzzy Algorithm for Unsupervised Learning

Abstract This paper presents a robust, dynamic, and unsupervised fuzzy learning algorithm (RDUFL) that aims to cluster a set of data samples with the ability to detect outliers and assign the numbers of clusters automatically. It consists of three main stages. The first (1) stage is a pre-processing method in which possible outliers are determined and quarantined using a concept of proximity degree. The second (2) stage is a learning method, which consists in auto-detecting the number of classes with their prototypes for a dynamic threshold. This threshold is automatically determined based on the similarity among the detected prototypes that are updated at the exploration of a new data. The last (3) stage treats quarantined samples detected from the first stage to determine whether they belong to some class defined in the second phase. The effectiveness of this method is assessed on eight real medical benchmark datasets in comparison to known unsupervised learning methods, namely, the fuzzy c-means (FCM), possibilistic c-means (PCM), and noise clustering (NC). The obtained accuracy of our scheme is very promising for unsupervised learning problems.


Introduction
Clustering is one of the most relevant data-mining tasks [42]. It is the process of organizing objects into a set of classes. The classes provided by the classical methods are considered hard, and each object is assigned to a single and unique class. This assumes that the boundaries between classes are well defined, whereas, in fact, class boundaries are often fuzzy and uncertain. This uncertainty is shown by the fact that an object possesses features that make it more likely to belong to more than a single class. Thus, in the fuzzy classification, an object does not belong exclusively to a single class but possesses a degree of membership to all existing classes [36]. The degree of membership is in the interval [0, 1], and the obtained classes are not necessarily disjoint. Clustering becomes very difficult in unsupervised contexts where no prior information on the experimental objects is provided. This difficulty increases when these objects contain outliers [39]. An outlier refers to a value that appears to be suspicious because it is significantly inconsistent with the rest of that set of data. According to Han and Kamber [27], outliers are the set of objects that are considerably dissimilar from the remainder of the data.
In clustering, giving an outlier the same importance that is given to other objects destabilizes the analysis and distorts the results [3]. Hence, outlier detection is important [44,47]. However, these outliers are not necessarily erroneous and may contain a meaningful indication [45], such as the case during fraud detection [9] or computer network intrusion [35]. Therefore, outliers ought not to be systematically rejected [5].
Handling clustering and outlier detection at the same time is a highly desirable task [33,38,40]. Many strong methods that have emerged in this direction take all the data into account, but minimize the influence of outliers [20,21]. The best-known algorithm is the noise clustering (NC), also called the robust-fuzzy c-means (FCM) [19]. In this algorithm, the notion of noise class is introduced. The class of these outliers is characterized by a fictitious prototype that has a constant distance δ to other objects. Hence, it is important to determine the distance δ, which is a critical parameter of the algorithm [17].
In this paper, we propose a robust approach, which allows clustering data by auto-detecting the classes they form and providing the existing outliers without giving any parameter. The proposed approach consists of three stages: -A pre-processing stage using similarity to detect objects likely to be outliers and which will be considered as possible outliers. These objects are quarantined and excluded from the second stage. -A second stage in which classes are determined based on a dynamic threshold. This threshold is based on the minimum similarity among the detected prototypes, which are updated at the exploration of any new object. This minimum similarity is considered as the condition of adding a new cluster. -A final stage, which is a processing of possible outliers in order to determine whether they belong to one of these classes detected in the second phase. To this end, each possible outlier is compared to its neighbors to confirm that it is really an outlier.
A more formal description of this method is presented in Section 3 following a brief related work in Section 2.
The results obtained from experiments on real and artificial data are presented in Section 4. Section 5 presents the main conclusions of this paper.

Related Work
Clustering is commonly used in real-world problems encountered in a variety of applications [13-15, 51, 54, 58]. It is an exploratory data analysis tool, which aims to find structure in a dataset according to the measured characteristics or similarities [12,48,52]. It consists in grouping a set of n data points into homogeneous groups, called clusters, without any prior information on the structure or the nature of the clusters. Clustering can be classified as hard or fuzzy. Hard clustering assigns each data point to a unique cluster with a degree of membership equal to one. Conversely, fuzzy clustering assigns each data point to every cluster with different membership degrees.
In mathematical terms, partitioning a learning base X = {x 1 , x 2 , . . . , x n } ⊂ ℜ p into c clusters can be represented by a (c × n) partition matrix U = [u ik ], which satisfies the following conditions: The space of hard partitions is, thus, defined by Bezdek [6]: Hard clustering assumes that clusters are disjointed, and their boundaries are well defined. However, the boundaries between clusters are not always definite in real-world datasets. Fuzzy clustering was proposed to deal with overlapping clusters.
Partitioning X into c fuzzy clusters can be defined by c fuzzy sets E 1 , . . . , E c and a membership function [57] assuming values in the interval [0, 1] such as: where u ik is interpreted as the membership degree to which the object i belongs to the k th cluster (1 ≤ k ≤ c and 1 ≤ i ≤ n) [6,19]. Therefore, a (c × n) fuzzy membership matrix U = [u ik ] can be used to represent the fuzzy partition of X. The k th row of this matrix contains values of the k th membership function µ k of the subset E k . Elements u ik satisfy the following conditions: Thereby, fuzzy clustering is considered as a generalization of hard clustering that can be used to describe imprecise or fuzzy information [30,50,53]. The most widely used clustering algorithm is FCM [15], which is highlighted in what follows:

The FCM Algorithm
FCM generalizes the hard c-means algorithm [K-means] to allow a point to partially belong to all existing clusters [6,7]. FCM is an iterative process, which optimizes an objective function J m defined by: where: u ik is the degree to which the element x k belongs to the i th class (1 ≤ i ≤ c, 1 ≤ k ≤ n).
m (1 < m < ∞) is an exponent of the weighting used to monitor the relative contribution of each object x i and the fuzziness degree of the final partition. -V = (v 1 , v 2 , . . . , v c ) represents a c triplet of prototypes, in which each prototype characterizes a class.
d(x k , v i ) is the distance between the i th prototype and the k th object.
Bezdek demonstrated that FCM converges to an approximate solution when two conditions are satisfied [10]: The pseudo-code of the FCM algorithm is given in Figure 1.
The idea of clustering data is natural. Indeed, we tend to group a large number of data into a small number of groups in order to facilitate further analysis. The search for these groups is not a simple task when data is affected by outliers. Generally, outliers are far from all the other items without neighbors. Outliers may significantly affect the estimation of the centers of detected clusters, which is the case for the FCM algorithm. Two methods were proposed to handle this problem: the possibilistic C-means algorithm (PCM) [34] and robust-FCM [19]. These are summarized in what follows.
Use U* and V*;

The PCM Algorithm
The PCM introduces a possibilistic type of membership function to describe the degree of belonging [56] and releases the objective function J m [Eq. (8)] by dropping the sum to 1 [Eq. (2)]. Hence, membership degrees became independent [41].
The PCM optimizes the objective J m defined as follows: where: and u ik is defined by: The parameter η i is a positive weight defined for modulating the opposing effects of the two terms in J m . It is set by the user and chosen according to each class. Unfortunately, it is not always available.

The Robust-FCM Algorithm
Dave [19] proposed a new method known as the «robust-FCM». It consists in introducing an additional class of noise that contains all the outliers. The fictive prototype of the noise cluster is set such that it is always at the same distance from the considered data points. This distance is called the noise distance δ. Thus, an object is not an outlier if its distance from one of the prototypes is inferior to δ.
The presence of the noise cluster modifies the objective function defined by Eq. (8) as follows: where u *k is the membership degree of the object x k to the noise cluster, and δ is the distance of noise defined, respectively, by: By minimizing the objective function defined by Eq. (14), we obtain: The variable λ is a multiplying factor for obtaining δ. Thus, the choice of δ depends on λ. Dave proposed a heuristic to select this parameter. However, this choice does not always give satisfaction.

Proposed Approach (RDUFL)
Handling simultaneously clustering data and detecting outliers, as mentioned earlier, is a highly desirable task. The intuitive approach consists in applying a clustering algorithm and considering objects that are distant from their nearest prototype as outliers. However, this algorithm may, itself, be extremely sensitive to the outliers that may have a disproportionate impact on prototypes [26]. Hence, detecting outliers is important in clustering tasks.
RDUFL allows to cluster the considered data and to detect eventual outliers. RDUFL consists of the following three phases:

Detection of Possible Outliers
The first phase of our approach consists in detecting objects that are likely to be outliers and which we will refer to as «possible» outliers. It originates from the fact that a normal object has more neighbors with which it shares similar characteristics [11,27]. It is based on the proximity degree of an object in relation to other objects. This concept consists in calculating the sum of similarities of an object to all other objects [22] and not just to its neighbors [1]: where: is the similarity between the two objects x i and x k [25].
p is the dimension of the objects space: -A is the p × p matrix defined by Bouroumi [10]: The factor r j stands for the difference between the maximum and minimal values of an attribute. It is defined by: The choice of this measure of similarity is motivated by the following properties [10,25,26]: which means that objects present a maximum of difference of each of their p components. Therefore, an object has a high degree of proximity when its neighbors are several, and the object with a low value «D» is more likely to be an outlier. It is considered as «possible» outlier.
This phase does not require any notion of clusters or expected number of outliers [29]. It allows determining the top objects within the small proximity degree. We note M as their number. Once these objects are detected, they are quarantined, and we proceed to the learning of the set X without taking into account these M possible outliers.

Learning Phase
Assuming that object vectors, which form the training database X, belong at least to two distinct classes and given an inter-point similarity measure, the learning algorithm of this phase starts by the creation of two classes around two first objects x 1 and x 2 [4].
RDUFL sequentially explores all the «n-2-M» objects of the training base X and analyzes their resemblances by utilizing the measure of similarity given by Eq. (19). A dynamic threshold ξ is utilized to detect when a new object is dissimilar to all existing prototypes. This threshold represents the minimum similarity that each object must have with its nearest prototype [4,10,23]. When this threshold is not attained, a new class is created, and its prototype is initialized with the current object.
In this paper, ξ is dynamic and depends on the current object. It is automatically calculated at each iteration as follows: If x i is the current object and v k its nearest prototype, ξ is defined by: The algorithm utilizes the similarity measure and its associated threshold in order to build classes. At each iteration, the similarity of the current object to the existing prototypes is calculated. Following the maximum of this similarity, two decisions will be conceivable [10]: (a) A new class is created when: This means that the current item x i is not sufficiently similar to the prototypes of the previously detected classes. It is supposed to come from a class that has not been detected yet and must, therefore, represent a new class [10]. Thus, we put: The prototypes are updated when: In this case, x i is considered to have the minimal similarity required with the prototypes of the previously detected classes. Therefore, we must not create any new classes.
The prototypes of the previously created classes are then updated according to the following learning rule: where: v k (i), v k (i − 1) are, respectively, the prototypes of the class k before and after the addition of x i . n i (k) designates the fuzzy cardinal of the class k after the addition of x i , defined by: Failure to consider the possible outliers during this phase allows the stabilization of prototype calculation and the no distortion of the automatic detection of the number of classes.

Treatment of Possible Outliers
In this phase, we deal with the M possible outliers O i (1 ≤ i ≤ M) that have been detected and quarantined during the first phase. For each possible outlier O i , we look for its nearest prototype v k and its corresponding class marked C k . For this class C k , the farthest element x l is determined: If the point x l is closer to the outlier O i than to its nearest prototype v k , the point O i has neighbors that are neighbors of x l . This point O i is not, therefore, an outlier. Hence, two cases are possible: The pseudo-code of RDUFL is presented in Figure 2. Step 2: Partition the set X after removing temporarily the points O i .

-Choose a similarity measure Sim
Step

Results and Discussion
To assess the performance of our approach, some experiments were conducted on an artificial dataset X 1 , and on eight real-world databases that are available in UCI [8]: Lymphography, Diabetes, Indian, Haberman's Survival, BCW, Post-operative Patient, Parkinsons, and EEG Eyes State. A first comparison is based on the recognition rate defined by: Recognition rate = 100 * Number of correctly identified objects Total number of objects (28) A second comparison is based on: where: -TP (true positive) is the number of objects correctly identified -FP (false positive) is the number of objects incorrectly identified -TN (true negative) is the number of objects correctly rejected -FN (false negative) is the number of objects incorrectly rejected.

Artificial Dataset
To illustrate the usefulness of the proposed algorithm, we consider the dataset X 1 , which is a two-dimensional artificial example derived from Ref. [10]. It is divided into three classes with 58 points in the plan and seven outliers ( Figure 3). This two-dimensional dataset is important due to the possibility it presents in terms of visualization.
For the dataset X 1 , RDUFL detected seven possible outliers. The learning and treatment phases of the outliers demonstrated that these are true outliers. The number of detected classes is three, and the recognition rate is 100% (Figure 4).
On the other hand, we applied on X 1 the following algorithms: FCM, PCM, and robust-FCM. First, the FCM algorithm failed to detect the existing outliers for c = 3 ( Figure 5A). This algorithm detected two outliers for c = 4 ( Figure 5B) and three outliers for c = 5 ( Figure 5C). It is only for c = 6 ( Figure 5D) that the FCM detected all the seven outliers by considering them as points that belong to two clusters with a weak cardinal.
As for the robust-FCM algorithm intended to detect the outliers, it only detects two outliers ( Figure 6). In addition, the obtained recognition rate for the robust-FCM equals 92.31%, whereas our approach was able to recognize all the objects.
For the tests we carried out, the value of λ ranged between 0.01 and 0.9, and the best result is obtained for λ = 0.7. Table 1 presents the recognition rate of learning through the considered algorithms compared to RDUFL. These results show the sensitivity of these algorithms toward the outliers and their difficulties in correctly extracting the classes [31,32].

Real-World Dataset
The first considered real dataset is «Lymphography», which has 148 objects with 18 attributes. These are observations, which were made on patients with cancer in the lymphatic of the immune system. It comprises four classes: normal (two objects), metastases (81 objects), malignant lymphs (61 objects), and fibrosis (four objects). The second dataset is «Diabetes», and it is composed of 768 objects with four attributes. The data fall into two classes: the class 0 with 500 instances and the class 1 interpreted as "tested positive for diabetes" with 268 instances.
The third dataset is the «Indian» dataset, which comprises 583 objects with 10 attributes. There are two classes: the first with 416 objects and the second with 167 objects.
The fourth dataset is Haberman's Survival dataset that is the result of a measure of 306 cases on the survival of patients who had undergone surgery for breast cancer. It is a three-dimensional pattern classification problem from two classes.
The Breast Cancer dataset is a nine-dimensional pattern classification problem with 699 samples from malignant (cancerous) class and benign (non-cancerous) class. The two classes contain, respectively, 458 and 241 points.
The Parkinsons Disease dataset is composed of a range of biomedical voice measurements. There are 22 attributes and 195 samples from the two classes corresponding to healthy people and those with Parkinson's disease. The two classes contain, respectively, 48 and 147 points.
The Postoperative Patient dataset aims to determine where to send patients in a postoperative recovery area. The number of instances is 90 distributed over three classes: class I (patient sent to Intensive Care Unit) with two items, class S (patient prepared to go home) with 24 items, and class A (patient sent to general hospital floor) with 64 items.
The EEG Eyes dataset consists of EEG values and a value indicating the eye state. This eye state was detected via a camera during the EEG measurement and added later manually to the file. It is a 14-dimensional pattern classification problem with 14,980 samples. The two classes contain, respectively, 8257 and 6723 points. Table 2 describes the data and provides information on the attributes, size, and number of classes. We initially checked if there are possible outliers in the considered datasets. To this end, we calculated the proximity degree for the objects and looked for their small values.
Once possible outliers were determined and isolated, we applied RDUFL to the other objects by assessing an object at each iteration. At the end of the learning phase, we obtained the detected classes with their prototypes.  The treatment phase of the possible outliers O i allowed to obtain the results described in Table 3.
The first finding is that RDUFL allows detecting the exact number of classes for all the examples of the considered data based on the defined dynamic threshold and the learning rule.
For the Lymphography dataset, there really exist four clusters in which two classes are considered rare (classes 1 and 4) given their small size [28,29]. Class 1 contains two items, and class 4 contains four. The RDUFL detects these six outliers, whereas the robust-FCM detects five outliers in which only two items belong to the rare classes. The PCM does not recognize any rare classes of this set.
For the BCW dataset, the concept of proximity degree allows detecting 18 possible outliers, which were isolated and not considered in the learning phase. The algorithm clusters the dataset into two clusters. The comparison of the 18 possible outliers with the prototypes of detected clusters demonstrates that these items had enough characteristics in common with these detected prototypes. Therefore, they are not true outliers.
-Specificity is the ability to correctly identify patients without the disease.
The RDUFL detects anomalies for this dataset and identifies outliers (small class) for each considered value of possible outliers. We report the results in Table 4. Moreover, the RDUFL can improve considerably the performance of clustering and allows an increase of 30.61% on the BCW dataset. Those results show that the adopted approach can lead to an increase in accuracy for class discovering even in the absence of outliers. Indeed, the RDFUFL improves learning and yields a good recognition rate as depicted in Table 5. The recognition rates of each of the studied learning algorithms are also reported, for each dataset, in Table 5.
The accuracy percentage is also determined for each algorithm, based on the values of the equation [Eq. (29)]. The results of the considered algorithms are shown in Table 6. According to this table, we say that RDUFL has the highest accuracy percentage.

Conclusion
In this paper, we introduced an adapted approach in order to partition a dataset and detect outliers. This approach consists of three stages. The first stage is a method of pre-treatment, which identifies objects that are likely to be considered possible outliers by utilizing the concept of proximity degree. The second stage is an unsupervised fuzzy learning algorithm, which detects existing classes formed by the data without possible outliers. In this stage, the algorithm equally provides the prototypes of these detected classes and the membership degrees of each object to these classes. The creation of classes is carried out according to a dynamic threshold, which is recalculated at each iteration of the algorithm. This threshold is based on the similarity among the prototypes updated at the exploration of a new object. As for the last stage, it consists in comparing the similarity of each possible outlier to the farthest object belonging to the class that corresponds to its nearest prototype. The experimental results demonstrated the effectiveness of the proposed approach especially that it does not require any user-specified parameter.
Future work will introduce the notion of granular computing [18,24,46,55] to quantify the imprecision and the tolerance of uncertainty in the given large attribute dataset [2,37,49]. Indeed, granularity allows simplification, clarity, low cost, and tolerance of uncertainty [36]. It "underlies the remarkable human ability