Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter July 8, 2016

Performance Evaluation of Line Symmetry-Based Validity Indices on Clustering Algorithms

  • Vijay Kumar EMAIL logo , Jitender Kumar Chhabra and Dinesh Kumar

Abstract

Finding the optimal number of clusters and the appropriate partitioning of the given dataset are the two major challenges while dealing with clustering. For both of these, cluster validity indices are used. In this paper, seven widely used cluster validity indices, namely DB index, PS index, I index, XB index, FS index, K index, and SV index, have been developed based on line symmetry distance measures. These indices provide the measure of line symmetry present in the partitioning of the dataset. These are able to detect clusters of any shape or size in a given dataset, as long as they possess the property of line symmetry. The performance of these indices is evaluated on three clustering algorithms: K-means, fuzzy-C means, and modified harmony search-based clustering (MHSC). The efficacy of symmetry-based validity indices on clustering algorithms is demonstrated on artificial and real-life datasets, six each, with the number of clusters varying from 2 to n, where n is the total number of data points existing in the dataset. The experimental results reveal that the incorporation of line symmetry-based distance improves the capabilities of these existing validity indices in finding the appropriate number of clusters. Comparisons of these indices are done with the point symmetric and original versions of these seven validity indices. The results also demonstrate that the MHSC technique performs better as compared to other well-known clustering techniques. For real-life datasets, analysis of variance statistical analysis is also performed.

1 Introduction

Clustering is an unsupervised classification technique where information about labeling and structural information is not available. It partitions the input data objects into a certain number of clusters based on some similarity or dissimilarity metric where the number of clusters may or may not be known [13, 27]. The three main challenges in clustering are (a) identification of the best clustering technique for the given dataset, (b) finding the number of clusters in the given dataset, and (c) finding the corresponding best partitions. We make use of validity indices for determining the number of clusters and partitions. A large number of cluster validity indices have been proposed in the literature [2, 10, 12, 19, 20, 22, 24, 25]. Most of the cluster validity indices use the Euclidean distance computation [27]. By this distance measure, they are able to identify only compact (or hyperspherical) clusters.

It has been observed that some kind of symmetry is also present in clusters. In this direction, Bandyopadhyay and Saha [4] proposed the point symmetry (PS)-based distance measure, which is used to detect clusters of any shape and size as long as they possess PS property. They have also incorporated the PS distance, rather than Euclidean distance, to develop PS-based versions of some existing cluster validity indices [21]. However, these PS-based validity indices perform poorly when clusters possess line symmetry property.

In order to improve the performance, we have developed line symmetry-based validity indices in this paper. We incorporate the recently developed line symmetry distance, rather than PS and Euclidean distances, to further develop the line symmetric versions of DB index, PS index, I index, XB index, FS index, K index, and SV index. This enables the validity indices to detect both convex and non-convex clusters of any shape and size. The three different clustering techniques, viz. K-means (KM), fuzzy C-means (FCM), and the recently developed modified harmony search-based clustering (MHSC), are used for the performance evaluation of the validity indices. These are compared over the number of clusters and corresponding partitions are also obtained. For the clustering algorithms comparison, analysis of variance (ANOVA) has also been used. The experimental results reveal that the incorporation of the concept of line symmetry improves the capabilities of the validity indices, and we are able to find out the number of clusters as close to the actual number of clusters as possible. The remainder of this paper is organized as follows. In Section 2, a brief description of clustering algorithms is presented. Section 3 describes the newly developed cluster validity indices. Section 4 covers the experimental results and discussion, followed by the conclusions in Section 5.

2 Clustering Algorithms

The clustering algorithms try to find out a K×n partition matrix U(X) of a dataset X (X={x1, x2, …, xn}), representing its partitioning into number of clusters, say K, (C1, C2, …, CK). The partition matrix U(X) may be represented as U=[uij]K×n, i=1, 2, …, K, and j=1, 2, …, n, where uij is the membership of data point xj to cluster Ci. The three most commonly used partitional clustering algorithms have been considered for comparison, and these are the KM, FCM, and MHSC techniques.

The well-known KM clustering algorithm seeks an optimal partition of data by minimizing the sum-of-squared-error criterion with iterative optimization procedure such as [13, 19]

(1)J=j=1ni=1Kuij||xjvi||2, (1)

where vi is the center of cluster Ci. Here, the K cluster centers are initialized to K randomly chosen data points from the given dataset. The data points are assigned to the nearest cluster center based on the minimum squared distance criterion. The cluster centers are subsequently updated to the mean of data points belonging to them. This process is repeated until there is no significant change in the value of J in two consecutive iterations.

Bezdek [6] proposed a fuzzy version of the KM algorithm, known as FCM. FCM attempts to find partitions, represented as K fuzzy clusters for n data points, while minimizing the objective function:

(2)Jm=j=1ni=1Kuijm||xjvi||2, (2)

where m is the fuzzification parameter, which influences the performance of the clustering algorithm. U=[uij]K × n is the fuzzy partition matrix and uij ∈ [0,1] is the membership coefficient of the jth data point in the ith cluster.

Kumar et al. [16, 17] developed an MHSC, which is used as third clustering algorithm for comparison. Here, a cluster center-based encoding scheme is used. Each harmony vector contains K cluster centers, which are initialized to K randomly chosen data points from the given dataset. This process is repeated for each of the HMS vectors in the harmony memory, where HMS is the harmony memory size. The data points are assigned to different cluster centers based on the minimum Euclidean distance criterion, and cluster centers represented by the harmony vectors are replaced by the mean data points of respective clusters. The fitness of each harmony vector is computed using the sum-of-squared-error criterion and is minimized using the modified harmony search algorithm. The improvisation process is used to improvise the harmony vectors. In MHSC, the processes of fitness computation and improvisation are executed for a maximum number of iterations. The best harmony vector at the end of the last iteration provides the solution to the clustering problem.

3 Newly Developed Line Symmetry-Based Cluster Validity Indices

In this section, seven new cluster validity indices based on the concept of line symmetry are developed. These validity indices use the definitions of existing cluster validity indices. These indices can further be applied to find out the number of clusters irrespective of their shape and size, provided the clusters possess line symmetry property. In this section, we first describe line symmetry distance followed by the newly developed line symmetry-based cluster validity indices.

3.1 Line Symmetry Distance

Motivated by the property of line symmetry that clusters often exhibit, a line symmetry-based distance was proposed by Saha and Maulik [23]. It is defined as follows:

Suppose a particular dataset X having n data points. First, the principal axis of X is computed using the principal component analysis [14]. The eigenvector of the covariance matrix of X with the highest eigenvalue be [ev1, ev2, …, evd], where d is the dimension of X. Then, the first principal axis of X is given by

(3)(x1c1)ev1=(x2c2)ev2==(xdcd)evd, (3)

where C=[c1, c2, …, cd] is the center of the dataset X. The obtained principal axis is treated as the symmetrical line of the relevant cluster. This line is used to compute the line symmetry of a particular point in that cluster. To compute the amount of line symmetry of a point x with respect to a symmetrical line i, dLS(x, i), the following steps are followed [23].

  1. For a particular data point x, calculate the projected point pi on the relevant symmetrical line i.

  2. Compute dsym(x, pi) as

    (4)dsym(x,pi)=i=1KneardiKnear, (4)

    where the Knear unique nearest neighbors of x*=2 × pix are at Euclidean distances of dis, i=1, 2, …, Knear.

  3. The amount of line symmetry of a particular cluster i is computed as

    (5)dLS(x,i)=dsym(x,pi)×de(x,vi), (5)

    where vi is cluster center of the particular cluster i and de(x, vi) is the Euclidean distance between point x and cluster center vi.

From Eq. (4), it has been found that Knear cannot be chosen as 1, because if x* exists in the dataset, then dsym(x, pi)=0, and hence there will be no impact of Euclidean distance. Large values of Knear may not be suitable as it may overestimate the amount of symmetry of a point with respect to the first principal axis. In this paper, Knear is chosen as 2.

3.2 Line Symmetry-Based Cluster Validity Indices

The main contribution of this paper is the development of seven new line symmetry-based validity indices, which are used to improve the performance of the indices. We have tested these line symmetry-based validity indices on three different well-known clustering techniques to avoid bias toward any one technique. We have compared the clustering algorithms on real-life datasets in terms of the Minkowski score. We have also applied the ANOVA test on the Minkowaski score for statistical analysis to know the best among the three techniques. The results further indicate that these indices outperform PS-based validity indices on real-life datasets.

3.2.1 Definitions of Line Symmetry-Based Cluster Validity Indices

In this subsection, the seven newly developed line symmetry-based cluster validity indices are described.

3.2.1.1 Line Symmetry-Based DB Index (LSym-DB Index)

This index is developed on the basis of the existing DB index proposed in Ref. [9]. This is a function of the ratio of the sum of within-cluster line symmetry to between-cluster separation. The scatter within the ith cluster, Si, is defined as

(6)Si=1|Ci|xCidLS(x,i), (6)

where |Ci| represents the number of data points in cluster i and dLS(x, i) is computed using Eq. (5). Here the Knear of x*=2 × pix are searched among the data points in cluster i. The distance, dij, between cluster Ci and Cj is computed as

(7)dij=de(vi,vj), (7)

where vi and vj represents the center of clusters i and j. de denotes the Euclidean distance. The line symmetry-based DB index (LSym-DB index) is defined as

(8)LSym-DB(K)=i=1KRiK, (8)

where Ri=maxj=1,..,K,ij{Si+Sjdij}. The optimal number of clusters is obtained by solving min2Kn1{LSym-DB(K)}.

3.2.1.2 Line Symmetry-Based PS-Index (LSym-PS Index)

This index is developed on the basis of the existing PS index proposed in Ref. [8]. The line symmetry-based PS index (LSym-PS index) is defined as

(9)LSym-PS(K)=1Ki=1K1|Ci|xCidLS(x,i)minp,q=1,..,Kpq{de(vp,vq)}, (9)

where |Ci| represents the number of data points in cluster i and dLS(x, i) is computed using Eq. (5). Here Knear are the nearest neighbors of the reflected point x* of the point x with respect to line symmetry of cluster i, and x belongs to the ith cluster. The optimal number of clusters is obtained by solving min2Kn1{LSym-PS(K)}.

3.2.1.3 Line Symmetry-Based I-Index (LSym-I Index)

This index is developed on the basis of existing I index proposed in Ref. [19]. The new cluster validity index, LSym-I index, is defined as

(10)LSym-I(K)=(1K×1εK×DK). (10)

Here, εK=i=1Kj=1|Ci|dLS(xji,i) and DK=maxi,j=1,,K{||vivj||}.DK is the maximum Euclidean distance between two cluster centers out of all pairs of clusters. dLS(xji,i) is computed using Eq. (5). xji represents the jth data point of the ith cluster and belongs to cluster i. The optimal number of clusters is obtained by solving max2Kn1{LSym-I(K)}.

3.2.1.4 Line Symmetry-Based Xie-Beni Index (LSym-XB Index)

This index is developed on the basis of existing XB index proposed in Ref. [26]. The cluster validity index is defined as

(11)LSym-XB(K)=i=1Kj=1|Ci|dLS2(xji,i)n(mini,j=1,,K,ijde2(vi,vj)). (11)

dLS(xji,i) is computed using Eq. (5). Here, Knear represents the nearest neighbors of the reflected point x* of the point x with respect to line symmetry of cluster i, and x belongs to the ith cluster. The optimal number of clusters is obtained by solving min2Kn1{LSym-XB(K)}.

3.2.1.5 Line Symmetry-Based FS Index (LSym-FS Index)

This index is developed on the basis of the existing FS index proposed in Ref. [11]. The line symmetry-based FS index, called as LSym-FS index,

(12)LSym-FS(K)=i=1KxCidLS2(x,i)i=1KxCide2(vi,v¯), (12)

where v̅ is the center of the whole dataset and dLS(x, i) is computed using Eq. (5). Here, Knear represents the nearest neighbors of the reflected point x* of the point x with respect to line symmetry of cluster i, such that x belongs to the ith cluster. The optimal number of clusters is obtained by solving min2Kn1{LSym-FS(K)}.

3.2.1.6 Line Symmetry-Based K Index (LSym-K Index)

This index is developed on the basis of the existing K index proposed in Ref. [18]. The newly developed LSym-K index is defined as follows

(13)LSym-K(K)=i=1KxCidLS2(x,i)+1Ki=1Kde2(vi,v¯)minij(de2(vi,vj)), (13)

where v̅ is the center of the whole dataset and dLS(x, i) is computed using Eq. (5). Here, Knear represents the nearest neighbors of the reflected point x* of the point x with respect to line symmetry of cluster i, such that x belongs to the ith cluster. The optimal number of clusters is obtained by solving min2Kn1{LSym-K(K)}.

3.2.1.7 Line Symmetry-Based SV Index (LSym-SV Index)

Kim et al. [15] proposed the SV index for determining the optimal number of clusters. On the basis of the existing SV index, a new LSym-SV index is developed. The LSym-SV index is defined as follows:

(14)LSym-SV(K)=1Ki=1KxCidLS(x,i)|Ci|+Kminij{de(vi,vj)}. (14)

Here, vi and vj represent the centroid of clusters i and j, respectively. dLS(x, i) is computed using Eq. (5). The optimal number of clusters is obtained by solving min2Kn1{LSym-SV(K)}.

4 Experimental Results

In this section, we describe the datasets used and the number of clusters identified by different cluster validity indices after the application of three well-known clustering algorithms: KM, FCM, and MHSC.

4.1 Datasets Used

As many as 12 datasets have been used for the experiments. Table 1 presents the details of these datasets. These datasets are divided into three categories.

  1. Category 1: There are two datasets in this category (Data 1 and Data 2). These are artificially generated and have concave-shaped clusters. Data 1 consists of 800 three-dimensional data points distributed over two clusters. This is shown in Figure 1A. Some clusters existing in the dataset are overlapping in nature. Data 2 consists of four clusters, as shown in Figure 1B, and each cluster has 400 data points. The total number of two-dimensional data points is 1600.

  2. Category 2: The four artificial datasets in this category are those used in Ref. [3]. The clusters present in datasets are convex in nature. These are Sph_4_3, Sph_5_2, Sph6_2, and Mixed_9_2. The Sph_4_3 dataset consists of 400 data points distributed over four hyperspherical-shaped clusters. This is shown in Figure 1C. The Sph_5_2 dataset consists of 250 two-dimensional data points distributed over five different clusters. Some clusters are overlapping in nature, as shown in Figure 1D. The Sph_6_2 dataset consists of 300 two-dimensional data points distributed over six clusters, as shown in Figure 1E. The Mixed_9_2 dataset consists of 900 two-dimensional data points distributed over nine clusters, as shown in Figure 1F.

  3. Category 3: This category consists of six real-life datasets such as Iris, Wine, Glass, Haberman, Breast Cancer, and Contraceptive Method Choice (CMC). These datasets are obtained from the University of California Irvine Machine Learning Repository [7]. The Iris dataset consists of 150 data points distributed over three clusters. Each data point has four features corresponding to sepal width, sepal length, petal length, and petal width. The Wine dataset consists of 178 data points. Each data point has 13 features. There are three clusters in this dataset. The nine-dimensional Glass dataset contains 214 data points distributed over six clusters. The Haberman survival dataset consists of 306 data points and the number of attributes is 3. It has two categories (patient survived and died). The Wisconsin Breast Cancer dataset consists of 683 data points distributed over two clusters. Each data point has nine features. The CMC dataset consists of 1473 instances and the number of attributes is 9. There are three subcategories in the dataset: no use, long term, and short term.

Table 1:

Characteristics of the datasets used.

Dataset nameNo. data pointsNo. attributesActual no. clustersType
Data 180032Artificial
Data 2160024Artificial
Sph_4_340034Artificial
Sph_5_225025Artificial
Sph_6_230026Artificial
Mixed_9_290029Artificial
Iris15043Real
Wine178133Real
Glass21496Real
Haberman30632Real
Breast cancer68392Real
CMC147393Real
Figure 1: Distribution of Data Points over Different Clusters: (A) Data 1; (B) Data 2; (C) Spherical_4_3; (D) Spherical_5_2; (E) Spherical_6_2; and (F) Mixed_9_2.
Figure 1:

Distribution of Data Points over Different Clusters: (A) Data 1; (B) Data 2; (C) Spherical_4_3; (D) Spherical_5_2; (E) Spherical_6_2; and (F) Mixed_9_2.

4.2 Parameter Setting

KM and FCM were executed for 100 iterations. For FCM, m is set to 1.5. The maximum number of iterations and harmony memory size for MHSC are fixed as 100 and 15, respectively. The pitch adjustment rate is chosen in the range as [0.01, 0.99], harmony memory consideration rate as [0.5, 0.95], and bandwidth as [0.01, 0.1], as mentioned in Ref. [16]. Each algorithm is used to compute a set of partitions with the number of clusters ranging from Kmin to Kmax. Here, Kmin is 2 and Kmax is n, where n is the number of data points in the dataset. In this paper, Kmax was limited to 15 for real-life datasets to avoid computational problems. For every dataset, algorithms were executed for 10 times individually, each time with randomly generated initial cluster centers. The results depicted in tables are the average values obtained over 10 runs of each algorithm.

4.3 Results and Discussion

To demonstrate the usefulness of the newly developed line symmetry-based validity indices in finding the appropriate number of clusters, the indices are computed on the above-mentioned 12 datasets. KM, FCM, and MHSC are used as the partitioning techniques. As the range assigned to the number of clusters is [2,n], the clustering technique gives a total of n1 partitions for each dataset. Each partition has a particular cluster validity index value, CV2,CV3,,CVn. The optimal number of clusters is identified as Kopt=Opti=2,3,,n(CVi), mentioned in Ref. [21].

Tables 2 through 4 show the optimal number of clusters obtained by the seven newly developed indices for the above-mentioned datasets over K=2 to n after the application of the KM, FCM, and MHSC clustering techniques, respectively. Figures 2 through 7 show the partitions obtained after the application of the above-mentioned three clustering techniques on the six artificial datasets, respectively.

Table 2:

Optimal Number of Clusters Obtained from the KM Clustering Algorithm Using the Newly Developed Line Symmetry, PS, and Original Versions of Validity Indices over Different Datasets.

DatasetDBPSIXBFSKSV
LSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrg
Data 17337332332357104735735
Data 2499496108545104684510252
Sph_4_324324724424431052443102
Sph_5_251055555555555105255752
Sph_6_261056104685641061056856105
Mix_9_2999993922929979939984
Iris31022102232222373322332
Wine4105384343349410632631010
Glass510431022623265109363562
Cancer2922222222222942222102
CMC33103323333393343310339
Haberman810102103222221021010251091010
Success rate0.58 (7/12)0.25 (3/12)0.25 (3/12)0.67 (8/12)0.33 (4/12)0.17 (2/12)0.67 (8/12)0.58 (7/12)0.50 (6/12)0.75 (9/12)0.42 (5/12)0.42 (5/12)0.67 (8/12)0.08 (1/12)0.25 (3/12)0.67 (8/12)0.42 (5/12)0.33 (4/12)0.58 (7/12)0.33 (4/12)0.08 (1/12)
Table 3:

Optimal Number of Clusters Obtained from the FCM Clustering Algorithm Using the Newly Developed Line Symmetry, PS, and Original Versions of Validity Indices over Different Datasets.

DatasetDBPSIXBFSKSV
LSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrg
Data 121034736310231041054310675
Data 249449749549109984810452
Sph_4_32942442242443943443104
Sph_5_25555555555555105555555
Sph_6_26105610469444461056446106
Mix_9_299999948248941099891082
Iris3102232232222333232332
Wine4994943444293994367910
Glass510339321026363107333592
Cancer222222222222210422221010
CMC3102392333339310333931010
Haberman61048101052382108101095104310
Success rate0.67 (8/12)0.17 (2/12)0.41 (5/12)0.50 (6/12)0.41 (5/12)0.33 (4/12)0.50 (6/12)0.41 (5/12)0.33 (4/12)0.41 (5/12)0.41 (5/12)0.41 (5/12)0.50 (6/12)0.08 (1/12)0.41 (5/12)0.50 (6/12)0.50 (6/12)0.33 (4/12)0.50 (6/12)0.08 (1/12)0.25 (3/12)
Table 4:

Optimal Number of Clusters Obtained from the MHSC Algorithm Using the Newly Developed Line Symmetry, PS, and Original Versions of Validity Indices over Different Datasets.

DatasetDBPSIXBFSKSV
LSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrgLSmSmOrg
Data 17357432210225785225735
Data 24554564510457485454432
Sph_4_32433432442434105223445
Sph_5_25645556542452935210545
Sph_6_266561066664446106624746
Mix_9_210410944922939993929924
Iris342322322322383322322
Wine81071010210282291010242510210
Glass61036233236226109523623
Cancer222222222222283222229
CMC3223223232210310432103210
Haberman254833222221089382108910
Success rate0.67 (8/12)0.25 (3/12)0.08 (1/12)0.67 (8/12)0.25 (3/12)0.25 (3/12)0.67 (8/12)0.50 (6/12)0.41 (5/12)0.58 (7/12)0.33 (4/12)0.25 (3/12)0.67 (8/12)0.08 (1/12)0.17 (2/12)0.67 (8/12)0.25 (3/12)0.25 (3/12)0.67 (8/12)0.17 (2/12)0.17 (2/12)
Figure 2: (A) Clustered Data 1 after the Application of KM for K=2. (B) Clustered Data 1 after the Application of FCM for K=2. (C) Clustered Data 1 after the Application of MHSC for K=2. (D) Clustered Data 1 after the Application of MHSC for K=7.
Figure 2:

(A) Clustered Data 1 after the Application of KM for K=2. (B) Clustered Data 1 after the Application of FCM for K=2. (C) Clustered Data 1 after the Application of MHSC for K=2. (D) Clustered Data 1 after the Application of MHSC for K=7.

Figure 3: (A) Clustered Data 2 after the Application of KM for K=4. (B) Clustered Data 2 after the Application of FCM for K=4. (C) Clustered Data 2 after the Application of FCM for K=9. (D) Clustered Data 2 after the Application of MHSC for K=4.
Figure 3:

(A) Clustered Data 2 after the Application of KM for K=4. (B) Clustered Data 2 after the Application of FCM for K=4. (C) Clustered Data 2 after the Application of FCM for K=9. (D) Clustered Data 2 after the Application of MHSC for K=4.

Figure 4: (A) Clustered Spherical_4_3 after the application of KM for K=2. (B) Clustered Spherical_4_3 after the Application of KM for K=3. (C) Clustered Spherical_4_3 after the Application of FCM for K=2. (D) Clustered Spherical_4_3 after the Application of FCM for K=3. (E) Clustered Spherical_4_3 after the Application of MHSC for K=3. (F) Clustered Spherical_4_3 after the Application of MHSC for K=4.
Figure 4:

(A) Clustered Spherical_4_3 after the application of KM for K=2. (B) Clustered Spherical_4_3 after the Application of KM for K=3. (C) Clustered Spherical_4_3 after the Application of FCM for K=2. (D) Clustered Spherical_4_3 after the Application of FCM for K=3. (E) Clustered Spherical_4_3 after the Application of MHSC for K=3. (F) Clustered Spherical_4_3 after the Application of MHSC for K=4.

Figure 5: (A) Clustered Spherical_5_2 after the Application of KM for K=5. (B) Clustered Spherical_5_2 after the Application of KM for K=7. (C) Clustered Spherical_5_2 after the Application of FCM for K=5. (D) Clustered Spherical_5_2 after the Application of MHSC for K=5.
Figure 5:

(A) Clustered Spherical_5_2 after the Application of KM for K=5. (B) Clustered Spherical_5_2 after the Application of KM for K=7. (C) Clustered Spherical_5_2 after the Application of FCM for K=5. (D) Clustered Spherical_5_2 after the Application of MHSC for K=5.

Figure 6: (A) Clustered Spherical_6_2 after the Application of KM for K=6. (B) Clustered Spherical_6_2 after the Application of FCM for K=6. (C) Clustered Spherical_6_2 after the Application of MHSC for K=6. (D) Clustered Spherical_6_2 after the Application of MHSC for K=7.
Figure 6:

(A) Clustered Spherical_6_2 after the Application of KM for K=6. (B) Clustered Spherical_6_2 after the Application of FCM for K=6. (C) Clustered Spherical_6_2 after the Application of MHSC for K=6. (D) Clustered Spherical_6_2 after the Application of MHSC for K=7.

Figure 7: (A) Clustered Mixed_9_2 after the Application of KM for K=9. (B) Clustered Mixed_9_2 after the Application of FCM for K=4. (C) Clustered Mixed_9_2 after the Application of FCM for K=9. (D) Clustered Mixed_9_2 after the Application of MHSC for K=9.
Figure 7:

(A) Clustered Mixed_9_2 after the Application of KM for K=9. (B) Clustered Mixed_9_2 after the Application of FCM for K=4. (C) Clustered Mixed_9_2 after the Application of FCM for K=9. (D) Clustered Mixed_9_2 after the Application of MHSC for K=9.

For the Data 1 dataset, LSym-I and LSym-XB are able to detect the appropriate number of clusters with the KM clustering technique (the corresponding partitioning is shown in Figure 2A). LSym-DB and LSym-XB are able to identify the appropriate partitioning and number of clusters obtained through the FCM clustering technique. The corresponding partitioning is shown in Figure 2B. The LSym-I, LSym-XB, and LSym-K indices are able to detect the appropriate number of clusters and partitioning after the application of the MHSC technique. The partitionings corresponding to K=2 and K=7 are shown in Figure 2C and D.

For the Data 2 dataset, all the line symmetry-based cluster validity indices except LSym-I and LSym-SV are able to find the appropriate number of clusters using the KM clustering technique (the corresponding partitioning is shown in Figure 3A). The optimal value of all of the line symmetry-based validity indices except LSym-FS indicate K=4 as the correct number of clusters using the FCM clustering technique. The partitionings corresponding to K=4 and K=9 are shown in Figure 3B and C, respectively. All the line symmetry-based cluster validity indices are able to detect the appropriate partitioning and proper number of partitions after the application of the MHSC technique (the corresponding partition is shown in Figure 3D).

For the Spherical_4_3 dataset, none of the line symmetry-based cluster validity indices is able to detect the appropriate number of clusters and proper partitioning after the application of the KM and FCM clustering techniques (the corresponding partitionings are shown in Figure 4A,B and C,D, respectively). The LSym-FS and LSym-SV indices are able to detect the appropriate number of clusters and partitioning obtained through the MHSC technique. The corresponding optimal partitionings obtained are shown in Figure 4E and F.

For the Spherical_5_2 dataset, all the line symmetry-based cluster validity indices except LSym-K and LSym-SV are able to detect the appropriate number of clusters after the application of the KM clustering technique. The partitionings corresponding to K=5 and K=7 are shown in Figure 5A and B, respectively. The optimal value of all the line symmetry-based validity indices indicates K=5 number of clusters after the application of FCM (the corresponding partitioning is shown in Figure 5C). The LSym-DB, LSym-PS, LSym-K, and LSym-SV indices are able to detect the appropriate partitioning and number of clusters after the application of the MHSC technique (the corresponding partitioning is shown in Figure 5D).

Again, for the Spherical_6_2 dataset, all the indices are able to detect the proper partitioning and proper number of partitions after the application of KM and FCM clustering. The corresponding partitionings obtained after the application of KM and FCM are shown in Figure 6A and B, respectively. The MHSC technique is able to detect the appropriate partitioning for K=6, and all the indices except LSym-XB and LSym-SV are able to identify this. The partitionings corresponding to K=6 and K=7 are shown in Figure 6C and D.

For the Mixed_9_2 dataset, all the indices are able to identify the appropriate partitioning and number of clusters after the application of the KM clustering technique (the corresponding partitioning is shown in Figure 7A). The LSym-DB, LSym-PS, and LSym-K indices are capable of identifying the appropriate partitioning and number of clusters using the FCM technique. LSym-I, LSym-XB, LSym-K, and LSym-SV perform worst for this dataset. The partitionings corresponding to K=4 and K=9 are shown in Figure 7B and C. The LSym-PS, LSym-I, LSym-XB, LSym-FS, LSym-K, and LSym-SV indices are able to detect the appropriate partitioning and number of clusters after the application of the MHSC technique (the corresponding partitioning is shown in Figure 7D). However, the optimal value of the LSym-DB index wrongly point out K=10 as the number of clusters for this dataset.

For real-life datasets, pictorial representation of partitioning is not possible as they are high dimensional. For the Iris dataset, the LSym-DB, LSym-FS, LSym-K and LSym-SV indices are able to detect the appropriate partitioning after the application of the KM and FCM clustering techniques. The optimum values of the LSym-PS, LSym-I, and LSym-XB indices indicate two clusters. The MHSC technique is able to detect the appropriate partitioning for K=3, and all the indices are able to identify this.

For the Wine dataset, all the line symmetry-based cluster validity indices except LSym-DB and LSym-FS are able to find the proper partitioning from this dataset. These indices indicate K=3 as the correct number of clusters after the application of the KM technique. The optimal values of the LSym-I and LSym-FS indices specify K=3 after the application of FCM, whereas LSym-SV wrongly points out K=7 as the number of cluster. None of the line symmetry-based validity indices is capable of detecting the appropriate number of clusters using the MHSC technique.

For the Glass dataset, the line symmetry-based validity indices fail to detect the proper partitioning and number of clusters obtained through the KM clustering technique. The optimum value of LSym-XB specify K=6 as the correct number of clusters obtained through the FCM technique. Except for LSym-I and LSym-K, all the line symmetry-based validity indices are able to detect the proper number of clusters after the application of the MHSC technique.

For the Breast Cancer dataset, the above-mentioned clustering techniques are able to identify the proper partitioning for K=2, and all the line symmetry-based validity indices are able to identify this.

For the CMC dataset, all the line symmetry-based validity indices are able to identify the appropriate number of clusters after the application of the KM and FCM clustering techniques. Most of the line symmetry-based validity indices except LSym-XB indicate K=3 number of clusters using the MHSC technique.

It can be seen that for the Haberman dataset, all the line symmetry-based validity indices except LSym-DB and LSym-SV are able to detect the appropriate number of clusters after the application of the KM technique. None of the validity indices is capable of finding the proper partitioning and number of clusters using the FCM technique. The optimum values of LSym-DB, LSym-I, and LSym-XB point out K=2 as the number of clusters obtained through the MHSC technique. The performances of LSym-PS, LSym-K, and LSym-SV are the worst for this dataset.

The above-mentioned results show that the LSym-XB index is able to identify the appropriate partitioning from 9 out of 12 datasets after the application of the KM technique. The LSym-DB index is able to find the proper partitioning from 8 out of 12 datasets after the application of the FCM technique. All line symmetry-based cluster validity indices perform better with the MHSC technique, as it is able to detect appropriate partitioning from 8 out of 12 datasets. Similarly, LSym-DB, LSym-PS, LSym-I, LSym-FS, LSym-K, and LSym-SV are able to detect appropriate partitioning from 7, 8, 8, 8, 8, and 7 out of 12 datasets using the KM clustering technique, respectively. The LSym-PS, LSym-I, LSym-XB, LSym-FS, LSym-K, and LSym-SV indices are capable of determining appropriate partitioning from 6, 6, 5, 6, 6, and 6 out of 12 datasets using the FCM clustering technique. These line symmetry-based validity indices perform better with the MHSC technique. They are able to detect proper partitioning from 8 out of 12 datasets. On the other hand, LSym-XB performs better in 5 out of 12 datasets.

4.4 Comparison with Original Versions of Cluster Validity Indices

The values of the original seven cluster validity indices, DB index, PS index, I index, XB index, FS index, K index, and SV index, are computed after the applications of the KM, FCM, and MHSC algorithms on the datasets as mentioned in Section 4.1. The number of clusters identified by these indices on the datasets using the above-mentioned three clustering techniques are depicted in Tables 2, 3, and 4, respectively. It is common practice to use success rate for comparing the line symmetric version with the point symmetric version and the original versions of validity indices. It is defined as the ratio of the number of datasets for which the validity index succeeds in determining the correct number of clusters to the total number of datasets. It is mathematically formulated as [21]

(15)SRi=SiTotal number of datasets, (15)

where Si denotes the number of datasets for which validity index i succeeds in finding the correct number of clusters.

The results obtained for Table 2 indicate that the line symmetry version of validity indices outperforms PS and the original version using the KM clustering technique. It has also been found that the PS version of the I index and the K index performs better than other point symmetric validity indices. Table 3 depicts that the line symmetry versions of validity indices perform better than the original versions as well as the PS versions using FCM. It is also found that the line symmetry versions of the XB and K indices give similar performances as the PS versions of these indices. It is also noticeable from Table 4 that the MHSC clustering technique with the line symmetry version of validity indices is able to find the appropriate number of clusters from the given datasets. From these tables, we can easily conclude that the newly developed line symmetry-based cluster validity indices are able to find out the number of clusters irrespective of the shape and size of the cluster, provided the clusters possess line symmetry property, whereas most of the original and point symmetric versions of the validity indices fail to detect the appropriate number of clusters in real-life datasets.

4.5 Comparison of Clustering Techniques

To evaluate the quality of partitions generated from clustering techniques, the Minkowski score (MS) [5] is used. The MS is mathematically defined as [21, 22]

(16)MS(XTrue,XEst.)=(SD+DS)(SS+DS), (16)

where XTrue represents the true solution, and the solution obtained after the application of clustering technique is XEst.. SS denotes the number of pairs of data points belonging to the same clusters in XEst. and XTrue. DS is number of pairs of data points that are in the same cluster only in XEst., and SD is the number of pairs of points that are in the same cluster only in XTrue. Zero has been considered the optimal value of the MS. The lower the score, the better it is. The above-mentioned three clustering algorithms were executed 10 times for each of the datasets. The estimated mean MS values and their standard deviations for real-life datasets after the application of the three clustering algorithms are tabulated in Table 5. For all the datasets, the MHSC technique provides low values of the MS. Hence, the MHSC technique is the best among the KM and FCM clustering techniques.

Table 5:

Minkowski Scores Obtained by Three Clustering Algorithms for Real-Life Datasets.

DatasetKM clusteringFCM clusteringMHSC technique
Iris0.64951 ± 0.08150.63636 ± 0.08140.56482 ± 0.0269
Wine0.89529 ± 0.02160.91807 ± 0.02550.87768 ± 0.0124
Glass0.98828 ± 0.10530.99094 ± 0.08270.93911 ± 0.0419
Cancer0.36858 ± 0.00310.37660 ± 0.00510.35380 ± 0.0214
CMC1.13219 ± 0.00271.13038 ± 0.00101.11070 ± 0.0159
Haberman0.99548 ± 0.00330.99506 ± 0.00310.98319 ± 0.0059

In addition to basic statistical analysis (i.e. mean and standard deviation), ANOVA test [1] has been performed for comparison of the KM, FCM, and MHSC techniques. ANOVA is performed on the combined MS values of three algorithms. The ANOVA results for all real-life datasets are tabulated in Tables 6 through 11. The boxplots of the MS values obtained by the three above-mentioned clustering techniques on real-life datasets are shown in Figure 8 for illustration. As seen from the tables, the difference in the means of the MSs of FCM and KM is not significant, and hence indicating their similar performances for the Iris, Wine, Glass, CMC, and Haberman datasets. It has been found that the difference in the mean MS-values obtained by the MHSC technique with those obtained from the KM and FCM clustering techniques is statistically significant, with a significance value of <0.05, for all real-life datasets, except for the Glass dataset.

Table 6:

Estimated Means and Pairwise Comparisons of Different Algorithms on the Minkowski Score Obtained by ANOVA Testing for the Iris Dataset.

Algorithm name (A)Comparing algorithm (B)Mean difference (A-B)Significant value
MHSCFCM–0.0715 ± 0.04190.00
KM–0.0847 ± 0.05450.00
FCMMHSC0.0715 ± 0.04190.00
KM–0.0131 ± 0.01260.702
KMMHSC0.0847 ± 0.05450.00
FCM0.0131 ± 0.01260.702
Table 7:

Estimated Means and Pairwise Comparisons of Different Algorithms on the Minkowski Score Obtained by ANOVA Testing for the Wine Dataset.

Algorithm name (A)Comparing algorithm (B)Mean difference (A-B)Significant value
MHSCFCM–0.0404 ± 0.01310.00
KM–0.0176 ± 0.00920.03
FCMMHSC0.0404 ± 0.01310.00
KM0.0228 ± 0.00390.05
KMMHSC0.0176 ± 0.00920.03
FCM–0.0228 ± 0.00390.05
Table 8:

Estimated Means and Pairwise Comparisons of Different Algorithms on the Minkowski Score Obtained by ANOVA Testing for the Glass Dataset.

Algorithm name (A)Comparing algorithm (B)Mean difference (A-B)Significant value
MHSCFCM–0.0518 ± 0.04090.01
KM–0.0492 ± 0.06340.189
FCMMHSC0.0518 ± 0.04090.01
KM0.0027 ± 0.02250.951
KMMHSC0.0492 ± 0.06340.189
FCM–0.0027 ± 0.02250.951
Table 9:

Estimated Means and Pairwise Comparisons of Different Algorithms on the Minkowski Score Obtained by ANOVA Testing for the Cancer Dataset.

Algorithm name (A)Comparing algorithm (B)Mean difference (A-B)Significant value
MHSCFCM–0.0228 ± 0.01630.00
KM–0.0148 ± 0.01840.02
FCMMHSC0.0228 ± 0.01630.00
KM0.0080 ± 0.00210.00
KMMHSC0.0148 ± 0.01840.02
FCM–0.0080 ± 0.00210.00
Table 10:

Estimated Means and Pairwise Comparisons of Different Algorithms on the Minkowski Score Obtained by ANOVA Testing for the CMC Dataset.

Algorithm name (A)Comparing algorithm (B)Mean difference (A-B)Significant value
MHSCFCM–0.0214 ± 0.01320.00
KM–0.0196 ± 0.01490.00
FCMMHSC0.0214 ± 0.01320.00
KM–0.0018 ± 0.00170.067
KMMHSC0.0196 ± 0.01490.00
FCM0.0018 ± 0.00170.067
Table 11:

Estimated Means and Pairwise Comparisons of Different Algorithms on the Minkowski Score Obtained by ANOVA Testing for the Haberman Dataset.

Algorithm name (A)Comparing algorithm (B)Mean difference (A-B)Significant value
MHSCFCM–0.0119 ± 0.00290.00
KM–0.0123 ± 0.00370.00
FCMMHSC0.0119 ± 0.00290.00
KM–4.2E-04 ± 8.2E-040.733
KMMHSC0.0123 ± 0.00370.00
FCM4.2E-04 ± 8.2E-040.733
Figure 8: Boxplots of the Minkowski scores obtained by the clustering techniques for datasets (A) Iris, (B) Wine, (C) Glass, (D) Cancer, (E) CMC, and (F) Haberman.
Figure 8:

Boxplots of the Minkowski scores obtained by the clustering techniques for datasets (A) Iris, (B) Wine, (C) Glass, (D) Cancer, (E) CMC, and (F) Haberman.

5 Conclusions

In this paper, seven new line symmetry-based cluster validity indices have been developed. Incorporation of the line symmetry distance enabled the cluster validity indices to detect symmetric clusters, both convex and non-convex types, provided the clusters have line symmetrical property. This is in contrast to the point symmetric and original versions of cluster validity indices, which fail in such situations. The newly developed line symmetric cluster validity indices have been evaluated on six artificial and six real-life datasets along with the KM, FCM, and MHSC techniques. The experimental results revealed that the inclusion of line symmetry-based distance in the definitions of existing cluster validity indices made them more efficient in determining the appropriate number of clusters. The results on datasets also depict that MHSC, among the three techniques, is well suited to detect appropriate partitioning from datasets.

Bibliography

[1] T. W. Anderson and S. L. Sclove, An introduction to the statistical analysis of data, Houghton Mifflin Harcourt, Boston, 1978.Search in Google Scholar

[2] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Perez and I. Perona, An extensive comparative study of cluster validity indices, Pattern Recogn.46 (2013), 243–256.10.1016/j.patcog.2012.07.021Search in Google Scholar

[3] S. Bandyopadhyay and U. Maulik, Genetic clustering for automatic evolution of clusters and application to image classification, Pattern Recogn.35 (2002), 1197–1208.10.1016/S0031-3203(01)00108-XSearch in Google Scholar

[4] S. Bandyopadhyay and S. Saha, GAPS: a clustering method using a new point symmetry based distance measure, Pattern Recogn. 40 (2007), 3430–3451.10.1016/j.patcog.2007.03.026Search in Google Scholar

[5] A. Ben-Hur and I. Guyon, Detecting stable clusters using principal component analysis, in: M. J. Brownstein and A. B. Khodursky, eds., Methods in Molecular Biology, pp. 159–182, Humana Press, New York, 2003.10.1385/1-59259-364-X:159Search in Google Scholar

[6] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms, Plenum Press, New York, 1981.10.1007/978-1-4757-0450-1Search in Google Scholar

[7] C. L. Blake and C. J. Merz, UCI Repository of Machine Learning Databases. http:/www.ics.uci.edu/∼mlearn/databases/. Accessed 11 March, 2013.Search in Google Scholar

[8] C. H. Chou, M. C. Su and E. Lai, Symmetry as a new measure for cluster validity, in: Int. Conf. on Scientific Computation and Soft Computing, pp. 209–213, 2002.Search in Google Scholar

[9] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell.1 (1979), 224–227.10.1109/TPAMI.1979.4766909Search in Google Scholar

[10] E. Dimitriadou, S. Dolnicar and A. Weingassel, An examination of indexes for determining the number of clusters in binary datasets, Psychometrika67 (2002), 137–160.10.1007/BF02294713Search in Google Scholar

[11] Y. Fukuyama and M. Sugeno, A new method for choosing the number of cluster for fuzzy C-means method, in: Proc. 5th Fuzzy Syst. Symp., pp. 247–250, 1989.Search in Google Scholar

[12] E. Hruschka, R. J. G. B. Campello, A. A. Freitas, A. C. Ponce and F. D. Carvalho, A survey of evolutionary algorithms for clustering, IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev.39 (2009), 133–155.10.1109/TSMCC.2008.2007252Search in Google Scholar

[13] A. K. Jain and R. C. Dubes, Algorithms for clustering data, Prentice-Hall, Englewood Cliffs, NJ, 1998.Search in Google Scholar

[14] I. Jolliffe, Principal component analysis, Springer Series in Statistics, England, 1986.10.1007/978-1-4757-1904-8Search in Google Scholar

[15] D. J. Kim, Y. W. Park and D. J. Park, A novel validity index for determination of the optimal number of clusters, IEICE Trans. Inform. Syst.D-E84 (2001), 281–285.Search in Google Scholar

[16] V. Kumar, J. K. Chhabra and D. Kumar, Effect of harmony search parameters’ variation in clustering, Proc. Tech.6 (2012), 265–274.10.1016/j.protcy.2012.10.032Search in Google Scholar

[17] V. Kumar, J. K. Chhabra and D. Kumar, Parameter adaptive harmony search algorithm for unimodal and multimodal optimization problems, J. Comput. Sci. 5 (2014), 144–155.10.1016/j.jocs.2013.12.001Search in Google Scholar

[18] S. H. Kwon, Cluster validity index for fuzzy clustering, Electron Lett.34 (1998), 2176–2177.10.1049/el:19981523Search in Google Scholar

[19] U. Maulik and S. Bandyopadhyay, Performance evaluation of some clustering algorithms and validity indices, IEEE Trans. Pattern Anal. Mach. Intell.24 (2002), 1650–1654.10.1109/TPAMI.2002.1114856Search in Google Scholar

[20] G. W. Milligan and M. C. Cooper, An examination of procedures for determining the number of clusters in a dataset, Psychometrika50 (1985), 159–179.10.1007/BF02294245Search in Google Scholar

[21] S. Saha and S. Bandyopadhyay, Performance evolution of some symmetry-based cluster validity indices, IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev.39 (2009), 420–425.10.1109/TSMCC.2009.2013335Search in Google Scholar

[22] S. Saha and S. Bandyopadhyay, Some connectivity based cluster validity indices, Appl. Soft Comput.12 (2012), 1555–1565.10.1016/j.asoc.2011.12.013Search in Google Scholar

[23] S. Saha and U. Maulik, A new line symmetry distance based automatic clustering technique: application to image segmentation, Int. J. Imaging Syst. Tech.21 (2011), 86–100.10.1002/ima.20243Search in Google Scholar

[24] M. C. Su and C. H. Chou, A modified version of the k-means algorithm with a distance based on cluster symmetry, IEEE Trans. Pattern Anal. Mach. Intell.23 (2001), 674–680.10.1109/34.927466Search in Google Scholar

[25] J. Wu, J. Chen, H. Xiong and M. Xie, External validation measures for K-means clustering: a data distribution perspective, Expert Syst. Appl.36 (2009), 6050–6061.10.1016/j.eswa.2008.06.093Search in Google Scholar

[26] X. L. Xie and G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell.13 (1991), 841–847.10.1109/34.85677Search in Google Scholar

[27] R. Xu and D. C. Wunsch II, Clustering, John Wiley, Hoboken, NJ, 2009.Search in Google Scholar

Received: 2016-1-27
Published Online: 2016-7-8
Published in Print: 2017-7-26

©2017 Walter de Gruyter GmbH, Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/jisys-2016-0010/html
Scroll to top button