Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter February 4, 2016

Grey Wolf Algorithm-Based Clustering Technique

  • Vijay Kumar EMAIL logo , Jitender Kumar Chhabra and Dinesh Kumar

Abstract

The main problem of classical clustering technique is that it is easily trapped in the local optima. An attempt has been made to solve this problem by proposing the grey wolf algorithm (GWA)-based clustering technique, called GWA clustering (GWAC), through this paper. The search capability of GWA is used to search the optimal cluster centers in the given feature space. The agent representation is used to encode the centers of clusters. The proposed GWAC technique is tested on both artificial and real-life data sets and compared to six well-known metaheuristic-based clustering techniques. The computational results are encouraging and demonstrate that GWAC provides better values in terms of precision, recall, G-measure, and intracluster distances. GWAC is further applied for gene expression data set and its performance is compared to other techniques. Experimental results reveal the efficiency of the GWAC over other techniques.

1 Introduction

Clustering is the process of partitioning a set of data points into a finite number of groups (clusters) in such a way that maximizes the between-group variability, known as the intercluster distance, and minimizes the within-group variability, known as the intracluster distance. It has been used in many engineering and scientific fields, including image segmentation, data forecasting, information retrieval, and bioinformatics [1, 35]. Due to this wide applicability, researchers carry out a lot of efforts to design new clustering algorithms as well as to improve the performance of existing algorithms using newly developed metaheuristic approaches. Classical and metaheuristic are the two broad categories of the existing clustering algorithms [13]. Classical clustering algorithms can be broadly divided into five categories: hierarchical clustering, partitional clustering, density-based clustering, grid-based clustering, and model-based clustering [4, 23]. K-means (KM) is a widely used classical clustering algorithm due to its simplicity and efficiency [8, 19]. However, KM has the shortcoming that it depends on the initial state and hence may converge towards local optima [17, 31].

In last few decades, many metaheuristic algorithms have been used to overcome the above-mentioned shortcomings. Metaheuristic algorithms are believed to be able to solve combinatorial problems with satisfactory near-optimal solutions and less computational time compared to other classical methods. Although many metaheuristic algorithms for solving clustering problems have been proposed, the results are unsatisfactory [27]. Hence, an improvement to metaheuristic algorithms for solving clustering problems is still required.

Grey wolves are considered as apex predators (i.e. they are at the top of the food chain). Their social hierarchy, tracking, encircling, and attacking prey characteristics have recently been one of the interesting research areas. Recently, Mirjalili et al. [25] have described a grey wolf algorithm (GWA) based on the behavior of grey wolves for numerical optimization problems. The main contribution of our paper is to propose a novel clustering approach using GWA. This approach takes advantage of the search capabilities of GWA to overcome the local optima through its ability to explore the search space. The performance of the proposed GWAC has been tested on a variety of data sets and compared to several other proposed clustering algorithms.

The rest of this paper is structured as follows. Section 2 defines the clustering problem and covers a brief overview of the previous work done in the field of metaheuristic-based clustering techniques. Section 3 gives a brief description of GWA. The GWA adapted for solving clustering problems is introduced in Section 4. The complexity analysis is described in Section 5. Section 6 presents the real-life data sets, parameter setting, and experimentation results. Finally, Section 7 summarizes the contribution of this paper.

2 Scientific Backgrounds

This section describes the basic concepts of cluster analysis and GWA.

2.1 Cluster Analysis

Clustering in d-dimensional space (Rd) is the process of partitioning a set of n data points into a number of clusters, K, based on some similarity measure. Let the set of n data points be represented by set X (i.e. X={x1, x2, …, xn }). The K clusters are represented by C={C1, C2, …, CK }, such that the data points that belong to the different clusters are dissimilar as possible, whereas the data points that belong to the same cluster are similar to each other as possible. The clusters should maintain the following three properties [6].

Each cluster should consists of at least one data point, i.e.

(1)Ciϕ,i{1,2,,K},

Two different clusters should have no data point in common, i.e.

(2)CiCj=ϕ,ij and i,j{1,2,,K}

Each data point should be definitely attached to a cluster, i.e.

(3)i=1KCi=X

For this, a similarity/dissimilarity measure must be defined for adequate partitioning. The Euclidean distance is mostly used similarity measure in clustering.

The clustering problem is to find the optimal clusters C* with respect to all other feasible solutions C={C1, C2, …, CN(n,K) (i.e. CiCj, ij). N(n, K) is number of feasible clusters, which is given by the following formula:

(4)N(n,K)=1K!i=1K(1)Ki(iK)(i)n

This is same as represented below:

(5)OptimizeCf(X,C)

where f( ) is the statistical function to judge the quality of partitions generated from clustering.

2.2 Related Work

Researchers used metaheuristic algorithms to overcome the shortcomings of classical clustering algorithms. Most of them are either evolutional or population-based algorithms. Selim and Al-Sultan [30] proposed a simulated annealing algorithm for clustering algorithm. They proved it theoretically that a clustering problem’s global solution can be reached. The main weakness of this method is parameter setting. Krishna and Murty [20] developed a novel approach called genetic KM algorithm for clustering analysis, which defines a mutation operator specific to clustering. The main strength of this approach is that it is faster than KM. The disadvantage is that it is unable to check the boundary of data points. Sung and Jin [33] proposed a tabu search-based heuristic for clustering. They combined the packing and releasing procedures with tabu search. This approach provided better cluster solutions than KM and simulated annealing-based clustering approach. Maulik and Bandyopadhyay [24] proposed a genetic algorithm (GA)-based method for data clustering problem. The clustering solutions improve towards better solution via selection, crossover, and mutation. This approach generated optimal cluster centers compared to existing classical clustering techniques. An ant colony clustering algorithm (ACO) for clustering was presented by Shelokar et al. [32]. This algorithm mimics the behavior of ants to find the shortest path from their nest to the food source and back. Its performance was compared to GA, simulated annealing, and tabu search. The simulation results showed that its performance was better than the GA, simulated annealing, and tabu search-based clustering techniques. Fathian et al. [7] developed an application of the honeybee mating optimization algorithm for data clustering. This approach provided better cluster quality than GA and ACO-based clustering techniques. Particle swarm optimization (PSO), which simulates the social behavior of bird flocking, was used for clustering by Kao et al. [17]. Its performance was further enhanced by the hybridization of KM with PSO and compared to GA [26] and KGA [2]. The main strengths of this method are better convergence rate and less function evaluations. However, it fails on data sets having overlapped data points.

Karaboga and Basturk [18] described an artificial bee colony (ABC) algorithm based on the forging behavior of honeybees for clustering problem. Zhang et al. [38] extended the ABC for data clustering. Its performance was compared to other heuristic-based clustering techniques. The pros of this approach were better cluster quality and processing time than the PSO-based clustering technique. The parameter setting was the main drawback of this approach. Satapathy and Naik [29] used the teaching learning-based optimization technique for data clustering. They optimized the cluster centers for a user-specified number of clusters. This approach provided better intracluster distance than GA, PSO, and ACO-based clustering techniques. Hatamlou et al. [13] presented a gravitational search algorithm-based data clustering techniques. Hatamlou et al. [14] presented the Big Bang-Big Crunch algorithm (BB-BC) for clustering problem. In the Big Bang phase, some candidate solutions are randomly generated and spread all over the search space. In the Big Crunch phase, randomly distributed candidate solutions are drawn into a single representative point via a center of population. It avoids the premature converge towards global optimization. The parameter setting is main problem of this approach. Hassanzadeh and Meybodi [10] presented firefly algorithm to data clustering. They optimized the cluster centers and extended it to use the KM clustering to further refine cluster centers. This approach has better efficiency than KM and PSO-based clustering techniques. However, it has higher time complexity. Hatamlou [12] introduced a new algorithm named as the Black Hole (BH) algorithm and applied it to the clustering problem. The main advantages of this approach were ease of implementation and structure simplicity. Hatamlou and Hatamlou [11] investigated the hybridization of gravitational search algorithm and BB-BC algorithm on data clustering. They used GSA for exploring the search space for finding the optimal locations of cluster centers and BB-BC was used to diversify the problem. This approach has less function evaluations compared to KM. The main drawback of this approach was parameter setting. Kumar et al. [21] used the gravitational search-based clustering technique for magnetic resonance imaging (MRI) brain image segmentation. Kumar et al. [22] also proposed four new variants of harmony search clustering algorithms. They used these variants for solving the clustering problem. They used search capability of harmony search for the optimization of the within-cluster variation. This approach has better efficiency than GA, PSO, and ACO-based clustering techniques. Saida et al. [28] presented a cuckoo search for solving the data clustering problem. This approach has better computational efficiency and stable convergence.

In this paper, a novel approach for GWA-based data clustering (GWAC) has been proposed. There are two main reasons for adopting GWA as a metaheuristic technique for clustering. First is its simplicity and that it is easy to implement. Second, it has only one control coefficient to tune.

3 GWA

GWA is an efficient optimization algorithm that is inspired by the behaviors of grey wolves [25]. It mimics the leadership hierarchy and hunting mechanism of grey wolves in nature. The grey wolves are classified into four main groups such as α, β, δ, and ω wolves. The α wolves are leading wolves. They take the important decisions during the hunting process. They also track other wolves in the group to maintain the social equality. The second level of dominated wolves in the group is of the β wolves. The β wolves are the consultants of the α wolves and provide the guidance under different circumstances. When the α wolves die or become old, they are upgraded to the α wolves. The δ wolves control the ω wolves and provide the information to the α and β wolves. The lowest level on wolves’ hierarchy is the ω wolves. They may be the children of the group. The hunting process consists of three main steps. The first step is searching and tracking the prey. Thereafter, grey wolves encircle and harass the prey until it stops movement. The last step is attacking on the prey. The pseudo code of the GWA is described in Figure 1.

Figure 1: Pseudo Code of the GWA.
Figure 1:

Pseudo Code of the GWA.

The GWA is mathematically modeled as follows [25].

Community hierarchy: The fittest solution is considered as the α wolf. The second and third best solutions are depicted by the β and δ wolves, respectively. The remaining candidate solutions are depicted as the ω wolves. The optimization process is guided by the α, β, and δ wolves. The ω wolves follow them.

Encircling the prey: The grey wolves encircle the prey. This process is articulated using the following equations:

(6)X(t+1)=XP(t)AD

Here,

(7)D=|CXP(t)X(t)|

where D represents the distance between the position of prey (XP ) and a grey wolf (X). t is the current iteration. XP (t) is the position of the prey at iteration t. A and C are control coefficients. The A coefficient is used to maintain exploration and exploitation. C is the coefficient used for exploration and for local optimal avoidance. These are computed as follows:

(8)A=2ar1a
(9)C=2r2

where a is linearly decreased from 2 to 0 during the iteration process and r1 and r2 are the random numbers that lie in the range [0, 1].

Hunting the prey: The best candidate solutions (α, β, and δ wolves) have adequate awareness about the prey position. These are used to renew their positions based on the position of the best search agents. The hunting behavior is described by the following equations.

(10)Dα=|C1XαX|,  Dβ=|C2XβX|,  Dδ=|C3XδX|
(11)X1=XαA1Dα,  X2=XβA2Dβ,  X3=XδA3Dδ
(12)X(t+1)=X1+X2+X33

Attacking the prey: This step represents the exploitation power of algorithm. It is performed by linearly decreasing the value of a from 2 to 0. The value (should be set <1 for this step) of A is used to force the wolves to attack on the prey.

Searching for the prey: This step represents the exploration power of algorithm. The grey wolves diverge from each other to search the best prey. To avoid the algorithm to be stuck in local optima, the value (should be set >1 for this step) of A is used to compel the wolves to explore for a better prey.

4 GWAC

The basic steps of GWA are followed in the GWAC algorithm. The pseudo code of GWAC technique is shown in Figure 2. The steps of GWAC are described in detail. Here, the wolf is designated as a search agent.

Figure 2: Pseudo Code of the GWAC.
Figure 2:

Pseudo Code of the GWAC.

4.1 Agent Representation

Each agent is a sequence of real numbers representing the K cluster centers. For d-dimensional space, the length of agent is K×d. The first d positions represent the d dimensions of first cluster center, and the next d positions represent the second cluster center and so on. Let us consider the following example.

Example 1. Let K=4 and d=3 (i.e. the number of clusters being considered as four and the space is three-dimensional). Then, the agent is shown below.

4.2 Population Initialization

The K cluster centers encoded in each agent are initialized to K randomly chosen data points from the given data set. This process is repeated for each of the N agents in the population, where N is size of the population.

4.3 Fitness Function Computation

The fitness function computation process consists of two main stages. In first stage, the clusters are generated based on the cluster centers encoded in the agent. For this, each data point (say xi ) is assigned to one of the cluster (say Cj ), with the center having the smallest distance. Thereafter, the cluster centers encoded in the agent are replaced by their mean data points of the respective cluster.

Example 2. The first cluster center in the agent considered in Example 1 is (4.9, 3.6, 1.6). Let us further consider that this cluster has four more data points besides itself [i.e. (5.3, 4.0, 1.1), (5.0, 3.9, 0.9), (4.7, 3.5, 1.2), and (5.5, 3.9, 0.7)]. Therefore, the new cluster center becomes [(4.9+5.3+5.0+4.7+5.5)/5, (3.6+4.0+3.9+3.5+3.9)/5, (1.6+1.1+0.9+1.2+0.7)/5]=(5.1, 3.8, 1.1).

The second stage is the computation of fitness function that is defined as the sum of squared Euclidean distance between each data point and the center of the cluster that belongs to every allocated data point.

(13)f(X,C)=i=1nmin{xiCl2|l=1,2,,K}

4.4 Position Updation

In this process, the current position of search agents is updated based on the location of the first three best search agents (α, β, and δ). The control parameter A is used for exploration and exploitation. Another parameter C is used to explore the search space and avoid the premature convergence. These parameters are computed using Equations (8) and (9). These control parameters are used to compute the new position of search agents using Equations (10)–(12).

Example 3. Let us assume that A1=–1.732, A2=–0.253, A3=1.623, C1=0.503, C2=1.081, C3=0.004. The current position of the first cluster center in the agent considered in Example 2 is (5.1, 3.8, 1.1). Let the first, second, and third best positions of first cluster center are (5.3, 3.9, 0.9), (5.2, 4.1, 1.2), and (5.0, 4.0, 1.0), respectively. Hence, the new Dα, Dβ, and Dδ become (2.434, 1.838, 0.647), (0.521, 0.632, 0.197), and (5.080, 3.784, 1.096), respectively. Based on these values, X1, X2, and X3 are (9.516, 7.083, 2.021), (5.332, 4.259, 1.249), and (–3.245, –2.141, –0.778), respectively. Therefore, the new position of first cluster becomes [(9.516+5.332–3.245)/3, (7.083+4.259–2.141)/3, (2.021+1.249–0.778)/3]=(3.867, 3.067, 0.831).

4.5 Termination Criterion

The fitness function computation, parameter, and position updation are executed for a maximum number of iterations. The α agent provides the best solution to the clustering problem at the last iteration.

5 Complexity Analysis

In this section, the complexity analysis of GWAC is presented. Both the time and space complexities of the intermediate states of the proposed clustering technique are discussed.

5.1 Time Complexity

  1. The initialization of GWAC requires O(number_of_agents×string_len) time. Here, the number_of_agents and string_len indicate the numbers of grey wolves and the length of each encoded agent, respectively. Here, the string_len is O(K×d), where d is the dimension of the data set and K is the number of clusters.

  2. The fitness computation is composed of three basic substeps.

    1. The assignment of data points to different clusters requires O(n2×K×d) for each agents.

    2. The cluster center updation requires O(K×d).

    3. Time complexity of fitness function is O(n×K×d).

    Step 2 is repeated for all agents [i.e. number_of_agents (say N) computation times] and the above-mentioned three substeps are performed in sequence one after the other. Hence, the total complexity of Step 2 will be N×(n2×K×d+K×d+n×K×d)=O (N×n2×K×d).

  3. The position and cluster center updation steps of GWAC require O(string_len×N) each.

Therefore, summing up the complexities of all the above steps and considering that string_len=n, the total time complexity becomes O(N×K×n2×d) per generation. The total time complexity of GWAC for maximum number of generations is O(N×K×n2×d×Max_gen). Here, Max_gen indicates the maximum number of generations.

5.2 Space Complexity

The space complexity of GWAC technique is due to its number of search agents. Thus, the total space complexity of GWAC technique is O(string_len×number_of_agents).

6 Experimental Results

This section describes the experimentation to evaluate the performance of GWAC techniques. These data sets are described in Section 6.1. The results are evaluated and compared to well-known clustering techniques.

6.1 Data Sets Used

Two artificial and six real-life data sets with a variety of complexity are used to evaluate the performance of the proposed GWAC technique. The artificial data sets are Sph_4_3 and Sph_5_2, which are taken from [2]. The real-life data sets are Iris, Wine, Glass, Haberman, Bupa, and Contraceptive Method Choice (CMC), which are available in the UCI machine learning repository [3]. Table 1 summarizes the main characteristics of the used data sets.

Table 1

Main Characteristics of Data Sets Used.

Data set nameNumber of instancesNumber of featuresNumber of classesType
Sph_4_340034Artificial
Sph_5_225025Artificial
Iris15043Real
Wine178133Real
Glass21496Real
Haberman30632Real
Bupa34562Real
CMC147393Real

6.2 Algorithms and Parameter Setting Used for Comparisons

The performance of the GWAC is compared to well-known most recent algorithms reported in literature, including KM [16], GA-based clustering (GAC), harmony search-based clustering (HSC), modified HSC (MHSC), PSO-based clustering (PSOC), flower pollination algorithm-based clustering (FPAC), and bat algorithm-based clustering (BATC) [36]. The performance of the algorithms is evaluated and compared using four cluster quality measures.

The value of K, number of clusters, for data sets equals the number of classes of the corresponding data sets as mentioned in Table 1. The population size and maximum number of iteration for all algorithms are set to 30 and 500. The parameter settings of all the above-mentioned algorithms are follows. For GAC, crossover and mutation probabilities are set to 0.8 and 0.01, respectively, as recommended by Maulik and Bandyopadhyay [24]. For HSC, the harmony memory consideration rate (HMCR) is taken as 0.9, the pitch adjustment rate (PAR) is taken as 0.5, and the bandwidth (BW) is taken as 0.01 as recommended by Geem et al. [9]. For MHSC, the HMCR is assigned a range [0.5, 0.95], PAR is assigned a range [0.01, 0.99], and BW is assigned a range [0.001, 0.1] as recommended by Kumar et al. [22]. For PSOC, the inertia weight (w) is taken as 0.72. The acceleration constants [i.e. ac1 and ac2] are set to 1.49 and the maximum velocity (Vmax) is set to 255 as mentioned by Omran et al. [27]. For FPAC, the switch probability is set to 0.8 as mentioned by Yang [37]. For BATC, the loudness and pulse emission rate are assigned a range [1, 2] and [0, 1], respectively. The constant coefficients α and γ are set to 0.9 as mentioned by Yang [37].

6.3 Cluster Quality Measures

The performance of the algorithms is evaluated and compared using five criteria.

Sum of intracluster distance: The distance between each data point and the center of the corresponding cluster is calculated and summed up as defined in Equation (13). The smaller value of sum of intracluster distances requires for higher quality of clustering. It is also taken as fitness function for all the algorithms.

The other four cluster quality measures are precision, recall, weighted average, and G-measure. These are mathematically defined for cluster j with respect to class i as follows:

(14)Precision(i,j)=NijNj
(15)Recall(i,j)=NijNi
(16)WA(i,j)=NijNT
(17)GM(i,j)=Precision(i,j)×Recall(i,j)

where Nij is the number of data points of class i in the cluster j. Nj is the number of data points of cluster j. Ni is the number of data points of class i. NT is the total number of cases. A large value of these measures is required for better clustering. The results reported in the tables are the averages (SDs) of more than 20 independent simulations.

6.4 Results and Discussion

As shown in Table 2, the GWAC attained the best results among all the algorithms, except for the Glass data set. For the Sph_4_3 data set, the intracluster distance obtained from GWAC is 749.005, which is better than other algorithms. For the Sph_5_2 data set, the proposed GWAC produced the value of intracluster distance is 326.882, whereas other algorithms were unable to reach this solution. For the Iris data set, the GWAC provided the best solutions and small SD compared to other algorithms. For the Wine data set, the GWAC attained the optimal value of intracluster distance is 16302.600, which is better than the other algorithms. For the Glass data set, the BATC outperformed the other algorithms. For the Haberman data set, the GWAC outperformed the KM, GAC, HSC, MHSC, and BATC. However, the results of PSOC and FPAC are comparable to the proposed GWAC. For the Bupa data set, the value of intracluster distance obtained from GWAC is 9894.950, which is better than the other existing algorithms. As seen from the results for the CMC data set, the GWAC is far superior to the other algorithms.

Table 2

Mean (SD) of Sum of Intracluster Distances Obtained by Algorithms on Different Data Sets.

Sph_4_3Sph_5_2IrisWineGlassHabermanBupaCMC
KM1098.750 (16.8243)428.863 (21.1250)107.346 (33.684)18959.400 (4.3500)347.211 (8.8355)3404.222 (1.3017)10876.22 (54.1682)10733.44 (31.7927)
GAC1007.133 (40.7119)418.559 (14.6405)109.477 (3.9319)17294.670 (594.9260)244.003 (6.1477)2728.233 (51.2775)10850.110 (163.8875)6609.178 (182.978)
HSC994.812 (49.1828)405.034 (12.9078)109.897 (5.7997)17247.440 (468.8827)248.939 (2.7734)2736.633 (51.9819)10736.690 (186.1807)6472.578 (137.065)
MHSC866.123 (25.2086)412.004 (16.5759)107.449 (4.4011)16885.560 (103.0826)244.439 (5.3618)2680.189 (31.3732)10668.440 (275.9619)6339.056 (54.5441)
PSOC749.924 (0.0468)327.383 (0.6039)97.136 (0.1864)16387.600 (41.7272)217.468 (2.7698)2590.260 (15.2523)9975.190 (162.2825)5540.530 (2.9311)
FPAC750.914 (0.4182)327.571 (0.6672)97.041 (0.1972)16318.440 (7.9704)229.277 (2.9689)2590.222 (14.0841)9980.144 (32.2590)5600.856 (12.9108)
BATC749.728 (0.4375)333.199 (12.1413)96.658 (0.0055)16815.600 (421.8718)216.916 (2.3657)2592.290 (10.5465)10313.270 (228.2650)5634.910 (85.6383)
GWAC749.005 (0.6956)326.882 (0.0939)78.942 (0.0019)16302.600 (5.6999)249.089 (3.4333)2589.910 (8.4076)9894.950 (18.4049)5530.160 (4.6212)

Tables 3–10 show the comparison between the proposed GWAC approach and the above-mentioned techniques in terms of cluster quality measures for Sph_4_3, Sph_5_2, Iris, Wine, Glass, Haberman, Bupa, and CMC data sets, respectively. For artificial data sets named as Sph_4_3 and Sph_5_2, the GWAC provides the best cluster quality measures than the other clustering algorithms. For real-life data sets, GWAC provides much better cluster quality than the other competitive algorithms for Iris, Wine, Glass, Haberman, and CMC. FPAC and BATC perform better than the proposed GWAC for Bupa data set.

Table 3

Mean (SD) of Cluster Quality Matrices for Sph_4_3 Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.4577 (0.1511)0.3088 (0.1600)0.3788 (0.2667)0.3688 (0.1131)0.4500 (0.3073)0.3611 (0.1816)0.3750 (0.1317)0.5495 (0.3284)
Recall0.4985 (0.1529)0.2988 (0.1710)0.3825 (0.2679)0.3510 (0.1103)0.4500 (0.3073)0.3611 (0.1816)0.3750 (0.1317)0.5495 (0.3284)
Weighted average0.4585 (0.1529)0.2988 (0.1710)0.3825 (0.2679)0.3513 (0.1102)0.4500 (0.3073)0.3611 (0.1816)0.3750 (0.1317)0.5495 (0.3284)
G-measure0.4518 (0.1453)0.2988 (0.1649)0.3785 (0.2649)0.3545 (0.1100)0.4500 (0.3073)0.3611 (0.1816)0.3750 (0.1317)0.5495 (0.3284)
Table 4

Mean (SD) of Cluster Quality Matrices for Sph_5_2 Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.3043 (0.2422)0.1629 (0.1258)0.2894 (0.2123)0.3911 (0.1119)0.2396 (0.1698)0.2727 (0.1939)0.3102 (0.1477)0.3939 (0.0951)
Recall0.2964 (0.2423)0.1720 (0.1301)0.2776 (0.2035)0.3826 (0.1105)0.2480 (0.1681)0.2685 (0.2025)0.2880 (0.1570)0.3904 (0.0960)
Weighted average0.2964 (0.2423)0.1720 (0.1301)0.2776 (0.2035)0.3826 (0.1105)0.2480 (0.1681)0.2685 (0.2025)0.2880 (0.1570)0.3904 (0.0960)
G-measure0.3002 (0.2415)0.1677 (0.1149)0.2822 (0.2066)0.3812 (0.1064)0.2435 (0.1685)0.2703 (0.1979)0.2886 (0.1607)0.3891 (0.0958)
Table 5

Mean (SD) of Cluster Quality Matrices for Iris Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.3018 (0.2584)0.3534 (0.3193)0.2428 (0.2782)0.4586 (0.3178)0.4299 (0.3482)0.4266 (0.1936)0.4396 (0.4159)0.5688 (0.2323)
Recall0.3020 (0.2569)0.3733 (0.3011)0.2393 (0.2669)0.4427 (0.3126)0.4460 (0.3316)0.4385 (0.1870)0.4360 (0.4090)0.5813 (0.2179)
Weighted average0.3020 (0.2569)0.3733 (0.3011)0.2393 (0.2669)0.4427 (0.3126)0.4460 (0.3316)0.4385 (0.1870)0.4360 (0.4090)0.5813 (0.2179)
G-measure0.1144 (0.2828)0.2046 (0.3735)0.1102 (0.2870)0.4082 (0.3454)0.4364 (0.3385)0.4319 (0.1893)0.4359 (0.4107)0.5682 (0.2208)
Table 6

Mean (SD) of Cluster Quality Matrices for Wine Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.2718 (0.1365)0.2925 (0.2191)0.3569 (0.2554)0.4343 (0.2108)0.3851 (0.2002)0.4074 (0.1290)0.3872 (0.2381)0.4994 (0.1808)
Recall0.2742 (0.1409)0.3105 (0.2323)0.3579 (0.2312)0.4087 (0.1897)0.3757 (0.1805)0.3900 (0.1045)0.3773 (0.2233)0.4778 (0.1694)
Weighted average0.2691 (0.1376)0.3073 (0.2272)0.3629 (0.2337)0.4147 (0.1952)0.3781 (0.1812)0.3960 (0.0994)0.3905 (0.2263)0.4955 (0.1525)
G-measure0.0777 (0.2457)0.1082 (0.2445)0.2110 (0.3153)0.2519 (0.2940)0.3795 (0.1913)0.3629 (0.1286)0.3807 (0.2296)0.4878 (0.1679)
Table 7

Mean (SD) of Cluster Quality Matrices for Glass Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.1573 (0.1146)0.1086 (0.0466)0.1976 (0.1035)0.2411 (0.1266)0.1441 (0.0930)0.1577 (0.0755)0.2043 (0.1321)0.2921 (0.0937)
Recall0.1363 (0.1083)0.1158 (0.0661)0.1889 (0.0741)0.2510 (0.1176)0.1629 (0.1005)0.1538 (0.0645)0.2016 (0.1252)0.2621 (0.0818)
Weighted average0.1579 (0.1443)0.1139 (0.0599)0.2019 (0.0987)0.3152 (0.0882)0.1594 (0.1359)0.1667 (0.0931)0.1882 (0.1137)0.3483 (0.1116)
G-measure0.1413 (0.0736)0.1624 (0.0959)0.1319 (0.0987)0.1927 (0.1044)0.1408 (0.0902)0.1409 (0.0596)0.1919 (0.1201)0.2511 (0.0779)
Table 8

Mean (SD) of Cluster Quality Matrices for Haberman Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.4972 (0.0141)0.4995 (0.0181)0.4955 (0.0226)0.4996 (0.0157)0.4981 (0.0144)0.4983 (0.0158)0.4941 (0.0106)0.5042 (0.0121)
Recall0.4964 (0.0179)0.4994 (0.0229)0.4953 (0.0279)0.4994 (0.0189)0.4977 (0.0184)0.4978 (0.0206)0.4925 (0.0134)0.5054 (0.0154)
Weighted average0.4836 (0.0085)0.4996 (0.0397)0.5088 (0.0501)0.5079 (0.0259)0.4977 (0.0171)0.4978 (0.0206)0.4994 (0.0216)0.5090 (0.0208)
G-measure0.4556 (0.0161)0.4647 (0.0226)0.4596 (0.0340)0.4602 (0.0209)0.4756 (0.0105)0.4764 (0.0241)0.4703 (0.0090)0.4768 (0.0148)
Table 9

Mean (SD) of Cluster Quality Matrices for Bupa Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.5000 (0.0132)0.4787 (0.0463)0.5078 (0.0490)0.5119 (0.0465)0.5138 (0.0421)0.5411 (0.0672)0.5243 (0.0643)0.4944 (0.0256)
Recall0.5000 (0.0082)0.4892 (0.0271)0.5015 (0.0266)0.5054 (0.0195)0.5078 (0.0221)0.5253 (0.0412)0.5242 (0.0570)0.4978 (0.0102)
Weighted average0.5000 (0.0443)0.5088 (0.0372)0.4757 (0.0316)0.5023 (0.0308)0.5082 (0.0342)0.5027 (0.0077)0.5232 (0.0301)0.5101 (0.0548)
G-measure0.4302 (0.0119)0.3839 (0.0514)0.4108 (0.0362)0.4324 (0.0491)0.4684 (0.0124)0.5090 (0.0018)0.5032 (0.0601)0.4100 (0.0191)
Table 10

Mean (SD) of Cluster Quality Matrices for CMC Data Set.

KMGACHSCMHSCPSOCFPACBATCGWAC
Precision0.3432 (0.0417)0.3378 (0.0599)0.3559 (0.0136)0.3615 (0.0185)0.3502 (0.0589)0.3090 (0.0806)0.3166 (0.0863)0.3620 (0.0417)
Recall0.3320 (0.0444)0.3372 (0.0570)0.3520 (0.0144)0.3569 (0.0184)0.3472 (0.0514)0.3124 (0.0748)0.3186 (0.0827)0.3614 (0.0347)
Weighted average0.3470 (0.0463)0.3294 (0.0559)0.3671 (0.0064)0.3673 (0.0089)0.3471 (0.0519)0.3095 (0.0735)0.3166 (0.0833)0.3678 (0.0311)
G-measure0.3139 (0.0569)0.3196 (0.0588)0.3352 (0.0218)0.3459 (0.0223)0.3440 (0.0535)0.3061 (0.0759)0.3134 (0.0842)0.3538 (0.0345)

6.5 Test for Statistical Significance

The unpaired t tests are used to test whether the results obtained from the best performing algorithm differ from the results of the rest of the competitor algorithms in a statistically significant way. We have taken 20 as the sample size for unpaired t tests. Table 11 depicts the results of unpaired t tests based on the sum of intracluster distances. For all the data sets, except the Haberman, the proposed GWAC is statistically significant compared to the other clustering algorithms.

Table 11

Unpaired t Test Between the Best and the Second Best Performing Algorithms for Each Data Set Based on the Sum of Intracluster Distances.

Data setStandard errort95% Confidence intervalTwo-tailed PSignificance
Sph_4_30.1843.9347–1.094978 to –0.3510220.0003Extremely significant
Sph_5_20.1373.6661–0.777651 to –0.2243490.0008Extremely significant
Iris0.00113615.61–17.718634 to –17.713366<0.0001Extremely significant
Wine2.1917.2293–20.275598 to –11.404402<0.0001Extremely significant
Glass0.93234.5089–34.060366 to –30.285634<0.0001Extremely significant
Haberman3.6680.0851–7.736997 to 7.1129970.9327Not significant
Bupa36.5202.1971–154.171074 to –6.3089260.0342Significant
CMC1.2248.4746–12.847169 to –7.892831<0.0001Extremely significant

7 Application to Gene Expression Data

The proposed GWAC technique has been applied to gene expression data. The three real-life data sets, namely, Yeast Sporulation [5], Human Fibroblast Serum [15], and Rat CNS [34], are used for experimentation. The performance of the proposed GWAC is compared to two well-known gene expression data clustering methods, such as iterative fuzzy C-means (IFCM) and average linkage. The silhouette index (SC) is used for evaluating the performance of the clustering algorithms.

Table 12 shows the SC values for the IFCM, average linkage, and GWAC algorithms. For the Yeast Sporulation data set, the proposed GWAC provides the best value of SC compared to the other clustering algorithms. The proposed GWAC outperforms all the other algorithms for the Human Fibroblast Serum and Rat CNS data sets in terms of SC value.

Table 12

SC Values for Real-Life Gene Expression Data Sets.

AlgorithmYeast SporulationHuman Fibroblast SerumRat CNS
KSCKSCKSC
IFCM70.475580.299550.4050
Average linkage60.500760.309260.3684
GWAC60.516260.391960.4165

For the purpose of illustration, the cluster profile plots for clustering solution found using GWAC are shown in Figure 3. The quality of clustering can be evaluated by observing the shapes of the patterns and tightness of the bundles. From Figure 3, the patterns of the six clusters are very different. The proposed GWAC outperforms all the other algorithms for the Human Fibroblast Serum and Rat CNS data sets in terms of SC value. Figures 4and 5 show the cluster profile plots of the clustering solution found from GWAC on the Human Fibroblast Serum and Rat CNS data sets, respectively.

Figure 3: Cluster Profile Plots of Yeast Sporulation Data Clustered Using GWAC.
Figure 3:

Cluster Profile Plots of Yeast Sporulation Data Clustered Using GWAC.

Figure 4: Cluster Profile Plots of Human Fibroblast Serum Data Clustered Using GWAC.
Figure 4:

Cluster Profile Plots of Human Fibroblast Serum Data Clustered Using GWAC.

Figure 5: Cluster Profile Plots of Rat CNS Data Clustered Using GWAC.
Figure 5:

Cluster Profile Plots of Rat CNS Data Clustered Using GWAC.

7.1 Statistical Significance on Gene Expression Data Sets

To establish the statistical significance of proposed approach on gene expression data set, unpaired t tests are used. The sample size for unpaired t tests is set to 20. Table 13 depicts the results of unpaired t tests based on the SC mentioned in Table 12. For all the data sets, except the Yeast Sporulation, the proposed GWAC is statistically significant compared to the other clustering algorithms.

Table 13

Unpaired t Test Between the Best and the Second Best Performing Algorithms for Each Gene Expression Data Set Based on SC.

Standard errort95% Confidence intervalTwo-tailed PSignificance
Yeast Sporulation0.0111.4197–0.037602 to 0.0066020.1639Not significant
Human Fibroblast Serum0.0099.006–0.101300 to –0.064099<0.0001Extremely significant
Rat CNS0.0153.2834–0.077756 to –0.0184440.0022Statistically significant

8 Conclusions

The hunting and search behavior of social animals, such as grey wolves, has been the emerging area of swarm intelligence. The grey wolves are closely related to leadership hierarchy. In this paper, a GWA was developed to solve the clustering problems. The GWAC was applied for data clustering when the number of cluster is known a priori. It was compared to other algorithms such as GA, HSC, MHSC, FPAC, and BATC and tested on eight data sets. The experimental results revealed that GWAC outperformed the other algorithms in most of the data sets. Statistical analysis also demonstrated that the proposed GWAC is statistically significant. The GWAC has further been applied on gene expression data. Comparing the results of the GWAC with the other well-known techniques, the GWAC outperforms the recently developed techniques.


Corresponding author: Vijay Kumar, Computer Science and Engineering Department, Thapar University, Patiala, Punjab, India, e-mail:

Bibliography

[1] B. Amiri, L. Hossain and S. E. Mosavi, Applications of harmony search algorithm on clustering, in: Proceedings of the World Congress on Engineering and Computer Science, pp. 460–465, 2010.Search in Google Scholar

[2] S. Bandyopadhyay and U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN, Inform. Sci.146 (2002), 221–237.10.1016/S0020-0255(02)00208-6Search in Google Scholar

[3] C. L. Blake and C. J. Merz, UCI Repository of Machine Learning, 1998. http://www.ics.uci.edu/_mlearn/databases/.Search in Google Scholar

[4] P. Brucker, On the complexity of clustering problems, in: Optimization and Operations Research, Lecture Notes in Economics and Mathematical Systems, M. Beckmenn and H. P. Kunzi, eds., vol. 157, pp. 45–54, Springer-Verlag, Berlin, Germany, 1978.10.1007/978-3-642-95322-4_5Search in Google Scholar

[5] S. Chu and I. Herskowitz, The transcriptional program of sporulation in budding yeast, Science282 (1998), 699–705.10.1126/science.282.5389.699Search in Google Scholar PubMed

[6] S. Das, A. Abraham and A. Konar, Automatic clustering using an improved differential evolution algorithm, IEEE T Syst Man Cy A38 (2008), 218–237.10.1109/TSMCA.2007.909595Search in Google Scholar

[7] M. Fathian, B. Amiri and A. Maroosi, Application of honey bee mating optimization algorithm on clustering, Appl. Math. Comput.190 (2007), 1502–1513.10.1016/j.amc.2007.02.029Search in Google Scholar

[8] E. W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics21 (1965), 768–769.Search in Google Scholar

[9] Z. W. Geem, J.-H. Kim and G. V. Loganathan, A new heuristic optimization algorithm: harmony search, Simulation76 (2001), 60–68.10.1177/003754970107600201Search in Google Scholar

[10] T. Hassanzadeh and M. R. Meybodi, A new hybrid approach for data clustering using firefly algorithm and K-means, in: CSI International Symposium on Artificial Intelligence and Signal Processing, pp. 7–11, Shiraz, Fars, Iran, 2012.10.1109/AISP.2012.6313708Search in Google Scholar

[11] A. Hatamlou, Black hole: a new heuristic optimization approach for data clustering, Inform. Sci.222 (2013), 175–184.10.1016/j.ins.2012.08.023Search in Google Scholar

[12] A. Hatamlou and M. Hatamlou, Hybridization of the gravitational search algorithm and Big Bang-Big Crunch algorithm for data clustering, Fund. Inform.126 (2013), 319–333.10.3233/FI-2013-884Search in Google Scholar

[13] A. Hatamlou, S. Abdullah and H. Nezamabadi-pour, Application of gravitational search algorithm on data clustering, in: Rough Set and Knowledge Technology, Lecture Notes in Computer Science, J. Yao, S. Ramanna, G. Wang, and Z. Suraj, eds., vol. 6954, pp. 337–346, Springer, 2011.10.1007/978-3-642-24425-4_44Search in Google Scholar

[14] A. Hatamlou, S. Abdullah and M. Hatamlou, Data clustering using Big Bang-Big Crunch algorithm, in: Innovative Computing Technology. Lecture Notes in Communications in Computer and Information Science, P. Pichappan, H. Ahmadi, and E. Ariwa, eds., pp. 383–388, Springer, 2011.10.1007/978-3-642-27337-7_36Search in Google Scholar

[15] V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler and J. C. F. Lee, The transcriptional program in the response of human fibroblasts to serum, Science283 (1999), 83–87.10.1126/science.283.5398.83Search in Google Scholar PubMed

[16] A. K. Jain, Data clustering: 50 years beyond K-means, Pattern Recogn. Lett.31 (2010), 651–666.10.1007/978-3-540-87479-9_3Search in Google Scholar

[17] Y.-T. Kao, E. Zahara and I. W. Kao, A hybridized approach to data clustering, Expert Syst. Appl.34 (2008), 1754–1762.10.1016/j.eswa.2007.01.028Search in Google Scholar

[18] D. Karaboga and B. Basturk, On the performance of artificial bee colony algorithm, Appl. Soft Comput.8 (2008), 687–697.10.1016/j.asoc.2007.05.007Search in Google Scholar

[19] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley & Sons, New York, 1990.10.1002/9780470316801Search in Google Scholar

[20] K. Krishna and M. N. Murty, Genetic K-means algorithm, IEEE T Syst Man Cy B29 (1999), 433–439.10.1109/3477.764879Search in Google Scholar PubMed

[21] V. Kumar, J. K. Chhabra and D. Kumar, Automatic MRI brain image segmentation using gravitational search based clustering technique, in: Recent Advances in Computer Vision and Image Processing: Methodologies and Applications, R. Srivastava, S. K. Singh and K. K. Shukla, eds., pp. 313–326, IGI Global, USA, 2014.10.4018/978-1-4666-4558-5.ch015Search in Google Scholar

[22] V. Kumar, J. K. Chhabra and D. Kumar, Parameter adaptive harmony search algorithm for unimodal and multimodal optimization problems, J. Comput. Sci.5 (2014), 144–155.10.1016/j.jocs.2013.12.001Search in Google Scholar

[23] V. Kumar, J. K. Chhabra and D. Kumar, Variance based harmony search algorithm for unimodal and multimodal optimization problems with application to clustering, Cybernet. Syst.5 (2014), 486–511.10.1080/01969722.2014.929349Search in Google Scholar

[24] U. Maulik and S. Bandyopadhyay, Genetic algorithm based clustering technique, Pattern Recogn.33 (2000), 1455–1465.10.1016/S0031-3203(99)00137-5Search in Google Scholar

[25] S. Mirjalili, S. M. Mirjalili and A. Lewis, Grey wolf optimizer. Adv. Eng. Softw.69 (2014), 46–61.10.1016/j.advengsoft.2013.12.007Search in Google Scholar

[26] C. A. Murthy and N. Chowdhury, In search of optimal clusters using genetic algorithms, Pattern Recogn. Lett.17 (1996), 825–832.10.1016/0167-8655(96)00043-8Search in Google Scholar

[27] M. G. H. Omran, A. P. Engelbrecht and A. Salman, Dynamic clustering using particle swarm optimization with application in image segmentation, Pattern Anal. Appl.8 (2006), 332–344.10.1007/s10044-005-0015-5Search in Google Scholar

[28] I. B. Saida, K. Nadjet and B. Omar, A new algorithm for data clustering based on cuckoo search optimization, in: Genetic and Evolutionary Computing. Lecture Notes in Advances in Intelligent Systems and Computing, J.-S. Pan, P. Kromer and V. Snasel, eds., vol. 238, pp. 55–64, Springer, 2014.10.1007/978-3-319-01796-9_6Search in Google Scholar

[29] S. C. Satapathy and A. Naik, Data clustering based on teaching learning-based optimization, in: Swarm, Evolutionary, and Memetic Computing. Lecture Notes in Computer Science, B. K. Panigrahi, P. N. Suganthan, S. Das and S. C. Satapathy, eds., vol. 7077, pp. 148–156, Springer, 2011.10.1007/978-3-642-27242-4_18Search in Google Scholar

[30] S. Z. Selim and K. Al-Sultan, A simulated annealing algorithm for the clustering problem, Pattern Recogn.24 (1991), 1003–1008.10.1016/0031-3203(91)90097-OSearch in Google Scholar

[31] S. Z. Selim and M. A. Ismail, K-means-type algorithms: a generalized convergence theorem and characterization of local optimality, IEEE T Pattern Anal.6 (1984), 81–87.10.1109/TPAMI.1984.4767478Search in Google Scholar

[32] P. S. Shelokar, V. K. Jayaraman and B. D. Kulkarni, An ant colony approach for clustering, Anal. Chim. Acta509 (2004), 187–195.10.1016/j.aca.2003.12.032Search in Google Scholar

[33] C. S. Sung and H. W. Jin, A tabu search-based heuristic for clustering, Pattern Recogn.33 (2000), 849–858.10.1016/S0031-3203(99)00090-4Search in Google Scholar

[34] X. Wen, Large-scale temporal gene expression mapping of central nervous system development, Proceedings of the National Academy of Sciences of the United States of America95 (1998), 334–339.10.1073/pnas.95.1.334Search in Google Scholar PubMed PubMed Central

[35] R. Xu and D. C. Wunsch II, Clustering, John Wiley and Sons, USA, 2009.10.1002/9780470382776Search in Google Scholar

[36] X.-S. Yang, A new metaheuristic bat-inspired algorithm, in: Proceedings of Nature Inspired Cooperative Strategies for Optimization, Studies in Computational Intelligence, J. R. Gonzalez et al., eds., vol. 284, pp. 65–74, Springer, Berlin, 2010.10.1007/978-3-642-12538-6_6Search in Google Scholar

[37] X.-S. Yang, Flower pollination algorithm for global optimization, in: Unconventional Computation and Natural Computation. Lect. Notes Comput. Sci., J. Durand-Lose and N. Jonoska, eds., Springer, 7445 (2012), 240–249.10.1007/978-3-642-32894-7_27Search in Google Scholar

[38] C. Zhang, D. Ouyang and J. Ning, An artificial bee colony approach for clustering, Expert Syst. Appl.37 (2010), 4761–4767.10.1016/j.eswa.2009.11.003Search in Google Scholar

Received: 2014-9-23
Published Online: 2016-2-4
Published in Print: 2017-1-1

©2017 Walter de Gruyter GmbH, Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/jisys-2014-0137/html
Scroll to top button