Enhancement of K - means clustering in big data based on equilibrium optimizer algorithm

: Data mining ’ s primary clustering method has several uses, including gene analysis. A set of unlabeled data is divided into clusters using data features in a clustering study, which is an unsupervised learning problem. Data in a cluster are more comparable to one another than to those in other groups. However, the number of clusters has a direct impact on how well the K - means algorithm performs. In order to ﬁ nd the best solutions for these real - world optimization issues, it is necessary to use techniques that properly explore the search spaces. In this research, an enhancement of K - means clustering is proposed by applying an equilibrium optimization approach. The suggested approach adjusts the number of clusters while simultaneously choosing the best attributes to ﬁ nd the optimal answer. The ﬁ ndings establish the usefulness of the suggested method in comparison to existing algorithms in terms of intra - cluster distances and Rand index based on ﬁ ve datasets. Through the results shown and a comparison of the proposed method with the rest of the traditional methods, it was found that the proposal is better in terms of the internal dimension of the elements within the same cluster, as well as the Rand index. In conclusion, the suggested technique can be successfully employed for data clustering and can o ﬀ er signi ﬁ cant support.


Introduction
Data clustering, which is the classification of an unlabeled dataset into clusters of comparable items, is one of the most crucial and extensively used data analysis techniques [1,2]. One cluster is made up of things that are comparable to each other but not to objects from other clusters [3,4]. Clustering has been applied in various fields of science and engineering, including web mining, text mining, image processing, stock prediction, signal processing, biology, and others [5,6]. In general, hierarchical algorithms and partitional algorithms can be used to categorize clustering techniques. A well-known class of partitional clustering algorithms is K-means method, which is simple to construct and efficient in most applications [7,8]. However, the K-means algorithm's effectiveness hinges on figuring out K, the number of clusters, and the initial condition of centroids that could become trapped in local optima [9][10][11].
In recent years, the problem of grouping high-dimensional data has become apparent. High-dimensional data clustering is the process of doing cluster analyses on datasets that have a few dozen to thousands of dimensions. High-dimensional data, on the other hand, presents unique obstacles for clustering algorithms that necessitate specific solutions [12]. Standard similarity measurements, as utilized in traditional clustering techniques, are frequently not useful with high-dimensional data [13,14].
In order to solve classification problems, one of the key strategies is feature selection, which relies on selecting a subset of the initial data that significantly affects the clustering process. Many datasets contain misleading and redundant characteristics, which take a long time to evaluate during an exhaustive search in the solution space and confuse the classification process. Both enhancing clustering accuracy and increasing computational efficiency can benefit from keeping only a portion of essential characteristics, while eliminating unnecessary features [15,16]. A number of nature-inspired and swarm intelligence-based algorithms, such as the genetic algorithm [17], bat algorithm [18], particle swarm optimization [19], gray wolf optimization [20], and grasshopper algorithm [21] have been introduced over the past few years [17]. Studies on these optimization techniques have shown encouraging results, and they can be used to handle a number of optimization issues, including the clustering problems. In 2019, Wu et al. presented a standardized dimensionality reduction model, which combines the K-means clustering algorithm with linear trace ratio analysis to find the effective drop direction. The proposed model is suitable for supervised, semisupervised, and unsupervised applications, in contrast to existing dimensionality reduction strategies. An effective and detailed optimization technique is presented for the goal of determining the best W projection matrix [22]. In 2020, Chen and others proposed a hybrid algorithm that combines the K-means clustering algorithm with the quantum-inspired ant lion optimized algorithm, combining the advantages of quantum computing and swarm intelligence algorithms to improve the clustering algorithm. The proposed algorithm was tested on many standard data and through this it was concluded that it can be used efficiently for data clustering and intrusion detection [23]. In 2021, Al-Thanoon et al. improved the BCSA binary raven algorithm for selecting features in big data. The proposed improvement includes the use of the opposition-based learning method concept to define the flight length parameter. The experimental results show that the properties were chosen to take advantage of the proposed modifications and are more representative and improve the accuracy of classification and the time spent in the calculations. In general, the proposed algorithm outperformed the traditional and other well-known algorithms in the two datasets [15]. Also, in 2021 Al-kababchee et al. proposed a binary bat algorithm with penalized regression-based clustering. The experimental results and statistical analysis on three chemical datasets have demonstrated that the performance of our proposed BPRBC compared with the PRBC and K-means leads to a better performance in terms of purity and computational time [24].
The most important problem in cluster analysis is to find the number of clusters. Therefore, in our proposal, we used the equilibrium algorithm to find the appropriate number for the number of clusters. Since we used high-dimensional data, we also used the proposed algorithm to determine the best data by choosing the features, because the cluster algorithm deals with large amount of data. The distance is challenging, thus our aim was to work the clustering algorithm to solve the challenges.
In this article, we discuss the use of a hybrid algorithm for cluster analysis that is based on the gravitational search method [15] and k-means algorithm [16,18,19]. The suggested method's effectiveness has been evaluated on a number of genuine, standard datasets from the UCI repository [24], and the outcomes have been contrasted with those of alternative methods. In short, the main contributions can be listed as: • An effective strategy is suggested based on the improvement of K-means clustering through the use of an equilibrium optimization strategy. • A thorough methodology for feature selection is given that fully exploits feature interaction and prior information while capturing high-order connectivity between features using K-means clustering. • By using the optimum adjustments to the number of clusters, the K-means clustering was improved.
The remainder of the essay is structured as follows: A brief introduction to clustering issues and kmeans algorithms is given in Section 2. We provide our suggested algorithm for resolving data clustering issues in Section 3 using the gravitational search algorithm and the k-means algorithm. Section 4 discusses experimental findings and comparisons to other available techniques. Finally, Section 5 summarizes this study's findings and offers suggestions for further research.

K-means clustering
The process of grouping a set of data objects into clusters is known as data clustering, and it is one of the most essential and common data analysis approaches [25], which is comparable to things in the same cluster but different from objects in other clusters [4,26]. Let be a collection of clusters and be a set of data items that need to be clustered, where ∈ Y R i D . During clustering, each data point inset R is assigned to one of the K clusters in order to minimize. The sum of the squared Euclidean distances between each object Y i and the cluster X j center is the intra-cluster variance objective function. The following is the objective function [27]: Real-valued data vectors for continuous data are divided into a preset number of clusters by the partitional clustering algorithm K-means [28]. Consider a partition P c of a dataset with N data patterns (each data pattern is represented by a vector Clusters are produced in K-means using a dissimilarity measure called the Euclidean distance (equation (2)). For each iteration (until a maximum number of iterations t K max means is reached or another halting condition is met), a new cluster centroid vector is created for each cluster as the mean of its current data vectors (i.e., the data patterns currently assigned to the cluster). The new division is then created, with each pattern being assigned to the cluster with the closest centroid [29].
The number of patterns associated with the cluster c is N c . The criteria function for K-means is supplied by the within-cluster sum of squares in equation (4) [29].
Algorithm 1 shows the passcode of K-means algorithm [29]. 1. For ← t 0 put C patterns randomly as the initial cluster centroids 2. Assign each pattern x j to its closest cluster.

While
5. New centroids are determined for each cluster c, centroid g c is updated using (equation (2)). 6. Determine the new partition for each data pattern x j and assign it to the cluster with the nearest centroid g c .

Equilibrium optimizer (EO) algorithm
The inspiration for the EO algorithm technique came from the basic well-mixed dynamic mass balance on a control volume, which employs a mass balance equation to describe the concentration of a nonreactive ingredient in a control volume as a function of its various source and sink processes. The mass balance equation gives the fundamental physics for the mass conservation that applies to mass coming into, going out of, and being generated in a control volume. A first-order ordinary differential equation is used to represent the universal mass-balance equation [30], which states that the change in mass over time is equal to the mass entering the system plus the mass being generated inside the system minus the mass exiting the system [31]: where Q is the volumetric defect rate into and out of the control volume, V C t d d is the rate of change of mass in the control volume, and C is the concentration inside the control volume (V), G is the rate of mass generation inside the control volume, and C eq is the concentration at an equilibrium state where there is no generation. A stable equilibrium is attained when is the inverse of the residence duration, also known as the turnover rate or λ in this context (i.e., = λ Q V ). Equation (5) can also be altered in order to solve for the concentration in the control volume (C) as a function of time (t).
The integration of equation (6) over time is seen in equation (7).
This results in In equation (8), F is calculated as follows: where t 0 and C 0 , which depend on the integration interval, are the initial start time and concentration, respectively. Use equation (8) to calculate the concentration in the control volume with a known turnover rate. Equation (8) can be used to obtain the average turnover rate using a straightforward linear regression with a known generation rate and other factors. t 0 and C 0 stand for the initial start time and concentration, respectively, according to the integration interval. It is possible to use equation (8) to either estimate the concentration in the control volume with a known turnover rate or to determine the average turnover rate using a simple linear regression with a known generation rate and other circumstances. The starting population is where EO starts the optimization process, just like the bulk of meta-heuristic algorithms. The search space is uniformly initialized using uniform random initialization, and the starting concentrations are constructed according to the dimensions and number of particles.
where C min and C max stand for the min and max values of the dimensions, Rand i is a random vector in the range [0, 1], and n is the population's total number of particles. C i initial is the starting concentration vector of thei th particle. To identify the candidates for equilibrium, particles are sorted after being appraised for their fitness function.
The algorithm's equilibrium state, which is intended to represent the overall optimal state, is where it finally converges. Only equilibrium candidates are chosen to create a search pattern for the particles at the beginning of the optimization process because the equilibrium state is unknown. The approach performs worse in multimodal and composition functions when there are less than four candidates present, but performs better in unimodal functions. The reverse outcome will occur with more than four candidates. The equilibrium pool is created using five particles, which are proposed as equilibrium candidates.
eq,pool eq 1 eq 2 eq 3 eq 4 eq ave (11) Each particle changes its concentration in each iteration using a random selection process from candidates selected with the same probability.
The exponential term comes next, adding to the main concentration updating rule (F). This term's precise definition will help EO strike a fair balance between exploration and exploitation. λ is assumed to be a random vector with a range of [0, 1] because the turnover rate in a real control volume can change with time.
Since time, t, is a function of iteration (Iter), and as a result, gets shorter the more iterations there are.
where a 2 is a constant used to control exploitation ability and Iter and max_iter represent the current and maximum number of iterations, respectively. In order to ensure convergence by slowing down the search speed as well as enhancing the algorithm's capability for exploration and exploitation where a 1 is a fixed value that governs the capacity for exploration. The greater the a 1 and lower the exploitation, performance correspond to higher exploring capabilities.
Enhancement of K-means clustering in big data based on EO algorithm  5 Equation (15) shows the revised version of equation (12) with the substitution of equation (14) in equation (12).
λ t 1 (15) One of the most crucial components of the algorithm that gives the correct solution by boosting the exploitation phase is the generation rate. As an illustration, the following general model defines generation rates as a first order exponential decay process.

⎯→ ⎯
where G 0 denotes the starting point and k denotes the decay constant. The following are the final set of generation rate formulae: where the random integers r 1 and r 2 are in the range [0, 1] and the GCP vector is created by repeating the same value obtained from equation (19). GCP, which in this equation incorporates the potential contribution of the generation term to the updating process, is described as the Generation Rate Control Parameter. Another term called Generation Likelihood determines the probability of this contribution, which indicates how many particles employ generation term to update their states (GP). Equations (18) and (19) determine the mechanism of this contribution (19). Each particle is affected by equation (19). For instance, if GCP = 0, G = 0, and no generation rate term is used, all of the particle's dimensions are updated. An equitable balance between exploration and exploitation is achieved when GP = 0. The final rule for revising the EO is as follows: where V is regarded as a unit and F is specified in equation (20).
In equation (20), the equilibrium concentration is represented by the first term, and the fluctuations in concentration are represented by the second and third terms. The second term is in charge of doing a thorough examination of the area to choose an ideal point. This phrase adds more to investigation and hence makes use of significant differences in attention. By identifying a point, the third term contributes to increasing the solution's accuracy. The generation rate term is more exploitative and benefits from the equation (17) because it controls concentration fluctuations. The second and third terms could have the same or opposing signs depending on factors like the concentrations of particles and equilibrium candidates as well as the turnover rate (λ). The variance is made large by the same sign, which improves searching over the entire domain, and small by the opposing sign, which improves local searches.
These fluctuations are controlled by the generation rate terms (equations (17)- (19)). This big variance only affects the dimensions with tiny values because λ varies as each dimension changes. It is important to note that this feature functions similar to an evolutionary algorithm's mutation operator and significantly aids EO in exploiting the solutions.
We will present the detailed pseudo code of EO. Set the particle's populations as i = 1,…, n. Assign a substantial number of fitness points to equilibrium candidates. Give the free parameters a 1 = 2 and a 2 = 1, and GP = 0.5. While Max_iter < Iter For i = 1: number of particles (n) Identify it's fitness by calculating , , , eq,pool eq 1 eq 2 eq 3 eq 4 eq ave Accomplish memory saving (if > Iter 1) Iter Max iter For i = 1: number of particles (n) Select one candidate at random from the pool of candidates for equilibrium (vector).
End for = + Iter Iter 1 End while.

The proposed enhancement
In K-means clustering, one element must be fixed. The number of clusters is the K-factor. The effectiveness of K-means clustering is strongly influenced by the choice of K. There have been numerous attempts in the literature to improve K-means clustering performance by appropriate K selection. These various techniques, including algorithms inspired by nature, were used to choose the K [32]. However, none of these existing processes for choosing K make any attempt to simultaneously choose many features. Our approach in this search aims to enhance K-means clustering by maximizing the number of clusters while simultaneously incorporating the feature selection. The equilibrium optimizer algorithm (EOA) is a novel meta-heuristic physics-based algorithm and is considered one of the most powerful, fast, and best-performing populationbased optimization algorithms. In this study, we proposed to deal with feature selection and select the number of clusters in K-means by EOA simultaneously. Figure 1 is an illustration of the solution representation.
Each family member holds a position that is made up of both binary and quantitative values that indicate various attributes. If 0 is entered, the relevant characteristic will take the value of 1. To put it another way, each member holds positions. The following are the steps of our suggested algorithm.
Step 1: The maximum number of repetitions is = t 150 max , and there are = n 30 EOA populations.
Step 2: The number of clusters, K, is randomly generated from uniform distribution as ( ) ∼ K U 0, 10 . The rest positions which are representing the feature are generated as ( ) U 0, 1 .
Step 3: The rest positions which are representing the feature are generated as ( ) U 0, 1 .
Step 4: The fitness function is defined as the total within-cluster variance in equation (21).
Step 5: The positions are changed utilizing equation (20). Binary EOA is applied to deal with feature selection. In this situation, a p-bit binary string serves as the representation for each member. The position is often updated by forcing Hawk into a binary space using the transfer function. This binary vector can be produced using a transfer function; however, the final solution can only include binary values [33].
Step 6: In order to reach t max , steps 4 and 5 are repeated.

Results and discussion
The performance of the proposed algorithm, EOAK-means, is investigated by applying the proposed algorithm to solve five different publicly available datasets. Further, the performance of EOAK-means algorithm was compared with (1) A standard version of K-means algorithm by employing EOA for selecting the optimum number of clusters, K-means. (2) K-means algorithm using cross-validation (CV) methods for selecting the optimum number of clusters, CVK-means. Table 1 shows the brief description of the used datasets. The selected datasets vary in the number of clusters, C, dimensions, d, and number of observations, N.
To evaluate the effectiveness of the used algorithms, two criteria are used: (1) Sum of intra-cluster distances (ICD) as an internal quality criterion which is defined in equation (21).
Obviously, the smaller the sum of ICD, the higher the quality of the clustering algorithm [37,38]. (2) Rand index (RI) is a well-known external clustering criterion which is defined as where f 1: one cluster in each of the two partitions has been assigned to the two data points, f 2: a single cluster has been assigned to the two contrasting data points, f 3: different clusters were assigned to the two related data points, and f 4: two dissimilar data points have been assigned to different clusters. The RI criterion value is lying between 0 and 1. The RI with 1 represents that the algorithm has perfect clustering [39,40]. Table 2 provides a summary of the ICD that clustering techniques on the used datasets produced. The outcomes represent the top, bottom, average, and standard deviation (SD) of the 20 achieved solutions. The RI and its related SD are listed in Table 3. The best outcomes are in bold font in Tables 2 and 3.
As it can be observed from Table 2, the proposed algorithm, EOAK-means, obtained the best, averaged, and worst values for all five datasets. For the H1N1 dataset, the best, average, and worst solutions obtained by EOAK-means are 59.544, 65.895, and 74.204, respectively, which are better than the other algorithms. Further, it can be observed that K-means is in second place for all used datasets. Additionally, it can be seen that the proposed EOAK-means algorithm has consistent results due to the lowest SD values for all datasets compared with other used algorithms. From Table 3, we can see that the EOAK-means algorithm achieves the highest average RI in all datasets, followed by K-means, and CVK-means. This indicates that EOAK-means is able to form clusters that are close to the target clusters on average. Simultaneously, K-means obtains excellent performance compared to CVK-means.
A non-parametric statistical test known as the EOAK-means is used to further demonstrate the effectiveness of the suggested algorithm. The ICD and RI criteria comparison p-values between the EOAK-means and the other three clustering algorithms on the five datasets are shown in Tables 4 and 5, respectively. Tables 4 and 5 show that the EOAK-means clustering algorithm outperforms the K-means, CVK-means, and EOAK-means clustering algorithms with a level of significance where the Wilcoxon signed-rank test rejects the null hypothesis of EOAK-means and each used cluster algorithm having equivalent performance and confirms significant differences in the performance of all the cluster algorithms.

Conclusion
In this study, an enhancement of K-means clustering is proposed by employing a equilibrium optimization algorithm. The suggested approach adjusts the amount of clusters while simultaneously choosing the best   attributes to find the optimal answer. Using five datasets, the proposed algorithm, EOAK-means, is contrasted with K-means and CVK-means in terms of intra-cluster distances and Rand index. The outcomes demonstrate that the Rand index and intra-cluster distances are better handled by the EOAK-means than by other methods. Additionally, the Wilcoxon rank sum test-based statistical analysis has shown that our suggested algorithm performs better than previous algorithms. As a limitation, the performance of the EOAK-means depends on the choosing of the algorithm's parameters. In future, it will be possible to combine various feature selection techniques simultaneously in order to take use of the advantages of each technique. Ensemble methods offer greater accuracy and stability than relying just on a single feature selection technique.

Conflict of interest:
Authors state no conflict of interest.