Improving binary crow search algorithm for feature selection

: The feature selection ( FS ) process has an essential e ﬀ ect in solving many problems such as prediction, regression, and classi ﬁ cation to get the optimal solution. For solving classi ﬁ cation problems, selecting the most relevant features of a dataset leads to better classi ﬁ cation accuracy with low training time. In this work, a hybrid binary crow search algorithm ( BCSA ) based quasi - oppositional ( QO ) method is proposed as an FS method based on wrapper mode to solve a classi ﬁ cation problem. The QO method was employed in tuning the value of ﬂ ight length in the BCSA which is controlling the ability of the crows to ﬁ nd the optimal solution. To evaluate the performance of the proposed method, four benchmark datasets have been used which are human intestinal absorption, HDAC8 inhibitory activity ( IC50 ) , P - glycoproteins, and antimicrobial. Accordingly, the experimental results are discussed and compared against other standard algorithms based on the accuracy rate, the average number of selected features, and running time. The results have proven the robustness of the proposed method relied on the high obtained value of accuracy ( 84.93 – 95.92% ) , G - mean ( 0.853 – 0.971% ) , and average selected features ( 4.36 – 11.8 ) with a relatively low computational time. Moreover, to investigate the e ﬀ ectiveness of the proposed method, Friedman test was used which declared that the performance supremacy of the proposed BCSA - QO with four datasets was very evident against BCSA and CSA by selecting the minimum relevant features and producing the highest accuracy classi ﬁ cation rate. The obtained results verify the usefulness of the proposed method ( BCSA - QO ) in the FS with classi ﬁ cation in terms of high classi ﬁ cation accuracy, a small number of selected features, and low computational time.


Introduction
The feature selection (FS) process has an essential effect in solving prediction, regression, and classification problems. It includes detecting and selecting the most appropriate features for each dataset to get the most accurate results [1]. Basically, the main aim of the FS process is to maximize the accuracy and reduce the number of sub-optimal selected features with minimum running time [2]. There are three types of FS algorithms, which are wrappers, filters, and hybrid [3,4]. In the filter method, the classifier is not included in the FS process. So, it is less time-consuming than hybrid and wrapper methods. In the wrapper method, the process of selecting the features depends on the accuracy of the classifier which considers a good evaluation criterion to choose the most related features. The hybrid method combines the techniques of filter and wrapper methods which determine the sub-selected features based on the classifier design [3,4].
Actually, there are many variants of swarm intelligence algorithms that are applied to FS problems such as binary particle swarm optimization (BPSO) [5], binary differential evolution algorithm based on tapershaped transfer functions (T-NBDE) [6], binary grey wolf optimization (BGWO) [7], binary whale optimization algorithm [8], and binary crow search algorithm (BCSA) [9]. In addition, the following studies explain in more detail about using a swarm intelligence algorithm in solving FS problems. In ref. [10], the authors proposed a novel FS method based on a binary multi-verse optimizer algorithm which is used for solving text clustering problems. Also, in ref. [11], the authors proposed a new FS method based on a BGWO algorithm for solving text clustering problems. In ref. [12], the authors proposed a novel multi-objective binary version of the cuckoo search algorithm to select the optimal EEG channel for person identification. Also, in ref. [13], the authors used a BGWO algorithm to select the optimal EEG channel which will be used as inputs to the hybrid support vector machine with a radial basis function classifier for solving person identification problems. In ref. [14], the authors proposed a new hybrid FS method based on TRIZ-inventive solution and GWO algorithm to solve gene selection for microarray data classification which is classified using support vector machine.
In addition, FS is one of the problems that is solved using CSA algorithm. CSA has been implemented on many fields like medical diagnosing problems [15], usability feature extraction for software quality measurement [16], and home energy management in smart grid [17]. In ref. [18], the authors proposed opposition-based learning strategy based on BCSA to select the most relevant features in classifying a big dataset. Adamu et al. [3] suggested a method called ECCSPSOA using a novel hybrid PSO with binary chaotic CSA for selecting the most relevant features in classification problems [3]. Arora et al. [19] suggested a hybrid method for solving FS problems using the GWO and CSA algorithms. Rodrigo et al. [1] presented a model called V-shaped BCSA to solve FS problem. Also, in ref. [20], the authors proposed a method to enhance the CSA by adapting the awareness probability. The authors proposed a new approach called BCSA-TVFL based on BCSA and time-varying flight length to solve FS problems. In ref. [21], the authors enhanced the efficiency of CSA based on cellular automata model to control the diversity of the search process.
From all the aforementioned studies, the following research question is raised. In spite of the methods being developed for selecting the most optimal features and improved in many previous studies, is there a need to develop and enhance a new FS method [22]? The answer is yes because not all the existing FS techniques produce the best performance in solving different problems with various datasets [20]. Consequently, a robust method to select the most relevant features is still needed. This is the motivation of this study.
Accordingly, it is clear that the CSA algorithm plays an important role in solving the problems that concern FS, but the main problem of the CSA algorithm is trapping in local minima [20]. Thus, the main aim of this study is that there is a crucial need to develop and enhance the efficiency of the CSA algorithm to overcome the mentioned problem which considers the objective of this study.
In this study, the CSA has been selected as a FS method among other optimization algorithms because it is easy to implement, it requires few tuning parameters compared to other swarm optimization algorithms, fast convergence speed, and high efficiency [1,18]. The parameters are the probability of awareness (AP) and flight length (fl). AP parameter balances the trade-off between exploration and exploitation and fl parameter controls the search capability of crows [20]. In the basic CSA, the amount of fl is a constant value which may cause inappropriate searching by the crows in the solution space that results in trapping in the local optima [23]. Also, constant fl does not resemble the natural behavior of crows [20]. Thus, the primary goal of this study is to improve the exploitation and the exploration ability of CSA. So, the main problem of this work is tuning the value of fl. In addition, the FS is a discrete problem and converting the continuous values to discrete values requires transfer function. Thus, the sigmoid function is used to convert the continuous values to discrete values as a part of binary crow search algorithm improvement in solving FS problems. The contributions of this work are as follows: 1. Propose a new FS technique based on the CSA by tuning the value of fl of crows using the Quasi-Oppositional (QO) method to get the most significant features in order to be more suitable for solving various problems with different datasets.
2. The performance of the proposed BCSA-QO is tested on four popular datasets using k-nearest neighbors (KNN) as a classifier based on classification accuracy, the minimum number of the selected features, G-mean, and minimum needed CPU time. 3. Jaccard index is used to measure the consistency of the selected features by the proposed method. 4. Friedman and Bonferroni tests have been implemented to show the significance of the difference between the obtained results of the proposed method against other existing methods.
The rest of this article is arranged as follows: Section 2 includes the CSA. Section 3 explains the suggested method. The results are explained in Section 4. Finally, Section 5 includes the conclusion.

CSA
CSA [24] is an algorithm that belongs to swarm intelligence algorithms that simulate the nature foraging process of the crows in the environment. In summary, the work of CSA depends on the behavior of the crows which includes: living in the swarm, hiding food's locations in memory, following another crow to steal its food, and finally preventing other crows from finding its food location. The implementation of the CSA for solving an optimization problem could be represented as: n c is the swarm of crows, and x i t is the position value of the crow (ith) number at (t) iteration. Suppose at iteration t, there are n crows and d dimensional search space, then the crows are represented as follows: where each row in matrix crows represents one possible solution. Then, the crows travel in the search space and attempt to discover and memorize the solution with the best fitness reached so far in the matrix M which can be represented as follows: In standard CSA, the searching process is represented through one of the two cases. In the first case, the owner crow j of food source M j t does not realize that the thief crow i is following it. Thus, the crow i will reach the hiding place of the crow j. So, the position of the crow i (thief) will be updated based on the equation (1).
where fl represents the distance of the flight of each crow, and t denotes an arbitrary value within [0,1].
In the second case, the crow j realizes that is being followed by crow i. Thus, crow j tries to distract crow i by moving to a different place. So, crow i will update its location randomly. The scenario of searching process could be concluded by equation (2).
where θ denotes a random value within the interval [0, 1], where AP denotes the probability of awareness which balances between exploration and exploitation phases during the searching process. To achieve the FS process, the BCSA is proposed [20,22]. It is a developed version of the standard crow search algorithm [24], but it cooperates with binary search space instead of continuous search space. The value 0 in the search space represents that the feature has not been chosen and value 1 represents that the feature has been chosen. In BCSA, the transformation of the search space from continuous values to binary values {0, 1} is required using a suitable transfer function. In this study, the sigmoid function is used as a transfer function. Figure 1 demonstrates the BCSA solution implementation.

The proposed method: BCSA-QO
As mentioned before, CSA depends only on two parameters which are fl and AP to get the best solution [20]. To control the exploration and exploitation, find the optimal solution, and simulate the nature of crows, the value of the fl parameter should not be a constant [20]. So, in this study, the QO method is suggested to tune the value of fl parameter and get the optimal solution. QO method [20] is a developed method that is based on the oppositional-based learning (O-BL) technique [25]. The O-BL technique depends on the opposite number for any number located within a specific interval which can be represented mathematically as follow: Suppose ∈ x R and x is a real number represented within the interval [ ] a b , , then the opposite number (x̅ ) is represented as shown in equation (3).
If a = 0 and b = 1, then, Also, in multi-dimensional vectors, suppose that we have a point . Thus, the quasi-opposite number q̅ xi of x̅ i is represented as shown in equation (6). Accordingly, we will get two values, the first value (x) represents the initial random value within the interval [ ] a b , . The second values q̅ x represents the quasi-opposite number of x. In addition, in this technique, there are two functions, which are f(x) as a primary function and g(.) as an assessment function (i.e., fitness function). Then, the f(x) and f(q̅ x ) could be calculated in all iterations of (BCSA-QO). Thus, if ( ( )) ( ( )) ≥ g f x g f q̅ x , then x will be selected, else q̅ x will be selected. As a result, the fl will be in the interval [fl min , fl max ]. The quasi-opposite number ( ) q ̅ fl is defined using equation (7). . Then, the fitness values for both the initial values of fl and q ̅ fl in each iteration will be calculated. If fitness (fl) ≥ fitness (q ̅ fl ), then fl will be selected, otherwise, q ̅ fl will be selected. Algorithm 1 shows the pseudo-code of the proposed method.
The following steps explain the proposed approach in detail: Step 1: Set the values of the number of crows, AP, fl min , fl max , and maximum number of iterations.
Step 2: Convert the values of the positions that represent the features to binary values using the sigmoid function based on equation (8) [20].
where ( ) + f x ij t 1 represents the calculated values of ( ) x ij at iteration t + 1 using equation (9). T represents the threshold value within the interval [0, 1].
Step 3: Calculate the fitness value based on the fitness function as shown in equation (10).
where C r is error rate of classification, X is the number of the selected features by crow ( ) x i t , , K is the number of all features, and ∝ and β are real numbers within [0, 1] Step 4: Update the crows' positions based on equation Step 5: Repeat steps 3 and 4 until the maximum iteration is reached. for i = 1: N (total number of crows) 8: select one of the crows randomly 9: determine the awareness probability (AP) and generate (r i ) randomly 10: if r i >= AP 11:
The dataset is split into training and testing groups. The training group includes 70% of the number of instances in each dataset for implementing the training phase and the rest represents the testing group for implementing the testing phase.
In addition, two classification criteria are utilized to evaluate the efficiency of the proposed method and compare it with other methods in the literature in terms of classification accuracy (CA) (G-mean), and F-measure. The classification accuracy represents the percentage of the classes that are classified correctly which are calculated based on equation (11).
True positive True negative True positive True negative False positive False negative .
In addition, G-mean was utilized to emphasize the combined efficiency of specificity (SP) and sensitivity (SN) according to equation (12).
where SN represents the probability of positive patterns which are classified as truly positive patterns. SP denotes the probability of negative patterns which are classified as truly negative patterns.
The F-measure is another criterion used in the evaluation. It is defined as Then, the average of running these experiments is calculated. KNN algorithm is used as a classifier. Table 2 shows the total number of the selected features by each method in average. The smaller the number of selected features, the better the performance of the method. As shown in Table 2, BCSA-QO has fewer features than the other methods. For example, BCSA-QO selected around 12 features compared to 14 and 18 features for the modified binary CSA (M-CSA) method and standard binary CSA method (S-CSA) [9], respectively. Meanwhile, the standard CSA has the largest number of features compared to other methods. Hence, in terms of the selected features, the proposed BCSA-QO has the minimum number of the selected features compared to other methods with all datasets. Thus, it is considered as the best method compared with M-CSA and S-CSA.
Moreover, to examine the effectiveness of the suggested technique (BCSA-QO), the obtained results are compared against the results that are obtained by S-CSA method [9] and M-CSA in terms of accuracy and G-mean based on the same datasets. As revealed in Table 3, the performance of the BCSA-QO method is higher than the performance of S-CSA and M-CSA in terms of CA and G-mean for both training and The performance of the proposed method compared with other methods in terms of minimum obtained features is highlighted in bold. Once again, Figure 3 shows the CPU time in seconds of the BCSA-QO, M-CSA, and S-CSA methods. As shown in Figure 3, the BCSA-QO method has less training time than M-CSA and S-CSA methods with all datasets.
Moreover, to investigate the stability of the proposed method BCSA-QO, Jaccard index is used to measure the consistency of the selected features. The Jaccard index is a proportion of the intersection between two populations and the union between the same populations. For example, J 1 and J 2 are subgroups of the selected features and ⊆ J J J , 1 2 , then the Jacard index is defined as shown in equation (14):  , the calculation of the stability test (S. test) value for J could be implemented as shown in equation (15). The higher S.test value is considered more efficient in FS. Figure 4 depicts the proposed method stability against other methods which have been examined by four datasets. As shown in Figure 4, BCSA-QO displays the higher rate of stability than M-CSA and S-CSA For more convincing evidence regarding the performance of the proposed method (BCSA-QO), in choosing the best features that produce the highest classification accuracy, a Friedman test has been used. The Friedman test has been implemented based on the AUC values in training phase. When the alternative hypothesis is accepted, the post hoc of the Bonferroni test is calculated under various critical values (0.01, 0.05, and 0.1). Table 4 shows the outcomes of the applied statistical tests.
Based on these outcomes, the alternative hypothesis is accepted at = α 0.05. As a result, there are significant differences between BCSA-QO, M-CSA, and S-CSA over the four datasets depending on the AUC criterion. Moreover, BCSA-QO has the lowest average rank with 3.278 when compared with M-CSA and S-CSA. Depending on Bonferroni test results, it appears that the average ranks of M-CSA and S-CSA are higher than α 0.05 , α 0.01 , and α 0.10 . These results suggest that both M-CSA and S-CSA are significantly worse than BCSA-QO over the four datasets studied. Moreover, it is clear that the difference between the averaged AUC of M-CSA and S-CSA is not significant at = α 0.05.

Conclusion
This study included improving the BCSA by using the QO method in terms of tuning the value of flight length with the aim of overcoming the trapping in local minima. The main purpose of this work is to demonstrate that not all the features for each dataset are relevant to the solution of a problem to get the highest accuracy rate. Also, it has been applied for solving classification problem with FS. The BCSA-QO method depended on the advantages of both BCSA and the QO methods. The experimental results have shown that the proposed method has produced the highest classification accuracy, the minimum number of the selected features, and the minimum needed CPU time with high stability for all datasets compared to other methods. As a limitation, the performance of the quasi-CSA depends on choosing the lower and the upper bounds. For future works, the following points are suggested: 1. Propose a new FS technique by hybridization of QO method with other binary optimization algorithms such as binary-PSO, binary-GWO, and binary-TLBO to enhance their performance. 2. Appling the BCSA-QO technique on a different application such as prediction and clustering with different datasets. 3. Utilizing another classifier such as a support vector machine and naïve Bayes instead of the KNN.