Circular convolution - based feature extraction algorithm for classi ﬁ cation of high - dimensional datasets

: High - dimensional data analysis has become the most challenging task nowadays. Dimensionality reduction plays an important role here. It focuses on data features, which have proved their impact on accuracy, execution time, and space requirement. In this study, a dimensionality reduction method is proposed based on the convolution of input features. The experiments are carried out on minimal pre - processed nine benchmark datasets. Results show that the proposed method gives an average 38% feature reduction in the original dimensions. The algorithm accuracy is tested using the decision tree ( DT ) , support vector machine ( SVM ) , and K - nearest neighbor ( KNN ) classi ﬁ ers and evaluated with the existing principal component analysis algorithm. The average increase in accuracy ( Δ ) is 8.06 for DT, 5.80 for SVM, and 18.80 for the KNN algorithm. The most signi ﬁ cant characteristic feature of the proposed model is that it reduces attributes, leading to less computation time without loss in classi ﬁ er accuracy.


Overview
The data generated every day due to Internet-based applications are multidimensional. Data originate from various sources in different forms [46]. It is a very cumbersome task to analyze such extensive data. The accuracy and speed of data analysis depend on the features of the data involved. Features describe instances of data. Features can be constructed, discretized, transformed, or selected from massive data to classify data more accurately. There are three main kinds of features: categorical, ordinal, and quantitative. There is a need for feature transformation to improve the usefulness of features by changing, removing, or adding information. The feature transformation may turn the set of features into another set of features having fewer features. Selecting a subset of a given set of features speeds up learning and helps protect from overfitting [1].

Dimensionality reduction
Different features of data contain information about the target. If more features characterize data, then information carried by features is more, and hence data analytics should lead to better results. But this is

Motivation
Feature selection, feature extraction, and feature optimization are the different forms of dimensionality reduction. Dimensionality reduction is today's research topic as it plays a vital role in analyzing data in various important fields like sentiment analysis [8]. A lot of research is done on sentiment analysis using machine learning algorithms and different word embedding techniques in a wide range of applications such as massive open online courses (MOOC) course assessment, product reviews, and probable question topic extraction [9][10][11]. Turkish and bibliometric data analysis is also done using sentiment analysis with feature selection techniques [12,13]. Humanoid robots have a wide application area in today's world. Feature selection plays a vital role in robotics too. The real challenge here is to get push recovery for the humanoid robot and non-linearity associated with the motion. To know the neurological disorders and problem identification of gait abnormality, human motion study is very important [49,51]. Human gait is unique for every person. The human walk is described with different joint trajectories [14][15][16]. This push recovery data classification is also done in the literature by using deep learning techniques. Features play an essential role in developing a computing module for push and gait recovery [17,18].
Feature reduction is also necessary for natural language processing. Machines do not have the capability like humans to understand the language with proper meaning. For natural language processing, research is carried out with different dimensions by many authors for the sentiment analysis of the comments, reviews on social media or websites [19], context-based queries [20], teacher assessment reviews [21], and text classification using keyword extraction [22][23][24]. Thus, feature reduction has become an important part of widely applicable data analytics. One can apply feature reduction to directly impact classifier accuracy, computational time, and space requirements.

Research contribution
This study contributes toward a feature extraction method that very effectively uses the convolution technique for the first time in the literature for feature reduction.
The highlights of this research are as follows: 1. The proposed method shows a significant reduction in features. 2. We are modeling the convolution technique to reduce the features. 3. Performance analysis: Testing on different dimensional benchmark datasets is done with different classifiers. All combinations give better results than the existing method over accuracy and time. 4. Due to notable reduction in features, reduction in computation time and space is observed.

Organization of the article
The rest of the article is schematized as follows. The second section describes the literature survey in the dimensionality reduction domain. The section also describes the work done in different domains related to dimensionality reduction. The third section gives the proposed methodology. The detailed algorithms are discussed in this section. The fourth section discusses the experimental setup. The fifth section elaborates on the results obtained with different classifiers and their analysis. The last section presents the concluding remark and future directions of the research.

Literature survey
Heart disease is one of the major killers among other diseases, which has caused 7.4 million deaths worldwide in 2015 [25]. Other diseases like cancer, kidney disease, hepatitis are other significant killers worldwide. In the medical field, disease detection is challenging, as the features under consideration may be irrelevant and redundant, causing a reduction in classification accuracy. This irrelevant and redundant feature elimination is essential to reduce classification efforts, reduce the risk of overfitting, and improve classification accuracy. Reduction in the feature set is inversely proportional to classification accuracy and directly proportional to classification time. In the literature, preprocessing steps before the actual data classification include feature reduction, feature selection, feature extraction, and feature optimization. These are popular techniques used for dimensionality reduction [26].
Vipin Kumar et al. discussed different feature selection approaches, such as filters, wrappers, and embedded methods, and their application in the real world. The features could be highly dependent on one another, or there could be too many features. Hence, different feature reduction techniques are used frequently by researchers in different areas. The author described different application areas of feature selection in the real world, such as remote sensing, text categorization, intrusion detection, and image retrieval. Some of the challenges in feature selection, such as large dimensional data, scalability, and stability, are mentioned in this study [27]. A database of 50 subjects is created for human gait analysis [49] by Vijay et al. The author represented data using 24 attributes per instance, which is preprocessed by using the Kalman filter. Different combinations of deep learning and hybrid deep learning classifiers are used to experiment, which shows a significant increase in accuracy (99.34%). The author also worked on a biped robot. The data regarding different walking styles are collected through inertial measurement unit sensors and walking patterns are analyzed [50].
Some of the nature-inspired algorithms for feature selection, such as the binary bat algorithm, particle swarm optimization, and modified cuckoo search algorithms, are discussed by Shrivastava et al. [28,29]. An integrated filter and wrapper method for a sequential search procedure improves classifier performance to tackle the overfitting problem and chances of getting a local optimal solution [30]. Xie and Wu proposed a feature selection algorithm based on association rules, which discovers the class attribute features as per the association analysis theory. Yet, this algorithm's time complexity is relatively high due to the Apriori algorithm [31]. Alessio Ferone proposed a novel approach of feature selection based on rough set theory [32]. Jinghua Liua et al. proposed a feature selection method based on the distinguishing ability of features. He used the maximum nearest neighbor concept to discriminate against the nearest neighbor of samples and evaluated the quality of features [33]. Tajanpure and Jena [38] put forth a multistage classifier system approach, where the feature decimation according to their level of processing need is done. The first-level features are processed first, and as per the first classifier's output, further level processing decision is taken. Dua et al. worked on human activity recognition with a proposed deep neural network-based CNN and gated recurrent unit. The proposed method performs features extraction and classification too [51].
Saul Solorio-Fernandez et al. proposed a new unsupervised spectral feature selection method. In many practical problems, the dataset under study is described through numerical and nonnumerical features, that is, mixed datasets. The spectral feature selection-based proposed method uses a kernel and a new spectrum-based feature evaluation measure for deciding the relevance of features. K-nearest neighbor (KNN), Naïve Bayes, and support vector machine (SVM) classifiers are used to find the proposed algorithm's performance, and their accuracy is compared [39]. Much research is done on the disease diagnosis system based on different approaches, such as learning vector quantization and artificial neural networks [40] and classification algorithms [41,42]. Many dimensionality reduction algorithms are developed by researchers for different applications to work with some advantages and some limitations like overfitting, higher time complexity, and so on. The common base to evaluate these techniques is classifier accuracy and other parameters like specificity, sensitivity, F-measure, and so on [43].
Wei Song proposed an effective content-based feature selection approach to improve clustering performance for genetic algorithms. The conventional genetic algorithms face the problem of slow learning and local minimum due to a high-dimensional exploration space. The proposed approach is a parametric and a nonparametric algorithm to adjust the genetic algorithm operators properly [44].
In the brute-force feature selection method, all possible combinations of the input features are evaluated to find the best subset. Here the computational cost is high, with the considerable danger of overfitting. An important aspect of feature selection techniques is the evaluation of a candidate feature subset and searching through the feature space. If there exist at least two instances with the same feature values but with different class labels, the feature subset is classified as inconsistent [45].

Proposed feature reduction system design
Convolution is a way of converting two sequences into another sequence. Each value in the input sequence is viewed as scaling and shifting of unit impulse or delta function in digital signal processing (δ(n)). The output of convolution is simply expressed as shifted and scaled version of the input impulse. From this, it is very clear that the dimensions that are dominant in the input will result in output with the same dominance so the accuracy of the output is improved despite feature reduction [52]. This research proposes a feature extraction method based on convolution. Convolution is one of the basic operations of analog signals. There are two forms of convolution: linear convolution (LC) and circular convolution (CC). LC gives a linear overlapping result of the two sequences, which is nothing but the system's output when triggered with input x 1 (n) having transformation function x 2 (n). LC relates an output sequence with a given input sequence and impulse response, as shown in equation (1). LC is computed over all relevant values of n, that is, from −∞ to +∞. In mathematical form, if k is any positive integer, convolution is expressed [35] as whereas circular convolution gives the output when two sequences are circularly overlapped. Table 1 shows the comparison between LC and CC. One can get the output of CC same as LC [34] if we follow Here we focus more on the CC since we can reduce more features with it, as shown in the second row of Table 1. From the table, one can reduce only one feature per LC application, whereas one can find the nearest 2 n value concerning the maximum size of the first or second sequence for CC [34]. For getting the same number of values in both the input sequences, that is, N, both sequences are zero-padded as per need.
CC is one of the important properties of discrete Fourier transform (DFT). Equation (3) represents the mathematical form. If L is number of values in Y(L), that is, output sequence, n is number of values in x 1 (n) and x 2 (n), and N is the nearest 2 n of [max (length(x 1 (n)), length(x 2 (n)))], then CC is mathematically expressed [35] as The above convolution equation involves index ((L−n)) N , known as CC. It is an important property of DFT. The multiplication of a DFT of two sequences corresponds to the CC of two sequences in the time domain [17].
As the proposed algorithm uses CC, we have different methods to find CC. Consider x 1 (n) and x 2 (n) are the two sequences used to convolve linearly with the number of samples "m" and "n," respectively, to get the output sequence y(l), containing "l" number of samples.

Method 1: CC by linear convolution equivalence (LCE) method
One can get the result of CC equivalent to that of LC by applying zero paddings to make the number of elements in the input sequence equal to (m + n − 1). According to the principle of LC, after convolution, one value will be less in the output sequence than the addition of the number of values in the input sequence. It is the base of the proposed attribute reduction concept.

Method 2: CC by DFT/inverse discrete fourier transform (IDFT) method
To find the CC of two sequences of length "m" and "n," condition m = n should meet, and the length of two sequences should be close to the upper 2 n value. To meet this condition, one has to apply a zero-padding technique to the input sequence and then find the CC of the input sequences.
The architecture of the proposed system based on CC is as follows: As shown in Figure 1, the input dataset undergoes data-preprocessing techniques. Then normalized data is divided into two groups, with elements nearly equal to the closest 2 n in each group. These two sets of features act as input to CC to extract a reduced set of features at the output. Afterwards, this reduced set of features is tested with a classifier to judge the proposed feature reduction algorithm's performance.
According to the above-mentioned method 1 and method 2 for finding CC, there are two variants of the FrbyCC method.

LCE method (FrbyLCE)
In this method, zero padding is applied to x 1 (n) and x 2 (n) so that each sequence will contain L (L = m + n − 1) number of elements. With this method, at a time, one feature is reduced. Here to reduce the number of attributes, we have to apply FrbyLCE repeatedly on the input.

DFT IDFT circular convolution method (FrbyCC)
Here, depending upon the number of points/samples, that is, N expected in DFT, zero padding is applied.

Experimental setup
The experiment is carried out on Intel core i5-7200U CPU @ 2.50 GHz with 8 GB RAM and 64-bit Windows 10 Operating System laptop. The feature reduction and classification algorithms are implemented in Matlab 2015b.

Dataset description and preprocessing
From the UC irvine machine learning repository, nine datasets are used to evaluate the FrbyCC algorithm [36]. Out of these nine datasets, a good combination of less than 50 features, around 200 features, and more than 700 features dataset is selected. The datasets are preprocessed first. The missing values are replaced by the average of the column values and normalization is applied. FrbyLCE is assessed on the Cleveland heart disease dataset, whereas large attribute datasets evaluate FrbyCC. The purpose of selecting a large attribute dataset is to dominantly show the reduction in features. The output of FrbyCC is given to different classifiers for evaluation. Here tenfold cross-validation is applied to evaluate the classifier performance. Table 2 shows the basic information of the datasets.

Experimental evaluation
The proposed algorithm is applied to the dataset for feature reduction. The percent feature reduction is calculated to get the effectiveness of the proposed algorithm. Equation (4) finds percent feature reduction, which is the ratio of dimensions reduced after application of proposed algorithm to the original dimensions present in the dataset.
The reduced dataset is classified using SVM, K-nearest neighbor, and decision tree (DT) classifiers. A summary of an effect of the proposed feature reduction algorithm on different datasets is given in Table 3. Table 3 shows that the FrbyCC algorithm has >30% reduction for large attribute datasets. Feature reduction also reduces the storage space and execution time of an algorithm. It is always useful to process a dataset containing a large volume of attributes. Metrics, such as accuracy, specificity, sensitivity, Fmeasure, and so on, are used to assess imbalanced datasets [37]. The evaluation of the proposed system is done by using the most frequently used accuracy metric on different datasets.
FrbyLCE method reduces one attribute per execution of the routine. One has to apply the FrbyLCE five times with the output of the previous stage applied as input to the next stage for reduction by five features. Figure 2 shows a two-stage FrbyLCE in a cascade fashion.
After the desired reduction in attributes, a classifier evaluates the results. Here two classifiers such as DT and SVM are used to evaluate the proposed method. Table 4 shows the results of the LCE-FrbyLCE method with DT classifier.
Comparing Tables 4 and 5, it is clear that the FrbyLCE method's use shows improvement in the accuracy of the DT classifier. The LCE method combines the two sets of input attributes to produce a new set of attributes, making DT classifier to classify more correctly on it. Hence, as we reduce features one by one, the DT classifier's accuracy increases, and from stage III, accuracy remains constant for further reduction in features as observed from Table 4. Also, it is clear in Table 5 that the SVM classifier cannot select decision     boundaries as features generated after FrbyLCE are a combination of two sets of input attributes. Here the accuracy decreases as we reduce the number of attributes. Now, Table 6 shows FrbyCC accuracy evaluated with DT, KNN, and SVM classifiers. The results are taken on nine benchmark datasets from the UCI repository. Accuracy is taken as the performance measure for evaluation. The results are compared with the existing PCA feature reduction algorithm. If the accuracy The assessment of Table 6 leads to the fact that classification performance is improved by an average 8% with DT classifier and 5% for SVM classifier. The 18% average accuracy improvement is shown with KNN classifier. The accuracy improvement achieved is purely due to feature extraction done by FrbyCC algorithm. FrbyCC multiplies and adds the results of multiplication; that is, it maintains the importance of features in extracted features too. The features extracted are convolved output of the input features with  important features influencing the extracted features. Despite of feature reduction, the impact of important features remains in the extracted output. Also, the difference in the improvement of FrbyCC with different classifiers is observed due to the impact of individual classifier behavior on the respective dataset. This proves that the FrbyCC method was very effective in accuracy terms. FrbyCC reduces more features in a single application on the dataset under consideration. Table 6 shows the results with tenfold cross-validation. Observations show no drop in the accuracy of the classifier, despite feature reduction. In comparison with PCA, the FrbyCC method gives improved accuracy.
When we compare the FrbyLCE with FrbyCC, FrbyCC seems to be the most effective method. It reduces a significant number of features in one application, whereas FrbyLCE is to be applied in cascade fashion n times to reduce the "n" number of features. To verify accuracy in different datasets, FrbyCC is applied to different datasets, such as Parkinson's dataset, Arrhythmia, Internet ads, QSAR, and SCADI dataset, and compared with the PCA algorithm. As seen in Table 3, the first five datasets are high dimensional. We get a reduction in features up to 60% with the FrbyCC algorithm. The rest of the datasets have fewer dimensions as compared to the first five. Our focus is on high-dimensional datasets, in which the reduction in features leads to a decrease in execution time and storage space. The best reduction is observed in features if half of the total number of features is near any 2 n .
The following graphs show each dataset's accuracy with DT, KNN, and SVM, which implies that all the classifiers work differently on each dataset with improved accuracy.
It is observed from Figure 3 that DT and KNN work well, showing increased accuracy with the FrbyCC method compared to PCA. SVM gives accuracy in line with the PCA algorithm. Since PCA is also a feature extraction technique based on a linear combination of input attributes like FrbyCC, both algorithms work with similar accuracy for SVM.
In Matlab, time is measured using function tic and toc, which gives execution time in seconds. It is clear from Table 7 that the execution time of the FrbyCC algorithm is less than that of the PCA algorithm as FrbyCC is implemented using a DFT algorithm. Figures 4-6 show the graphical execution time comparison of FrbyCC and PCA feature reduction algorithm for different classifiers.
FrbyCC is also tested with Naïve Bayes classifier, which is a probabilistic classifier. Naïve Bayes purely works on the independence assumption between the input features. Hence Naïve Bayes algorithm shows poor accuracy on FrbyCC results. Here FrbyCC algorithm extracts the reduced features by overlapping of original features. Hence Naïve Bayes does not classify with good accuracy.

Conclusion
Feature reduction and accuracy are crucial concerns in data classification. High data dimensionality is the main issue in data classification. The problem of high-dimensional data handling is addressed in this research. We proposed a feature extraction method based on convolution to reduce features without loss of classification accuracy. In the first technique described in this article, feature reduction by LCE reduces one feature per application. It works well with the DT algorithm. The second method, FrbyCC, proves very effective in dimensionality reduction. Experiements show that it works well with DT and KNN. For SVM, accuracy is seen inline with the PCA for most of the datasets. It is observed that the feature reduction in each of the dataset depends on its proximity with 2 n value. The average increase in accuray (Δ) achieved with DT, SVM, and KNN is 8.06, 5.80, and 18.80, respectively, on different benchmark datasets. The proposed algorithm reduces execution time with the use of the DFT/IDFT [52]. Overall in FrbyCC algorithm, with feature reduction, we relish the accuracy with less storage space and execution time.