Identification of Biomarker on Biological and Gene Expression data using Fuzzy Preference Based Rough Set

: Cancer is fast becoming an alarming cause of human death. However, it has been reported that if the disease is detected at an early stage, diagnosed, treated appropriately, the patient has better chances of survival long life. Machine learning technique with feature-selection contributes greatly to the detecting of cancer, because an efficient feature-selection method can remove redundant features. In this paper, a Fuzzy Preference-Based Rough Set (FPRS) blended with Support Vector Machine (SVM) has been applied in order to predict cancer biomarkers for biological and gene expression datasets. Biomarkers are determined by deploying three models of FPRS, namely, Fuzzy Upward Consistency (FUC), Fuzzy Downward Consistency (FLC), and Fuzzy Global Consistency (FGC). The efficiency of the three models with SVM on five datasets is exhibited, and the biomarkers that have been identified from FUC models have been reported.


Introduction
Classification of cancer, based on gene expression data, has become an attractive research area in the field of bioinformatics. Currently, diagnosis through the recent technique, Fine Needle Aspiration Cytology (FNAC) [1], is not up to the mark, because it has been reported that it does not possess high-quality diagnostic capability. Numerous studies about cancer classification exist. The methods cover Principle Component analysis (PCA) [2,3], relief, mutual information and information gain [4], FPRS, and the like. From among these, FPRS is the most commonly used feature-selection (FS) method, because its reasoning methods are simpler than a computationally precise system. This results in the conservation of computational power, which is an interesting feature in real-time systems. Classification is performed by choosing the most compelling genes in order to construct a good classification model. In addition, identifying significant genes greatly reduces the run time in the designing of a good classification model. Besides, extracting significant genes from thousands of genes is a critical issue.
A complete review of the FS method has been described in [5]. Depending on the interaction of genes with the classification model, the FS method can be classified into three classes: filter, wrapper, and embedded methods. Filter methods [6] evaluate the relevance of features by observing the intrinsic properties of the data. These methods are fast and computationally less expensive than the wrapper method. The wrapper method [7] selects a subset of features by observing the performance of a classification model. However, this model has a great chance of over fitting and is computationally more expensive than the filter method. In contrast, the internal parameter of the classification model has great use in the case of the embedded method [8], which greatly reduces the computational cost of the FS method. In this context, the FS technique and the supervised classifier play an important role in selecting significant gene markers for cancer diagnosis. These techniques avoid the error that is reported by FNAC, because they examine the dataset in a less amount of time.
Several researches to identify genes have been carried out. In [9], researchers tried to introduce the distributed FS method by using symmetrical uncertainty and the Multi-Layer-Perceptron (MLP) classifier. Features are distributed across multiple clusters. The classifier MLP is applicable in each cluster and the cluster with the highest classification accuracy and lowest root means square error is nominated. Local search methods, sequential backward selection, and maximization of mutual information are applied on gene expression datasets to obtain the most relevant and non-redundant genes, and it is presented in [10]. Thereafter, global optimization techniques such as Genetic Algorithm (GA), Particle Swarm Optimization (PSO) [11] are used to improve the classification accuracy and computational time. Also in [12] imperialist Competition algorithm, wavelet least squares support vector regression and imperialist competition algorithm is applied for assessing the incipient faults of transformer polymer Insulation. In [13], Binary Differential Evolution (BDE) algorithm with a rank-based filter FS method, BDEX Rank and BDEX Rankf , are applied in gene expression datasets. Both algorithms join the filter with the wrapper in a two-stage algorithm. In [14], Binary Particle Swarm Optimization (BPSO) encoding gene that classes sensitivity is used to extract the important features from five microarray datasets. Here, Extreme Learning Machine (ELM) functions as the classifier. Lately, bi-clustering has been very popular in the identifying of relevant genes in gene expression data. In [15], the bi-clusters algorithm with the use of formal concept analysis has shown to be an efficient methodology for bi-clustering binary data.
The said algorithm is applied to recognize groups of genes that are relevant under a subset of samples. The article [16] demonstrated a deterministic initialization algorithm for the kmeans algorithm by identifying a set of clusters through the bi-partitioning approach. The results are promising in terms of computational time and stable convergence. In this paper, SVM [17] coupled with FPRS [18] is employed for informative gene selection from the datasets, and to diagnose the cancer. The kernel implicitly in SVM contains a nonlinear transformation, no assumptions about the functional form of the transformation, which makes data linearly separable. In this way SVM extends itself in order to classify the linearly inseparable data. Choosing an appropriate value for the parameters, SVMs can be robust, even when the training sample has some bias. Rough set model is designed based on the equivalence relations. The equivalence relations have one of the main drawbacks, when applying this model on complex decision making problem. Whereas, fuzzy preference relation focused the degree of preference quantitavely making it more powerful to collect information from fuzzy data than equivalence relation. This inspire us to use FPRS technique for gene selection. The details of three models of the FPRS method are explained in [19]. In this literature, we have considered Wisconsin Breast Cancer Dataset (WBCD), leukaemia, prostate, Diffuse Large B-cell Lymphomas (DLBCL), and Mixed-Lineage Leukaemia (MLL) datasets to conduct the experiment. Cancer research is biological. The observed results show that our proposed method selects significant genes for cancer classification. The statistical analysis of any given dataset is a new concept, which enlightens cancer research further. Research on the application of a data mining technique to diagnose the outcome of the disease is encouraging news recently [20].

Support Vector Machine
SVM is a well-known, high-performance, supervised, learning algorithm [21], which follows the concept of the statistical learning theory [22]. Currently, it is being applied in many fields including remote sensing [23], biological data analysis [24], natural language processing, marine appliances, and so on. In the case of lin-early separable binary cases, the goal is to design a hyper plane that will classify all training vectors in two classes. The best selection will be the hyper plane that maintains the maximum margin from both classes. When the data are not linearly separable [25], SVM performs a mapping to transfer the data from an input space to a higher dimensional feature space. The upper bound of the generalization error can be reduced by providing the largest possible distance between the separating hyper plane and the sample on either side of it. The decision function of SVM is denoted by Here, φ(x) denotes the mapping of the sample, x, from an input space to a higher dimensional feature space. The SVM finds an optimal separating hyper plane by maximizing the separating margin between the two classes of data.
The following optimization can be solved to provide optimal value of ω and b. Minimize and Here, C is the regularization parameter, and ξ i is the i th slack variable. The details of SVM have been elaborated in [25].

Feature-selection
Feature-selection methods can be applied to maximize the performance of the classifier on cancer data analysis [26]. It is a useful technique in dealing with dimensionality reduction. In classification, it is used to find an optimal subset of relevant features so that the overall accuracy is increased, while the size of the dataset is reduced. When a classification problem is defined by features, the number of features can be quite large, many of which can be irrelevant. A relevant feature can increase the performance of a classifier, while an irrelevant feature can deteriorate it. Therefore, in order to select the relevant features, it is necessary to measure the appropriateness of selected features by using a feature-selection criterion. For example, the clinical classification accuracy of a dermatologist in the diagnosis of malignant melanomas is between 65% and 85%; whereas, after the application of the feature-selection algorithm, the accuracy increases to more than 95% in automated skin tumor identification systems [27]. Normally, the feature values are real and categorical. Hence, feature-selection algorithms such as the traditional Rough Set (RS) theory encounter problems. This problem can be efficiently tackled by using a fuzzy rough set theory, in which the membership values are set in the range [0, 1]. This allows for a higher degree of elasticity than that in the crisp rough set theory, which deals with only one and zero membership values. Another kind of feature-selection algorithm deals with continuous decision output to handle regression problems. FPRS can also be applied to classification as well as to regression problems.
Hence, in our proposed methodology, we have exploited FPRS for feature (gene) selection. Finding informative genes by using FS [28] algorithms is an important aspect of diagnosing cancer. Initially, the dimension is reduced in order to reduce the computational cost [29]. Overall, noise is reduced to improve prediction accuracy.

Fuzzy Preference Based Rough Set
Analyzing preference is a challenging task in the decision-making process. When fuzzy preference relation is blended with the rough set, it is known as a fuzzy rough set. In this paper, we have used the FPRS method to identify the preference relation in order to aggregate the features [30]. To evaluate the robustness of our proposed methodology, we have conducted our experiments on WBCD, leukaemia, prostate, DLBCL and MLL datasets, respectively. Noteworthy genes are selected by the three selectors (FUC, FLC, and FGC) of the FPRS feature-selection method. Initially, the datasets are partitioned into 50% for training and 50% for testing. Gene selection from three models of FPRS is performed on the 50% training set. Subsequently, whole datasets are split into three training test set partitions, which are 80-20%, 70-30%, and 50-50%, respectively. SVM blended with three models of FPRS is employed on each partition of five datasets to justify the performance of our proposed technique.
Our proposed methodology is depicted in Figure 1. Fuzzy preference relation can be explained as: A fuzzy product set, v × v, which is described by a membership function, µr : v ∈ [0, 1]. The fuzzy preference relation can be expressed by an m × m matrix (r ij )m×m, in which r ij represents the preference of r i over r j . r ij = 1/2 demonstrates that r i and r j are the same, r ij > 1/2 demonstrates that r i is preferred to r j , and r ij = 1 demonstrates that r i is preferred to r j . Meanwhile, r ij > 1/2 indicates that r j is preferred tor i . Here, the preference matrix r ij + r ji = 1, ∀i, j ∈ {1 . . . m}, in which the cardinality of v is finite. We consider v as being a universe of finite numbers of the object, v = {a 1 , a 2 . . . am}. The feature value of any object, a, is denoted by f (a, I), where I is the feature of the object. The upward and downward fuzzy preference relations over v are as follows: l )) and Here β is a positive constant. In FPRS, we are required to know the decision or conclusion with respect to available criteria. We can also know which criteria are important to decision-making, as well as the criteria that are unwanted or redundant. We can get an idea about, which criteria are important for decision making; also which criteria are unwanted or redundant.
Currently, researchers mostly use the RS theory as a numerical tool to cope with the problem of uncertainty and incompleteness. In [31], the RS model is defined by the concept of fuzzy relation and fuzzy operators (max and min).
The RS theory is superior to multiple regressions, because it does not require any previous knowledge about the data that is considered. Fuzzy preference is a special case of fuzzy relations. The rough set FS method collects information from both the distance metric and lower approximation dependency value. The FS method also considers the total number of objects in the boundary region and the distance of those objects from the lower approximation. Though it is very efficient in mining technique, the rough set suffers from huge computation of either discernibility function or the positive region to find the attribute reduction. To overcome the problem rough set need to be blend with fuzzy as feature selection. This fuzzy preference relation, when blended with the rough set, is known as FPRS, and this relation has been modeled to measure the fuzzy preference. Let an information system be (U, F), where U = {a 1 . . . an} is a nonempty finite set of objects, and F = {F 1 . . . F N } is a finite set of features to classify the objects.
A decision table is defined by DT = (U, C, D), where the set of features are grouped into condition (C) and decision (D). The conditions (features) are assumed, and the task is to classify the object. The decision about the object of U is then predicted approximately.
Let N decision class labels can be represented as In the RS theory, the fundamental operations are lower approximation and upper approximation. It is assumed that R > and R < are the fuzzy preference relation that is generated by P ⊆ C. The fuzzy preference approximation qualities [32] of the decision D in terms of P are denoted as:

Experimental results and discussion
In our experiments, we have first selected relevant features by using the FUC, FLC, and FGC models of the FPRS method. Then classifier SVM is applied on each partition of all the datasets considering the genes from each model of FPRS method. The characteristics of cancer datasets are small sample size with huge number of attributes (often with hundreds or thousands of dimensions). Hence, the dataset contain huge inconsistency due to redundancy and noise of the data. The rough set contains an effective tool to deal with inconsistency data.

Dataset Description
Here, we have used three publicly available bi-class datasets, one multiclass, and one biological dataset to conduct the experiment. The datasets are available in [33]. The details of each dataset are described below and presented in Table 1.

Results and Discussion
The results that are obtained from the proposed approaches are presented in this section. The three models of the FPRS method are compared to assess the effects of FUC along with FLC and FGC models. To identify the best approaches, all the approaches are compared in Table 2 in terms of classification accuracy. Results are compared with different partitions: 80-20%, 70-30%, and 50-50% on five datasets. The performance of In Table 3, we have presented the accuracy of the whole dataset (without FS) by using the SVM classifier along with 50-50% training test partition of the whole dataset. The observed results are very poor in terms of accuracy. The significant genes that are obtained from our proposed methods are listed in Table 4. In this literature the effectiveness of our proposed method is demonstrated on five datasets. Also it is clear from the table that some biomarkers (marked in bold faces) identified from our method are reported by other literature (described in the supplementary copy). Hence, those biomarkers have a great impact in causing cancer. Whereas, the rest are informative genes as we are getting good classification accuracy as well as good statistical measurement of the proposed method. It can be observed from Table 5 that there are some overlapping genes between different models of FPRS method. For example FUC, FLC and FGC three models of FPRS have selected two common genes from leukemia dataset. An experimental study on five datasets is performed using statistical measurement, and the results are reported in Figure 2. In Figure 2(a-d) WBCD dataset explores equal substitution (100%) between specificity and sensitivity. In contrast, for the rest of the datasets, it explores ≥ 95% statistical measurement for leukaemia, prostate, and MLL datasets. FUC obtains the Area under Curve (AUC) value (=1) for the WBCD dataset. However, for the other datasets, the AUC value almost reaches one (=1). It can be concluded from the figure that the FUC model provides the best statistical measurement when compared to the two FLC and FGC models for the four datasets from among the total five datasets. Moreover, the boxplot results for the three models are depicted in Figure 3(a-e). The training and test sets are randomly partitioned, and the classification by using SVM is performed to obtain the boxplot results. From the figure, it is clear that the FUC surpasses the other two models, because it is placed in the topmost position. The lists of gene that are selected from our proposed FUC models are reported in Table 4. The experiment is implemented in matlab by using the Lib-SVM software that was developed by Chang and Lin [34]. The experiment is performed on Intel Core i5 2430M CPU (2.46 GHZ) with 4GB of RAM.

Comparative study
The investigation of the performance of our proposed approach in terms of classification accuracy is summarized in Table 6. Our proposed method produces good results for all the datasets with good classification accuracy. However, it is difficult to further compare our method with those listed in the said table. The results are promising in most of the published works. As for the MLL dataset, it makes it difficult to search reference for comparison. The classification accuracy is 95.67% with nine genes when our proposed method is used. It can be stated that the proposed method cannot outperform all the present methods. However, it can surpass some of the published articles. J4, MLP (PCA) [35] 97.56 Decision Tree [36] 96.14 SVM, artificial Neural network [37] 97.00 Decision tree with feature-selection [38] 97.85 SVM [39] 94.54 LEM2, Rough set reduction method [40] 96.40 F_score with SVM [41] 99.51 PCA + MLP [42] 94.42 91.60 GA-PCA & CCA [43] 88.23 VVRKFA [44] 94.81 93 88.97 DRF0+SVM/KNN [45] 94.12 97.06 94.67 DRF+IG+SVM [45] 97.06

Conclusion
The FUC model is employed as a gene-selection method. The proposed model evaluates the significance of genes with respect to a certain criteria. By using the RS theory, the proposed method yields information about the criteria that are important for conclusion as well as about those that are redundant or unwanted. Four gene-expression datasets and one biological dataset are investigated by the proposed method. The biomarkers and informative genes are identified here. A comparative study can establish the superiority of the proposed model in conjunction some of the published articles. The final goals of our proposed method are the accurate tumor identification and diagnose cancer, which is possible by finding biomarkers. It is apprehensible as our proposed method is capable of extracting some biomarkers as it is reported by some literatures. But there are some limitations in FPRS method during fuzzy preference based approximation. Firstly, a function is used to compute the preferences degrees of objects. But there are lot of functions (and choices of their parameters) to measure the preferences. Secondly, user should provide information whether monotonous relations persists between attributes and decision. However, it is difficult for the user to provide such information in application and an algorithm is require to provide such information. Finally, the membership of any consistent sample to the upper approximation may be less than that of lower approximation, which is not compatible with the definition of fuzzy approximation. Moreover, the limitations of SVM is the lack of transparency of results.
Hence, the method can be used as an alternative means in the diagnosis of cancer. For future work it might be interesting to exploit imperialist competition algorithm for measuring the efficiency of our model.