Analyzing high dimensional correlated data using feature ranking and classifiers

: The Illumina Infinium HumanMethylation27 (Illumina 27K) BeadChip assay is a relatively recent high-throughput technology that allows over 27,000 CpGs to be assayed. The Illumina 27K methylation data is less commonly used in comparison to gene expression in bioinformatics. It provides a critical need to find the optimal feature ranking (FR) method for handling the high dimensional data. The optimal FR method on the classifier is not well known, and choosing the best performing FR method becomes more challenging in high dimensional data setting. Therefore, identifying the statistical methods which boost the inference is of crucial importance in this context. This paper describes the detailed performances of FR methods such as fisher score, information gain, chi-square, and minimum redundancy and maximum relevance on different classification methods such as Adaboost, Random Forest, Naive Bayes, and Support Vector Machines. Through simulation study and real data applications, we show that the fisher score as an FR method, when applied on all the classifiers, achieved best prediction accuracy with significantly small number of ranked features.


Introduction
DNA methylation (DNAm) is an important epigenetic mechanism [1] controlling direct modi cation of DNA and abnormal DNA has been involved in the formation of diseases [2,3]. DNAm is also involved in regulation of genes, genetic re-programming [4,5], cell di erentiation [6] and gene expression [7]. DNAm involves a process where the methyl group is added to the fth carbon of a cytosine ring of DNA molecule, the presence of 5-methyl cytosine may change the activity around the DNA sequence. The methyl group can alter the transcription process of genes [8] and may lead to progression of tumor [9]. DNAm changes are linked with di erent types of diseases including neurological disorders, cardiovascular, and cancer [10,11,12]. Hence, studies on DNAm can also help in biomarker identi cation and disease classi cation [13,14].
With the advent of high-throughput technology [15] large number of Illumina In nium HumanMethyla-tion27k (Illumina 27k) data sets have been amassed and made publicly available [16,17]. Most of these data can be found in Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA). The In nium data sets have over 27000 CpGs to be assayed and are called CpG sites [18]. There are also Illumina In nium Hu-manMethylation450k data sets being generated which have over 450k CpGs to be assayed which are ultra-high dimensional in nature.
In this study, we focus on Illumina 27k that are high dimensional DNAm data. The results of the BeadChip arrays, the methylation status indicating the methylated and unmethylated DNA sequence measured using the β-values. For each of the speci c CpG site, the β-value is calculated from the levels of the methylated and unmethylated alleles as the ratio of uorescent signals [15,19].
The β-value is between 0 (no methylation) and 1 (methylated), U is the uorescent signal from an unmethylated allele, and M is that of a methylated allele.
There are relatively fewer studies conducted in the past for performing statistical analysis of the DNAm data sets generated from the Illumina 27k platform. Some of the existing studies for statistical comparison includes unsupervised clustering and normalization [20,21], supervised feature ranking and classi cation methods [22,23,19,24,25]. feature ranking and classi cation have been extensively utilized in microarray gene expression studies [26,27]. The key di erence of DNAm data from gene expression is that the DNAm has continuous variables ranging between 0 and 1. There other di erence is that there is a group structure with a gene. In the DNAm data, there are about 1-22 CpG sites per gene where the methylation levels of CpG sites within a gene are highly correlated.
In this article, we introduce the most widely utilized feature ranking method and classi er to the DNAm data. Feature ranking is a method employed to reduced the original dimension of the data to a lower dimension by removing the noise or unimportant features. The data in the eld of data mining and bioinformatics [28,29] such as microarray analysis [30,31,32], RNA-seq analysis [33], and mass spectra analysis appears to be high-dimensional in nature, where the number of features exceeds the number of samples. Dealing with high dimensional data is a challenging task and there is a need to reduce the dimension of data using the feature ranking techniques [34,35,36,37,38,39,40]. There are two traditional ways to deal with the highdimensional data in biological application includes dimensionality reduction and feature ranking. The goal of the dimensionality reduction technique is to reduce the dimension of original feature space by transforming into a new feature space. Some of the most commonly used dimensionality reduction methods includes principal component analysis [41], and linear discriminant analysis [42]. The aim of feature selecting method is to select the signi cantly important features without modifying the original features. Several popular methods used for feature ranking in the literature includes information gain [43], sher score [44], and minimum redundancy maximum relevance [45]. Both the dimensionality reduction and the feature ranking approaches are computationally e cient and are capable of improving the model performance. In this paper, we focus on the feature ranking methods, as they help in successfully preserving the original structure of the variables, whereas with the dimensionality reduction, the obtained new set of features are transformed with no significant meaning and cannot be used for further analysis. In the context of DNAm, when the feature ranking methods are employed the subset of most important CpG sites having an e ect on the response variable are chosen and the original CpG sites are left unaltered. These important CpG sites can be used by biomedical researchers for further interpretability.
The overall purpose of this study is to compare the performances of popularly used feature ranking and classi cation methods on DNAm data and determine the performances utilizing the various performance metrics. To address these goals, we make use of a simulation study and two experimental Illumina 27k DNAm datasets which are obtained from GEO.
The remainder of the article is organized as follows. In section 2, the simulation and experimental data setup, various feature ranking and classi cation methods, and performance metrics are explained in detail. In section 3, we show the results of comparison FR and classi ers through simulation and experimental studies. We discuss our ndings brie y in the section 4 and section 5 concludes the paper.

Materials and Methods
In this section, we describe the construction of the datasets; both synthetic and real data. This is followed by a description of the feature ranking and classi cation methods and then nally the de nitions of the performance metrics applied.

. Data Collection
We demonstrate by performing a simulation study and using the real datasets that the proposed procedure on DNA methylation can lead to better prediction than existing methods that ignore the association of these feature ranking methods.
DNAm data generated using the Illumina 27K array has large number of features representing the CpG sites and a relatively much smaller number of samples representing patients. This type of data is de ned as high dimensional problem p >> n where p is the total number of CPG sites and n is the total number of samples.
DNAm data with p CpG sites and n samples can be represented by the following matrix: X i is a row vector that represents the total DNAm levels of sample i and X ij is the methylation level of CpG site j of the sample i and all the values for X i should lie between [0,1].

Simulation data
In order to ascertain the results from the real datasets we include a synthetic dataset by simulation models which mimic the real dataset. In doing so we have to generate group correlated variables which lie between 0 and 1 inclusive. In our case, we have 475 genes in total consisting of 1-12 CpG sites such that the rst group has 100 genes with one CpG site each, the second group has 75 genes and next subsequent groups have 30 genes with 3-12 CpG sites, respectively. Therefore we end up with a total of 2500 CpG sites representing 12 di erent groups. Now these CpG sites are formed using the inverse logit transformation of multivariate normal random variables. Thus, the methylation β values of the g − th gene for the i − th individual is calculated as; where, t i,g ∼ √ pN f ,g (µ, ), p is a scale parameter and fg is the size of the g − th gene, i.e., . For example, the th gene belongs to group 5 and fg = indicates the number of CpG sites for that particular gene. We set µ = (− . , ....., − . ) and p = such that the methylation values have as no methylation and as completely methylated.
In order to the response variable, we assigned some of the variables as true important variables that is, those that are related to the cases (disease related sites). We did so by rst selecting a gene from each of the 12 groups leading to 78 CpG sites each group having the corresponding number of CpG sites and are stored as CpG set. Mathematically, the true regression coe cient for the gene [19,46] are as follows; CpG set is de ned as follows, , for all k = , ....., pg; g = , ....., and δ = .
Finally, the resulting response variable was simulated according to a Bernoulli distribution with the following model-based probabilities, where ) T and x i = (θ T ( ) , ....θ T ( ) ) T The above procedure for generating the response variable was performed till we obtained the balanced class labels i.e., equal number of binary class labels '0' and '1'.
DNAm data is known for having strong correlation structures. The covariance matrix within genes is de ned by uv = ρ |u−v| . The correlation of the models is AR(1). We simulated three di erent types of data based on di erent correlation structures. The simulation scenario S1 having low correlation ρ = 0.2, S2 with medium correlation ρ = 0.5 and S3 with high correlation ρ = 0.8. For each simulation scenario, samples were generated until we have 100 number of controls and cases respectively. Therefore, in each scenario, the total number of samples were 200 and the total number of CpG sites referred as features were 2500. For each scenario, We repeated the simulations 100 times.

Experimental data
For the real datasets, we downloaded two DNAm datasets from NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) which was generated from Illumina 27K. To compare the performances of the FR and Classi ers we obtained the data sets from Gene Expression Ominbus (GEO) having the accession numbers GSE26126 and GSE32393, designated as D1 and D2 respectively in the remaining part of the article. The D1 is a genome wide DNAm pro ling of normal and tumor prostate samples, as well as cultured primary prostate cells overexpressing DNA Methyltransferases and EZH2. It involved the quantitative pro ling of 95 primary prostate tumors and 98 healthy prostate tissue samples for their DNA methylation levels at 27578 CpGs representing 14,104 gene promoters by using the Illumina 27K platform. That is, it has 193 total samples in general and the data is balanced with almost equal number of cases and control.
The D2 is a epigenome analysis of breast tissue from women with and without cancer. The DNA methylation pro les across 27578 CpGs in breast tissues from women with and without breast cancer are obtained using Illumina 27K platform. Breast tissue samples were drawn from 114 breast cancer and 23 non-neoplastic breast tissues. That is, it has 137 total samples in general and the data is imbalanced with the cases being almost ve times more than the control.
The real datasets were normalized using the Linear Models for Microarray and RNA-seq Data package (Limma). The "normalizeBetweenArrays" function in R is a quantile and cyclic loess normalization which was originally proposed by [47] for A ymetrix-style. It forces the entire empirical distribution of each column to be identical to achieve consistency between arrays. The pre-processing procedures were performed to handle these high-dimensional DNAm data. As rst ltering step to overcome the redundant noisy CpG sites, the marginal maximum likelihood estimation (MMLE) was performed, the top 5000 CpG sites are ranked based on their level of signi cance. These ranked signi cant genes are further applied on the feature ranking methods.

. Feature Ranking methods
Feature ranking (FR) methods are classi ed into supervised [35] and unsupervised [36] based on the presence of class labels. The simulation and the DNAm datasets used for this paper have labels hence quali es under supervised FR. We used the lter based FR methods such as, Fisher Score [37], Information Gain [38], MRMR [39], and Chi-Square [40]. These methods measures the di erent characteristics of the data such as information, correlation structure, the distance and dependency between the variables and labels.

Fisher Score
Fisher Score (F-score) [37] is the most popular FR method used to select important features among all features which can be used for further classi cation. To achieve this, discriminative and statistical models are used. The general principle to select the features is that the calculated distance between the data points in the same class should be small and the distance between data points in the di erent class needs to be large. With this intuition, the sher score for each feature f i is calculated as, (6) where, F i is the calculated score for the feature f i , n j is the number of instances in the class j, µ i is the mean of the feature f i , and µ i j and ρ ij are the mean and variance of the feature f i in the class j, respectively.

Information Gain
Information Gain (IG) is used to measure the amount of information a feature gives us about the class [38]. Information Gain is one of the widely used FR methods because of its computational e ciency and its simple interpretation. The unrelated/ incorrect features gives no information and the features that perfectly divide gives the maximal information. This measure of purity is called the information and the measure of impurity is called entropy. The disadvantage of information gain is that it does not work well when there are redundant features.
The IG between the i-th feature X i and the response labels Y are given as, where, H(X) is the entropy of X i , H(X i |Y) is the entropy of X i given Y.

Minimum Redundancy Maximum Relevance
The Minimum Redundancy Maximum Relevance (MRMR) is a FR method. Features are selected by ranking them based on the maximal statistical dependency criterion [39]. The true important features are selected such that there is a minimum redundancy within the features and a maximum relevance to target variable. MRMR is also called a mutual information based method as it is used to determine both redundancy and relevance. The mutual information is de ned as, where X, Y are the vectors, P(X, Y) represents joint probabilistic density, P(X) and P(Y) are marginal probabilities density.
The MRMR for discrete variables is given by, The MRMR for continuous variables is given by, where S is the set of features, I(i, j) and C(i, j) is mutual information and the correlation between features f i and f j respectively, h is the target class and F(i, h) represents F-statistic.

Chi-square
Chi-square test (Chi2) is another popular FR method. It is a statistical test of independence which is used to determine the dependency of two variables [40]. Each sample in the data has several features and a response variable (or class label). The relationship between these features and response variable is observed and recorded using Chi2 statistics. The basic rule is that if there is a dependency between the response variable and the feature variable then it is considered as very important variable. Whereas, if the response variable is independent of the feature variable then such feature is ignored. Two events X and Y are said to be independent of each other if: The two events in the context of feature selection corresponds to the occurrence of feature and the response variable. Based on the following equation, we can rank the terms.
where e f and ey are the feature term and the response class respectively. N is the observed frequency in D and E is the expected frequency.

. Classi ers
Many classi cation techniques are available in statistics and machine learning which can be applied to broad range of problems from business applications to biological research [48]. Depending on the type of data i.e. with or without labels, the algorithms are classi ed as supervised (such as classi cation and regression) and unsupervised (such as clustering). We focus on the supervised classi cation techniques as our data sets are labelled. The most commonly used advanced algorithms in the eld of biomedical research such as Random Forests [49], Naive Bayes [50], Support vector machines [51], Logistic regression [48] and AdaBoost [52] are used here for comparing the results of FR method on di erent classi ers.

Random Forests
Decision trees are most popular among biomedical researchers as they are more decipherable when dealing with binary outcomes. A decision tree treats the given sample by going down the tree as a series of yes/no questions which eventually leads to the terminal node which carries predicted class. Random forests (RF) [49] is combination of not one but multiple decision trees, where each one trains on di erent set of features (random sampling with replacement), the nodes are split in each tree taking limited number of features. The number of classi cations equally correspond to the number of trees. The majority voting scheme is applied by aggregating all the trees and one nal classi cation is obtained. The basic idea is to not rely on individual source but multiple. Through the concept of bagging, or averaging, the accuracy of unstable individual decision trees can be improved.
The interesting quality of random forests is it's out of bag technique. All the samples which were omitted of the bootstrap sample during training are considered as Out Of Bag (OOB) data, the error estimates are calculated by running the random forests through this OOB data. The average error estimate is taken by considering the error rate over all the trees and is called miss classi cation error estimate.

Naive Bayes
Naive Bayes (NB) is the most popular classi cation algorithm based on the probabilistic algorithms which take the advantage of probability theory and the Bayes theorem [50]. It is based on a simple rule that all features being classi ed are independent of each other and this turns out to be a major disadvantage because in real world situations the data always has some correlation.
Bayes theorem is used to nd the probability of an event occurring given the probability of another event that has occurred already. It is stated mathematically as, where X and C represents the response variable and the feature vector.

Support Vector Machines
Support vector machines (SVM) are most commonly used among medical researchers in classi cation problems [51]. The main idea of SVM is to nd a hyperplane that best divides the samples into two classes with the largest possible margin. The margin is referred as the distance between the hyperplane and the closest data points. In the linear condition, the goal of SVM is to nd the hyperplane that maximizes the margin such that both the classes are well separable and classi ed correctly. But when we consider classi cation problems based on the real world datasets, the classes are never distinctly separated therefore the condition of classifying distinctly through a hyperplane becomes void.
To overcome this problem, SVM introduces two extensions: soft margins and kernel tricks. In the rst approach, SVM tries to nd the hyperplane that maximizes the margin but allows few misclassi ed dots. A threshold is chosen which acts as a penalty term such that the total distance between the dots and all misclassi cation dots remains smaller than this threshold. The latter approach is the more commonly used since the classes cannot be separated linearly in the original feature space, the kernel trick uses existing features, maps them into a higher dimensional space and applies transformation and creates new set of features. Those new set of features help SVM to nd the non-linear decision boundaries. The dot product of input feature space can be computed as a kernel function K(X (i) , X (j) ) = ϕ(X (i) ) T ϕ(X (j) ) Here the input data X (i) T X (j) are mapped into feature space as ϕ(X (i) ) T ϕ(X (j) ) There are di erent types of kernel which can be used: 'linear', 'polynomial', 'radial basis function', and 'sigmoid'. The most popular among the kernel functions are: The k-degree polynomial given by, The radial basis kernel is given by, and nally the sigmoid kernel which is de ned as follows, Where k, k , k and c are the required parameters which need to be speci ed. The boundaries which are nonlinear on the original space are mapped as linear in a transformed high dimensional space through the use of kernel function and henceforth SVM gets its exibility over other linear classi ers.

AdaBoost
AdaBoost (AB) also known as adaptive boosting is the popular boosting technique [52]. It is based on the idea of combining multiple "weak classi ers" or "inaccurate classi ers" into a single "strong classi er". AB algorithm can be applied to any classi cation problem. Random subset of the training set is used to train each weak classi er, but the subsets are not chosen at complete random. AB algorithm assigns weight to each training item. A higher weight is assigned to the incorrectly classi ed item so that it appears with higher probability in the training subset of future classi er. Once each classi er is trained, based on the accuracy the weights are assigned. Thus, higher weight is assigned to more correct classi ers so that they will have higher probability in the nal outcome. The rule of thumb for assigning weights to the classi er is that, a weight of zero is assigned to the classi er with fty percent accuracy and a negative weight is assigned to the classi er with less than fty percent accuracy.

Results
We compare the performance of all FR and classi ers through extensive simulation studies with three different scenarios and to further support the results real DNAm dataset is used which are described in earlier section 2.

. Simulation Data
The simulation studies are performed with three di erent scenarios: S1, S2 and S3 corresponding to the low, medium and high correlation structures are generated using the procedure explained in detail in simulation data section. Each simulated data was randomly partitioned into the training and testing dataset, where 70% of the samples were selected for the training dataset and the remaining 30% were selected for the testing dataset. The 70% of the training set was applied to the FR methods and the ranked features were given to the classi cation methods. For fair comparison, all of the used classi cation methods were evaluated for their classi cation performance metrics, averaged over 100 times. 2.1.
In this section, we show only the results for highly correlated simulation scenario S3. This scenario because of its high correlation structure matches with real DNAm data which are usually known for their high correlation structures. However, the results for low and medium correlation data (S2 and S3 respectively) along with the Figures and Tables depicting the accuracy, sensitivity and speci city are described in the appendix section. The average accuracies obtained for all the FR methods and classi ers on simulation scenario S3 are shown in Figure 1. The Table 1 shows the evaluation metrics accuracy, sensitivity and speci city along with the corresponding standard deviation. Here, we see that F-score outperforms all the FR methods by attaining the highest average prediction accuracies when applied on all the classi ers compared. Also, F-score attained the best prediction accuracy within top 30 features and is distinctly separated from other FR methods. The order of performance for the rest of FR methods within top 30 features with all classi ers is as follows: F-score, Chi2 and MRMR. These results shows that F-score method selects more number of true important variables given the range of ranked features and henceforth attains the best accuracies. F-score with Naive bayes classi er with the accuracy of more than 72% showed marginally better performance compared to F-score with Random forest and is the best performing combination among the remaining applications of FR methods and classi ers in the high correlation structure data. Also, we can note that the FR methods Chi2 and IG are having an overlapping performance between the 60 to 120 ranked features when applied on all the classi ers.
In the Table 1 we show the details of accuracy, sensitivity and speci city for the small, intermediate and large number of ranked features. i.e., at the 30, 90 and 200 ranked features. The order of accuracies for Fscore method applied on Naive bayes and Random forest and AdaBoost classi er at the 90th ranked feature mark are 72.68%, 71.14% and 69.96% respectively. This shows that the optimal number of true important features were attained by the F-score within 90 ranked features and we see the drop in the performance of all these classi ers at the 200th ranked feature mark. This can be justi ed because by adding more features to the model along with the less number of true variables more noise is being added. Henceforth the drop in performance is inevitable in the tree based methods such as random forests and AdaBoost. In contrast, we see that all the FR methods when applied on SVM showed an increase in accuracy with the increase in number of features, this is one of the important characteristics of support vector machines. Given a certain type of data, the support vector machines show an rise in performance when more number of features are added in the model. from the Table 1 we see that the SVM has the highest accuracies at the 200th ranked feature in comparison to the 30th and 90th mark among all the FR methods. Also, across all FR methods with SVM, the F-score with SVM method showed the best performance with an accuracy of 69.74%.
Although MRMR and IG are not showing the best performance, For analyzing the behavior of these FR methods on all the classi ers, we look at the corresponding accuracies and we can note that the MRMR and IG behaves similar to the SVM classi er with all FR methods. i.e., as the number of ranked features increases from 30 to 300, the accuracy of all the classi ers increases. This shows that both the MRMR and IG are not selecting more number of true important variables within the top 30 features. As the number of features are increased more number of true important variables are added along with which the accuracy increases. Henceforth, we can say that F-score which selects the large number of true important variables within the small range of ranked features is the best FR method compared to others.

. Experimental Data
The experimental studies are performed with two DNAm datasets: D1 and D2 corresponding to the balanced and imbalanced samples. Both of the dataset D1 and D2 are obtained from GEO and the complete description of the data along with the data pre-processing steps is explained in section 2.1. Each DNAm dataset was randomly partitioned into the training and testing dataset, where 70% of the samples were selected for the training dataset and the remaining 30% were selected for the testing dataset. For fair comparison, all of the used classi cation methods were evaluated for their classi cation performance metrics, averaged over 100 times.

. . Experimental Data D1: GSE26126
The Figure 2 shows the average accuracies of all the classi ers with the ranked CpG sites using FR methods on real data D1. The detailed evaluation metrics is shown in Table 2. Here, we see that the F-score when applied on all the classi ers outperforms the other FR methods by attaining the highest average prediction accuracies within the top range of ranked CpG sites. These results are very similar to what we saw in the simulation data S3, which was a highly correlated data. from this matching behavior in terms of accuracy we can say that the experimental DNAm dataset D1 is also having very high correlation. For further illustrations of the correlation structure of the experimental data D1 we added the boxplot and histogram as shown in Figure 3, where we see that the median correlation for this data D2 is 0.76 which is closer to correlation structure of 0.8 from simulation scenario S3.
From the plots in Figure 2 we also see that there is an overlapping performance between the Chi2 and IG which was also seen in the simulation scenario S3. Th MRMR although it performed well on the tree based AdaBoost and random forest classi ers, it failed to show the consistency on naive bayes and support vector machines with the lowest accuracies among all FR methods.
In the Table 2, the evaluation metrics of all the FR with classi ers on the dataset D1 are shown. As in simulation studies, again we show the performance of all the FR methods with classi ers at the small, intermediate and large number of ranked CpG sites i.e., when p is 30, 90 and 200.
With the accuracy of 82.39% and standard deviation(sd) (5.59%), MRMR with SVM achieved the lowest performance among all the applications of FR method on classi ers. However, we also see that the F-score  with SVM on the experimental data is seen performing better with the small number of ranked CpG sites with the accuracies of 88.58%, 86.05% and 83.55% respectively for the p = 30, 90 and 200 respectively. The F-score compared to other FR methods has remarkably good performance in selecting the true important variables within the top 30 ranked CpG sites. We also see that the F-score with Random Forests has an accuracy of 92.32% which is the highest considering all other applications of FR on classi ers. This shows that F-score attains the optimal number true important CpG sites required required for predicting the unknown samples within the small range of ranked CpG sites.  The average accuracies of all the classi ers with the ranked CpG sites using the FR methods on real data D2 is shown in Figure 4. The detailed results with the di erent evaluation measures based on the di erent number of ranked features can be seen in Table 3. Here, we see that the F-score method when applied on all the classi ers outperforms the other FR methods by attaining relatively better prediction accuracies within the top 30 ranked CpG sites. The results show that performances of all the classi ers are very close among all the FR methods. The correlation of the DNAm data D2 was evaluated to assess the performance of di erent FR methods on classi cation methods used in this study. In the Figure 5, we see that this dataset is seen to have a correlation of nearing to 0.5. The results of data D2 are found to be very similar to what we see in the low and medium correlation simulation scenarios S1 and S2, where the correlation were 0.5 and 0.8 respectively.
The results for scenario S1 and S2 are explained in appendix section. The matching results in these scenarios and the experimental data D2 suggests that when there is a intermediate correlation in the DNAm data, the FR methods and classi ers tend to perform well. The F-score and MRMR methods on the naive bayes classi er are seen to have an overlapping performance with the F-score attaining the best score with the 200 ranked CpG sites by a very close margin. In the Table 3, the evaluation metrics of FR methods with di erent classi ers on the dataset D2 are shown. The F-score compared to other FR methods has remarkably good performance in selecting the true important variables within the top 30 ranked CpG sites. We also see that the F-score with SVM has an accuracy of 96.57% when the ranked CpG sites were 30. This is the highest accuracy considering all other FR methods and classi ers the top 30 number of ranked CpG sites. This shows that F-score attains the optimal number true important CpG sites required required for predicting the unknown samples within the small range of ranked CpG sites.
The Figure 5 shows the correlation structure of the experimental data D2. We see that the data is having a intermediate correlation structure of near to 0.5.

Discussion
High dimensional data is very complex to deal with. DNAm data with its high correlation structures adds more complexity. Many studies have been proposed to solve this problem. However, the best performing FR method on di erent classi ers to deal with this type of data is not known yet. For this purposed, In this paper, through our extensive simulation studies and experimental data we show the best performing FR with di erent classi ers to deal with the high-dimensional DNAm data. from the results, in both the simulation and experimental data we saw that among the FR methods de nitely F-score had better performance than other FR methods by selecting the optimal number of true important CpG sites within the small range of ranked features. On the other hand, MRMR was having the least accuracies when applied on the classi ers compared here. The IG and Chi2 showed similar performance all along the ranked features. Di erent type of classi er and FR methods tend to perform di erently on certain types of data. It is partially biased to conclude one best performing combination in a high dimensional with high correlation setting such as DNAm data as it is extremely challenging problem to deal with. However, among the classi ers in our simulation data,Naive Bayes showed the better performance followed by Random Forest and AdaBoost respectively. But variation in the accuracies among all these were marginal. In the experimental data D1, tree- based ensemble methods Random Forests and AdaBoost showed marginally better performance compared to other classi ers. The fall in accuracies using SVM classi er in experimental data D1 explains how the inclusion of noise can a ect performance and even lead to a high false positive rate. In the experimental D2, all the FR and classi er methods had accuracies ranging between 95-96%. This shows that all combination of the methods are performing best with closer margins. However, among the FR methods we can easily decide the best performing method. The FR method as F-score showed the best performance on all the classi ers with small number of ranked important CpG sites. When dealing with high-dimensional data, FR approach is an important to reduce the number of features by eliminating the noise in the data and getting the truly signi cant features. It helps in reducing the computational time. The performances of all the experiments carried out in this research showed that having the small number of signi cant features can be su cient to attain good performance.

Conclusion
The performances of the FR methods are seen similar in both the synthetic and real DNAm data. Among all the simulation scenarios and DNAm data sets, the F-score as a FR method when applied on all the classiers achieved best prediction accuracy with signi cantly small number of ranked features. The second best performed FR method was IG followed by Chi2 which has least false positive rate especially when used with random forest classi er. That is, it has the lowest rate of predicting that a patient without a disease has a disease. The FR on all the classi ers used for comparison in this paper performed competitively with much less variation in the average accuracies. Also, the overall performances achieved by all the classi ers were closer when very few signi cant set of ranked features called CpG sites were considered. This showed that small set of signi cant features can classify the DNAm samples in more e cient manner not only by reducing the computational time but also in boosting the overall performance. As future research, we plan to focus on ultra-high dimensional DNAm data from Illumina 450K beadarrays which has over 450K CpG sites.

A Appendix
The simulation scenarios S1 and S2 which corresponds to correlations of 0.2 and 0.5 respectively were carried out in this article. In this section, we show results of these scenarios. The Figure A.1 shows the averages of accuracy for low correlation simulation scenario: S1 and the Figure A.2 shows the performance of average accuracies for intermediate correlation simulation scenario: S2. Each of the simulated data was randomly partitioned into training and testing data, 70% of the samples were applied on FR methods and the features were ranked based on their importance. The reduced number of top selected features were utilized for training the classi cation models. For fair comparison, all of the methods were evaluated with accuracy, sensitivity and speci city, averaged over 100 times and are shown in Tables A.1 and A.2.