Metalearning approach for leukemia informative genes prioritization

Abstract The discovery of diagnostic or prognostic biomarkers is fundamental to optimize therapeutics for patients. By enhancing the interpretability of the prediction model, this work is aimed to optimize Leukemia diagnosis while retaining a high-performance evaluation in the identification of informative genes. For this purpose, we used an optimal parameterization of Kernel Logistic Regression method on Leukemia microarray gene expression data classification, applying metalearners to select attributes, reducing the data dimensionality before passing it to the classifier. Pearson correlation and chi-squared statistic were the attribute evaluators applied on metalearners, having information gain as single-attribute evaluator. The implemented models relied on 10-fold cross-validation. The metalearners approach identified 12 common genes, with highest average merit of 0.999. The practical work was developed using the public datamining software WEKA.


Introduction
The type of leukemia is determined by the stage of development of the cell when it becomes malignant or cancerous. Acute lymphoblastic leukemia (ALL) is the most common type of leukemia in childhood, targeting the lymphoid line of blood cells [1]. Acute myeloid leukemia (AML) affects the myeloid line of blood cells and is a fast-growing form of cancer of the blood and bone marrow.
The occurrence of cancer or subtype cancer can be determined through the informative genes, considering pattern expressions and its correlation to cancer typology. For this purpose, statistical methods and machine learning techniques can be employed for feature selection and, in this way, prioritizing informative genes.
The objective of this work was to identify an optimal subset of genes as best diagnostic markers for leukemia, inferred from the best results from performance evaluation in classification implementing Kernel Logistic Regression (KLR). KLR model is a statistical classifier [2] that generates a fit model by minimizing the negative log-likelihood with a quadratic penalty using the Broyden-Fletcher-Goldfard-Shanno (BFGS) optimization [3].
Machine learning tools and techniques allow the implementation of metalearners. Metalearning algorithms use classifiers as powerful learners. An attribute selection classifier is a metalearner example. It contains parameters such as filter and search method, which allow to reduce dimensionality of data by attribute selection, without loss information [4].
Filter methods are one of the three general classes of feature selection algorithms. They apply a statistical measure to assign a scoring to each feature. The features are ranked by its score and accordingly selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable. Examples include chi-square [4], correlation coefficient [5], and information gain [6]. This paper has been structured as follows. After a brief introduction, in Section 2 we explain the methodology followed in this study, as well as the procedures, concluding with the performance assessment of the classification methods. Details of the experimental work using WEKA datamining workbench, plus the obtained results are discussed in Section 3. The conclusions are presented in Section 4.

Experimental procedures
The experimental work was based on the WEKA, version 3.8.3, a datamining workbench publicly accessible at: www.cs.waikato.ac. nz/ml/weka/. In this work, two metalearners were applied to reduce dimensionality of data by attribute selection. The procedures workflow is shown in Figure 1. Correlation attribute evaluator and chi-squared attribute evaluator were chosen as supervised filter methods before being passed by KLR. The optimal parameterizations of KLR were described in Refs. [7]. These experiments ran 10 times several schemes with 10-fold cross-validation testing with Paired T-Tester (corrected). The number of attributes to retain was chosen after several tests and validating the results of performance evaluation through comparison with results obtained when the classifier was applied on the original number of attributes. After, information gain was applied on the attributes retained by the two metalearners and the rank proceeded according to their evaluation. Moreover, biological interpretation of the subset of genes selected was extracted from literature. These set of experiments were conducted on a computer with an Intel Core i7-5500U CPU 2.40 GHz processor, with 8.00 GB RAM.

Datasets
The Leukemia dataset was obtained online from http://portals.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi? mode=view&paper_id=63, and was published as part of the experimental work in Refs. [8]. It includes two types of leukemia: ALL and AML. The dataset was analyzed in a reduced version, composed by 28 samples keeping the same features (12,582 genes). The goal for this subdivision was to identify informative genes in balanced data.

Performance evaluation
We have trained the classifiers to predict outcomes of cancer microarray datasets containing positive samples and control samples as described in Refs. [7]. The evaluation measures to evaluate the classifiers [9,10] includes classification accuracy (ACC), i. e., the ratio of the true positives and true negatives obtained by the classifier over the total number of instances in the test dataset, defined as:
Mean absolute error (MAE) measures the average magnitude of the errors in a set of prediction, without considering their direction [12]. It is given by: Precision (PRE), it is also called the Positive predictive values (PPV), is the proportion of the true positives against the true positives and false positives, as given by equation: Recall (REC) also called sensitivity and hit rate, is the proportion of the true positives against true positives and false negatives, as given by the equation:

REC
TP TP + FN F-measure, it is also called F score, is the harmonic mean of precision and recall which is given by the equation: ROC stands for Receiver operating characteristic. It's created by plotting the True Positives rates versus False Positives rates. It is also exploited to evaluate the performance of classifiers as Area Under ROC.

Results and discussion
The dimensionality of the dataset was reduced by applying attribute selection before being passed on to KLR. The two evaluators selected were correlation and chi-squared. In Table 1 are presented the KLR performance evaluation results applied on the original data to comparison. These results are expressed on average, considering the 10 times that each test was repeated.
The results of metalearner correlation-KLR and metalearner chi-squared-KLR presented in Table 1 were achieved with 71 features. The obtained results validate the reduction procedure as do not present statistically After having found the reduced number of features without affecting the performance evaluation of the implemented classifier, the features retained by the two metalearner: correlation-KLR and chi-squared-KLR; were subjects to the information gain attribute evaluator. It allowed to determine the goodness of an attribute by measuring the class information gained as a result of adding it to the list of input attributes. The results of the average merit of information gain attribute selection after used metalearner-correlation-KLR are presented in Figure 2 and the results of the average merit of information gain attribute selection after used metalearnerchi-squared KLR are presented in Figure 3.
In Table 2 are presented the features with highest score obtained (0.999) and the respective gene name/ protein reported in the literature. The respective p-values are also present.
As demonstrated below the genes that emerged from the information gain evaluator are correlated with the studied disease. TCL1A encodes T-cell leukemia/lymphoma protein 1A. This gene enhances the phosphorylation and activation of AKT1, AKT2 and AKT3. It enhances cell proliferation, promotes cell survival and stabilizes mitochondrial membrane potential [13][14][15]. Its expression is deregulated in chronic lymphocytic leukemia and most lymphomas [16]. According to Uniprot database, MME encode neprilysin protein and it is an important cell surface marker in the diagnostic of human ALL (Table 3).
TBPL1 encodes TATA box-binding protein-like protein 1. It is part of a specialized transcription system that mediates the transcription of most ribosomal proteins [17]. A recent study [18] demonstrated that the expression of IFI16, a member of the PYHIN protein family involved in apoptosis regulation and proliferation inhibition, is associated with clinical outcome in chronic lymphocytic leukemia.  According to Uniprot database, FUBP3 may play a role in activation of gene expression and may interact with single-stranded DNA from the far-upstream element (FUSE). Referring to Uniprot database, CD79B encodes B-cell antigen receptor complex-associated protein beta chain. It is required in cooperation with CD79A for initiation of the signal transduction cascade activated by the B-cell antigen receptor complex (BCR) [19]. A study [20] reports that CD79B is found in mature B blasts (B-ALL) that express membrane Ig as it is in normal and leukemia B lymphocytes. SLP65 or BLNK play functions as a central linker protein, regulating biological outcomes of B-cell development and function, and downstream of the BCR [21,22]. PPP3CC plays an essential role in the transduction of intracellular Ca 2 +mediated signals [23].  According to the Atlas of Genetics and Cytogenetics in Oncology and Haematology database, DNTT/ BLNK is related to ALL [24]. RPS24 is required for maturation of 40S ribosomal subunits and pre-rRNA [25]. This gene was identify on the top list of 20 genes as precursor of B-ALL [26]. It has been identified and characterized an increased risk of developing leukemia [27]. CD24 modulates B-cell activation responses and may have a pivotal role in cell differentiation of different cell types [28]. USP13 is involved in various processes such autophagy and endoplasmic reticulum-associated degradation [29,30].

Conclusions
In this work, we have applied metalearners to reduce the number of features in order to optimize the informative genes prioritization. Metalearner correlation-KLR and metalearner chi-squared-KLR provided the methods to reduce the number of features to 71, the minimal number conserving the optimal classifying potential. Using the information gain attribute evaluator, we were able to identify the most promising biomarkers for Leukemia, based on the highest average merit score. In this way, it was possible to gather 12 common genes to the two metalearner reduction results. Furthermore, based on literature and protein databases we were able to confirm that the metalearner results are, mostly, coincident with laboratory studies identifying the same genes involved in Leukemia. In conclusion, the used metalearners proved to be effective methods to optimize the informative gene discovery and therefore can be relevant to corroborate diagnostic and prognostic of time critical diseases like cancer.