Machine learning models can predict the presence of variants in hemoglobin: arti ﬁ cial neural network-based recognition of human hemoglobin variants by HPLC

Objectives: This article presents the use of machine learning techniques such as artificial neural networks, K-nearest neighbors (KNN), naive Bayes, and decision trees in the prediction of hemoglobin variants. To the best of our knowledge, this is the first study using machine learning models to predict suspicious cases with HbS or HbD Los Angeles carriers state. Methods: We had a dataset of 238 observations, of which 128 were HbD carriers, and 110 were HbS carriers. The features were age, sex, RBC, Hb, HTC, MCV, MCH, RDW, serum iron, TIBC, ferritin, HbA2, HbF, HbA0, retention time (RT) of the abnormal peak, and the area under the peak of the abnormal peak. KNN, naive Bayes, decision tree models, and artificial neural network models were trained. Model performances were estimated using 7-fold cross-validation. Results: When RT, the key point of differentiation used in high-performance liquid chromatography (HPLC), was included as a feature, all models performed well. When RT was excluded (eliminated), the deep learning model performed the best (Accuracy: 0.99; Speci ﬁ city: 0.99; Sensitivity: 0.99; F1 score: 0.99), while the naive Bayes model performed the worst (Accuracy: 0.94; Speci ﬁ city: 0.97; Sensitivity: 0.90; F1 score: 0.93). Conclusions: Deep learning and decision tree models have performance and have the potential to be integrated into medical laboratory work practices for hemoglobinopathy that when machine learning models are fed enough data, they can detect a wide range of hemoglobin variants. However, more comprehensive studies with data from a larger number of patients and hemoglobinopathies will be useful for validating our models. function. The output layer had one node with a sigmoid activation function. Adam optimizer, binary cross-entropy loss function, and batch size of 32 and 100 epochs were used to train deep learning models [15, 16]. Performances of the models were evaluated according to the accuracy, sensitivity, speci ﬁ city, negativepredictivevalue(NPV),positive predictivevalue(PPV),and F1 scores. We used TensorFlow 2.5.0, scikit-learn 0.22.2, and Python 3.7.10 to create the machine learning models trained on the data [17, 18].


Introduction
The world is changing rapidly. Machine learning algorithms are increasingly important due to their success and benefit in biomedical applications. Machine learning was used successfully in many biomedical applications, such as the prediction of the 3D structure and function of proteins from DNA sequences, the decoding of motor cortex activity to send signals to neuroprosthetic devices, the prognosis estimation of diseases, and the determination of the life expectancy of cancer patients. We witness how essential it is in shaping the future. Big data and machine learning have broadened new horizons in health and many other fields [1,2].
Machine learning is a branch of artificial intelligence that uses data to improve predictions and decisions. It builds models based on training data, which are typically collected from various sources. Through machine learning, a computer can perform simple tasks without explicitly being programmed to do them. A machine learning algorithm tries to estimate the pattern in the data. An error function is a type of test that a machine learning algorithm has to perform to evaluate its predictions. If the model can fit better to the data points in the training, then weights are adjusted to improve its accuracy. The algorithm repeats these evaluations and optimization processes to obtain the best model under the constraints of the learning phase. Artificial neural networks (ANN), a type of machine learning, are built by modeling human neural networks structurally and functionally. Signals (data) transmitted to neurons can be transferred to each other via connections between neurons, much like a biological network of neurons. Artificial neural networks can learn through experience and achieve more specific results as the neural network deepens. ANN and machine learning algorithms outperform the human mind in detecting patterns in data having complex and nonlinear relationships [1,3].
Machine learning could be promising, particularly for hemoglobinopathies, which affect more than 300 million people worldwide each year and are frequently misdiagnosed due to clinical, complete blood count (CBC), and electrophoretic similarities [4,5]. Every year, hundreds of thousands of babies are born with sickle cell trait worldwide [6]. The sickle cell trait, which affects five percent of the world's population, and Hb D-Los Angeles (HBB: c.364G>C) carriers have clinical, CBC, chromatographic and electrophoretic similarities [6,7]. Hb D-Los Angeles variant has similar alkaline electrophoretic mobility (electrophoresis method), concentration, and high-performance liquid chromatography (HPLC) profiles to Hb S (HBB: c.20A>T) [7]. As a result, differential diagnosis of these hemoglobinopathies is difficult. Furthermore, genotype determination is not usually performed because they are regarded as carriers and are considered harmless. However, recent research has shown that the Hb S carrier is not completely harmless and can cause various clinical complications, including acute pain crisis, venous thromboembolism, shock, chronic kidney disease, spleen infarcts, and stroke [8][9][10][11].
Besides, different co-inherited mutations that may accompany carrier status can significantly worsen the clinical phenotype. Misdiagnosis may also have an impact on future generations. Whether heterozygous or homozygous, Hb D inheritance does not result in a clinically significant phenotype. On the other hand, its association with betathalassemia or Hb S results in a variable clinical phenotype ranging from mild to severe hemolytic anemia [12].
HPLC, classical electrophoresis, and capillary electrophoresis are frequently used methods for identifying hemoglobin variants. Molecular diagnostic methods such as Multiplex-Polymerase chain reaction (PCR) tests and DNA sequencing are used for definitive diagnosis. However, these tests are not available in small centers, are more expensive, and must be interpreted by trained professionals [7]. When DNA sequencing is impossible or HPLC is insufficient, neural networks or other machine learning models can be used as fast and accurate aids in interpreting hemoglobinopathy. Machine learning models can assist laboratory professionals in the detection of hemoglobinopathies.
Generally, HPLC is more effective at separating Hb S and Hb D-Los Angeles, but this is not always possible. Both diseases, which have similar CBC, can be distinguished using HPLC retention time (RT). The gradient program, which allows for elution patterns, makes HPLC superior. It alters hemoglobin's total charge and conformation by constantly changing the pH during analysis. It is possible to determine the solubility of hemoglobin and RT in this way [13,14]. However, due to the gradient program software differences and the discrepancies between HPLC models, Hb S and Hb D cannot always be separated from each other reliably [7,14].
Red blood cell indices, Hb A, Hb A2, Hb F, abnormal hemoglobin values, and RT values can be used to train machine learning models to predict hemoglobin variants based on these data. The effects of RT on the machine learning model prediction performance can be evaluated by training machine learning models without this data. The aim of the study is to create machine learning models that will assist medical laboratories in differentiating between sickle cell and Hb D Los Angeles carriers using artificial neural networks and other machine learning methods. Here, we report the performance metrics of our trained models.

Materials and methods
In this retrospective study, 90 (38%) women and 148 (62%) men between the ages of 11 and 67 who applied to Mugla Sitki Kocman University Training and Research Hospital Thalassemia Diagnosis, Treatment, and Research Center between 01.01.2015 and 01.06.2021 were included. Patients with a history of surgery, chemotherapy, having any infective disease, cholestasis, thyroid dysfunction, acute-chronic hepatitis, and liver or other organ failures were excluded from the study. The Ethics Committee of Mugla Sitki Kocman University Training and Research Hospital granted ethics approval with the decision no. 111 on 01.06.2021. The study was conducted in accordance with the principles of the Declaration of Helsinki.
The patients' data were obtained from the database of Mugla Sitki Kocman University Thalassemia Diagnosis, Treatment, and Research Center. Sysmex XN 1000 (Sysmex Diagnostics, Japan) was used to measure the red blood cell index parameters. Primus Ultra II (Trinity Biotech Diagnostic, Ireland) was used to analyze hemoglobin variants using HPLC. Serum iron, total iron binding capacity (TIBC), and ferritin levels were measured using spectrophotometric or ECLIA immunoassay methods in a Cobas 601 (Roche Diagnostics, Germany) analyzer.
Our dataset contained 238 observations, of which 128 were suspicious cases with HbD carriers and 110 were suspicious cases with HbS carriers state. The dataset was balanced. There was no missing data in our dataset. The features were age, sex, RBC, Hb, HTC, MCV, MCH, RDW, serum iron, TIBC, ferritin, HbA2, HbF, HbA0, RT of the abnormal peak, and the area under the peak of the abnormal peak. Sex was coded as 0 (female) and 1 (men).
To estimate the relevance of the features, we used the Boruta feature selection. The Boruta feature selection algorithm applies random forest classification at its core. First, it extends the original dataset by adding the random probes (shadow features) whose values are obtained by shuffling the original features' values across instances. The importance of the feature in question is compared with a threshold value using z scores. Then, the algorithm iteratively removes features that are statistically less relevant than a shadow feature. Insignificant features are removed from consideration in the subsequent iterations. The algorithm classifies the features into three types: confirmed, tentative, and rejected.
The features were standardized by removing the mean and scaling to unit variance. K-fold cross-validation was performed. The chosen k for cross-validation was 7. We used the same training and testing data for all the models. We trained our models with and without RT data to understand the impact of RT data on the models. Euclidean distance was used for k-nearest neighbors (KNN) models. The number of neighbors for the model trained on data with RT was 75. For data without RT, the number of neighbors was 61. We used Gaussian naive Bayes algorithm for naive Bayes models. The classification and regression tree (CART) algorithm was used for decision tree models. Information gain was used to measure the quality of a split in decision tree training [12]. For the deep learning models, we constructed fully connected multilayer perceptrons (MLP) with two hidden layers. The nodes in the input layers were 16 and 15 for the entire dataset and data without the RT feature, respectively. Hidden layers had ten nodes with ReLU (Rectified Linear Unit) activation function. The output layer had one node with a sigmoid activation function. Adam optimizer, binary cross-entropy loss function, and batch size of 32 and 100 epochs were used to train deep learning models [15,16]. Performances of the models were evaluated according to the accuracy, sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and F1 scores.
We used TensorFlow 2.5.0, scikit-learn 0.22.2, and Python 3.7.10 to create the machine learning models trained on the data [17,18]. Table 1 shows the five-point summaries of the features. Figure 3 shows the spearman correlation coefficients of the features.

Results
In our study, naive Bayes models showed the worst performances, albeit having similar performance metrics with those of KNN models. Deep learning models were the best performers among all models. The performance differences between deep learning models and decision tree models were minor (Table 2).
When RT, the key point of differentiation used in HPLC, was included for model training, all methods performed well. When RT was excluded (eliminated), deep learning performed the best, while the naive Bayes model performed the worst (Figure 1, Table 2).

Discussion
In this study, we trained k-nearest neighbors models, naive Bayes models, decision tree models, and deep learning models as binary classifiers. We used four different machine learning algorithms to train as binary classifiers of Hb S and Hb D carriers. This was one of the strengths of our approach.
Well-prepared data is of utmost importance for machine learning model training. Our dataset was devoid of any missing data, albeit a small one. We performed 7-fold crossvalidation to understand how generalizable our models were. K-fold cross-validation was also a necessity due to the small size of our dataset. Well-prepared and featured engineered data and hyperparameters tuning were behind the success of our machine learning models.
The worst-performing models were naive Bayes models (Figure 3). The naive Bayes algorithm is accepted as naive because it assumes all variables are independent of each other, atypical of real-world examples. Features in our    dataset were not independent of each other either. In our study, decision tree and deep learning models showed similar performances. Deep neural networks require too much effort to select the right combinations of hyperparameters. Due to this requirement, it is advisable to use tree-based models for tabular data such as ours. It is a known fact that for prediction and regression problems with tabular data, tree ensemble models (like XGBoost) outperform deep learning models [17]. Essentially, neural networks are continuous models. Neural networks can approximate discontinuities via non-linear activation functions like ReLU. By contrast, tree-based models are fundamentally discontinuous, and this gives them some advantage for tabular data. Nevertheless, a new deep learning method is claiming to outperform gradient boosting methods, including XGBoost, Cat-Boost, and LightGBM, on average over various benchmark tasks [17]. In addition to hematological data, the effect of RT was investigated in our study. While all of the machine learning models we used performed well when RT, the main distinction point used in HPLC, was also used in training, the models' performance declined when RT was excluded. However, the deep learning model (Accuracy: 0.99; Specificity: 0.99; Sensitivity: 0.99; F1 score: 0.99) and decision tree (Accuracy: 0.99; Specificity: 0.99; Sensitivity: 0.98; F1 score: 0.99) models maintained their high performance. The deep learning and the decision tree models produced the best results, while the naive Bayes models produced the worst. In this way, the learning and performance of models based solely on hematology-related data were evaluated. It has been demonstrated that Hb S and Hb D Los Angeles can be distinguished from hematological data relationships without the need for RT using machine learning.
Machine learning methods have been shown to be successful in the differential diagnosis of beta-thalassemia minor and iron deficiency anemia, breast cancer metastasis prediction, heart diseases, and the life expectancy prediction of intensive care patients [1,19,20].
Setsirichok et al. used decision trees, naive Bayes, and MLP machine learning approaches to predict alfa and beta thalassemia trait, HbE, HbH, hereditary persistent fetal hemoglobin (HPFH), beta thalassemia major patients. They concluded that naive Bayes and MLP are excellent screening tools for thalassemia [21]. Borah et al. attempted to differentiate patients with beta-thalassemia, thalassemia major, Hb E, and sickle cell anemia. They achieved huge success with their machine learning methods [22].
Piroonratana et al. attempted to predict thalassemia types by analyzing HPLC chromatograms with decision trees and artificial neural networks. As a result, they concluded that machine learning algorithms could be used to guide thalassemia typing [23].
Chy et al. tried detecting sickle cell anemia with hematological images using machine learning methods and achieved great performance [24].
In the study of Magen et al. ANN was used in the differential diagnosis of beta-thalassemia and iron deficiency. Both diseases were accurately predicted with a specificity of 0.96 and a sensitivity of 0.99, and they stated that population screenings could be performed easily using ANN: [25].
To the best of our knowledge, this is the first study to use machine learning models to distinguish between Hb S carriers and Hb D-Los Angeles carriers.
The deep learning and decision tree models we trained were able to distinguish two hemoglobin types, very similar hematologically, electrophoretically, and chromatographically, with high performance.
These results suggest that when the machine learning models are fed with enough data, machine learning models may detect a wide range of hemoglobin variants.

Conclusions
Deep learning and decision tree models have demonstrated high performance and have the potential to be integrated into medical laboratory work practices as a tool for hemoglobinopathy detection aid. These outcomes suggest that when our machine-learning models are fed enough data, machine-learning methods may predict a wide range of hemoglobin variants. However, more comprehensive studies with data from a larger number of patients and hemoglobinopathies will be useful for validating our models.