Tree - based machine learning algorithms in the Internet of Things environment for multivariate ﬂ ood status prediction

: Floods are one of the most common natural disasters in the world that a ﬀ ect all aspects of life, including human beings, agriculture, industry, and education. Research for developing models of ﬂ ood predictions has been ongoing for the past few years. These models are proposed and built - in proportion for risk reduction, policy proposition, loss of human lives, and property damages associated with ﬂ oods. However, ﬂ ood status prediction is a complex process and demands extensive analyses on the factors leading to the occurrence of ﬂ ooding. Consequently, this research proposes an Internet of Things - based ﬂ ood status prediction ( IoT - FSP ) model that is used to facilitate the prediction of the rivers ﬂ ood situation. The IoT - FSP model applies the Internet of Things architecture to facilitate the ﬂ ood data acquisition process and three machine learning ( ML ) algorithms, which are Decision Tree ( DT ) , Decision Jungle, and Random Forest, for the ﬂ ood prediction process. The IoT - FSP model is implemented in MATLAB and Simulink as development platforms. The results show that the IoT - FSP model successfully performs the data acquisition and prediction tasks and achieves an average accuracy of 85.72% for the three - fold cross - validation results. The research ﬁ nding shows that the DT scores the highest accuracy of 93.22%, precision of 92.85, and recall of 92.81 among the three ML algorithms. The ability of the ML algorithm to handle multivariate outputs of 13 di ﬀ erent ﬂ ood textual statuses provides the means of manifesting explainable arti ﬁ cial intelligence and enables the IoT - FSP model to act as an early warning and ﬂ ood monitoring system.


Introduction
Natural disasters have caused a lot of damages to mankind, causing huge material and moral losses that affected the lives of 200 million people and affected the economy, with a loss of about $95 billion. Its impact also included other life aspects, including transportation, electricity, water, and ecosystems [1]. The main factors that cause flood are intense or extreme rainfall events [2], hurricanes, and sewage blockages. In addition, other factors related to humans, such as land-use changes, urbanization, and mineral resource exploitation [3]. The risk of natural disasters increases, especially with the rapid growth in urban areas, where there is an increase in the density of human structures, which causes a lack of efficient water resources management [4], sanitation networks, and poor management of solid waste. This may result in health problems, floods, and landslides. According to ref. [5], the percentage of human losses in the Asian continent as a result of natural disasters is about 90%, which is often caused by floods. Due to these issues, floods and mitigating the damage they cause are essential and important strategies to consider [2,6]. One of the main solutions in managing flood disasters and mitigating their future severity is the identification of flood and torrential risk areas using effective and highly accurate methods [7]. Hydrological models are used to determine areas at risk of flooding; hence, forecasting the severity of the floods and assessing the anthropogenic mitigation measures will be required in the future [8].
Studies on rainfall and floods help establish the correct procedures in the natural disaster alert and response systems and improve preparedness to face these situations [4]. Due to the number of complex variables related to floods, it is important to apply the Internet of Things (IoT) technology along with the data mining techniques to assess flood sensitivity accurately [9]. The IoT technology provides an advanced data acquisition infrastructure that satisfies the needs of distributed and dynamic environments of the flooding areas. Data mining and machine learning (ML) techniques are effective tools to investigate and develop various models for floods prediction. Data mining techniques can be used to illustrate the mechanism between specific events and related variables [10]. Several research and studies have been performed in the field of floods, specifically the creation of models that predict floods occurring in different regions of the world [11,12]. In general, supervised ML or statistical data mining methods have recently been used for flood predictions such as Logistic Regression (LR) [13], Artificial Neural Networks (ANNs) [14,15], Naïve Bayes (NB) [16], Random Forest (RF) [17], Support Vector Machine (SVM) [18], Decision Trees (DT) [2], and Decision Jungle (DJ). Among these methods, DT is a good and effective method for mapping susceptibility to floods and has proven effective with the high predictive performance of floods [19]. SVM is another effective tool for a range of hydrological modeling applications across many continents [20]. However, in explainable artificial intelligence, there is a need to propose a solution model that can produce results that are understood by humans. This issue can be resolved by enabling the supervised ML algorithm to predict a bigger set of class labels that have contextual form.
This article has presented the various methods, techniques, and models that are used to achieve this work. It provides an extensive overview of the various machine learning and data mining approaches in the flood predictions field. Subsequently, this article contributes to the following: 1. An Internet of Things-based flood status prediction (IoT-FSP) model that facilitates the prediction of the river's flood situation based on three ML algorithms of DT, DJ, and RF. 2. A complete design of flood alert system that has the feature of explainable artificial intelligence.
3. An evaluation of three different ML algorithms of DT, DJ, and RF in terms of accuracy, precision, and recall of flood prediction.
The structure of the article is arranged in six sections as follows. Section 2 reviews all works related to flood data acquisition and predictions. Section 3 presents the research framework, methods used to perform the data mining task, along the dataset and the evaluation metrics. Section 4 illustrates the main components of the IoT-FSP model. Section 5 presents the results and discusses the outcomes, and Section 6 states the concluding remarks and future work.

Related works
The related works are divided into two sections based on the research scope. The first section focuses on flood data acquisition using IoT technology. The second section focuses on flood status prediction using ML. The review aims to identify the state-of-the-art models and methods that are used to solve the problems of multivariate classification for pre-disaster flood early warning, as presented in the last section of this review.

Flood data acquisition
The IoT technology has been recently implemented in a wide range of projects as a smart solution for data collection and processing, especially in dynamic and complex systems. Ghapar et al. [9] propose the use of IoT architecture for flood data management. They suggest the usage of different sensors for collecting floodrelated data, such as sensors for measuring hydrological, geological, and meteorological data. The project confirms the ability of the IoT architecture to facilitate data collection, transmission, and management. However, this architecture remains conceptual and is never implemented.
Noymanee et al. [21] propose a conceptual framework based on an IoT platform for flood early-warning systems in an urban environment. A context-aware module supports the framework to characterize the flood conditions. The IoT architecture for the context-aware system consists of five layers. These layers are application, storage management, processing or reasoning, raw data retrieval, and sensors. They intend to make the framework to predict the flood situations from observations without using an explicit prediction method. However, this framework also remains conceptual and is never implemented. Similarly, Chen and Chen [22] propose the use of a context-oriented IoT platform for data acquisition and integration. The integration process entails converting the raw data to a semantic context for easy storage, understanding, and sharing.
Balakrishna et al. [23] propose a sensor data acquisition and analysis framework based on an IoT platform. The framework is implemented in a traffic monitoring system by using the ThingSpeak IoT Cloud platform that provides data analysis and visualization services. A median filter is used to improve the context of the data. The data processing is performed by the Gaussian mixture model and context construction method. The test results show that the framework achieves an accuracy of 84.56% on average for traffic condition description.
Fang et al. [24] propose an IoT-based integrated information system for snowmelt flood early warning. The IoT architecture is used for data acquisition, sharing, and management of multi-source information. The integrated information system has been developed as a web application. The test case study is the Quergou River Basin, which is located in Hutubi County, Xinjiang, China. The resulted warning recommendations are compared between the actual and presented water levels.

Flood status prediction
Various research articles investigate the use of ML algorithms in the prediction of river flooding situations, and some of them are presented in this section. Chen et al. [2] use three techniques for spatial forecasting of floods in the Quannan region of China. They conduct an evaluation and comparison of these techniques, the Naive Bayes Tree (NBTree), the Alternative Decision Tree (ADTree), and RF methods, to determine their ability to predict floods. Their approach is based on producing a flood inventory with 363 flood sites and dividing them into training and verification datasets with a 70/30 random selection. Thirteen factors are used to create the spatial flood database to explain and understand the floods. Their results show that RF is an effective and reliable model for assessing flood vulnerability.
Hong et al. [6] examine four DT-based ML models, namely Logistic Model Trees (LMT), Reduced Error Pruning Trees (REPT), NBT, and Alternating Decision Trees (ADT) for flash flood susceptibility mapping in Iran. They construct a spatial database with 201 present and past flood locations and 11 flood-influencing factors. The capability of these models for flood predictions is evaluated and compared using statistical evaluation measures, the receiver operating characteristic curve, and Freidman and Wilcoxon signed-rank tests. The results show that the ADT model has the highest prediction capability for flash flood susceptibility assessment, followed by the NBT, the LMT, and the REPT, respectively. These strategies have proven to be effective in the rapid determination of flood-prone areas.
Hong et al. [6] introduce a new approach to building a flood susceptibility map in China by applying fuzzy proof weight (fuzzy-WofE) and data-mining methods. The thing that distinguishes the proposed approach is the use of fuzzy-WofE, which creates a preliminary flood sensitivity map and determines the variables associated with floods. LR, RF, and SVMs are implemented, taking into account the 11 floodrelated variables. They evaluate the efficiency of their approach using the area under the curve (AUC). Their results show that the fuzzy WofE-SVM model produces the highest predictive performance (AUC value, 0.9865), which also appears to yield statistically significant differences from the other predictive models.
Tehrany et al. [19] examine and validate the hypothesis that the accuracy of the final susceptibility mapping result improves by adding more conditioning variables to the dataset used in river flood modeling. In addition, this research assesses the effect of individual conditioning influences on flood susceptibility mapping and their significance in the construction of accurate mapping of possible flood regions. They use DT and SVM to test spatial correlations between flood conditioning factors and rate their degree of importance for flood-prone mapping. They assess the accuracy for two ML approaches, SVM and DT, using the AUC method. The results show that SVM and DT provide the highest predictive accuracy levels of 85.52 and 88.47%, respectively, using DS1 (LiDAR dataset). Finally, it is concluded that the use of additional variables in the simulation does not necessitate the achievement of higher accuracy. Suliman et al. [20] review the related works of flood forecasting. The review focuses on the two most popular ML algorithms, which are ANN and SVM.
Choubin et al. [25] use two new algorithms, namely Multivariate Discriminant Analysis (MDA) and Classification and Regression Trees (CART) combined with the SVM algorithm in flood susceptibility analysis. They use these models with a flood inventory map and many factors of flood conditioning to develop a flood susceptibility map. A new framework for flood susceptibility assessment is proposed to ensure a more accurate ensemble model where only those models with an accuracy of 80% are permissible for use in ensemble modeling. The results show that the MDA model produces the highest predictive accuracy (89%), followed by the SVM (88%) and CART (83%) models. The ensemble modeling approach indicates that areas with a high population density are more vulnerable to floods, and therefore these areas should be given priority to flood prevention and treatment.
Liu et al. [26] combine Stacked Autoencoders (SAE) and Backpropagation Neural Networks (BPNN) to implement a new deep learning approach for floods prediction. To further develop the ability to model nonlinearity, their architecture performs two processes. First, K-means clustering is applied for data classification into various categories. Then, they represent their related data categories by using multiple SAE-BP modules. The results of the comparison between their approach and other approaches, which are SVM, BPNN, Radial-based Functions, and Extreme Learning Machine, show that their approach performs much better.
Widiasari et al. [27] provide a definition of the main model of the ANN that is useful in time series forecasting and a basic procedure for the practical implementation of the ANN in this form of mission. The model analyzed is the Multilayer Perceptron (MLP). To assess the degree of precision of the flood prediction, Mean Absolute Percentage Error (MAPE) is used in which the system predicts the lower the MAPE value, the more accurate the results. By using this, MLP achieves a MAPE value of 3.64%, which means that the error caused by the built-in device is 3.64% compared to the actual value used for testing. Also, MLP has a greater effect on the expected water level than multiple linear regression.
Widiasari et al. [28] use the Long Short-Term Memory (LSTM) algorithm, which is common and powerful at managing long-run periods of temporal dependencies for complex time-series data like precipitation and water elevation degree that can be sensed by using sensors. The model produces a MAPE value of 3.6% through the LSTM algorithm, which indicates that the error that is produced to predict the water level within the downstream river is 3.6% compared with the real water elevation value. LSTM produces more correct predictions of water elevation level inside the downstream if compared to MLR models that produce a MAPE prediction value of 10.55%.
From the previous contributions, we observe that ML algorithms are mainly used to predict floods. In Chen et al. [2], the RF model achieves the highest accuracy of 91.5% in predicting the flood status of five classes. The work of Khosravi et al. [5] achieves the highest accuracy of 94.3% in predicting the flood status of four classes through the ADT model. The work of Hong et al. [6] achieves the highest accuracy of 92.2% in predicting the flood status of five classes through the Fuzzy WofE-SVM model. However, none of these models introduces a complete architectural design of the flood alert system. They only focus on evaluating the ML ability to predict flood status from a maximum of five classes and neglect flood data acquisition.

Anglian river basin district (RBD) dataset
The dataset that is obtained from a specific source, such as the Internet, might be incomplete and inconsistent. Hence, selecting appropriate data is an important research issue [29,30]. The Anglian RBD includes 27,900 km 2 in which a total of 7.1 million people settled in this area. It includes the cities of Northampton, Lincoln, Chelmsford, and Milton Keynes as shown in Figure 1. The RBD dataset that is used in the research for flood prediction is an open dataset taken from the environmental agency [31]. The dataset includes a collection of datasets related to flood measurement, classification, environmental effect, and protection. The last update of the RBD dataset is on September 17, 2020. This dataset has 19 attributes and 149,676 instances, and the multivariate target classes of the dataset are 13 river states, as displayed in Table 1.

ML algorithms
Data mining is a process of classification, audit, and semi-automatic analysis of very large amounts of data to obtain useful information and explore patterns and links [32]. Data mining is used in several different fields as a method in predicting data appropriately [33]. This section provides a summary of some data mining algorithms, which are DT, RF, and DJ algorithms.
DT is a structure of a branching tree that is used to determine the course of work or show the possibilities of a solution. Each internal node represents a test on an attribute, and each branch represents a potential DT [34]. Usually, an entropy function is deployed to control the DT splitting the data. The entropy affects the boundaries of a solution that the DT draws. The entropy formula is given in (1).
The DT provides a transparent tree-like structure with effective, easy rules that are easily interpreted and understood [35]. DT is used in flood forecasting of high-water levels and water flow [36]. Compared with other methods of classification, DT can be built quickly [34]. It has also been widely used for both continuous and discrete datasets. Variables screening and features selection are good enough in DT [37]. Regarding its performance, nonlinearity does not impact any of the DT parameters [38]. Hence, it is good to solve problems with multiple alternatives or situations that are related to risks and uncertainties. The basic rule in building a DT is to find the best question for each branch of the tree so that these questions divide the data into two sections. The first section applies to the question, and the second section does not apply so that through a series of questions, the DT is built with its chain of branches. Although the DT is used for exploration and data preparation for statistical operations, it is also used more often to predict the values of other cases not found in the training group [39].
RF is considered as an ensemble learning algorithm. It is utilized mainly for classification and regression tasks [40]. The RF architecture is represented as a multitude of DTs in which the output of the RF is the class that has an average or a majority selection among the DTs [41]. The prediction of a new case x after the training phase by using the average function is represented in (2). For many classification problems, RF usually outperforms the DT for linear and nonlinear problems with an efficient mapping of input into forecasted spaces [42]. The design of the RF allows it to overcome the main problem of the DT, which is its tendency of training overfitting [43]. However, the RF performance is highlighted and affected by the characteristics of the data [44]. RF is currently considered as a popular classifier and has been integrated into many applications due to its stable performance in a wide range of classification problems [45].
DJ algorithm is an ensemble learning approach for classification. The algorithm works by utilizing and building different selection trees and then voting on the maximum famous output magnificence [46]. The trees that have high prediction have a greater weight in the final decision of the ensemble. A large number of applications have been developed using decision forests and trees in data science, although these approaches have certain drawbacks, such as given a large amount of data, the number of nodes in DTs may grow exponentially in size [47]. The DJ differs from DTs by the directed acyclic graph (DAG) group that permits multiple paths from a root to the leaves, whereas traditional DTs allow only one path per node [48].
The DJ method has a strong contrast between the two algorithms as it combines two modern nodes, thereby improving the function and structure of the DAG.

Evaluation metrics
To compare the performance of each of the classifications, preprocessing is carried out based on all the values of the taken 19 attributes in which no feature selection has been performed. A comparative study of DT, DJ, and RF classification results is performed with a three-fold validation method for training and testing [49][50][51].
• Accuracy: It is the most commonly used metric to judge a model and is not a clear indicator of performance. The worse happens when classes are imbalanced. (4) • Recall or sensitivity: It is the ratio of correctly classified attack flows (TP), in front of all generated flows (TP + FN).
where TP, TN, FP, and FN have their usual meaning.

IoT-FSP model
The IoT-FSP model is proposed to provide suitable flood data acquisition and prediction mechanisms. The IoT-FSP model employs an IoT architecture to manage the data flow and processing from the source to the destination. It uses three tree-based ML algorithms of multivariate classification for analyzing the extracted data and performing flood status prediction. The basic IoT architecture consists of three core layers: application layer, network layer, and data acquisition or sensor layer. This architecture can play an important role in facilitating flood early detection; hence, it is selected to advance this research. Figure 2 shows a general view of the data acquisition of the IoT environment. From the top to bottom of this architecture, a set of sensors and devices are used to collect data related to the RBD, operational and catchment management, waterbody, water level, and water quality assessment items. These data are transmitted to the network layer using the client-server local area network. Because this is an outdoor environment, cellular connectivity is applied with Wi-Fi and Ethernet as subnetworks. They are known as a gateway or edge computing that connects the sensors and devices with the cloud computing system, as shown in Figure 2. Subsequently, the network layer reads the required data and sends these data through the Wi-Fi and cellular connectors that allow access to the cloud services. Hence, the main task of the network layer connects the IoT acquisition layer with the application platform. The application platform engages with the collected data from the previous layers. In the real world, data are generally incomplete as they lack certain behaviors or trends, or contain only aggregated data without meaning, and they are likely to contain many errors. Data preprocessing is a proven method for solving such problems to obtain consistent, meaningful data that are understandably arranged. The application layer starts with the data preparation that includes three sub-phases: data selection, data preprocessing, and data partitioning. In this phase, the RBD dataset is retrieved from the database and get selected to be processed. Data transformation is the process of converting data from one format or structure into another format or structure. It is performed after the data selection process to ensure that the chosen data are complete and verified. Data partition forms as part of preprocessing. It helps to split the data using threefold cross-validation so that the resulting mining process is more effective and the patterns found are robust. Figure 3 shows the IoT-FSP model based on the IoT architecture.
The data mining paradigm applies ML technology aims to extract knowledge from huge amount of data as it works to convert the incomprehensible raw data into data that people can read and understand. Finally, the proposed ML algorithms, which are DT, DJ, and RF, are applied to the datasets using MATLAB ML Toolbox. It is a platform that facilitates the work with ML algorithms. It has a complete reference and module to help experiments and scoring workflow. ML algorithms require parameter setting phase, and the parameters are set based on the need of each algorithm to provide the best performance. The fused results from the ML identify the river flood initial status (flood or no flood). Then, it determines the target 13 river final status. When the river status implies flood, then the model triggers an alarm and provides notification through the user interface regarding the flood status conditions.

Results and discussion
The implementation of the IoT-FSP model considers MATLAB and Simulink as development platforms in which MATLAB is used for the soft computing process and Simulink is used to stimulate the online data transmission process through the cloud computing services. The implemented ML algorithms to predict the flood status are DT, DJ, and RF algorithms. The algorithms are tested using the three-fold cross-validation approach to check the ability of the algorithms in predicting the 13 flood status. Every algorithm has its separate process at the data simulation phase to perform data splitting, data preprocessing, and training. In the test phase, the ML algorithms of the IoT-FSP model are trained after the training parameters have been set correctly. The test set is used to calculate and evaluate the performance of every algorithm's prediction. Subsequently, verified results on the prediction of tested data are demonstrated by those three algorithms.
Finally, the results and analysis phase consist of testing and evaluating the performance of the model of the selected algorithms. During the evaluation, the results from the testing phase are evaluated by comparing the results of the algorithms to see which algorithm produces lower error rates and higher accuracy. This phase then shows which is the most efficient algorithm to get the best flood status results of accuracy, precision, and recall. Table 2 shows the results of the three tested algorithms of the IoT-FSP model in terms of accuracy, precision, and recall.
The results in Table 2 shows that the IoT-FSP model successfully performs the data acquisition and prediction tasks. The three algorithms achieve an average accuracy of 85.72% for the three-fold crossvalidation results. The statistical analysis of the accuracy results shows that there is a small variance between the three-fold cross-validation results of 0.004409 for the DT, 0.0334622 for the DJ, and 0.0018556 for the RF. The highest accuracy result is achieved by the DT, which is 93.22%, followed by the RF that achieves an accuracy of 86.57% and the DJ achieves the lowest accuracy of 77.38%. Moreover, the DT has the lowest time complexity followed by the RF, and the DJ has the highest time complexity among the three [52]. Figure 4 shows the three-fold test results of the three ML algorithms.    IoT architecture For flood data management. It is never implemented [19] ML For assessing the effect of individual conditioning influences on flood susceptibility mapping. Multivariate classification of three classes. LR model achieves the highest accuracy of 90.6% [21] IoT, context-aware For flood early-warning system, only conceptual. It is never implemented [22] IoT, context-oriented For wireless sensor networks to be used in ambient systems. It is never implemented [23] IoT, Gaussian mixture For traffic flow analysis, implemented ThingSpeak IoT Cloud platform. It is never implemented [24] IoT For snowmelt flood early warning. It is only used for data acquisition IoT-FSP ML, IoT For forecasting of floods of RBD dataset. Multivariate classification of 13 classes. DT model achieves the highest accuracy of 93.22% The literature review presents several research examples that are conducted to handle flood disasters, including flood data acquisition, data visualization, flood prediction, flood detection, early warning, and flood monitoring. The scope of this article covers the pre-disaster phase for early warning comprising flood data acquisition using IoT technology and multivariate classification for flood status prediction using ML. The related works in flood research are summarized in comparison to our work in Table 3.
Based on Table 3, the IoT-FSP model implementation-wise outperforms its alternative models, predicting complex multivariate of 13 classes and achieving higher prediction accuracy. Moreover, the model utilizes both IoT and ML technologies in its design to form a flood alert system that has the feature of explainable artificial intelligence. The limitation of this model is that it needs to be tested in a real-world multivariate flood status prediction environment.

Conclusion
Flooding is a dangerous event whose risks should be controlled properly. It is mainly attributed to heavy rain, melting snow, or events emerging frequently as a consequence of climate change. The impacts of the floods are not limited to individuals only but extend and encompass whole societies, affecting many aspects, most notably the economic, environmental, and social aspects. However, the negative aspects of a flood differ in its intensity and impact depending on several natural and contingency factors. Therefore, there are many special measures to mitigate its impact, including early flood detection, warning, and monitoring systems. These systems should be implemented in the wider management of floodplains. Flood prediction or forecasting is made to ensure that the risk of flooding is reduced. The variables and methodology used in this study apply to flood susceptibility mapping for the different study areas. Subsequently, this article proposes an IoT-FSP model that facilitates the prediction of the river's flood situation.
The IoT-FSP model encompasses an IoT architecture for data collection and three different ML algorithms of DT, DJ, and RF for flood prediction. The performance of each ML algorithm is assessed and compared in terms of accuracy, precision, and recall. The experimental RBD dataset is used in this project to evaluate the performance of the IoT-FSP model and determine the best ML algorithm for flood prediction. The results of the three-fold cross-validation method show that the DT achieves the highest accuracy score of 93.22%, precision score of 92.85, and recall score of 92.81 among the three ML algorithms. The DT algorithm is found to be better than RF and DJ and is more reliable in predicting the flood status. The main contribution of this article is proposing a model that is able to handle multivariate flood statuses, providing the means of manifesting explainable artificial intelligence and enabling it to act as an early warning and flood monitoring system.
The future work will closely study the IoT aspects of the project, including managing data of heterogeneous sensors, energy consumptions, and efficiency of performance. Additionally, integrating contextual description modules in the IoT-FSP model to improve the explainable artificial intelligence ability of the model in this domain shall be implemented to be represented as a report form.