Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access September 1, 2022

Machine learning-based forecasting of potability of drinking water through adaptive boosting model

  • Surjeet Dalal , Edeh Michael Onyema ORCID logo EMAIL logo , Carlos Andrés Tavera Romero ORCID logo , Lauritta Chinazaekpere Ndufeiya-Kumasi , Didiugwu Chizoba Maryann , Ajima Judith Nnedimkpa and Tarandeep Kaur Bhatia
From the journal Open Chemistry


Water is an indispensable requirement for life for health and many other purposes, but not all water is safe for consumption. Thus, various metrics, such as biological, chemical, and physical, could be used to determine the quality of potable water for use. This study presents a machine learning-based model using the adaptive boosting technique with the ability to categorize and evaluate the quality rate of drinking water. The dataset for the study was adopted from Kaggle. Consequently, an experimental analysis of the different machine learning techniques (ensemble) was carried out to create a generic water quality classifier. The results show that the forecast accuracy of the logistic regression model (88.6%), Chi-square Automatic Interaction Detector (93.1%), XGBoost tree (94.3%), as well as multi-layered perceptron (95.3%) improved by the presented ensemble model (96.4%). The study demonstrates that the use of ensemble model presents more precision in predicting water quality compared to other related algorithms. The use of the model presented in this study could go a long way to enhance the regulation of water quality and safety and address the gaps in conventional prediction approach.

1 Introduction

The term “potable water” is defined as the water that is purified as well as feasted perfectly and is eventually free from all the toxins and hazardous microorganisms. This filtered water is appropriate to consume, or it may be dubbed as “the drinking water” following the cleansing operations and is secure for both cooking and drinking. Water is purified in distinct ways, a few of these involve employing UV purified water cleanser, by reverse osmosis, etc. [1]. At a national, regional, and local level, access to clean drinking water is a health and development concern. As a result of the decreased risk of disease and healthcare expenditures, investments in water supply and sanitation have been proved to be economically beneficial in some places. This is true for large-scale water supply infrastructure expenditures and for domestic water treatment. As a part of poverty reduction measures, interventions in increasing access to safe water favor the poor in particular, regardless of whether they are in rural or urban settings [2].

A lifetime’s worth of safe drinking-water intake poses no major threat to one’s health, even if one’s body’s sensitivity to contaminants changes over the course of one’s lifespan. Infants and small children, the elderly, and those who are weakened or living in filthy conditions are the most vulnerable groups when it comes to waterborne sickness [3]. Safe drinking water may be used for a wide variety of household tasks, including bathing and shaving. Packaged water and ice meant for human consumption are covered by the guidelines [4]. However, some unique applications, like as renal dialysis and cleaning of contact lenses, may need the use of water of a higher grade [5]. Severely immunocompromised individuals may need to take extra precautions, such as boiling their water before consuming it, to protect themselves against microorganisms that are generally not a problem in tap water. When it comes to assessing and managing drinking water supply risks, a comprehensive strategy promotes public trust in the water’s safety. Using this strategy, the risks associated with drinking water are systematically assessed at every step along the supply chain, from catchment to consumer, and solutions are devised to mitigate those risks, including strategies to guarantee that control mechanisms are successful [6]. To cope with the day-to-day maintenance of the water quality, it includes techniques for dealing with disruptions and failures. Drinking water safeness is guaranteed by the deployment of various barriers, from the start user to the end user, to evade or restrict contamination to groups that are secure for human consumption [7]. Multiple safeguards, such as safeguarding water supplies, selecting and properly operating a sequence of treatment stages, and managing distribution networks (whether piped or not), help to preserve and protect the quality of treated water. Preventing or decreasing pathogen entrance into water sources and lowering the need on treatment methods to remove pathogens are the ideal strategies [8].

Water polluted with human or animal (including bird) feces poses the biggest microbiological danger in broad terms. Pathogenic bacteria, viruses, protozoa, and helminths have been found in human feces in the past [9]. Setting health-based goals for microbiological security is confused by the existence of pathogens that can only be acquired through excreta. The quality of microbial water can fluctuate swiftly and widely. When pathogen levels spike suddenly, the danger of illness increases dramatically, which might lead to a wave of waterborne outbreaks. In addition, many people may have been exposed to microbial contamination by the time it is discovered [10].

In this article, we introduce an artificial intelligence technique based on the previously acquired data from a large number of water samples. The main objectives of this study are given as below:

  • O1: Develop a novel ensemble model of machine learning algorithms for water quality prediction.

  • O2: Predicate water quality using machine learning where it plays a significant role in accurately predicting the water quality whether it is safe or not.

  • O3: Implement adaptive boosting (AdaBoost) ensemble model to water quality prediction and compare it to existing model.

The objective O1 is being designed for predicting water quality. Moreover, for finding the accurate prediction of water quality whether it is safe or not, the objective O2 is planned. Objective O3 will be fulfilled by applying the AdaBoost ensemble model to predict whether it is safe or not. The results are then compared with existing machine learning algorithms. The main output of these objectives is predicting whether it is safe or not more accurately.

The remainder of this article is organized as follows: Section 2 reviews the related work in the body of literature. Section 3 discusses the data and materials used followed by Section 3.2 explaining the theory behind the algorithms. Section 4 presents and analyses the experimental results. Finally, the article concludes and proposes future work in Section 5.

2 Related work

Mounce et al. [11] proposed a totally AI-subordinate framework to mechanize water meter data assortment that was included a Recognition System and a web help outlet. This system presented various advantages for the two purchasers as well as water help suppliers, for a case, observing of utilization, seeing water spillage, imagining water use, and drinkable water range in a geographic guide. It even conveyed a solid apparatus to help and guarantee legitimate decision production with various detailing benefits. The significant component of the RS is the convolutional neural network (CNN) model educated on a recommended MR-AMR (Moroccan Automatic Meter Reading) dataset. In the standard test stage, they procured a precision of 98.70%. Our methodology was tried different things with and approved by examinations.

Roelich et al. [12] decided and foreseen the quality crumbling of the water precisely and on schedule, a condition of-workmanship AI algorithm had been taken on in this specific article. Recognizing the fish infections in the underlying stage was significant for the ranchers as this aides in going to preventive lengths that could diminish the possible breakout and ultimately forestall misfortunes financially so public economy did not get hampered. The exploratory outcome demonstrated high exactness in identifying fish illnesses for the given water quality by utilizing this algorithm with genuine datasets.

Schmitz et al. [13] created and assessed an AI approach for the ID of unmodified dams by utilizing a blend of chorographic information and geo-data. For the reason for approval, an irregular woodland order algorithm was prepared that utilization carefully designed indicator factors along with realized dam destinations for the identification of unmapped dams. This algorithm was applied to two subbasins of the Hudson River watershed, USA, along with measured network impacts, as well as assessed a scope of indicator sets for inspecting the tradeoffs between model definition exertion and characterization accuracy. The expectation of dam locales utilizing this arbitrary timberland classifier by utilizing the subset of factors connected with stream slant and presence of upstream lentic natural surroundings accomplished high exactness (genuine positive rate = 89%, bogus positive rate = 1.2%). Undetected dams were normal all through the two test watersheds and can bias information connected with stream availability. In any case, it is observed that with the utilization of AI draws near, ID of obscure dams should be possible precisely and scalable, which can help in creating dam inventories in this way prompting enable endeavors for better administration.

Van Summeren et al. [14] determined five WQIs for two fleeting periods: (I) June to December 2019 acquired continuously. The registered WQIs ordered the accumulated datasets as “extremely poor,” ordinarily because of the lopsided appropriation of the water tests that has caused brilliance unevenness inside the facts. Moreover, this review center around researching the kind of water quality utilizing AI, for example, pH, conductivity, disintegrated oxygen, turbidity, temperature, and waste coliform. The exploratory outcomes demonstrated that the DT algorithm outperformed different models by achieving a characterization exactness of around 99%.

Yudenkova and Savina [15] predicted the potential occurrence of future abrupt pollution events by historical and real time monitoring water quality data. The 12 basic water quality monitoring stations and 2 heavy metal monitoring stations were selected in this study. They then use support vector machines and random forest (RF) methods to predict whether the water quality might exceed normal standard in the near future. Their result showed that both of the methods received high credibility in predicting the standard-exceeding conditions of irrigation water. In addition, their study takes water level as well as precipitation factors into the models for a better precision in predicting major standard-exceeding concentration of heavy metal, copper, in the irrigation water of study area. The result indicated that the prediction ability increased after water level factor was added, but not in the case of precipitation factor. In addition, by making water quality data resemble the actual conditions, data segmentation should be conducted based on time series while analyzing the data instead of random selection. The accuracy of SVM model can be increased to 99.7 and 85.18% in the validation and test data set.

Gusmini et al. [16] offered a judgment of existing AI applications in the water natural surroundings and made a couple of unsure experiences with respect to what “dependable AI” could demonstrate there. There exists structures for the analyzed distributions, four assortments of not entirely set in stone: demonstrating, determining, judgment support and utilitarian administration, and advancement. These understandings demonstrated that the development and utilization of dependable AI techniques for the water area should not be let to data researchers nevertheless request a purposeful activity by water specialists and information researchers working together, finished with ability from the friendly sciences and mankind.

Howell et al. [17] added to the examination and advancement of a delicate sensor used in water quality identification utilizing chlorine. A general examination between SVM as well as extreme learning machine approaches as far as learning period and various measurements for relapse and order was proposed. An essential worry in water treatment plants is the determined trouble experienced in web-based estimation through devoted estimating equipment and research center examination of explicit factors associated with the water piece. The main role was to set up a technique design reliant upon a delicate sensor planned for water quality to make a changed judgment to oversee and observe water quality issues.

Sanz et al. [18] turned close to the development of directed AI norms, which could naturally arrange the quality pace of waterway water. A more profound examination of the exploration uncovered that the relationship subordinate parts are useful in RF and CART. At the specific time frame, the PCA information point displayed a more raised degree of precision (0.989) with the brain network approach.

Shabani et al. [19] focused on three norms to be specific probabilistic neural network, k-closest neighbor, as well as SVM as a choice to NSFWQI to sort the nature of water for Karoon River, Iran, as a contextual investigation examination, concerning least achievable measurements. The results showed that considering the present situation that no measurement is ended, every one of the three norms conveyed indistinguishable outcomes.

Villarroel Walker et al. [20] stated that with a direct plan, quick arrangement rate, and hearty speculation execution, the proposed crossover standard could effectively overpower the haphazardness, and nonlinearity of the water quality measurements data, making it meriting update and application. The finishes of the examination could be utilized to decide the current state of bowl water quality regarding goal and dependability, as well as convey a logical starting point for bowl water living space conservation and the executives arranging.

Wang et al. [21] used AI approaches in this examination, for example, topo-hydrological as well as geo-natural measurements, for characterizing the spatial connection alongside the more groundwater efficiency data. The affirmation of the model was broke down through the receiver operating characteristics (ROC) bend examination strategy. The area under the ROC bend was determined to approve the model.

Yousefi et al. [22] recommended a work area application procedure to support the administration of water’s quality observing instruments. The advancement and development of instruments used for water quality rate examination and assessment were important to sidestep troublesome circumstances. The assessment of the model was achieved utilizing the request technique and undertaking designation. The results exhibited fast versatility and direct route by end clients. In development to this, the exhibition of a machine learning technique uncovered the underlying impacts of water quality conjecture.

Alizadeh et al. [23] recommends assorted AI procedures: SVM, RF, Boosted Trees, and artificial neural network (ANN), to figure the PO4 focus. Month-to-month assessed information between years 1986 and 2014 were utilized to prepare and test the precision of these methods. The execution of these methods was researched by utilizing different factual records.

Amali et al. [24] proposed an inversion adjustment process and executed the AI method. In this examination article, they had utilized a feed-forward fake brain network-subordinate inversion standard to extend the adjustment life-season of indicators. The assessment of the example was executed in light of root-mean square blunder and this mistake for cross-confirmation. The recommended standard is likewise compared with the standard factual method and affirmed to be awesome to the exemplary one. The execution results shows the most helpful execution with a minor mistake rate. In view of the results of the current investigation, ANN is by all accounts better versatile for data research examination in regular observing applications.

Barzegar et al. [25] looked at the stream prompted consequences for the execution of AI principles utilized for guaging water quality rate measurements in the nearshore water. The execution of different AI norms seemed, by all accounts, to be near one another and showed an indistinguishable work on with respect to the exactness and vulnerability of the expectations. The results showed that stream release affected the water pungency and turbidity of the straight in which the examples including the waterway stream as admission factors had more agreeable execution compared with those aside from the stream time arrangement.

Birch et al. [26] highlighted the solidness of geopolymer self-compacting concrete delivered by the expansion of minerals which were planned with both hereditary programming (GEP) and the counterfeit brain organizations (ANN) approaches. The standard development concerned using unrefined components and new blend properties as visionaries, and soundness impacted as a reaction. Both GEP and ANN procedures showed a decent figure of the experimental information, with exceptionally less mistakes. However, GEP principles could be liked as direct conditions are created from the technique, while ANN was only a visionary.

Chhabra [27] fused an AI method, WQI, and remote detecting ghastly records through fragmentary side-effect systems and, thus, determines a norm for assessing and assessing the WQI. The results demonstrated that the assessed WQI values shift somewhere in the range of 56.61 and 2886.51. They even examined the association between reflectance information as well as the WQI.

Dezfooli et al. [28] made a grouping standard for the quality pace of water. Two grouping principles were built by the information mining instrument named WEKA, which was the RF methodology and the Random Tree approach. The quality pace of water was well contingent on different perspectives, for instance the ascent of occupants, the speedy development of monetary development, and regular contamination. The execution of the standard was assessed relying upon exactness, accuracy, and memory.

Haghiabi et al. [29] examined the prospect of one of the machine learning instruments, ANN, in the forecasts and analysis of the Kelantan River bay. Water quality rate information gathered through the 14 levels of the River bay was employed for modeling and predicting (WQP). As for the WQP investigation, the outcomes acquired from this analysis demonstrate that the most suitable forecast was acquired from the forecast of pH. The inferior kurtosis significances of pH reveal that the impression of outliers offered an adverse influence on the execution. As for WQP analysis for each level, they discovered that the WQP forecast in levels 1, 2, and 3 offers promising outcomes. This was connected to the functional data of those levels that are better than the available data in different levels, except level 8.

3 Materials and methods

3.1 Dataset

The dataset used in this research work was fetched from the source Kaggle. The waterQuality.csv file contains water quality metrics for 8,000 records as given below: all attributes are numeric variables and they are listed below:

  • aluminum – dangerous if greater than 2.8;

  • ammonia – dangerous if greater than 32.5;

  • arsenic – dangerous if greater than 0.01;

  • barium – dangerous if greater than 2;

  • cadmium – dangerous if greater than 0.005;

  • chloramine – dangerous if greater than 4;

  • chromium – dangerous if greater than 0.1;

  • copper – dangerous if greater than 1.3;

  • flouride – dangerous if greater than 1.5;

  • bacteria – dangerous if greater than 0;

  • viruses – dangerous if greater than 0;

  • lead – dangerous if greater than 0.015;

  • nitrates – dangerous if greater than 10;

  • nitrites – dangerous if greater than 1;

  • mercury – dangerous if greater than 0.002;

  • perchlorate – dangerous if greater than 56;

  • radium – dangerous if greater than 5;

  • selenium – dangerous if greater than 0.5;

  • silver – dangerous if greater than 0.1;

  • uranium – dangerous if greater than 0.3; and

  • is_safe – class attribute {0 – not safe, 1 – safe}.

3.2 Methods

3.2.1 Data processing

To transform the fetched raw data in useful and highly efficient format, certain data pre-processing techniques are employed [30]. At this stage, different types of functions are implemented so as to find the missing values, outliers, redundant, and skewed features [11].

3.2.2 Missing values

Once the above pre-processed data are loaded, a function is employed to find the missing values in relation to each feature [12].

3.2.3 Outliers data

After the above pre-processing stage, a certain category of data is referred to as noisy data if it may be corrupted, distorted, or cannot be interpreted. This kind of data may have originated from improper procedures or wrong data collection. However, it can be handled by implementing the methods, such as regression, clustering, and binning [31].

3.2.4 Logistic regression

Logistic regression is a type of supervised learning category approach employed to forecast the possibility of a prey variable. The essence of the prey or conditional variable is dichotomous, which indicates that there would be solely two potential categories [32]. In straightforward remarks, the conditional variable is binary including information coded as either 1 (frames for win/yes) or 0 (frames for loss/no). Generally, in mathematical terms, a logistic regression standard indicates P(Y = 1) as a position of X. It is one of the easiest ML techniques that can be employed for diverse category issues, for example, spam identification, diabetes forecast, and cancer identification. Logistic regression models may be categorized as subsequent categories that are mentioned below [33] (Table 1):

  • Binary logistic regression standard and

  • Multinomial logistic regression model.

Table 1

Logistic regression

Model information
Target field is_safe
Category percentages 1a 11.4%
0b 88.6%
Scale parameter handling Fixed at 1.0
Probability distributionc Binomial
Link functionc Logit
Model type Logistic regression
Model building method Forced entry
Number of predictors input 20
Number of predictors in final model 20
Area under ROC curve 0.864
Log likelihoodd −1975.624
Akaike information criterion 3993.249
Bayesian information criterion 4139.969
Finite sample corrected AIC (AICc) 3993.364
Consistent AIC 4160.969

aReference category.

bModeled category.

cThe probability distribution and link function were automatically detected based on the model with minimum ASE in the testing data (ASE = 0.075).

dThe kernel of the log-likelihood position is depicted as well as employed in calculating data standards.

Figure 1 demonstrates the residual predicted in a graphical manner.

Figure 1 
                     Residual deviance.
Figure 1

Residual deviance.

3.2.5 XGBoost algorithm

XGBoost or extreme gradient boosting is a famous approach (ensemble) having improved execution and rate in tree-based (sequential decision trees) machine learning techniques [34]. XGBoost was retained by the Distributed (Deep) Machine Learning Community batch. It is the standard technique employed for using machine learning in contests and has achieved favor via succeeding answers in structured as well as tabular data. Previously, only python and R packages were created for XGBoost but nowadays it has expanded to Java, Scala, Julia, and other programming languages as well [35] (Table 2).

Table 2

Top decision rules for “is safety”

Decision rule Most frequent category Rule accuracy Ensemble accuracy Interestingness index
(chloramine > 0.53) and (uranium > 0.03) and (cadmium > 0.01) and (ammonia > 10.98) and (chromium > 0.09) 0 1.000 1.000 1.000
(silver > 0.1) and (cadmium > 0.01) and (ammonia ≤ 10.98) and (chromium > 0.09) 0 1.000 1.000 1.000
(copper > 0.08) and (copper > 0.04) and (uranium > 0.03) and (nitrates > 9.93) and (cadmium > 0.01) 0 1.000 1.000 1.000
(silver > 0.1) and (nitrites > 1.12) and (nitrates ≤ 9.93) and (cadmium > 0.01) 0 1.000 1.000 1.000
(viruses > 0.54) and (nitrites ≤ 1.12) and (nitrates ≤ 9.93) and (cadmium > 0.01) 0 1.000 1.000 1.000

3.2.6 Multilayer perceptron (MLP)

An MLP is a category of a feed-forward ANN. MLP models are the basic deep neural network [36], which is comprised of sequences of completely correlated layers. In the present time, MLP machine learning approaches can be employed to overwhelm the necessity of increased computing capability demanded by trendy deep learning structures [37]. An individually unique layer is a collection of non-linear operations of a weighted aggregate of all outcomes (completely interconnected) from the previous one [38] (Figures 2 and 3).

Figure 2 
                     MLP for water quality prediction.
Figure 2

MLP for water quality prediction.

Figure 3 
                     MLP details for water quality prediction.
Figure 3

MLP details for water quality prediction.

MLP employed to involve in computer vision, directly followed by CNN. MLP is now deemed inadequate for currently developed computer vision assignments. It has the feature of completely interconnected layers, where an individual perceptron is associated with every different perceptron [39].

The disadvantage is that the total metrics can increase to extremely high (perceptron number in layer 1 reproduced by # of p in layer 2 reproduced by # of p in layer 3, etc.). This is ineffective due to the reason that there is monotony in such elevated measurements. One more drawback is that it ignores spatial data. It accepts flattened vectors as intakes [39]. A lightweight MLP (2–3 coatings) can efficiently gain more accuracy with the MNIST dataset.

3.2.7 Chi-square automatic interaction detector (CHAID)

CHAID was a strategy developed by Gordon V. Kass in 1980. CHAID is a device employed to uncover the connection between variables. CHAID investigation produces a predictive model, or tree, to assist in determining how variables sufficiently combine to illustrate the result in the provided conditional variable [40]. In CHAID research, minor, ordinal, and successive data can be utilized, where constant predictors are divided into varieties with an approximately similar number of statements. CHAID assembles all potential cross-tabulations for individually categorical predictors until the most suitable result is accomplished and no additional splitting can be executed [40] (Tables 3 and 4).

Table 3

CHAID model information for water quality prediction

Model information
Target field is_safe
Model type Binary decision tree
Algorithm name CHAID
Number of features 13
Tree depth 6
Number of nodes 73
Table 4

CHAID top decision rules for water quality prediction

Top decision rules for “is_safe”
Rule ID Rule Mode category Record count Record percentage Rule confidence
13 chloramine < 2.000 and viruses ≥ 4.000 and aluminum < 3.000 0.0 1,051 13.1 99.9
32 chloramine < 2.000 and nitrites ≥ 3.000 and viruses < 3.000 and aluminum < 3.000 0.0 730 9.1 99.9
28 silver ≥ 4.000 and arsenic ≥ 4.000 and aluminum ≥ 5.000 0.0 577 7.2 99.8
64 viruses ≥ 2.000 and viruses < 3.000 and nitrates ≥ 4.000 and nitrites < 2.000 and viruses < 3.000 and aluminum < 3.000 0.0 480 6.0 98.5
47 nitrites ≥ 3.000 and uranium ≥ 3.000 and cadmium ≥ 3.000 and aluminum < 4.000 0.0 421 5.3 99.8

In the CHAID strategy, we can notice the correlation among the split variables as well as the associated related aspect within the tree. The evolution of the decision, or classification tree, begins with determining the target variable or conditional variable; that would be believed the root [41]. CHAID investigation breaks the target into two or more additional varieties that are named as the initial, or parent nodes, and then, the nodes are divided utilizing statistical techniques into child nodes. Unlike in the regression estimation, the CHAID method does not need the data to be regularly distributed [42].

3.2.8 Proposed ensemble learning model

For a machine learning getup, the models must be autonomous of each other (or as separated of each other as potential). One of the getup ways is to employ models of identical machine learning techniques and familiarize them with diverse data sets [43]. For example, you can construct a getup comprised of 12 linear regression standards, each instructed on a subset of your training information.

There are two fundamental approaches for selecting data from the training cluster. “Bootstrap aggregation,” known as “bagging,” carries spontaneous instances from the training cluster “with a substitute.” The other approach, “pasting,” marks instances “without substitute.” Another famous ensemble approach is “growing.” In distinction to traditional getup techniques, where machine learning standards are introduced in parallel, boosting strategies train them sequentially, with each new standard forming upon the prior one and resolving its inefficiencies [44].

AdaBoost, one of the most famous boosting techniques, enhances the accurateness of getup standards by adjusting new standards to the errors of earlier ones. After preparing your first machine learning instance, you single out the training samples incorrectly forecasted by the model. When preparing the subsequent instance, you place a better focus on these criteria. This outcome is in a machine learning model that achieves more useful where the earlier one failed. The cycle recounts itself as multiple standards you desire to add to the getup. The last getup includes several machine learning standards of diverse accuracies, which together can deliver more useful accurateness [4550]. In boosted algorithms, the result of each standard is given a significance that is proportional to its accurateness [51,52,54].

3.2.9 Making water quality predictions with AdaBoost

Inadequate standards are included sequentially, prepared to employ the weighted training information. The procedure resumes until a pre-set number of weak learners have been formed or no further enhancement can be formed on the training dataset. Once finished, you have a collection of weak learners each having a stage value.

Forecasts are prepared by computing the weighted standard of the weak classifiers. For a unique intake example, each weak learner estimates an expected value as either +1.0 or −1.0. The forecast for the getup sample is taken as the full of the weighted forecasts. If the full is favorable, then the first class is expected; if negative, the second class is expected.

4 Results

4.1 Predictive performance of different machine learning methods

We have developed four machine learning algorithms for the prediction of the water quality. It will check whether it is safe or not for drinking purposes. First we implemented the linear regression model. The accuracy level of this model is 88.6%. The performance of this model is demonstrated in Table 5.

Table 5

Evaluation of linear regression

Parameter B Std. error 95% Wald confidence interval Hypothesis test
Lower Upper Wald chi-square df Sig.
(Intercept) −0.660 0.234 −1.119 −0.201 7.952 1.000 0.005
Aluminum −0.715 0.032 −0.778 −0.651 489.113 1.000 0.0001
Ammonia 0.025 0.005 0.015 0.034 25.664 1.000 0.0001
Arsenic 3.046 0.309 2.441 3.651 97.265 1.000 0.0001
Barium −0.121 0.040 −0.198 −0.043 9.352 1.000 0.002
Cadmium 20.641 1.783 17.146 24.136 133.974 1.000 0.0001
Chloramine −0.179 0.020 −0.219 −0.139 77.218 1.000 0.0001
Chromium −1.226 0.179 −1.578 −0.875 46.800 1.000 0.0001
Copper 0.377 0.071 0.239 0.516 28.661 1.000 0.0001
Flouride −0.129 0.097 −0.320 0.061 1.766 1.000 0.184
Bacteria −0.772 0.214 −1.191 −0.353 13.046 1.000 0.001
Viruses 1.255 0.183 0.897 1.614 47.030 1.000 0.0001
Lead 1.605 0.746 0.143 3.066 4.630 1.000 0.031
Nitrates 0.050 0.008 0.035 0.065 43.412 1.000 0.0001
Nitrites 0.312 0.098 0.120 0.505 10.119 1.000 0.001
Mercury 36.236 13.949 8.896 63.576 6.748 1.000 0.009
Perchlorate 0.026 0.003 0.020 0.032 71.442 1.000 0.0001
Radium 0.056 0.020 0.017 0.094 8.028 1.000 0.005
Selenium 4.839 1.469 1.959 7.718 10.847 1.000 0.001
Silver 1.320 0.345 0.644 1.995 14.665 1.000 0.001
Uranium 12.926 1.605 9.780 16.071 64.851 1.000 0.0001
(Scale) 1a

In the second phase, we implemented the multi-layered perceptron. This model consists of one hidden layer and seven neurons. The accuracy level of this model is 95.3%. The performance of this model is demonstrated in Figure 4.

Figure 4 
                  MLP results.
Figure 4

MLP results.

In the third phase, we implemented the CHAID. The AUC value of this model is 0.954. The accuracy level of this model is 93.11%. The performance of this model is demonstrated in Figure 5.

Figure 5 
                  CHAID results.
Figure 5

CHAID results.

In the fourth phase, we implemented the XGBoost. The AUC value of this model is 0.814. The accuracy level of this model is 94.42%. The performance of this model is demonstrated in Figure 5.

In the final phase, we implemented the proposed ensemble learning model. The AUC value of this model is 0.996. The accuracy level of this model is 96.42%. The performance of this model is demonstrated in Figures 6 and 7.

Figure 6 
                  XGBoost tree results.
Figure 6

XGBoost tree results.

Figure 7 
                  Accuracy level in proposed ensemble model.
Figure 7

Accuracy level in proposed ensemble model.

This value is more efficient to individual machine learning algorithms as shown in Table 6.

Table 6

Accuracy comparison

S. No. Name of algorithm Accuracy (%)
1 Logistic regression 88.6
2 CHAID 93.1
3 XGBoost tree 94.3
4 Multi-layered perceptron 95.3
5 Ensemble model 96.4

5 Discussion

In the conclusion, this article suggested an ensemble learning model for water quality employing an adaptive boosting approach, which is known as AdaBoost in short. The result from this experiment is performed on the water quality dataset and the result indicated that the proposed ensemble learning model performed better than the logistic regression model, CHAID, XGBoost tree, and multi-layered perceptron. On executing these algorithms on the prescribed dataset, the accuracy level of these algorithm is 88.6% (logistic regression model), 93.1% (CHAID), 94.3% (XGBoost tree), and 95.3% (multi-layered perceptron), which is enhanced at the level of 96.4% (proposed ensemble model). AS the future scope, we desire to examine more additional attributes along with different classification techniques on the water quality rate dataset. Transfer learning may help to convert these trained models to different water sources (rivers, sea, and underground). It is desired that the outcomes for this article can be utilized to help the local government agencies to take proper health care of their citizens. The study supports similar works (Shariq et al. [49]; Onyema et al. [50]; Edeh et al. [51]; Pejin et al. [52]) that affirmed the power of technology and indeed machine learning in accurate predictions and detection.

tel: +234-7039009048

  1. Funding information: This research was funded by DirecciónGeneral de Investigaciones of Universidad Santiago de Cali under call no. 01-2022.

  2. Author contributions: All authors have contributed equally.

  3. Conflict of interest: The authors declare that there is no conflict of interest.

  4. Ethics approval: The study does not require ethics approval along with evidence.

  5. Consent to participate: The authors declare that they express their consent to participate.

  6. Consent for publication: The authors declare that they express their consent for publication.

  7. Data availability statement: The dataset is available in an open access way. It may be provided as per request.


[1] Agudelo-Vera C, Blokker M, Vreeburg J, Bongard T, Hillegers S, Van Der Hoek JP. Robustness of the drinking water distribution network under changing future demand. Proc Eng. 2014;89:339–46. 10.1016/j.proeng.2014.11.197.Search in Google Scholar

[2] Candelieri A, Archetti F. Identifying typical urban water demand patterns for a reliable short-term forecasting – the icewater project approach. Proc Eng. 2014;89:1004–12. 10.1016/j.proeng.2014.11.218.Search in Google Scholar

[3] Clark RM, Hakim S. Securing water and wastewater systems: global experiences. Secur. Water Wastewater Syst Glob Exp. 2014;2:1–398. 10.1007/978-3-319-01092-2.Search in Google Scholar

[4] Fracasso PT, Barnes FS, Costa AHR. Optimized control for water utilities. Proc Eng. 2014;70:678–87. 10.1016/j.proeng.2014.02.074.Search in Google Scholar

[5] Steingrimsson JG, Seliger G. Conceptual framework for near-to-site waste cycle design. Proc CIRP. 2014;15:272–7. 10.1016/j.procir.2014.06.014.Search in Google Scholar

[6] Ajbar AH, Ali EM. Prediction of municipal water production in touristic Mecca City in Saudi Arabia using neural networks. J King Saud Univ – Eng Sci. 2015;1:83–91. 10.1016/j.jksues.2013.01.001.Search in Google Scholar

[7] Candelieri A, Soldi D, Archetti F. Short-term forecasting of hourly water consumption by using automatic metering readers data. Proc Eng. 2015;119(1):844–53. 10.1016/j.proeng.2015.08.948.Search in Google Scholar

[8] Ellis K, Mounce SR, Edwards J, Speight V, Jakomis N, Boxall J. Interpreting and estimating the risk of iron failures. Proc Eng. 2015;119(1):299–308. 10.1016/j.proeng.2015.08.889.Search in Google Scholar

[9] Eriksson R, Nenonen S, Junghans A, Nielsen SB, Lindahl G. Nordic campus retrofitting concepts – scalable practices. Proc Econ Financ. 2015;21(15):329–36. 10.1016/s2212-5671(15)00184-7.Search in Google Scholar

[10] Jones D, Fischer JE, Rodden T, Reece S, Ramchurn SD, Allen S. Augmenting the bird table: Developing technological support for disaster response. Proc Eng. 2015;107:54–8. 10.1016/j.proeng.2015.06.058.Search in Google Scholar

[11] Mounce SR, Pedraza C, Jackson T, Linford P, Boxall JB. Cloud based machine learning approaches for leakage assessment and management in smart water networks. Proc Eng. 2015;119(1):43–52. 10.1016/j.proeng.2015.08.851.Search in Google Scholar

[12] Roelich K, Knoeri C, Steinberger JK, Varga L, Blythe PT, Butler D, et al. Towards resource-efficient and service-oriented integrated infrastructure operation. Technol Forecast Soc Change. 2015;92:40–52. 10.1016/j.techfore.2014.11.008.Search in Google Scholar

[13] Schmitz QT, Macedo M, Hatakeyama K. Contributions of workflow for knowledge generation process. Proc Manuf. 2015;3:904–11. 10.1016/j.promfg.2015.07.136.Search in Google Scholar

[14] Van Summeren J, Raterman B, Vonk E M, Blokker, Van Erp J, Vries D. Influence of temperature, network diagnostics, and demographic factors on discoloration-related customer reports. Proc Eng. 2015;119(1):416–25. 10.1016/j.proeng.2015.08.903.Search in Google Scholar

[15] Yudenkova O, Savina E. Moscow higher education institutions: Eco-ergonomic aspects of operation and environmental initiatives. Proc Eng. 2015;117(1):382–8. 10.1016/j.proeng.2015.08.182.Search in Google Scholar

[16] Gusmini M, Jabeur N, Karam R, Melchiori M, Renso C. Reputation evaluation of georeferenced data for crowd-sensed applications. Proc Comput Sci. 2017;109:656–63. 10.1016/j.procs.2017.05.372.Search in Google Scholar

[17] Howell S, Rezgui Y, Beach T. Integrating building and urban semantics to empower smart water solutions. Autom Constr. 2017;81:434–48. 10.1016/j.autcon.2017.02.004.Search in Google Scholar

[18] Sanz R, Peris JA, Escámez J. Higher education in the fight against poverty from the capabilities approach: The case of Spain. J Innov Knowl. 2017;2(2):53–66. 10.1016/j.jik.2017.03.002.Search in Google Scholar

[19] Shabani S, Yousefi P, Naser G. Support vector machines in urban water demand forecasting using phase space reconstruction. Proc Eng. 2017;186:537–43. 10.1016/j.proeng.2017.03.267.Search in Google Scholar

[20] Villarroel Walker R, Beck MB, Hall JW, Dawson RJ, Heidrich O. Identifying key technology and policy strategies for sustainable cities: A case study of London. Env Dev. 2017;21:1–18. 10.1016/j.envdev.2016.11.006.Search in Google Scholar

[21] Wang X, Zhang F, Ding J. Evaluation of water quality based on a machine learning algorithm and water quality index for the Ebinur Lake Watershed, China. Sci Rep. 2017;7(1):1–18. 10.1038/s41598-017-12853-y.Search in Google Scholar PubMed PubMed Central

[22] Yousefi P, Shabani S, Mohammadi H, Naser G. Gene expression programing in long term water demand forecasts using wavelet decomposition. Proc Eng. 2017;186:544–50. 10.1016/j.proeng.2017.03.268.Search in Google Scholar

[23] Alizadeh MJ, Kavianpour MR, Danesh M, Adolf J, Shamshirband S, Chau KW. Effect of river flow on the quality of estuarine and coastal waters using machine learning models. Eng Appl Comput Fluid Mech. 2018;12(1):810–23. 10.1080/19942060.1528480.Search in Google Scholar

[24] Amali S, Faddouli NEE, Boutoulout A. Machine learning and graph theory to optimize drinking water. Proc Comput Sci. 2018;127:310–9. 10.1016/j.procs.01.127.Search in Google Scholar

[25] Barzegar R, AsghariMoghaddam A, Adamowski J, Ozga-Zielinski B. Multi-step water quality forecasting using a boosting ensemble multi-wavelet extreme learning machine model. Stoch Env Res Risk Assess. 2018;32(3):799–813. 10.1007/s00477-017-1394-z.Search in Google Scholar

[26] Birch D, Simondetti A, keGuo Y. Crowdsourcing with online quantitative design analysis. Adv Eng Inform. 2017;38:242–51. 10.1016/j.aei.2018.07.004.Search in Google Scholar

[27] Chhabra R. Irrigation water: Quality criteria. Soil Salin Water Qual. 2018;68(1):156–82. 10.1201/9780203739242-6.Search in Google Scholar

[28] Dezfooli D, Hosseini-Moghari SM, Ebrahimi K, Araghinejad S. Classification of water quality status based on minimum quality parameters: Application of machine learning techniques. Model Earth Syst Env. 2018;4(1):311–24. 10.1007/s40808-017-0406-9.Search in Google Scholar

[29] Haghiabi AH, Nasrolahi AH, Parsaie A. Water quality prediction using machine learning methods. Water Qual Res J. 2018;53(1):3–13. 10.2166/wqrj.2018.025.Search in Google Scholar

[30] Raux-Defossez P, Wegerer N, Pétillon D, Bialecki A, Bailey AG, Belhomme R. Grid services provided by the interactions of energy sectors in multi-energy systems: Three international case studies. Energy Proc. 2018;155:209–27. 10.1016/j.egypro.2018.11.055.Search in Google Scholar

[31] Triantafyllidis CP, Koppelaar RHEM, Wang X, van Dam KH, Shah N. An integrated optimisation platform for sustainable resource and infrastructure planning. Env Model Softw. 2018;101:146–68. 10.1016/j.envsoft.2017.11.034.Search in Google Scholar

[32] Awolusi TF, Oke OL, Akinkurolere OO, Sojobi AO, Aluko OG. Performance comparison of neural network training algorithms in the modeling properties of steel fiber reinforced concrete. Heliyon. 2019;5(1):e01115. 10.1016/j.heliyon.2018.e01115.Search in Google Scholar PubMed PubMed Central

[33] Brown R, Mcdonald R. The future of urban clean water and sanitation. One Earth. 2019;1(1):10–2. 10.1016/j.oneear.2019.08.010.Search in Google Scholar

[34] Browne AL, Jack T, Hitchings R. Already existing’ sustainability experiments: Lessons on water demand, cleanliness practices and climate adaptation from the UK camping music festival. Geoforum. 2018;103:16–25. 10.1016/j.geoforum.2019.01.021.Search in Google Scholar

[35] Djerioui M, Bouamar M, Ladjal M, Zerguine A. Chlorine soft sensor based on extreme learning machine for water quality monitoring. Arab J Sci Eng. 2019;44(3):2033–44. 10.1007/s13369-018-3253-8.Search in Google Scholar

[36] EL-Nwsany RI, Maarouf I, Abd el-Aal W. Water management as a vital factor for a sustainable school. Alex Eng J. 2019;58(1):303–13. 10.1016/j.aej.2018.12.012.Search in Google Scholar

[37] Eugene EA, Phillip WA, Dowling AW. Data science-enabled molecular-to-systems engineering for sustainable water treatment. Curr Opin Chem Eng. 2019;26:122–30. 10.1016/j.coche.2019.10.002.Search in Google Scholar

[38] Kondo Y, Miyata A, Ikeuchi U, Nakahara S, Nakashima K, Ōnishi H, et al. Interlinking open science and community-based participatory research for socio-environmental issues. Curr Opin Env Sustain. 2019;39:54–61. 10.1016/j.cosust.2019.07.001.Search in Google Scholar

[39] Muharemi F, Logofătu D, Leon F. Machine learning approaches for anomaly detection of water quality on a real-world data set. J Inf Telecommun. 2019;3(3):294–307. 10.1080/24751839.2019.1565653.Search in Google Scholar

[40] Abba SI, Pham QB, Saini G, Linh N, Ahmed AN, Mohajane M, et al. Implementation of data intelligence models coupled with ensemble machine learning for prediction of water quality index. Env Sci Pollut Res. 2020;27(33):41524–39. 10.1007/s11356-020-09689-x.Search in Google Scholar PubMed

[41] Awoyera PO, Kirgiz MS, Viloria A, Ovallos-Gazabon D. Estimating strength properties of geopolymer self-compacting concrete using machine learning techniques. J Mater Res Technol. 2020;9(4):9016–28. 10.1016/j.jmrt.2020.06.008.Search in Google Scholar

[42] Chakwizira J, Mashiri M, Mpondo B. Local level travel and travail narratives: A review of the King SabathaDalindyebo (KSD) integrated rapid transport plan household surveys. Transp Res Proc. 2020;48(2019):3070–89. 10.1016/j.trpro.2020.08.180.Search in Google Scholar

[43] French A, Mechler R, Arestegui M, MacClune KCisneros A. Root causes of recurrent catastrophe: The political ecology of El Niño-related disasters in Peru. Int J Disaster Risk Reduct. 2020;47:101539. 10.1016/j.ijdrr.2020.101539.Search in Google Scholar

[44] Goel S, Hawi S, Goel G, Thakur VK, Agrawal A, Hoskins C, et al. Resilient and agile engineering solutions to address societal challenges such as coronavirus pandemic. Mater Today Chem. 2020;17:100300. 10.1016/j.mtchem.2020.100300.Search in Google Scholar PubMed PubMed Central

[45] Ha NT, Nguyen HQ, Truong NCQ, Le TL, Thai VN TL, Pham TL. Estimation of nitrogen and phosphorus concentrations from water quality surrogates using machine learning in the Tri An Reservoir, Vietnam. Env Monit Assess. 2020;192(12):789. 10.1007/s10661-020-08731-2.Search in Google Scholar PubMed

[46] Halkia M, Ferri S, Schellens MK, Papazoglou M, Thomakos D. The global conflict risk index: A quantitative tool for policy support on conflict prevention. Prog Disaster Sci. 2020;6(6):100069. 10.1016/j.pdisas.2020.100069.Search in Google Scholar

[47] Hashim H, Ryan P, Clifford E. A statistically based fault detection and diagnosis approach for non-residential building water distribution systems. Adv Eng Inform. 2020;46(September):101187. 10.1016/j.aei.2020.101187.Search in Google Scholar

[48] Krishnaraj A, Deka PC. Spatial and temporal variations in river water quality of the Middle Ganga Basin using unsupervised machine learning techniques. Env Monit Assess. 2020;192(12):744. 10.1007/s10661-020-08624-4.Search in Google Scholar PubMed

[49] Shariq AB, Muhammad WA, Syed AH, Arindam G, Onyema EM. Smart health application for remote tracking of ambulatory patients. In: Hafizul Islam SK, Samanta D, editors. Smart healthcare system design: security and privacy aspects. Wiley; 2021. p. 33–55. Chapter 2.10.1002/9781119792253.ch2Search in Google Scholar

[50] Onyema EM, Elhaj MAE, Bashir SG, Abdullahi I, Hauwa AA, Hayatu AS. Evaluation of the performance of K-Nearest neighbor algorithm in determining student learning styles. Int J Innovative Sci Eng Techn. 2020;7(1):2348–7968.Search in Google Scholar

[51] Edeh MO, Khalaf OI, Tavera CA, Tayeb S, Ghouali S, Abdulsahib GM, et al. A Classification algorithm-based hybrid diabetes prediction model. Front Public Health. 2022;10:829519. 10.3389/fpubh.2022.829519.Search in Google Scholar PubMed PubMed Central

[52] Pejin B, Kien-Thai Y, Stanimirovic B, Vuckovic G, Belic D, Sabovljevic M. Heavy metal content of a medicinal moss tea for hypertension. Nat Product Res. 2012;26(23):2239–42. 10.1080/14786419.2011.648190.Search in Google Scholar PubMed

Received: 2022-05-06
Revised: 2022-06-16
Accepted: 2022-06-26
Published Online: 2022-09-01

© 2022 Surjeet Dalal et al., published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 24.9.2023 from
Scroll to top button