Model-based imputation of sound level data at thoroughfare using computational intelligence

: The aim of the paper was to present the methodology of imputation of the missing sound level data, for a period of several months, in many noise monitoring sta-tionslocatedatthoroughfaresbyapplyingonemodelwhich describes variability of sound level within the tested period. To build the model, at first the proper set of input attributes was elaborated, and training dataset was prepared using recorded equivalent sound levels at one of thoroughfares. Sound level values in the training data were calculated separately for the following 24-hour sub-intervals: day (6-18), evening (18-22) and night (22-6). Next, a computational intelligence approach, called Random Forest was applied to build the model with the aid of Weka software. Later, the scaling functions were elaborated, and the obtained Random Forest model was used to impute data at two other locations in the same city, using these scaling functions. The statistical analysis of the sound levels at the above-mentioned locations during the whole year, before and after imputation, was carried out.


Introduction
Missing values in measurement data always hamper interpretation of results, regardless of the area of research [1]. The reasons for the lack of data can be analyzed using three models: MCAR (missing completely at random), MAR (missing at random), and MNAR (missing not at random). In the last two models, the missingness of data is related to the data observed, or caused by a malfunction of measurement path components, wrong decisions or ethical considerations. In consequence, missing data may lead to bias as *Corresponding Author: Michał Kekez: Faculty of Mechatronics and Mechanical Engineering, Kielce University of Technology, al. Tysiąclecia Państwa Polskiego 7, 25-314 Kielce, Poland; Email: mkekez@tu.kielce.pl they do not appear in the sample completely randomly [1]. These factors have led to the development of various computational methods helping to overcome problems related to missingness of data [1]. In [2], the authors proposed classification of these methods into a weighting approach [3] and an imputation-based approach [1,4]. Both approaches use additional information on the phenomena under study [1]. Weighting methods make adjustments due to missing data by modifying the base weights. Imputation methods use additional information to build the imputation model, on the basis of which the missing data is imputed [1,5]. Imputation methods are divided into deductive and statistical [6]. Deductive methods use rules and relationships between variables for determining the missing data. Statistical methods use remaining part of dataset for reconstruction of the missing values. These methods can be divided into deterministic (imputation by the mean, and regression imputation) and stochastic (hot-deck, and stochastic regression imputation) [6]. However, they often do not give satisfactory results [7].
More sophisticated methods of imputation require building a model. When we consider time series data imputation only, autoregressive and computational intelligence (CI) methods [8] can be applied to building models. Such models can be often used also for time series forecasting [9]. Among autoregressive methods [10], autoregressive moving average (ARMA), autoregressive integrated moving average (ARIMA) [10], autoregressive conditional heteroscedasticity (ARCH), and generalized autoregressive conditional heteroscedasticity (GARCH) [11] are used. Examples of machine learning and computational intelligence methods used for missing data imputation are [12]: K-nearest neighbor (KNN), fuzzy K-means (FKM), singular value decomposition (SVD), and Bayesian principal component analysis (BPCA) as well as regression trees [1] like classification and regression trees (CART) [13] or Cubist [14]. CI methods for modeling, e.g. neural networks, or fuzzy systems are often used together with optimization algorithms like [15]: genetic algorithms, particle swarm optimization (PSO), ant colony optimization, and memetic algorithms [16]. Hybrid connections of various methods are also used for imputation [8]: hybrid simulated annealing and genetic algorithms (HSAGA), hy-brid of autoassociative neural network, simulated annealing and genetic algorithm (AANN-HSAGA), hybrid of principal component analysis and artificial neural networks (PCA-ANN), hybrid of principal component analysis, neural network, simulated annealing and genetic algorithm (PCA-NN-HSAGA). Ensembles of models, like committees of neural networks [8] or committees of regression trees [1] are also used in imputation.
In various transportation-related problems, computational intelligence methods as well as autoregressive methods like ARIMA are used for data imputation [17,18]. Neural networks were used for imputation in [19]. Machine learning was used for cleaning data collected in intelligent transportation systems [20]. Problems regarding road safety and modeling of transport processes were analyzed in [21] and the proposed tool for imputation was Random Forest. In [1], two kinds of CI methods, namely regression trees and Random Forest, were used for analysis of road traffic noise. However, imputation of road traffic noise at various locations using only one model required the development of a new methodology, as discussed in Section 2, 3 and 4. This allows building the initial model for the first location near thoroughfare and the immediate extension of this model for any new thoroughfare in the same city.

Measurement data used for building the model
Road traffic and noise monitoring stations, located near thoroughfares in many cities, constantly record sound level, traffic volume, and vehicle speed and type. Recorded data can be used for various purposes, including calculation of long-term noise indicators L DEN and L N [22], environmental monitoring, and creation of acoustic maps. However, if monitoring stations cease to function partially or completely, missing values of sound level need to be imputed. When traffic data are present, imputation can be carried out by using deductive method, e.g. using CNOSSOS-EU [23] or Nordic prediction method [24]; otherwise the possible way of sound level imputation is to produce models for traffic volume [18,[25][26][27][28][29][30][31], and vehicle speed and type, impute missing traffic data using these models, and finally use previously mentioned deductive method. However, such multi-stage imputation decreases the quality of imputed data. The better solution is to create the model of sound level variability and use it for imputation. To present the proposed methodology, the data recorded in a noise monitoring station will be used.
Sound level values were recorded in a noise monitoring station, situated at the location number 1 (thoroughfare, namely Krakowska Street in Kielce, Poland), consisting of class-1 sound level meter, a road radar, and weather station [1,12]. Measurements were made continuously and the RMS (root mean square) of the A-weighted sound level was saved in the buffer in 1 second intervals with a resolution of 0.1 dB. This allowed to calculate the most common indicator of noise annoyance [32], namely A-weighted equivalent sound level L Aeq , expressed in dB(A), defined as [32,33]: where T represents the total time of measurement (expressed in s), p A (t) -A-weighted sound pressure (in Pa), and p 0 -reference sound pressure of 20 µPa. Based on the previously mentioned measurements, L Aeq values were calculated for the three 24-hour subintervals: day (6-18), evening (18)(19)(20)(21)(22), and night , separately for each 24-hour period in the year, as shown in [1] and in Figure 1.

Elaborated model
The training data for the model contains equivalent sound level (L Aeq ) values of six previous days (for the same sub-  (18)(19)(20)(21)(22), and (c) night  in year 2013 interval of the 24-hour period) marked l 1 , l 2 , . . . , l 6 , where l i is the L Aeq recorded i days earlier. The authors in [1] created 3 separate training datasets, one for each sub-interval of 24-hour period (or time of the day, in other words). Each training set consisted of records containing the values of one output attribute dB_A (equivalent sound A-level, expressed in dB) and 8 input attributes: day_of_the_week (taking values from 1 -Monday to 7 -Sunday), day_of_the_year (values in the range from 1 to 365), l 1 , l 2 , l 3 , l 4 , l 5 , and l 6 . There was no time_of_day attribute (taking values 0 for night, 1 for evening, and 2 for day) in the created sets, because it was held constant in the entire set. The training sets included all records (301 for days, 302 for evenings, and again 302 for nights). Testing was conducted by 10fold cross validation [1]. Certain records in the training sets showed the missing values of some of l 1 , l 2 , . . . , l 6 of the input attributes, and contrary to model 1 in [1], these records were not removed from the training dataset.
The model was constructed using Random Forest algorithm without random selection of attributes, implemented in Weka software [34]. The obtained model consists of 300 trees (100 for each sub-interval of the 24-hour period). In this method, the size of the trees may be large due to no pruning [35]. To assess the accuracy of prediction made by the model, mean absolute error (MAE), which is the arithmetic average of absolute values of differences between the predicted and real value, can be used [36]: where y i denotes real db_A value, y ′ i denotes db_A value calculated by the model, and n is the number of records. Accuracy of the model defined by MAE estimator (eq. 2) is very good because MAE on the training set does not exceed 0.27 dB, and MAE on the test set during 10-fold cross validation is not higher than 0.72 dB (   The L Aeq values calculated by the model, shown in Figure 2, have smaller variance than L Aeq calculated from measurements ( Figure 1). One can observe that minimum value of modeled L Aeq at night, 60.76 dB, shown in Figure 2c is higher than corresponding measurement value of 59.12 dB in Figure 1. Similarly, maximum value of modeled L Aeq at night, 66.73 dB is lower than corresponding value of 67.69 dB in Figure 1. However, the overall accuracy of the model, shown in Table 1 is quite good.
The  (Table 2), the median of L Aeq did not change significantly (at most ±0.09 dB). The quartiles Q 1 and Q 3 did not change more than ±0.18 dB (Q 1 ) and ±0.15 dB (Q 3 ).

Generalization of the model by using scaling functions
The idea of imputing traffic data at one location by the model built on data from nearby location was presented in [5]. In this section, the idea of imputing sound level data at given location by the model built for another location will be presented. The elaborated model described in Section 3 can be adjusted to predict sound level values at another place in the same city, located close to any road of the same class as at location no. 1. For this purpose, the l 1 , l 2 , . . . , l 6 inputs of the model are modified by input scaling function (eq. 5), while y output of the model is modified by the output scaling function (eq. 6).
One can assume that at all thoroughfares in a given city, for a given 24-hour sub-interval and for a given day of week, traffic volume can be expressed as a product of a constant and a coefficient having the value specific to this road. When Nordic prediction model [24] is applied to calculate sound level at any of these thoroughfares (and when percentage of heavy vehicles is similar at all thoroughfares), we obtain the sound level expressed as a sum of a constant and a parameter having the value specific to this road. This led to the idea of output scaling function (eq. 6) in the form of the sum of a constant (obtained for location no. 1) and a parameter specific to given thoroughfare (calculated separately for each day of week and each of three 24-hour sub-intervals).
In order to obtain parameters of both scaling functions, at first a (d, t) values, which are equivalent sound levels [33] for a given day of the week d, and for a given 24-hour subinterval t, are calculated separately for each d=1,2,. . . ,7, and for each t=0,1,2, with use of L Aeq measurement data records from location no.1 (described as training data in Section 3): for all n records fulfilling the condition day_of _the_week = d and time_of _day = t, and where day_of_the_week is input attribute (1 -Monday, 2 -Tuesday, . . . , 7 -Sunday), time_of_day is also input attribute (0 -night, 1-evening, 2 -day), and y i denotes db_A value in measurement data for given location. Values of a (d, t) for location no. 1 are shown in Table 3.
The a (d, t) values for location no. 1 (calculated according to eq. 3) are denoted as a 1 (d, t). Then, the a (d, t) values for a new location are calculated according to eq. 3 (using measurement data from this new location), and denoted as a 2 (d, t). Later, the parameters of scaling functions, namely The extended model, proposed in this section, for the given data record replaces the value of each input attribute l i with the corresponding l ′ i value, calculated using the so-called input scaling function: where d is day_of _the_week attribute value, and t is time_of _day attribute value in the given data record.
Then, the extended model produces its output y by using the so-called output scaling function: where y ′ is the value of the output of elaborated model shown in Section 3, d is day_of _the_week attribute value, and t is time_of _day attribute value, in the given data record.

Application of the model for location no. 2
The values of L Aeq at location no. 2 (thoroughfare, namely Jesionowa Street in Kielce, Poland) in year 2013 calculated from measurements are shown in Figure 3. For over 130 days, the L Aeq values are missing (Figures 3a, 3b, 3c). However, the L Aeq data for the first and for the last 10 days of the year are present (contrary to data at location no. 1), with L Aeq taking values often close to the year's minimum. The lowest value of L Aeq for day subinterval ( Figure 3a) was 65.9 dB at 1 st Jan, and highest was 76.6 dB at 20 th Nov. The lowest values for evening and night sub-intervals were usually observed at national holidays.
In order to adjust the model (presented in Section 3) for location no. 2, the values of a (d, t) (eq. 3) for this location were calculated, based on the measurement values. Next, the p (d, t) values (eq. 4) for scaling functions were calculated. Then, the model with output scaling function  (18)(19)(20)(21)(22), and (c) night  in year 2013 was used to predict the values of L Aeq for location no. 2 ( Figure 4).
Absence of L Aeq data for the first and for the last week in the learning set of the model resulted in lower accuracy of the model for these two weeks at location no. 2 ( Figure 4). As a result, minimum value of modeled L Aeq for evening, 68.92 dB, shown in Figure 4b is higher than corresponding measurement value of 65.97 dB in Figure 3b. Similarly, maximum value of modeled L Aeq for evening, 71.98 dB, is lower than corresponding value of 74.48 dB in Figure 3b. However, the overall accuracy of the model, shown in Table 5, is fairly good.
The model was used for imputation of data at location no.  (Table 4), the median of L Aeq did not change significantly (less than ±0.06 dB). The quartiles Q 1 and Q 3 did not change more than +0.18 dB (Q 1 ) and −0.39 dB (Q 3 ).
To assess the accuracy of the model with scaling functions, parameters of that function were computed again, based on training data containing only about 2/3 of the

Application of the model for location no. 3
The values of L Aeq at location no. 3 (thoroughfare, namely Lodzka Street in Kielce, Poland) in year 2013 calculated from measurements are shown in Figure 5. For about 120 days, mainly from May to August and in November and December, the L Aeq values are missing ( Figure 5). The lowest values of L Aeq for every 24-hour sub- interval were recorded at national holidays and in January and December.
In order to adjust the model (presented in Section 3) for location no. 3, the values of a (d, t) (eq. 3) and p (d, t) (eq. 4) for this location were calculated. The values of L Aeq at location no. 3 in year 2013 calculated by the model with output scaling functions are shown in Figure 6.
The minimum values of modeled L Aeq for day, evening, and night (70.56, 66.38, and 61.92 dB, respectively), shown in Figure 6, are higher than the corresponding measure-  Table 7, is fairly good.
The model was used for imputation of data at location no. 3, for the whole year. Median values of L Aeq after imputation are: 72.6 dB (day), 70.6 dB (evening), 67.1 dB (night). After the imputation of missing L Aeq values by the model for the whole year 2013 at location no. 3 (Table 6), the median of L Aeq did not change significantly (at most −0.1 dB). The quartiles Q 1 and Q 3 did not change more than +0.4 dB (Q 1 ) and −0.3 dB (Q 3 ).
To assess the accuracy of the model with scaling functions at location no. 3, the same procedure as for location no. 2 was applied. The MAE was in the range from 0.7 to 1.1 dB (Table 7).

Conclusions
The presented model with scaling functions was successfully applied to imputation of missing L Aeq values at three various locations at thoroughfares in the same city. After imputation by the model with scaling functions, the Q 1 and Q 3 quartiles slightly changed (no more than ±0.4 dB), and the median values did not change more than ±0.1 dB. To evaluate the quality of the model, 10-fold cross validation (for location no. 1) and train and test sets (for locations no. 2 and 3) were applied. The accuracy of the model at location no. 1 was not worse than 0.72 dB (MAE value), while at locations no. 2 and 3 the MAE did not exceed 1.11 dB.
The accuracy of the model presented in Section 3 and 4 is sufficient for many practical purposes.