Spatial mixture modeling for analyzing a rainfall pattern: A case study in Ireland

: This study investigates the spatial heteroge neity in the maximum monthly rainfall amounts reported by stations in Ireland from January 2018 to December 2020. The heterogeneity is modeled by the Bayesian normal mixture model with di ﬀ erent ranks. The selection of the best model or the degree of heterogeneity is imple mented using four criteria which are the modi ﬁ ed Akaike information criterion, the modi ﬁ ed Bayesian information criterion, the deviance information criterion, and the widely applicable information criterion. The estimation and model selection process is implemented using the Gibbs sampling. The results show that the maximum monthly rainfall amounts are accommodated in two and three components. The goodness of ﬁ t for the selected models is checked using the graphical plots including the probability density function and cumulative distribu tion function. This article also contributes via the spatial determination of return level or rainfall amounts at risk with di ﬀ erent return periods using the prediction intervals constructed from the posterior predictive distribution.


Introduction
Rainwater, also called precipitation, is a natural feature of the earth's weather system. Air currents in the atmosphere bring evaporated water from the ocean and the earth's surface into the sky. The evaporated liquid condenses in the cold air, forming moisture-filled rain clouds [1]. Rain water's most well-known and most important effect is providing water to drink. According to the United States Geological Survey, rainwater seeps into the ground in a process called infiltration. Some of the water seeps deep beneath the top layers of soil where it fills up the space between subsurface rocks and becomes groundwater, also called the water table. Less than 2% of the earth's water is groundwater, but it provides 30% of our freshwater. Without rain water's continued replenishment of the water table, potable water would become scarcer than it already is in ref. [2]. Furthermore, many researchers demonstrated the impact of heavy rainfall on floods [3][4][5]. Therefore, the analysis of precipitation quantities on a certain area aims to give an overall visualization or prior information to evaluate the risk of some natural disasters such as droughts, floods, landslides, and so in ref. [6].
Many researchers have used different analysis methods to analyze the rainfall trend. Meneghini et al. [7] used different statistical methods such as the area-time integral (ATI) method to estimate the average rainfall over a large space. Arvind et al. [8] used different analysis methods on the Annual and Monthly rainfall for Musiri Region; they concluded that Gumbel distribution is the best type. Panda and Sahu [9] used the Mann-Kendall test as a statistical method and Sen's slope estimator to examine and analyze the seasonal rainfall over the state of Odisha in India. Their results showed a relatively maximum amount of rainfall in monsoonal months. Nyatuame et al. [10] used linear regression analysis as a statistical method for annual and monthly rainfall. They stated an insignificant increasing trend in annual mean rainfall data among the Volta Region and a significant trend in mean monthly rainfall. Asfaw et al. [11] inspected the change of rainfall and temperature in north-central Ethiopia using gridded monthly precipitation data. The Mann-Kendall was used to detect the time series trend. Praveen et al. [12] analyzed and forecasted rainfall changed in India. They used Pettitt and Mann-Kendall tests as analysis methods. However, this group of studies did not take into account the variation in data.
This study assumes that the maximum rainfall quantities follow the normal distribution assuming that rains are distributed equally throughout Ireland. However, this assumption is invalid due to variation or fluctuation in rainfall amounts, leading to a phenomenon called the heterogeneity of the data. Identifying spatial heterogeneity of rainfall can give a valuable indicator for the analytical studies and government planning to detect the factors linked to various causes to the low or height in the rain falling. For this reason, the statistical modeling method called the mixture model with a finite number of components was proposed to accommodate the heterogeneity in data. The task of finite mixture models is to capture unobserved heterogeneity in the population by assuming that the population consists of K homogeneous subgroups [13]. However, identifying the number of homogeneous subgroups K or the rank of model forms is more challenging. Several criteria have been addressed in the literature to determine the model's rank under both the frequentist and the Bayesian settings. As in this article, the Bayesian framework focused on estimating the model parameters, four well-known model selection criteria derived under the Bayesian principle. These criteria are the modified Akaike information criterion (AIC) [14], the modified Bayesian information criterion (BIC) [15], the deviance information criterion (DIC) [16], and the widely applicable information criterion (WAIC) [17]. An approach was followed to fit a set of candidate models to the data and select the best one. In other words, in this research, the assumption that the number of the model components is fixed and unknown and the best model is determined by one of our four proposed criteria via fitting several candidate mixture models with different components. Despite that, another approach called the reversible jump Markov chain Monte Carlo sampling [18] can be applied to select the appropriate number of components. However, this latter approach has drawbacks when the Markov chain moves between mixture models with different classes [13].
This article also determines the spatial return level or rainfall amounts at risk with different return periods using the prediction intervals constructed from the posterior predictive distribution. This latter can be considered as a newly developed alternative approach to the confidence intervals adopted by several kinds of literature to identify the rainfall amounts at risk [4,6].
This article is classified as follows. Section 2 introduces the article's methodology, including the model's construction, estimation, and election. Section 3 includes a description of the data under study. The results and discussion are shown in Section 4. Finally, Section 5 summarizes the important conclusions of this article.

Methodology
This section introduced the building of the model and estimation of the model parameters under the Bayesian principle. After that, the best model to fit the maximum was selected. Each station's monthly rainfall quantity is reported under study using model selection criteria such as AIC, BIC, DIC, and WAIC. In addition, the goodness of fit was also checked.

Model construction and Bayesian analysis
Let us assume that the study region is divided into m stations, and let y i represent the maximum monthly rainfall quantity reported by ith station, = … i n 1, 2, , . To take into account the heterogeneity in the data of rainfall quantities, those data follow a mixture of univariate Gaussian distribution: where k is the number of components of the model (that can be viewed as levels of monthly rainfall quantity), i represent the mixed Gaussian probability density function (PDF) which is defined as where μ j and σ j 2 are the mean and variance of jth mixture of the Gaussian distribution, respectively. The mixture models can be analytically easier by including latent variables in their formation as this latter makes it more useful for the purpose of interpretation and numerical computations. For a mixture with a certain number of components, the model can be described by inserting n independent discrete variables, … z z z , , , n distribution ( , equation (1) can be written as which is called the complete-data likelihood function. The role of latent variable, z i , is to assign the observation y i to one of the mixture components. By taking the logarithm for equation (3), we obtain: The log-likelihood function in equation (4) can be approximated over the posterior distribution. For example, given ( ) computed over a full Monte Carlo Markov chain (MCMC) run, obtaining the estimated log-likelihood by post-processing the posterior outcome: To complete the Bayesian analysis for the model, we have to define the prior and posterior distributions for all the model parameters. For this purpose, Algorithm 1, given by ref. [19], is used to implement the sampling process using one of the MCMC approaches that is called Gibbs sampler.
Algorithm 1: Gibbs sampler for K -component normal mixture model  and compute: where η j , ζ j , a j , b j , and δ j are known hyper-parameters, = … j K 1, 2, , , and they are commonly given non-informative hyper-priors or flat values [20]. For instance, the inverse Gamma with parameters = a 0.001 and = b 0.001 and thus a mean of / = a b 1 and a variance of / = a b 1,000 2 can give diffuse values of this form. The prior of the mean parameter can be assigned flat values from a Normal distribution with a shape parameter, = η 0, and a scale parameter, = ζ 0.001, which has a large variance equal to 1,000. The weight parameter, π, is given a Dirichlet prior with non-

Model selection criteria
These first three sections introduce four criteria for choosing the number of components in Gaussian mixture models in Bayesian settings. In the last section, a graphitic display method to evaluate the goodness of fitness of the model is shown.

Akaike information and Bayesian information criteria
Two well-known criteria modified were introduced under the Bayesian principle, which are the AIC and BIC proposed by Kadhem et al. [21]. These criteria depend on the deviance and penality term. From equation (5), the deviance can be defined as twice the negative log likelihood: where the deviance above is approximated over MCMC samples as explained in equation (5). The penality term is computed based on the free parameters of the model as: = − h 3 K 1 [22]. The AIC and BIC take the deviance as a measure of model fit and penalizes it for the number of parameters in the model. Then, AIC and BIC are given as follows: where n is the sample size.

DIC
Another criterion proposed in this article is the DIC. Eight versions of this criterion were introduced by Celeux et al. [23]. They recommended the version that is based on the complete-data likelihood. In this article, we apply this version which is given by: with its effect number of parameters, p DIC , defined as follows: , and ( ) w ẑ are the complete-data posterior modes of the parameters μ σ , 2 , and w, respectively, which are computed for each samples from the posterior ( | ) μ σ w p z y, , , 2 .

WAIC
The last criterion proposed in this article is the WAIC. This criterion is fully Bayesian and it is computed based on the so-called integrated pointwise predictive density (ilppd). For a Gaussian mixture distribution, the ilppd can be defined as follows:  [20] proposed adding a correction term or the so-called effect number of parameters p WAIC , to avoid the bias. This number is based on computing the variance of individual terms in the ilppd, which is defined as follows: The WAIC then is constructed as follows:

Graphical display method
In this research, the cumulative distribution function (CDF) is used as one of the graphical display methods to reinforce the correct model chosen by the model selection criteria above. The CDF plot is implemented to visualize the fitness of model distributions where it is monotonically increasing between the limits from 0 to 1. The CDF of a Gaussian mixture with K components can be given as

Analysis of rainfall amounts at risk
In this section, the so-called prediction intervals that are being constructed from the posterior predictive distribution were introduced [20] to analyze the extreme amounts of rainfall. The predicted values can be used as a goodness of fit approach to prediction accuracy of a statistical model. The limits of prediction interval can be constructed by the lower prediction limit, LPI ( * y ), and upper prediction limit, UPI ( * y ), where * y represent the predicted data. In such cases, the interval [LPI ( * y ), UPI ( * y )] is termed as the prediction interval and has a prediction coefficient of (1 − p)100%. By introducing the prediction interval, the range that T-year rainfall takes can be theoretically estimated and also becomes possible to estimate the swing of T-year probability hydrological quantities in flood control measures for a T-year probability scale. This makes it possible to interpret the record-breaking heavy rainfall mentioned above as a phenomenon within the prediction interval. In other words, the prediction interval can be a tool to evaluate the return period of heavy rainfall.
In this research, the prediction interval is constructed as follows. Given the estimation of the model parameters sam- represents the joint complete posterior distribution. Given samples of the relative risk parameter, ( ) λ jt m , and latent variables, ( ) z m , drawn from an MCMC run, the predictive data of a Poisson mixture model can be approximated as Hence, the prediction interval can be formulated as follows: where σ h is an estimate of the standard deviation of the h-step forecast distribution and c is the multiplier that includes a range of coverage probabilities assuming a normal forecast distribution.
Given that the return period can be calculated as follows. Let us assume that X is the variable that equals to or greater than an incident of magnitude x T occurring once in T years. In a given year, the probability of occurrence of incident X, ( ) ≥ X x Pr , is expressed as: Wilks [6] pointed out that the amounts of maximum monthly rainfall with the 50-year or 100-year return periods cannot be directly calculated from the data set used here, but have to be extrapolated from the 98th and 99th percentiles of the fitted distribution, respectively, i.e., years. On this basis, we can compute the return periods based on the prediction intervals.   (19.5% of the total area) were devoted to growing crops; 6 and 1.5% of the agricultural area were used to grow cereals, and root and green crops, respectively. Over half of the agricultural production is exported. The income was increased from 314£ million in 1972 to 1920£ million in 1995 and 1.1£ billion in 2001. Ireland's climate is mild, moist, and changeable, with abundant rainfall and a lack of temperature extremes. Ireland's climate is defined as a temperate oceanic climate. In general, Iceland has warm summers and pleasant winters, as compared to, say, Newfoundland, which is significantly warmer at the same latitude and located downwind of the Atlantic Ocean. It is also hotter than marine climates around the same latitude, such as the Pacific Northwest, due to the heat released by the Atlantic overturning circulation, which includes the North Atlantic Current and Gulf Stream. In comparison, Dublin is ∘ 9 warmer in the winter than St. John's, Newfoundland, and ∘ 4 warmer than Seattle, Washington [23]. Ireland's climate is not vulnerable to extreme weather phenomena, such as tornadoes, and storms are uncommon. Throughout the winter, the North Atlantic Current keeps the Irish coast clear of ice. Ireland, on the other hand, is vulnerable to storms heading eastward from the North Atlantic. The prevailing wind is from the southwest, and it breaks on the west coast's steep mountains. As a result, off the west coast of County Kerry, Valentia Island receives nearly twice as much rain as Dublin, on the east coast (1,400 vs 762) mm, demonstrating the importance of rainfall in Western Ireland. The coldest months are January and February, with average daily air temperatures ranging from four to ∘ 7 C. July and August are the hottest months, with average daily air temperatures of 14 to ∘ 16 C. In July and August, daily maximum temperatures range from 17 to ∘ 18 along the coast to 19 to ∘ 20 inland. May and June are the months with the highest sunshine, with an average of 5-7 h per day. Extreme weather occurrences do occur, notwithstanding their rarity in comparison to other European countries. Atlantic depressions can bring gusts of up to 160 km/h to Western coastal regions in December, January, and February. Thunderstorms are common during the summer months, especially in late July and early August. This article investigates the rainfall pattern via analyzing the monthly maximum rainfall quantities reported by the stations on Island. The data on the rainfall

Data description
Athenry Mullingar

Results and discussion
For every station, the aim is to select the best normal mixture fitted to the monthly maximum rainfall amounts for return periods of 50 and 100 years expressed with prediction interval periods derived from a posterior predictive distribution. Table 2 shows the results of model selection, where the maximum rainfall amounts reported by the stations follow models either with two or three components as produced by the proposed criteria. Note that model with = K 1 (standard normal distribution) is not selected by all model selection criteria suggesting that the data suffer from the heterogeneity. The model estimation results for the models selected by the criteria are shown in Table 3, where the estimated weights, means, and variances represent the estimated parameters of the models selected by our proposed criteria. In addition, we provide the CDF and PDF plots for the model selection, in Figures 2 and 3, respectively, which appear as adequate goodness of fit. It can be noted from Table 2 that most of the rainfall data reported by stations (18 stations) are modeled by models with = K 2, except seven stations such as Dublin Airport, Phoenix Park, Athenry, Mace Head, Newport Furnace, Mullingar, and Johnstown Castle are modeled by mixture model with = K 3. For the results of determination of the return periods, Figures 2 and 3 show the estimation of risk levels for a certain incident, with including a plot of the actual maximum monthly rainfall amounts in red color (as threshold), which are represented by the heights of rainfall levels of 50 and 100 year return periods with 95% prediction intervals. It can see from Figure 4 that there is a stable in rainfall amounts over 50 years return period (all values of mid-point of prediction intervals under the actual data), but there is a notable increase in the rainfall

Conclusion
In this study, we used a finite mixture of Gaussian distributions to model the heterogeneity in monthly rainfall quantities in Ireland based on databases taken from 25 stations. Several tools were used to assess the Bayesian normal mixture models fitted to the data under study. The model selection criteria such as AIC, BIC, DIC, and WAIC were used to select the best model fit to the data. In Figure 4: The maximum monthly rainfall amounts (measured in millimeter (mm)) for each station vs the predicted values represented by 95% prediction intervals for return period 50 years. addition, we used graphical approaches such as the CDF and PDF to assess both the selected models. Most of the data reported by stations were modeled under a normal mixture model with two components, while the rest stations were modeled with only three components. We conclude that data that have been modeled by two components are more homogeneous than those modeled by three components. This article also shows the computation of return periods, for 50 and 100 years, for each station using the prediction intervals deriving from the posterior predictive distribution of the models selected by the model selection criteria. The diagnostic of high rainfall rates in the long term can help the related systems to put the plans that save lives and possessions. The advantage of the methodology used in this article is to reveal the heterogeneity in data by modeling it over several homogeneous groups. On the other hand, the disadvantage of this methodology is that it could not taken into account the hidden trend in increasing and decreasing in the maximum rates of rainfall. This latter can be modeled by so-called the hidden Markov mixture model which represents our future interest to study this phenomenon.