Optimal bandwidth selection for recursive Gumbel kernel density estimators

Abstract In this paper, we propose a data driven bandwidth selection of the recursive Gumbel kernel estimators of a probability density function based on a stochastic approximation algorithm. The choice of the bandwidth selection approaches is investigated by a second generation plug-in method. Convergence properties of the proposed recursive Gumbel kernel estimators are established. The uniform strong consistency of the proposed recursive Gumbel kernel estimators is derived. The new recursive Gumbel kernel estimators are compared to the non-recursive Gumbel kernel estimator and the performance of the two estimators are illustrated via simulations as well as a real application.


Introduction
In probability theory and statistics, the Gumbel distribution (called also Generalized Extreme Value distribution Type-I) introduced by [11,12] is one of the most used parametric model in extreme value theory. This distribution can be used, for example, to represent the distribution of the maximum level of a watercourse. It can also be used to predict the probability of an extreme earthquake occurring. See [16] for a variety of Gumbel model applications.
Many approaches have been considered in the parametric framework, we can cite the work of [17], in which the model parameters were estimated using an inferential procedure based on likelihood, we can list also the work of [9], in which the parameters of the Gumbel function were estimated using the mode and the moments of the distributions. It should be noted that all these methods involve the estimation of unknown parameters. However, the parametric model may be limited for the analysis of the complex distribution of extreme values, and may give worse results for small and medium-sized samples. As an alternative to the Gumbel parametric model, we propose to use a non-parametric kernel method.
In the literature there has been some attention on non-parametric procedures for the tail density estimation. [6] proposes an estimator using the Champernowne transformation distribution and the classical kernel density estimator. [18] investigates kernel density estimation and its application to the Gumbel probability distribution. [4] considers an estimator based on classical transformation methods, the book includes real-life examples. [35] proposes two kernel density estimators based on a bias reduction technique. More recently, [2] develops a transformation kernel density estimator which is able to handle heavy tailed and bounded data.
The rst objective of this article is to propose a recursive non-parametric Gumbel kernel estimators. The second objective is to establish the statistical properties of the new estimators and compare them with the non-recursive version of Gumbel kernel estimator and with the conventional kernel density estimators developed respectively by [26] and [23] for the non-recursive estimator and [21] and [28,29] for the recursive estimator by using a Gaussian kernel and through a plug-in bandwidth selection. *Corresponding Author: Yousri Slaoui: Université de Poitiers, E-mail: Yousri.Slaoui@math.univ-poitiers.fr The layout of the present paper is as follows. Section 2 is devoted to introduce our proposed recursive Gumbel estimators. We establish the statistical properties of the proposed estimators and we develop a plugin bandwidth selection approach in Section 3. Section 4 is dedicated to our application results, rst by simulation (subsection 4.1) and second using a real dataset (subsection 4.2). We conclude the article in Section 5. To avoid interrupting the ow of this paper, all mathematical developments are relegated to Appendix A.

Recursive Gumbel kernel estimators
Let X , . . . , Xn be independent, identically distributed random variables, and let f be the probability of X. In order to construct a stochastic algorithm to recursively estimate the unknown density function of probability f at a point x, we consider at the beginning an algorithm of search of the zero of the function h : y → f (x) − y. Then, we consider Robbins-Monro's procedure, the proposed algorithm is de ned by setting f (x) ∈ R, and, for all n ≥ , where T Gum n (x) is considered as an observation of the function h at the point f Gum n− (x) and the stepsize (βn) is a sequence of positive real numbers that goes to zero. To construct T Gum n (x), we follow the approach of [24,25], [37] and of [28,30,34], which are based on the classical properties of stochastic algorithms which is (E T Gum n (x) |F n− = , where F n− stands for the σ-algebra of the events occurring up the time n − ) and we use the Gumbel kernel K Gum (z) = exp (−z − exp (−z)), and a bandwidth (hn) (that is, a sequence of positive real numbers that goes to zero), and set T Gum Then, the estimator f Gum n to recursively estimate the probability density function using the Gumbel kernel at x can be written as This recursive property (1) is very useful with a large sample size since f Gum n can be updated with each additional observation. In fact, if Xn is a new observation, the estimators f Gum n (x) can be updated recursively by relation (1). From a computational cost point of view, this arrangement provides important savings in computational time and storage memory, which is a consequence of the fact that the estimate updating is independent of the history of the data. The main drawback of the non-recursive kernel estimators is the use of all data at each step of estimation. One can check that, if we consider f Gum (x) = and we let Qn = n j= − β j , then the last equation (1) can be written as follows: This equation will be very useful to construct a data-driven bandwidth selection. Moreover, we show that the optimal bandwidth which minimizes E R f Gum n (x) − f (x) dx depends on the choice of the stepsize (βn); more precisely, we show that under some conditions of regularity of the unknown density of probability function f and using the special stepsizes (βn) = n − , the bandwidth (hn) will be equal to where γ is the Euler-Mascheroni constant (γ . ). Then, in this paper we propose an automatic selection of such bandwidth through a plug-in method, and then give the conditions under which the recursive estimator f Gum n will be approximately similar to the non-recursive Gumbel density estimator de ned as The obtained results presented in Section 4 corroborate our theoretical results given in the next section.

Assumptions and main results
We de ne the following class of regularly varying sequences.
De nition 1. Let γ ∈ R and (vn) n≥ be a nonrandom positive sequence. We say that (vn) ∈ GS (γ) if Condition (4) was introduced by [10] to de ne regularly varying sequences (see also [3]) and by [20] in the context of stochastic approximation algorithms. Noting that the acronym GS stands for (Galambos and Seneta). Typical sequences in GS (γ) are, for b ∈ R, n γ (log n) b , n γ (log log n) b , and so on.
In this section, we investigate asymptotic properties of the proposed estimators (1). The assumptions to which we shall refer to throughout this paper are the following: Assumption (A ) (iii) on the limit of (nβn) as n goes to in nity is standard in the framework of stochastic approximation algorithms. It implies in particular that the limit of [nβn] − is nite. For simplicity, we introduce the following notations: where L ( ) (x) is the rst derivative of the function L at a point x. In this section, we explicit the choice of the optimal bandwidth (hn) through a plug-in method, based on the minimization the Mean Weighted Integrated Squared Error MWISE of our proposed recursive Gumbel kernel density estimators f Gum n de ned in (1). Moreover, we provide a comparison between our proposed estimators with the non-recursive Gumbel kernel density estimator (3) and with the conventional Gaussian kernel density estimators (see, [29]). In addition, it has been demonstrated in [21] and considered in [28] that to minimize the Mean Integrated Squared Error , the stepsize (βn) must be chosen in GS (− ) and must satisfy limn→∞ nβn = . We consider here the case (βn) = n − .
Our rst result is the following proposition, which gives the bias and the variance of f Gum n .
Proposition 1 (Bias and variance of f Gum n ). Let Assumptions (A ) − (A ) hold.
Remark 1. Let us now state the following theorem, which gives the weak convergence rate of the estimators f Gum n de ned in (1).
1. If there exists c ≥ such that β − n h n → c, then where D → denotes the convergence in distribution, N the Gaussian-distribution and P → the convergence in probability. Now, in order to measure the quality of our recursive estimators f Gum n de ned in (1), we consider the following amount, The next proposition gives the MWISE of our recursive Gumbel kernel density estimators f Gum n de ned in (1).
Theorem 2 (Uniform convergence). Let Assumptions (A ) − (A ) hold, f is uniformly continuous and there exists η > such that z → z η |f (x)| is a bounded function. We let C be a compact set of R. Then, we have

. Bandwidth selection
The bandwidth selection methods studied in the literature as part of the non-parametric kernel estimators can be divided into three main classes: cross-validation techniques, plug-in methods and the bootstrap idea. [7] proposed a detailed comparison of the three practical bandwidth selection. The last authors reaches the following conclusions; chosen appropriately, the plug-in and bootstrap selectors both methods exceed the cross-validation bandwidth, and neither of the two is best in all cases. More recently, [33] compared a plug-in bandwidth selector of a semi recursive kernel hazard function to a cross-validation bandwidth selector. The last author conclude that the plug-in method outperforms the cross-validation method. In this section, we developed a plug-in bandwidth selector based on the minimization of the MWISE of the proposed recursive Gumbel kernel density estimator by using the function f (x) as a weight function.
The following corollary indicates that the bandwidth which minimizes the MWISE of f Gum n depends on the stepsize (βn) and consequently the corresponding MWISE depends also on the stepsize (βn).
Then, the corresponding MWISE is equal to The following corollary shows that, for a special choice of the stepsize (βn) = β n − , which ful lled that limn→∞ nβn = β and that (βn) ∈ GS (− ), the optimal bandwidth (hn) depends on β and then the corresponding MWISE depend on β .
1. Clearly the optimal bandwidth hn belong to GS − / and then a = / , which explain that the bandwidth hn and the corresponding MWISE do not depend on the quantity a anymore. 2. Since the minimum of β β − / − / is reached at β = , the optimal bandwidth (hn) must be equal and therefore we have 3. We consider in the rest of this paper only the stepsize (βn) = n − .
At this stage in order to estimate the optimal bandwidth (hn) given in (8), we must estimate I and I . We followed the approach of [1] and [29,30,33,34], which is called plug-in estimation, and we consider the following kernel estimators for I and I : where bn and b n are two pilot bandwidth, βn and β n are two pilot stepsizes for the estimation respectively of I and I . Qn = n i= ( − β i ) and Q n = n i= − β i . In practice, in the rst step of our method we let (see [27]) where s is the sample standard deviation, Q , Q denotes respectively the rst and third quartiles of the length-biased distribution and Q represent the quantile function of the standard normal distribution). We followed the same steps as in [29,34] and we showed that in order to minimize the MISE of I the pilot bandwidth (bn) must belong to GS − / and the pilot stepsize (βn) = . n − and in order to minimize the MISE of I the pilot bandwidth b n must belong to GS − / and the pilot stepsize β n = . n − . Consequently, the proposed plug-in estimator of the bandwidth (hn) using the recursive estimators de ned in (1) with the choice (βn) = n − is equal to and the corresponding MWISE is equal to Further, using the choice which minimize the variance (7) of the proposed recursive Gumbel density estimators, which is given by (βn) = ( − a) n − , the plug-in estimator of the bandwidth (hn) must be equal to and the corresponding MWISE must be equal to One can check easily that by considering the well known non-recursive kernel density estimator introduced by [26] and [23] together with the Gumbel kernel, the bias and the variance of the non-recursive Gumbel density estimator f Gum n can be equal to the following expression (see the next proposition).

Proposition 3 (Bias and variance of f
and It follows that, Then, to minimize the MWISE of f Gum n , the bandwidth (hn) must equal to and the corresponding MWISE is equal to Now, in order to estimate the optimal bandwidth (15), we must estimate I and I . For this purpose, we consider the following kernel estimators, to estimate I and I respectively: where bn and b n are two pilot bandwidth de ned by (11). We show that in order to minimize the MISE of I and I the pilot bandwidth (bn) must belong to GS − / , and in order to minimize the MISE of I the pilot bandwidth b n must belong to GS − / .
Then the plug-in estimator of the bandwidth (hn) using the non-recursive estimator (3), is given by and the plug-in of the MWISE of the non-recursive Gumbel density estimator (3), is given by An additional motivation of using Gumbel kernel density estimators rather than Gaussian kernel density estimators is given in the following paragraph.

. . . Large and moderate deviation
The theory of large and moderate deviations focuses on rare events and the asymptotic calculation of their probability on an exponential scale. We recall that the rate function obtained for the Moderate Deviation Principles (MDP) of the non-recursive kernel estimator ( [26] and [23]) is equal to [19]). By considering a normal kernel f (x) . We underline that the rate function obtained for the MDP of the recursive estimators proposed in [21] and [29], by using the stepsize which minimize the asymptotic variance and by considering a normal kernel, is equal to Jx (t) (see [28]). Now by using the Gumbel kernel, the rate function for the MDP of the Gumbel kernel density estimators are equal to t f (x) . Then, the rate function obtained for the Gaussian kernel density estimators are smaller than the one obtained for the Gumbel kernel density estimators. We conclude then that the Gumbel kernel density estimators are more concentrated around the density function f (x) than the Gaussian kernel density estimators.

Applications
The aim of our applications is to compare the performance of the proposed (R-Gumbel) recursive Gumbel kernel density estimator de ned in (1) using the stepsize (βn) = n − together with the proposed plug-in bandwidth selection procedure given in (12) to three others methods: The rst one (NR-Gaussien) consists of using the non-recursive kernel density estimator proposed by [26] and [23] by using the Gaussian kernel together with the plug-in bandwidth selection method proposed in [29]; the second one (R-Gaussien) consists in using the recursive kernel density estimator proposed by [21] by using the Gaussian kernel together with the plug-in bandwidth selection method proposed in [29]; the third one (NR-Gumbel) consists in using the non-recursive Gumbel kernel density estimator de ned in (3) and the considered plug-in bandwidth selection procedure given in (18).

. Simulations
In order to investigate the comparison between the four approaches, we consider three sample sizes: . We denote by f * i the reference density, and by f i the density test, and then we compute the following measures: Mean Absolute Error We recall that the extreme value type I distribution is referred to as the Gumbel distribution. The general formula for the probability density function of the Gumbel distribution given parameters (µ and σ) is equal to: where µ is the mode value (location parameter), σ is the scale parameter (see [11,12]). As µ increases, the distribution shifts to the left, whereas when µ decreases, it shifts to the right. Probability distributions that have thinner tails than an exponential distribution are called light tail distributions. They go to zero much faster than the exponential, and therefore have less mass in the tail. However, a heavy tailed distribution has a tail that is heavier than an exponential distribution (see [5]). Formally, a distribution F is called heavy-tailed if and only if R exp (λx) F (dx) = ∞, for all λ > .
Many distributions are heavy tailed, including; Fréchet distribution, Lognormal distribution and Pareto distribution. We recall that the probability density of the Fréchet distribution given parameters (µ, σ and τ) is equal to: for x ∈ (µ, ∞), µ, σ and τ are the location, scale and shape parameters (see [8]). Performing all the considered methods, we report the MAE, MSE and MRAE values for each considered distribution function and for each sample size in Tables 1 and 2. From Tables 1 and 2, we observe that • using Model 1, the NR-Gaussian kernel estimator gives better results for small sample size n = and for large sample size n = . However, the R-Gumbel kernel estimator gives better results for moderate sample size n = . • using Model 2 and Model 3, the NR-Gumbel kernel estimator gives better results for small sample size.
However the R-Gumbel kernel estimator gives better results for moderate and large sample size regarding to the MAE, MSE and MRAE, with an exception for the second model for large sample size regarding to the MRAE. • using Model 4 and Model 5, the Gumbel kernel estimators outperformed the Gaussian kernel estimators for all the sample sizes regarding to the MAE and MSE. However, the NR-Gaussian kernel estimator give better results regarding to MRAE for both moderate and large sample size using for the fourth model and for large sample size for the last model. Model : Model : X ∼ . Gum ( , . ) + . Gum ( , ) Model : • using Model 6, the NR-Gaussian kernel estimator gives better results for small sample size, while the R-Gumbel kernel estimator gives better results for moderate and large sample size. • using Models 7 and 8, the NR-Gumbel kernel estimator gives better results for small sample size, while the R-Gumbel kernel estimator gives better results for moderate and large sample size.
• using all the considered models, the R-Gumbel kernel estimator outperformed the NR-Gumbel kernel estimator for moderate and large sample size. However, the NR-Gumbel kernel estimator gives better results for small sample size. • using all the considered models and by performing the four approaches, the MAE, MSE and MRAE decrease as the sample size increase. • by considering models with light tail distribution (see Model 1, Model 2, Model 3), we observe that the more the location parameter moves away from zero and the more the scale parameter moves away from one, the more the MAE, MSE and MRAE decrease. • by considering models with heavy tailed distribution (see Model 6, Model 7, Model 8), we observe that the more the scale and shape parameter move away from one, the more the MAE, MSE and MRAE decrease.  Figure 1 shows that the Gumbel kernel estimators are closer to the true density compared to the Gaussian kernel estimators. Moreover, Figure 2 shows clearly that the recursive Gumbel kernel estimator is closer to the true density compared to the three other methods.

. Caniapiscau River Daily Flows
We considered caniapiscau dataset which appears in the R package FlowScreen. This data contain the mean daily stream ow for the Caniapiscau River for the period -. The Caniapiscau River is located in Nunavik, Quebec, Canada, and ows northward. The headwaters were dammed to create the Caniapiscau Reservoir, which started lling in 1981. In 1985, the reservoir was diverted to the west into the La Grande hydroelectric complex. This ow time series is used as an example of a river with a known change point to demonstrate the package's screening capabilities.
We are interested by the estimation of the probability density of the Mean daily stream ow, measured in m3/s. 1. The results given in Table 3 are based on the Plug-In estimator of I and I (see, (9), (10), (16) and (17) for the estimation of I and I and (14) and (19) for the plug-In MWISE). This estimators are very useful when we use real data since we do not need to know the density of our sample. 2. Table 3 provides the estimation of the two unknown quantities I and I before given the optimal bandwidth of the four considered methods respectively, we then use the MWISE to compare the four considered approaches, we infer that the Gumbel kernel estimators outperformed the Gaussian kernel estimators with a particular preference for using the recursive version. 3. Qualitative comparison between the Gumbel and Gaussian kernel estimators are given in Figures 3 and 4. 4. Figure 4 illustrates that the Gumbel kernel estimator provides globally the best t, whilst the Gaussian kernel estimator tends to underestimate the density function.

Conclusion
This paper propose an automatic selection of the bandwidth of two proposed Gumbel kernel density estimators, the rst one is recursive, however the second one is non-recursive. The two proposed estimators asymptotically follow normal distributions. The estimators are compared to the two classical Gaussian density estimators, the rst one is non-recursive and introduced by [26] and [23], the second one is recursive and proposed by [21] and investigated by [29]. We showed that, using some selected bandwidth and some particularly stepsizes, the two proposed Gumbel estimators are very competitive to the two classical Gaussian density estimators. The simulation study con rms the bene ts of our two proposed Gumbel estimators.
In conclusion, the proposed estimators are very competitive to the two classical Gaussian density estimators. We plan to extend this work to the multi-dimensional context with a comparison to the work of [2]. We plan also to consider the bias correction techniques with application to heavy tailed data.
We plan also to consider Bernstein polynomials rather than kernels and to propose an adaptation of the estimators developed in [15] and [36] with application to heavy tailed data. Moreover, we plan to make an extensions of our proposed plug-in method in future with application on extreme value and to consider the case of the averaged Révész's regression estimators (see [22] and [31,32]) and the semi-recursive kernel regression estimators proposed by [34] and the case of time series as in [13] in recursive way (see [14]).

A Proofs
Throughout this section we use the following notation: Let us rst state the following technical lemma. Remark 5.
1. The application of Lemma 1 requires Assumption (A )(iii) on the limit of (nβn) as n goes to in nity. 2. The application of Lemma 1 ensures that the bias and the variance depend only on hn and not on h , . . . , hn.

A. Proof of Proposition 1
First, it follows from (1) and (20), that Moreover, we have Further, since we have R K Gum (z) dz = , Taylor's expansion with integral remainder ensures that and, since f ( ) is bounded and continuous at x, we have lim k→∞ δ k (x) = . Moreover, since where Hz is the fractional harmonic number and γ is the Euler-Mascheroni constant (γ . ). When a ≤ β/ , we have limn→∞ (nβn) > a, we apply Lemma to infer that Moreover, in the case a > β/ , we have hn = o βn h − n . Whence, since limn→∞ (nβn) > (β − a) / , the application of Lemma gives which gives (6). Now, we have In view of (A ), we have lim k→∞ ν k (x) = and lim k→∞ h kνk (x) = . In the case a ≥ β/ , we have limn→∞ (nβn) > (β − a) / , we make use of Lemma 1 to infer that Moreover, by using the following variable change t = exp (−z), we have

A. Proof of Theorem 1
Let us at rst assume that, if a ≥ β/ , then In the case when a > β/ , Part 1 of Theorem 1 follows from the combination of (6) and (23). In the case when a = β/ , Parts 1 and 2 of Theorem 1 follow from the combination of (6) and (23). In the case a < β/ , (7) implies that h − n f Gum n (x) − E f Gum n (x) P → , and the application of (6) gives Part 2 of Theorem 1.
Let us now prove (23). In view of (22), we have Further, we have, for all p > , We let Vn = Then, the combination of (21), (24) and (25), ensure that The convergence in (23) then follows from the application of Lyapounov's Theorem.

A. Proof of Theorem 2
First, using the compactness property of the set C, we infer that, for some (x k ) ≤k≤γn , C ⊂ γn k= B (x k , an), where γn ∼ a − n with an = h α + n . Now, for any x ∈ C, we set k (x) = arg min k x k − x . Then, for any x ∈ C, we have