Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Mathematica Slovaca

Editor-in-Chief: Pulmannová, Sylvia

6 Issues per year

IMPACT FACTOR 2017: 0.314
5-year IMPACT FACTOR: 0.462

CiteScore 2017: 0.46

SCImago Journal Rank (SJR) 2017: 0.339
Source Normalized Impact per Paper (SNIP) 2017: 0.845

Mathematical Citation Quotient (MCQ) 2017: 0.26

See all formats and pricing

Access brought to you by:

provisional account

More options …
Volume 68, Issue 5


Approximation of Information Divergences for Statistical Learning with Applications

Milan Stehlík
  • Institute of Applied Statistics and Linz Institute of Technology Johannes Kepler University Altenberger Strasse 69 A–4040 Linz Austria, Institute of Statistics University of Valparaíso Gran Bretana 1111 Valparaíso Chile, Austria
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Ján Somorčík
  • Department of Applied Mathematics and Statistics Comenius University Mlynská Dolina SK–842 48 Bratislava Slovakia
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Luboš Střelec
  • Department of Statistics and Operation Analysis Mendel University in Brno Zemědělská 1 CZ–613 00 Brno Czech Republic
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Jaromír Antoch
  • Department of Probability and Mathematical Statistics Charles University Sokolovská 83 CZ–186 75 Praha Czech Republic
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2018-10-20 | DOI: https://doi.org/10.1515/ms-2017-0177


In this paper we give a partial response to one of the most important statistical questions, namely, what optimal statistical decisions are and how they are related to (statistical) information theory. We exemplify the necessity of understanding the structure of information divergences and their approximations, which may in particular be understood through deconvolution. Deconvolution of information divergences is illustrated in the exponential family of distributions, leading to the optimal tests in the Bahadur sense. We provide a new approximation of I-divergences using the Fourier transformation, saddle point approximation, and uniform convergence of the Euler polygons. Uniform approximation of deconvoluted parts of I-divergences is also discussed. Our approach is illustrated on a real data example.

MSC 2010: 62E17; 62F03; 65L20; 33E30

Keywords: deconvolution; information divergence; likelihood; change in intensity of Poisson process

We would like to extend our gratitude for the support from Fondecyt Proyecto Regular No. 1151441 and LIT-2016-1-SEE-023. This work was also supported by Grants P403/15/09663S and GA16-07089S of the Czech Science Foundation, and grant VEGA No. 2/0047/15. Support from the BELSPO IAP P7/06 StUDyS network is also prominently acknowledged. The authors are very grateful to the Editor, Associate Editor and anonymous referees for their valuable comments and extremely careful reading.

1 Introduction

It is well known that one of the most important statistical applications of information theory is testing of statistical hypotheses, and that deconvolution of information divergences can lead to the optimal statistical inference. We illustrate this fact by the deconvolution of information divergence in the exponential family (see [12] for details), which results in tests optimal in the Bahadur sense.

Let us consider a statistical model with N independent observations y1,…yN, which are distributed according to the gamma densities


where Γ(t)=0xt1exdx denotes the Gamma function, ϑ=(ϑ1,…,ϑp)TΘ is a vector of unknown scale parameters, which are the parameters of interest, and v=(v,…,vN)T is a vector of known shape parameters. The parameter space Θ is an open subset of ℝP, γiC2 (Θ) and the matrix of the first order derivatives of the mapping γ = (γ1,…γN)T has a full rank on Θ. This type of model is motivated, e.g., by modeling time intervals between N+1 successive random events in a Poisson process and testing its homogeneity; see [1] for details.

For v1 = v2 = … = vn =: v the model (1.1) belongs to a regular exponential family of N-variate densities


where Y= (Y1, …, YN)T, ψ(Λ)=(1-v)i=1Nln(yi), κ(γ)=Nln(Γ(v))-vi=1Nln(γi), and t(y) = -y is the sufficient statistic for the canonical parameter γ𝒢. The “covering property” (see [5] for details)


together with the relation


enable us to associate with each value of t (y) a value 𝜸 ^ y ∈ 𝒢, which satisfies


It follows from (1.3) that γ^y is the MLE of the canonical parameter γ in the family (1.2). This allows us to define I-divergence of the observed vector y in the sense of [5] as


Recall that I (γ⋆, γ) is the Kullback-Leibler divergence between the densities with parameters γ⋆ and γ. The I-divergence has many nice geometrical properties. For our purposes, let us just mention the Pythagorean relation, i.e., for all γ,γ¯,γint(G) such that (Eγ¯t(y)Eγt(y))T(γγ)=0, it holds that


where int(𝒢) denotes the interior of the set 𝒢. Recall that the Pythagorean relation can be used for constructing the density of the MLE in a regular exponential family, see [6] for details.

The use of I-divergence(s) has nice statistical consequences. Let us consider, e.g., the test statistic λ1 for the likelihood ratio test (LRT) of the hypothesis H0: γ = γ0 against H1 γ ≠ γ0, and the test statistic λ2 for the LRT of the homogeneity hypothesis H~0γ1==γN in the family of densities (1.1). Then we have the following interesting relation for every vector of canonical parameters Λ0=(γ0,,γ0)𝒢,


where the variables – In λ1 and – In λ2|γ1 = … = γN (being the test statistic – In λ2 under the H~0) are independent. Notice that deconvolution (1.4) of IN is a consequence of Theorem 4 in [11]. Both tests are asymptotically optimal in the Bahadur sense ([8, 9]). For details on homogeneity testing for selected members of the exponential family see, e.g., papers [14] and [15]. General deconvolution of IN is explained, using a geometric integration, in [13].

This paper is organized as follows. In Section 2 we study the approximation of the density of I1 by means of the Fourier transformation. In Section 3 we derive the corresponding saddle point approximation. In Section 4 we derive approximation of I1 by numerical methods applicable to differential equations. Section 5 about likelihood ratio tests follows. Finally, Section 6 illustrates our approach on a simple example based on a real problem and data. A concise discussion concludes the paper.

2 Approximation of the density of I1 by Fourier transformation

Suppose XExp(γ), i.e., X is exponentially distributed. The basic building block of the I-divergence IN is the random variable I1 (X, γ). For γ = 1 it can be shown easily, that I1 (X, 1) = X – In (X) – 1. Before studying the properties of I1 and IN, we will focus on the random variable Y = X – In (X). To find the exact cumulative distribution function of Y, we need the Lambert W-function, studied in detail in [11], for example. Here we outline an approach enabling us to use the Fourier transformation for approximating the density fY of the random variable Y.

Before deriving the characteristic function of Y, we will start with an approximation of the characteristic function φX(t)=EeitX of XExp(1). Suppose that we can use an expansion φX(t)=n=0antn. Then we can approximate the density fX(x) by 12π-+e-itx(n=0n0antn)dt.

We must distinguish between two basic cases, i.e.,

{for |t|<1 it can be shown φX(t)=n=0(it)n,for |t|>1 it can be shown φX(t)=n=1-1(it)n.(2.1)

Let us take only finitely many terms from the series φX(t), say n0, which is even, and approximate the right-hand side of the so-called “backward transformation”


By straightforward algebra for all t ∈ (-1, 1) we get an approximation


and analogously for all t ∈ (-∞, -1) ∪ (1, +∞) we get


Hence, for n0 large enough, (2.2) is approximately equal to the sum of (2.3) and (2.4). A graphical representation of this approximation was prepared using Mathematica v. 11.1 and is plotted in Figure 1. The integrals were calculated numerically. The exact density is marked with a thick line.

Density of the random variable X ~ Exp (1) and its approximation (2.3)+(2.4) for n0 = 4,10,16.
Figure 1

Density of the random variable X ~ Exp (1) and its approximation (2.3)+(2.4) for n0 = 4,10,16.

Let us now turn back to the random variable Y = X–ln (X) and its characteristic function


and notice that an attempt to express φY(t) in the series’ form is more complicated than in the case of φX(t). Therefore, we do not compute φY(t) directly, but we will concentrate on its n th-derivative, for whose existence we only need to prove that the random variable Y = X–ln (X) has the n th initial moment, i.e., that 0+(x-ln(x))ne-xdx is a real number. It holds that


Since there exists u0 > 0 such that for all u > u0 it holds that (k+1)u-eu<-u, then

e(k+1)u-eu<e-u  for allu>u0,

and, simultaneously,

u0+un-ke-udu<+  for allk=0,,n.(2.6)

Finally, from (2.5) and (2.6) we get the desired outcome in the form


Similarly, according to [7, Chapter VI, Theorem 7], we have


where the last equality follows from the identity


Thus, φY(t) can be formally written in the form


Using the first n0 terms of the series (2.8) and applying the backward transformation (2.2), we obtain


However, the integrals in (2.9) are divergent and a flaw of this procedure is that (2.8) is not valid for each t ∈ ℝ. Therefore, we are interested in the limit behavior of the sequence |a(n)|n. Using a computer, it is possible to calculate |a(n)|n for at least a few n. From the analysis of the obtained results we formulate the following hypothesis,


The well-known Cauchy-Hadamard theorem from the complex analysis would guarantee the absolute convergence of the series (2.8) for |t| < 1. However, neither the relationship in (2.10) could be confirmed, nor we expressed φY(t) in the series form for |t| > 1. As suggested by an anonymous referee, the relation (2.10) is quite difficult to confirm, but one can easily find a bound for |a(n)|=|φY(n)(0)|/n! and show that lim supn|a(n)|n1. Hence, the radius of convergence is at least 1.

3 Saddle-point approximations

In this section we use the saddle point approximation and the framework developed in [5, Chapter 3] to derive an approximation of the density of the random variable Y = X - ln (X), where X ~ Exp(1).

Let us first consider a family of densities of the random variable X:


where using the identity 1=0f(x|γ)dx we get


which means that


Note that e-x = e-ln x+(ln x-x)γ-κ(γ) | γ = 1. Hence, the saddle-point approximation qT (t|γ) of the density of the random variable T = ln (X)-X, where X has the density f (x|γ), can help us to approximate the density of the random variable Y = X - ln (X), where X ∼ Exp(1). The required approximation has the form qT(-t| 1).

Let us check the regularity assumptions a) – f) from [5, Chapter 3]. Assumptions a) – c) are obviously satisfied because the sample space of the random variable X is (0, ∞), the parametric space is 𝚯 = (0, ∞), and γ (0, ∞) → (0, ∞) is the identity. Since t(x)=ln(x)-x, we have ddxt(x)=1x-1, so that d) follows. Consequently, κ(γ)< for all γ(0,)=𝒢 and Assumption e) follows. Finally,


which implies the validity of Assumption f), since t(x) = ln (x) - x ≤ -1 for x > 0.

The random variable T = ln(X) - XT, where X has the density f(xIγ), can be shown to have a density of the form e-ϕ(t)+tγ-κγ, where the function ϕ(.) is usually hard to determine explicitly. For the MLE γ^(t) of γ define


Then for the I-divergence defined as


it can be shown easily by means of (3.1) that


from which we finally get


Now, it remains to determine γ^(t). From the equation γκ(γ)|γ=γ^(t)=t (cf. (1.3)) we have


Notice that the solution of (3.5) cannot be found in a closed form and must be evaluated numerically.

Substituting (3.2) and (3.4) into the saddle-point approximation formula (see [5, Chapter 3])


we get


Note 1.

In mathematical softwares, the derivatives of the gamma function are usually not implemented, unlike those of the digamma and polygamma functions

ψ(t)=ddtlnγ(t)=γ(t)γ(t)  and  ψn(t)=dndtnψ(t).

Note that equation (3.5) has now the form


and finally (3.2) turns into Σγ^(t)=ψ1(γ^(t))-1γ^(t), which enables us to conclude that

qT(t| 1)=12π1ψ1(γ^(t))-1γ^(t)e(1-γ^(t))t+lnγ(γ^(t))-γ^(t)lnγ^(t).(3.6)

The exact density of the random variable Y = X - ln (X), marked by a solid line in Figure 2, has the form (see Note 6 in the Appendix)


for t ∈ [1, +∞), where LW (k, t) is the k-th branch of the complex multifunction called Lambert W-function. Recall that LW (z) is defined as the solution of equation


and notice that (3.7) is a special case of a general expression for the density derived in [11].

Density of the random variable Y = X - Log(X) (X ~ Exp(1)) given by (3.7) and its approximation qT (-t, 1) given by (3.6).
Figure 2

Density of the random variable Y = X - Log(X) (X ~ Exp(1)) given by (3.7) and its approximation qT (-t, 1) given by (3.6).

Since equation (3.8) has infinitely many solutions, LW (t) is a multifunction. Real values are contained in branches LW (0, t), t ∈ (-e -1, ∞), and LW (-1, t), t ∈ (-e-1, 0]. The ranges of the corresponding functional values are [-1, +∞) for LW (0, t) and (-∞, -1] for LW (-1, t). For more details on the Lambert W-function, see [2] and [11].

The following two Lemmas, describing this function’s basic properties, will be needed below.

Lemma 3.1.

Equation X-ln (x) = t in the variable X has for all t ∈ [1, +∞) two real solutions X1, X2 such that 0 < x1 ≤ x2, and it holds that

x1=-LW(0,-e-t)  𝑎𝑛𝑑  x2=-LW(-1,-e-t).


See Appendix.

Lemma 3.2.

For k = 0, 1, the function LW (k, -e-x) is continuously differentiable in the variable x ∈ (1, +∞), and it holds that



See Appendix.

4 Approximation by Euler polygons

To calculate the values of the density fY of the random variable Y = X-ln (X), where X ~ Exp(1), the exact form (3.7) can be used. However, an approximation of it can also be useful. For example, if we cannot determine the values of the Lambert W-function, which is not implemented in some common statistical softwares such as Microsoft Excel, SPSS or Minitab.

The basic idea of our approximation is to replace the values LW (-1, -e-t) and LW (0, -e-t) in (3.7) by their approximations y-1(t) and y0(t), which can be obtained by numerical solving of the differential equation


which is satisfied by both LW (-1, -e-t) and LW (0, -e-t), see Lemma 3.2.

For finding the required solution of this equation, Euler’s forward method (also called the Euler’s polygon method) can be used, providing us with a solution in the form of a continuous piecewise linear function. More precisely, at first we divide the interval [a, b] into n subintervals [ti,ti+1], i=0,1,…, n-1, t0 =a, ti+1 - ti = b-a/n = h. we put


Consequently, we proceed from the point t0 to the right by step h. At every point ti, i = 0,1,…, n-1, it holds that


However, since the function yk (t) is linear on the interval [ti,ti+1], we have


Notice that the right hand side of the last equality is sometimes called “the forward-difference”.

Consequently, if we substitute (4.3) into (4.2), we get


from which


Equations (4.1) and (4.4) give us values yk (t0), yk (t1), yk (t2), etc. Finally, from the linearity of yk (t) on each interval [ti, ti+1], we have


The above-described idea for computing yk (t) can be easily implemented; see Appendix. In Figure 3 we illustrate our approach on the interval [a, b] = [1.1, 5] for different numbers n of the grid points ti, n ∈ {10, 13, 25}. This enables us to compare the suggested approximation with the exact values of LW (0, -e-t). Parallel to that, in Figure 4 we can compare the quality of the approximations of LW (-1, -e-t) with the exact values for n ∈ {10, 13, 25} on [a, b] = [1.1, 5].

Plot of LW (0, -e-t) and its approximation (4.5) for n ∈ {10, 13, 25}.
Figure 3

Plot of LW (0, -e-t) and its approximation (4.5) for n ∈ {10, 13, 25}.

Plot of LW (-1, -e-t) and its approximation (4.5) for n ∈ {10, 13, 25}.
Figure 4

Plot of LW (-1, -e-t) and its approximation (4.5) for n ∈ {10, 13, 25}.

Substituting the approximations y-1 (t) and y0 (t) of LW (-1, -e-t) and LW (0, -e-t) into (3.7), we get the approximation ffn (t) of the exact density fy (t) on the interval [a, b] in the form


For an illustration of the quality of this approximation on [a, b] = [1.1, 5], see Figure 5.

Exact density fy(t) of the random variable Y = X - log(X) (X ~ Exp (1) given by (3.7), and its approximation (4.6) for n ∈ {10, 13, 25}.
Figure 5

Exact density fy(t) of the random variable Y = X - log(X) (X ~ Exp (1) given by (3.7), and its approximation (4.6) for n ∈ {10, 13, 25}.

To examine the approximation (4.6) in more depth, we first focus on LW (0, -e-t). To simplify the notation, let us, for the moment, denote

u(t):=LW(0,-e-t)  and  y(t):=y0(t),

and, without a proof, use the fact that -1 < u (t) < 0 holds for all t ∈ [a, b]. From Lemma 3.2 it follows ddtu(t)=-u(t)1+u(t)>0 for all t ∈ [a, b], i.e., u (.) is an increasing function on [a, b]. Moreover, d2dt2u(t)=u(t)(1+u(t))3<0 for all t ∈ [a, b], i.e., u (.) is strictly concave on [a, b]. We will need the following two Lemmas for the subsequent proofs.

Lemma 4.1.

If the number n of grid points is “large enough”, then:

  1. y⁢(⋅) is increasing on [a,b] and, moreover, y⁢(t)<0 for all t∈[a,b].

  2. y⁢(t) approximates u(t) from above, i.e., for all t∈(a,b] it holds that y⁢(t) > u⁢(t).


See Appendix. ∎

Lemma 4.2.

There is a constant c > 0 such that for all x, y ∈ [a, b] it holds that |u(x) - u(y)| < c |x-y|, i.e., u (.) is Lipschitz continuous.


See Appendix. ∎

And this is our main result about the accuracy of the approximation y0(t):

Theorem 4.1.

Let 1 < a < b < + ∞. Then y0 (t) uniformly converges to LW (0, -e-t) on the interval [a, b] as the number of grid points n → + ∞.


See Appendix. ∎

Note 2. Recall that the Picard’s existence theorem, being one of the basic results in the theory of ordinary differential equations, gives the conditions that guarantee existence and uniqueness of the solutions of the Cauchy initial problem


It states that there exist both an interval I , such that t0I, and a solution x: I → ℝ. For the existence of the solution it is sufficient that f: ℝ × ℝ → ℝ is Lipschitz continuous in the second variable.

Proof of the Picard’s existence theorem is usually based on the uniform convergence of the Euler polygons. Unfortunately, we cannot rely on this approach in the proof of Theorem 4.1 despite the fact that the Lipschitz continuity of the function f(t,x)=-x1+x can be easily proven, because Theorem 4.1 guarantees global convergence of the Euler polygons on the entire interval [a, b], not just on a small neighborhood [a, a + ε] of the point a.

Now we focus on LW (-1, -e-t) and show convergence analogous to that in Theorem 4.1. We will use procedures and Lemmas similar to those used in the proof of Theorem 4.1. Therefore, it is advantageous, for the moment, to use analogous notation as well, i.e.,

u(t):=LW(-1,-e-t)  and  y(t):=y-1(t).

Without a proof, we will use the following properties of the Lambert W-function, namely, for all t ε [a, b] it holds that u (t) < -1, ddtu(t)=-u(t)1+u(t)<0 and d2dt2u(t)=u(t)(1+u(t))3>0, i.e., u (.) is a decreasing and strictly convex function. We will also need the following two Lemmas.

Lemma 4.3.

If the number n of the grid points is “large enough”, then:

  1. y⁢(⋅) is decreasing on [a,b].

  2. y(t) approximates u⁢(t) from below, i.e., for all t∈(a,b] it holds that y⁢(t)<u⁢(t).


See Appendix. ∎

Lemma 4.4.

There is a constant c > 0 such that for all x, y ε [a, b] it holds that |u(x) - u(y)| < c|x-y|, i.e., u (t) is Lipschitz continuous.


See Appendix. ∎

This is our main result about the accuracy of the approximation y-1 (t):

Theorem 4.2.

Let 1 < a < b < + ∞. Then y-1 (t) uniformly converges to LW (-1, -e-t) on the interval [a, b] as the number of the grid points n → +∞.


See Appendix. ∎

Finally, Theorems 4.1 and 4.2 allow us to make an explicit statement about the quality of the approximation ffn (t) (see (4.6)) of the exact density fY (t) given by (3.7).

Theorem 4.3.

Let 1<a<b<+∞, then 𝑓𝑓n⁢(t) converge uniformly on the interval [a, b] to 𝑓Y (t) as the number of the grid points n→+∞.


The assertion follows from the uniform convergence of y0(t) and y-1(t) to LW (0, -e-t) and LW (-1, -e-t), from the continuity of the function z1+zez and from the fact that there exist a lower bound of LW (0, -e-t) on the interval [a, b], that is greater than -1, and an upper bound of LW (-1, -e-t) on the interval [a, b], that is smaller than -1. ∎

5 Likelihood ratio tests

First recall that if the random variables X1, …, XN∼Exp (γ) are independent, then the maximum likelihood estimator of the parameter γ has the form


Moreover, ZN=γi=1NXi is governed by the gamma distribution γ (N, 1) with the distribution function


Using integration by parts, we can calculate DN(x) using

DN(x)=1-e-x(1+x1!++xN-1(N-1)!)for x>0.(5.2)

If we test the hypothesis H0 γ = γ0 against H1 γ ≠ γ0, and λ(X) denotes the corresponding likelihood, the test statistic of the associated LRT equals to


Applying the same procedure as used for the derivation of (3.7) and slightly generalizing Lemma 3.1, we can show that the distribution function of -ln λ (X), i.e., F¯N(x)=P(-lnλ(𝑿)x), has, under the hypothesis H0, the form


Note 3. If N = 1 and γ0 = 1, then -ln λ(x) = X - ln(X)-1, where X ~ Exp(1), and the corresponding distribution function has the form


Hence, the random variable X - lnX has the distribution function of the form eLW(0,-e-x)-e-LW(-1,-e-x) on the interval [1, +∞) and zero otherwise, being the same result as (.2) in Note 6.

For testing H0: γγ0 against H1: γγ0, most people use the LRT and the Wilks’ test statistic -2lnλ(X), which is well known to be asymptotically distributed as χ12, see [16], with the corresponding asymptotic χ12 critical values. The exact critical values are usually not used.

A natural question arises as to how good the asymptotic approach is, i.e., how much the asymptotic critical values differ from the exact ones. Therefore, we will, in more detail, focus on the exact distribution of the Wilks’ test statistic under the given conditions. To that purpose, first recall that for x > 0:


Note that the corresponding density can be found following the approach outlined in Note 6.

Moreover, let us denote the distribution function of χ12 by CHI1(x). Since FN(x) converges to CHI1(x) from below as N→+∞, the corresponding asymptotic critical values based on CHI1(x) are smaller than the exact ones based on FN(x). Consequently, we will try to approximate the function FN(x) from above to get closer to its graph than CHI1(x). To approximate the function LW (k, -e-t) for k = -1,0, we will use the method of the Euler polygons described in Section 4. There are two options:

  1. To derive an approximation of the exact density and integrate it to obtain the approximation of the distribution function (5.4).

  2. To directly approximate the individual terms appearing in (5.4).

Ad a). This approach is not suitable, since Lemmas 4.1 and 4.3 show that y-1(x) and y0(x) are approximations of LW (-1, -e-x) and LW (0, -e-x) from below and from above, respectively. However, we need the approximation of the exact density from above, because the approximating distribution function must be enclosed between FN(x) and CHI1(x). It can be shown, that to ensure it, we would need, for N odd, the function z1+zeNz to be decreasing on the interval (-∞, 0) and, on the contrary, for N even, the same function to be increasing on the same interval. Unfortunately, the opposite is true, since


which is (for z<0) positive for odd N and negative for even N.

Ad b). The direct approximation of the individual terms in (5.4) appears to be a better idea. According to Lemmas 4.1 and 4.3, we have


from which


Moreover, the distribution function DN(.) is non-decreasing, so that


where FFNn(x) denotes an approximation of FN (x) obtained when n grid points are used for the evaluation of the corresponding Lambert W-functions on the interval [a, b]. The required inequality FN(x)FFNn(x) holds, but only for a sufficiently large number n of grid points, because we have used Lemmas 4.1 and 4.3.

Note that for each fixed N, the sequence of functions {FFNn(x)}n=1+ uniformly converges to FN (x). This follows directly from Theorems 4.1 and 4.2 and from the continuity of DN (x). Recall that if we want to calculate FFNn(x) on the interval [c, d], then we need to choose the appropriate a and b needed to determine y-1 (x) and y0 (x). It is evident that we can simply take


In [11] one can look up the exact α critical values based on FN (x) for α ∈ {0.005, 0.01, 0.02, 0.05} and N ∈ {1, 2, 3, 4, 5} (see Table 1). To obtain approximate critical values, it is sufficient to approximate the distribution functions FN (x) just on a certain subinterval to be able to determine the critical values for the most common α values. Note that the smallest exact critical value in Table 1 is 3.968 and the largest is 8.853; therefore we will compute the approximations FFNn(x) just on the interval [c, d] = [3.8; 8.85] for N ∈ {1, 2, 3, 4, 5}. It means that it is sufficient to choose the same a, b for all five values of N, i.e., to set a=1+3.82×5=1.38 and b=1+8.852×15.43 (cf. (5.5)).

Since for all x[c,d]FFNn(x)>FN(x), the approximate critical values obtained from FFNn(x) are smaller than the exact critical values. Recall that the critical values based on FFNn(x) can be obtained as a numerical solution of the equation 1FFNn(x)=α. The approximate critical values calculated for n = 60 are presented in Table 1.

Table 1

Critical values: exact (based on FN (x)), approximate (based on FFN60(x)), and asymptotic (based on CHI1 (x))

Finally, the asymptotic critical values based on CHI1 (x) are also shown in Table 1 for the purpose of comparison. We can see that the approximate critical values obtained from FFNn(x) are always between the exact and the asymptotic critical values.

Using a computer, it is also possible to “experimentally” determine the smallest n0 for which FFNn0(x) is enclosed between FN (x) and CHI1 (x) on for all x ∈ [c, d] (see Table 2). This means that for all n > n0 the approximation based on FFN60(x) is more accurate than the asymptotic approximation based on CHI1 (x). The values of n0 reported in Table 2 were the motivation to set n = 60 when constructing Table 1.

Table 2

Smallest n0 for which FFNn0(x) is enclosed between FN (x) and CHI1 (x) for all x ∈ [c, d]

When calculating the approximations FFNn(x) on the interval [c, d] for a particular N, we can reach good accuracy for even smaller n, provided a and b have been set exactly according to (5.5). However, notice that for Tables 1 and 2, we chose the same a = 1.38 and b = 5.43 for all N. This is useful, e.g., for the construction of Table 1, because we need to know y-1 (t) and y0 (t) only on a fixed interval [a, b] for n = 60, regardless of the value of N.

Note 4. If we want to determine n0 exactly, it is evident that we should find the maximal error of the approximation of LW (-1, -e-t) and LW (0, -e-t) when using y-1 (t) and y0 (t), i.e.,

supt[a,b]|y-1(t)-LW(-1,-e-t)|  and  supt[a,b]|y0(t)-LW(0,-e-t)|.

Upper bounds of these errors were derived in the proofs of Theorems 4.1 and 4.2. According to (.8) and (.10), the upper bound of the latter maximal error is


where B=c1+LW(0,ea)2,d1=LW(0,ea)1+LW(0,ea)3ban2, and, according to Lemma 4.2, c=-LW(0,-e-a)1+LW(0,-e-a). Taking a = 1.38, b = 5.43 (as in Table 1) and n0 = 20, the upper bound is 1.191, being a very rough estimate. Thus, the resulting estimate of the smallest n0 would be too high, no matter how accurately we are able to estimate the change of the value of the DN (x) when changing its argument only negligibly.

Note 5. When practically using the exact distribution function FN (x) or its approximation FFNn(x), it is not surprising that the main numerical problem is not the computation of the functional values of the Lambert W-function or its approximation, but the computation of the functional values of DN(x), especially for large values of N.

6 Example

Imagine a company producing different products. In every moment, the company concentrates just on one product, and only starts production of another one after the completion of the previous one.

Assume that the production times of individual pieces of a given product can be described by independent random variables with small dispersion. Assume moreover that the overall production time of one product can be described by the Erlang distribution Erl(k, γ), which is a special case of the gamma distribution with an integer parameter k. It is well known that the corresponding cumulative distribution function is Dk (γx) (see (5.2)). Different estimators of the parameters and their properties have been thoroughly studied in the literature; for more details see, e.g., [3], [4] or [10].

Let the company, that currently has m back-orders, get a new order. The company is now interested in estimating the time in which it is able to process the new order with a prescribed reliability δ, say δ = 0.99. We will call this time T “the δ due time”. If we denote the production time of the ith product as Xi, we are interested in such a time T for which


Since Xi ∼ Erl(k, γ) are independent, X1+X2+ … Xm+1 ∼ Erl(k(m+1), γ), and the required T can be calculated from the equation (cf. (5.2))


It is evident that, provided the values of parameters k and γ are known, T is the δ-quantile of Erl ((m+1)k, γ). It can be shown easily, that an equivalent expression is


where d(m+1)k (δ) is the δ-quantile of Erl((m+1)k, 1). When the parameters are not known, we have to replace them by some kind of estimates k^ and γ^.

As an illustration of our approach, we will use real data representing production times in hours presented in Table 3. This data forms a subset of a larger data set studied in [10]. The MLEs of k and γ are k^=18 and γ^=12.4042, respectively (see [4] for details). Moreover, we assume m = 4. Replacing k and γ in (6.1) by k^ and γ^, respectively, and putting δ= 0.99, we obtain T^=9.1524.

Table 3

Observed data.

The company might be also interested in testing the hypothesis H0: T = T0 against an alternative H1: T > T0, where T0 is the δ due time of delivery of the (m+1)st product as required by the customer. If we can assume that the parameter k is known, say from past experience, these hypotheses might be replaced by hypotheses on the parameter γ, more precisely, H0: γ = γ(T0) against an alternative H1: γ < γ(T0), where γ0=d(m+1)k(δ)/T0 (cf. (6.1)). Put, for example, T0 = 8 hours. Then the corresponding γ(8) will be 14.1910. By straightforward algebra, the Wilks’ test statistic of the appropriate LRT turns out to be of the form


i.e., very similar to (5.3). Since the sum of independent exponentially distributed random variables is governed by the Erlang distribution, it is easy to show that the corresponding distribution function of -2 ln λ(X) under H0 is FNk (x) (cf. (5.4)), i.e., F10.8 (x) = F180 (x) in our case. For our data we obtain -2ln λ(X) = 3.4110, which is smaller than the exact, the approximate, and also the asymptotic critical value corresponding to α = 0.05 – see Table 4, which was obtained in a similar manner as Table 1. Hence, neither the exact, nor the approximate, nor the asymptotic version of the LRT rejected H0 at the significance level 5%.

Table 4

Critical values: exact (based on F180 (x)), approximate (based on FF180100(x)), and asymptotic (based on CHI1 (x)).

Another problem of the company is possible liability to pay a penalty for the delay of delivery. It is therefore appropriate to determine the power of the considered LRT if the real δ due time of delivery is T0+1, i.e., the real γ is γ(T0+1) = γ(8+1) = 12.6142 in our case. Similarly to Note 6 (see Appendix) it can be shown, that the distribution function of -2ln (X) under a general value of γ has for all x > 0 the form


so that the power of the considered LRT is


where cα, Nk is the critical value of the test derived either from FNk (x), FFNkn(x) or from CHI1 (x) (see Table 4). For α = 0.05 we obtain the powers 0.3596, 0.3597, and 0.3600 of the exact, the approximate, and the asymptotic version, respectively, of the LRT at the nominal significance level 5%.

7 Conclusions

Approximations of information divergences and their decompositions can be a very useful tool for both theory and practice. In particular, they are related to the likelihood ratio tests. In this paper we provide several approximations and discuss their properties. Applications to precision of likelihood ratio tests of production times are given.


  • [1]

    ANTOCH, J.—JARUšKOVá, D.: Testing a homogeneity of stochastic processes, Kybernetika 41 (2007),415–430. Google Scholar

  • [2]

    CORLESS, R. M.—GONNET, G. H.—HARE, D. E. G.—JEFFREY, D. J: On Lambert’s W Function. Technical Report Cs-93-03, Department of Computer Science, University of Waterloo. Google Scholar

  • [3]

    JOHNSON, N. L.—KOTZ, S.—BALAKRISHNAN, N.: Continuous Univariate Distributions, Volume 1, 2nd ed., J. Wiley, New York, 1994. Google Scholar

  • [4]

    MILLER, G. K.: Maximum likelihood estimation for the Erlang integer parameter, Statist. Probab. Lett. 43 (1999), 335–341. Google Scholar

  • [5]

    PáZMAN, A.: Nonlinear Statistical Models, Dordrecht, Kluwer Academic Publishers, 1993. Google Scholar

  • [6]

    PáZMAN, A.: The density of the parameter estimators when the observations are distributed exponentially, Metrika 44 (1996), 9–26. Google Scholar

  • [7]

    RÉNYI, A.: Wahrscheinlichkeitsrechnung mit einem Anhang über Informationstheorie. VEB Deutcher Verlag der Wissenschaften, Berlin, 1962. Google Scholar

  • [8]

    RUBLíK, F.: On optimality of the likelihood ratio tests in the sense of exact slopes. Part 1, General case, Kybernetika 25 (1989), 13–25. Google Scholar

  • [9]

    RUBLíK, F.: On optimality of the likelihood ratio tests in the sense of exact slopes. Part 2, Application to individual distributions, Kybernetika 25 (1989), 117–135. Google Scholar

  • [10]

    STEHLíK, M.: Exact likelihood ratio tests of the scale in the Gamma family, Tatra Mt. Math. Publ. 26 (2003), 381–390. Google Scholar

  • [11]

    STEHLíK, M.: Distributions of exact tests in the exponential family, Metrika 57 (2003), 145–164. Google Scholar

  • [12]

    STEHLíK, M.: Decompositions of information divergences: Recent development, open problems and applications. In: AIP Conf. Proc. 1493, 2012, pp. 972–976. Google Scholar

  • [13]

    STEHLíK, M.—ECONOMOU, P.—KISEL’áK, J.—RICHTER, W. D.: Kullback-Leibler life time testing, Appl. Math. Comput. 240 (2014), 122–139. Google Scholar

  • [14]

    STEHLíK, M.—OSOSKOV, G. A.: Efficient testing of the homogeneity, scale parameters and number of components in the Rayleigh mixture. JINR Rapid Communications E-11-2003-116, 2003, 1493. Google Scholar

  • [15]

    STEHLíK, M.—WAGNER, H.: Exact likelihood ratio testing for homogeneity of the exponential distribution, Comm. Statist. Simulation Comput. 40 (2011), 663–684. Google Scholar

  • [16]

    WILKS, S. S.: Mathematical Statistics, New York – London, J. Wiley and Sons, 1962. Google Scholar


A Maple implentation of the approximation (4.4) and (4.5) of the function LW (0, -e-t)

Euler_W0 =


local knot,f_knot;

knot = a;

f_knot = evalf(W(0,-exp(-a)));

while (b-a)/n < x-knot do

knot = knot+(b-a)/n; f_knot = f_knot + (b-a)/n/(1+f_knot) - (b-a)/n


RETURN( f_knot - f_knot/(1+f_knot)*(x-knot) )


Proof of Lemma 3.1

It holds that


so that x-ln x attains its global minimum 1 at x= 1, and


It is evident that for all t ≥ 1 there exist solutions x1(t) ∈ (0, 1] and x2(t) ∈ [1, ∞) of the equation x-ln x= t. For the sake of simplicity, we will use the notation x1 and x2 instead of x1 (t) and x2 (t) below.

Moreover, the equation x-ln x= t is equivalent to the equation (-x)e-x = -e-t, so that, according to the definition of LW(t), we have -x = LW(-e-t), cf. (3.8). Now it is sufficient to determine in which branch of the multifunction LW(t) the solutions are contained. Since x1 ∈ (0, 1] and x2 ∈ (1, +∞], it holds that


and the proof is complete. □

Proof of Lemma 3.2

The real function F(a, b) = beb-a is continuously differentiable in both variables. From the properties of the Lambert W-function we know that, for any a0 ∈ (-e-1, 0), the terms b01 = LW (0, a0) and b02 = LW (-1, a0) are real, and (a0, b01) and (a0, b02) are solutions of the equation F(a, b). Moreover, Fb(a,b)=eb+beb, so that Fb(a0,b01)0Fb(a0,b02) because b01 ≠ -1 ≠ b02 (⇔ a0 ≠ -e-1)

Applying the implicit function theorem at points (a0, b01) and (a0, b02), a0 ∈ (-e-1, 0), we find that there exist continuously differentiable functions bi: (-e-1, 0) → ℝ, i = 1, 2, satisfying the equality F(a, bi(a)) = 0, i = 1, 2 for a ∈ (-e-1, 0). However, b1(a) = LW (0, a) and b2(a) = LW (-1, a), i.e., the functions LW (0, -e-x) and LW (-1, -e-x) are continuously differentiable on the interval (-1, +∞).

Differentiation of the equality (3.8) leads to


from which


Multiplying both sides by -LW(k,z)LW(k,z) we get


so that


Substituting z = -e-x and -z=e-x=dzdx into (7.1) we finally obtain


with the left-hand side being ddxLW(k,-e-x), which finishes the proof. □

Note 6. By Lemmas 3.1 and 3.2 it is easy to derive the distribution function FY(t) and the density fY(t) of the random variable Y = X-ln (X), where X ∼ Exp(1) and P(X < x) = 1-e-x. Indeed,


where x1 and x2 are the real numbers guaranteed by Lemma 3.1. We can therefore write


from which for all t ∈ [1, +∞)


being the expression (3.7).

Proof of Lemma 4.1

a) We use mathematical induction to show that y(ti) < 0 for all i = 0, 1,…,n, and y(.) is increasing on the intervals [ti, ti+1]. As y(.) is piecewise linear, the assertion will follow.

1° Let i = 0, then y(t0) = LW (0, -e-a) < 0 and for all t ∈ [t0, t1]


i.e., y(.) is increasing on the interval [t0, t1].

2° Let i > 0 and h=b-an. Then y(ti)=y(ti-1)+-y(ti-1)1+y(ti-1)h and it is evident that y (ti) < 0 if 1+y(ti-1) > h. According to the induction assumption, y(ti-1) ≥ y(t0) > -1, so that it is sufficient to take


According to the induction assumption, y(ti) ≥ y(t0) > -1. Moreover, y(ti+1)y(ti)+y(ti)1+y(ti)h and we have just shown y(ti) < 0, from which y(ti+1). Therefore, y(.) is increasing on the interval [ti, ti+1].

b) This proof also uses mathematical induction. We will show that for all t ∈ (ti, ti+1], i = 0, 1, …, n-1, it holds that y(t) > u(t).

1° Let i = 0. Then, from the definition of y(t), we have y(t0) = u(t0) and dy(t)dt|t=t0=-u(t0)1+u(t0)=du(t)dt|t=t0. From the strict concavity of u(.) and linearity of y(.) on the interval [t0, t1] we get for all t ∈(t0, t1] that y(t) > u(t).

2° Let i > 0 and assume that y(t) ≤ u(t) for a certain t ∈ (ti, ti+1]. Then, from the induction assumption, we have y(ti) > u(ti). From the continuity of y(.) and u(.) and from the nonlinearity of u(.) there exists T ∈ (ti, ti+1], such that (T, u(T)) = (T, y(T)), i.e., the graphs of the functions u(.) and y(.) intersect and


Note that y(.) is increasing, u(T) = y(T), and the induction assumption implies u(ti) < y(ti). These facts yield that there exists T0 ∈ (ti, T) such that u(T0) = y(ti). Further,


and together with (7.3) we obtain


being a contradiction with the descent of the first derivative, because d2dt2u(t)>0 on [a, b]. □

Proof of Lemma 4.2

Since u(.) is increasing, it is sufficient to take x > y and prove the inequality without the absolute value. Lagrange’s mean value theorem assures that there exists t0 ∈ [y, x] such that


Since d2dt2u(t)<0 for all t ∈ [a, b], then


so that, using Lemma 3.2, we get


Finally, from (7.4), (7.5) and (7.6), we get the assertion of Lemma 4.2. □

Proof of Theorem 4.1

Let us denote by di the difference between the grid points (knots), i.e., di = y (ti) - u (ti). Using Lagrange’s mean value theorem, we have, for i ≥ 0,


For i = 0 we use the facts that y(t0) = u(t0), d0, u(.) is increasing, and u(t0) > -1; hence


Now, again using Lagrange’s mean value theorem, we get


Let 0 < i < n and u(T) - y(ti) > 0. Then, using the fact that both functions u(.) and y(.) are increasing, y(t) ≥ u(t) > -1 for all t ∈ [a, b] (this follows from Lemma 4.1) and formula (7.7), we obtain that


and by the inequality


which follows directly from Lemma 4.2, we have


where B=c(1+u(t0))2 is a positive constant. If 0 < i < n and u(T) - y(ti) ≤ 0, then we obtain (7.9) immediately, since by (7.7): di+1di < di + B h2.

An iterative use of (7.9) leads to


From (7.8) it follows that limn+d1=0, which ensures that the last term in (7.10) converges to 0 for n→ + ∞. Since it does not depend on i, we have


The function y(.) is linear on each interval [ti, ti+1], i = 0, 1,…, n, and u(.) is concave. Therefore, the function y(t) - u(t) is convex on each [ti, ti+1], and since it is continuous, it can be shown easily, that its supremum is attained on one of the grid points ti or ti+1. Lemma 4.1 ensures that y(t) ≥ u(t) for all t ∈ [a, b]. It follows that


Finally, using (7.11) we get


which proves the uniform convergence. □

Proof of Lemma 4.3

a) The proof follows the approach used in the proof of Lemma 4.1 a). In step 1∘ it is sufficient to use the fact that


and in step 2∘ the fact that


b) We proceed analogously to the proof of Lemma 4.1 b). In step 1∘, the only difference is that we use the strict convexity of u(.). In step 2∘ we can show easily that du(t)dt|t=T0du(t)dt|t=T, which is the contradiction with the growth of the first derivative, because d2dt2u(t)>0 for all t ∈ [a, b]. □

Proof of Lemma 4.4

Since u(.) is decreasing, it is enough to prove that u(y)-u(x) < c(x-y) for all y. Similarly as in the proof of Lemma 4.2 we get

there existst0[y,x]:u(y)-u(x)=du(t)dt|t=t0(y-x).(7.12)

Since d2dt2u(t)>0 for all t ∈ [a, b], it is true that


Moreover, according to Lemma 3.2 we have


Substituting (7.13) and (7.14) into (7.12), we finally get


which completes the proof. □

Proof of Theorem 4.2

In this proof we again denote by di the difference between the grid points (knots); however, now di = u(ti)-y(ti). Then analogously to the proof of Theorem 4.1 we have


If i = 0, then y(t0) = u(t0) and d0 = 0. Further, u(.) is decreasing and u(t) < -1 for all t ∈ [a, b]. Similarly to the proof of Theorem 4.1 we have


Let 0 < i < n and y(ti) - u(T) > 0. Then, using the fact that both functions u(.) and y(.) are decreasing, y(t) < u(t) < -1 for all t ∈ [a, b] (see Lemma 4.3), and (7.15), similarly to the proof of Theorem 4.1 we get that the following equality holds, i.e.,


Finally, we use the inequality u(ti) - u(ti+1) < c(ti+1 - ti) = ch, which follows directly from Lemma 4.4, to get


where B=c(1+u(t0))2>0 is a positive constant. Further, we follow the same approach as in the proof of Theorem 4.1. □

About the article

E-mail Milan.Stehlik@uv.cl

Received: 2016-05-12

Accepted: 2017-09-22

Published Online: 2018-10-20

Published in Print: 2018-10-25

Communicated by Gejza Wimmer

Citation Information: Mathematica Slovaca, Volume 68, Issue 5, Pages 1149–1172, ISSN (Online) 1337-2211, ISSN (Print) 0139-9918, DOI: https://doi.org/10.1515/ms-2017-0177.

Export Citation

© 2018 Mathematical Institute Slovak Academy of Sciences.Get Permission

Comments (0)

Please log in or register to comment.
Log in