Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter October 20, 2018

Approximation of Information Divergences for Statistical Learning with Applications

  • Milan Stehlík EMAIL logo , Ján Somorčík , Luboš Střelec and Jaromír Antoch
From the journal Mathematica Slovaca

Abstract

In this paper we give a partial response to one of the most important statistical questions, namely, what optimal statistical decisions are and how they are related to (statistical) information theory. We exemplify the necessity of understanding the structure of information divergences and their approximations, which may in particular be understood through deconvolution. Deconvolution of information divergences is illustrated in the exponential family of distributions, leading to the optimal tests in the Bahadur sense. We provide a new approximation of I-divergences using the Fourier transformation, saddle point approximation, and uniform convergence of the Euler polygons. Uniform approximation of deconvoluted parts of I-divergences is also discussed. Our approach is illustrated on a real data example.

MSC 2010: 62E17; 62F03; 65L20; 33E30

We would like to extend our gratitude for the support from Fondecyt Proyecto Regular No. 1151441 and LIT-2016-1-SEE-023. This work was also supported by Grants P403/15/09663S and GA16-07089S of the Czech Science Foundation, and grant VEGA No. 2/0047/15. Support from the BELSPO IAP P7/06 StUDyS network is also prominently acknowledged. The authors are very grateful to the Editor, Associate Editor and anonymous referees for their valuable comments and extremely careful reading.



E-mail

  1. Communicated by Gejza Wimmer

References

[1] ANTOCH, J.—JARUšKOVá, D.: Testing a homogeneity of stochastic processes, Kybernetika 41 (2007),415–430.Search in Google Scholar

[2] CORLESS, R. M.—GONNET, G. H.—HARE, D. E. G.—JEFFREY, D. J: On Lambert’s W Function. Technical Report Cs-93-03, Department of Computer Science, University of Waterloo.Search in Google Scholar

[3] JOHNSON, N. L.—KOTZ, S.—BALAKRISHNAN, N.: Continuous Univariate Distributions, Volume 1, 2nd ed., J. Wiley, New York, 1994.Search in Google Scholar

[4] MILLER, G. K.: Maximum likelihood estimation for the Erlang integer parameter, Statist. Probab. Lett. 43 (1999), 335–341.10.1016/S0167-7152(98)00186-2Search in Google Scholar

[5] PáZMAN, A.: Nonlinear Statistical Models, Dordrecht, Kluwer Academic Publishers, 1993.10.1007/978-94-017-2450-0Search in Google Scholar

[6] PáZMAN, A.: The density of the parameter estimators when the observations are distributed exponentially, Metrika 44 (1996), 9–26.10.1007/BF02614051Search in Google Scholar

[7] RÉNYI, A.: Wahrscheinlichkeitsrechnung mit einem Anhang über Informationstheorie. VEB Deutcher Verlag der Wissenschaften, Berlin, 1962.Search in Google Scholar

[8] RUBLíK, F.: On optimality of the likelihood ratio tests in the sense of exact slopes. Part 1, General case, Kybernetika 25 (1989), 13–25.Search in Google Scholar

[9] RUBLíK, F.: On optimality of the likelihood ratio tests in the sense of exact slopes. Part 2, Application to individual distributions, Kybernetika 25 (1989), 117–135.Search in Google Scholar

[10] STEHLíK, M.: Exact likelihood ratio tests of the scale in the Gamma family, Tatra Mt. Math. Publ. 26 (2003), 381–390.Search in Google Scholar

[11] STEHLíK, M.: Distributions of exact tests in the exponential family, Metrika 57 (2003), 145–164.10.1007/s001840200206Search in Google Scholar

[12] STEHLíK, M.: Decompositions of information divergences: Recent development, open problems and applications. In: AIP Conf. Proc. 1493, 2012, pp. 972–976.10.1063/1.4765604Search in Google Scholar

[13] STEHLíK, M.—ECONOMOU, P.—KISEL’áK, J.—RICHTER, W. D.: Kullback-Leibler life time testing, Appl. Math. Comput. 240 (2014), 122–139.10.1016/j.amc.2014.04.027Search in Google Scholar

[14] STEHLíK, M.—OSOSKOV, G. A.: Efficient testing of the homogeneity, scale parameters and number of components in the Rayleigh mixture. JINR Rapid Communications E-11-2003-116, 2003, 1493.Search in Google Scholar

[15] STEHLíK, M.—WAGNER, H.: Exact likelihood ratio testing for homogeneity of the exponential distribution, Comm. Statist. Simulation Comput. 40 (2011), 663–684.10.1080/03610918.2010.551011Search in Google Scholar

[16] WILKS, S. S.: Mathematical Statistics, New York – London, J. Wiley and Sons, 1962.Search in Google Scholar

Appendix

A Maple implentation of the approximation (4.4) and (4.5) of the function LW (0, -e-t)

Euler_W0 =

proc(x)

local knot,f_knot;

knot = a;

f_knot = evalf(W(0,-exp(-a)));

while (b-a)/n < x-knot do

knot = knot+(b-a)/n; f_knot = f_knot + (b-a)/n/(1+f_knot) - (b-a)/n

od;

RETURN( f_knot - f_knot/(1+f_knot)*(x-knot) )

end

Proof of Lemma 3.1

It holds that

ddx[xlnx]=11x>0;x(1,+),<0;x(0,1),

so that x-ln x attains its global minimum 1 at x= 1, and

limx+[x-lnx]=limx0+[x-lnx]=.

It is evident that for all t ≥ 1 there exist solutions x1(t) ∈ (0, 1] and x2(t) ∈ [1, ∞) of the equation x-ln x= t. For the sake of simplicity, we will use the notation x1 and x2 instead of x1 (t) and x2 (t) below.

Moreover, the equation x-ln x= t is equivalent to the equation (-x)e-x = -e-t, so that, according to the definition of LW(t), we have -x = LW(-e-t), cf. (3.8). Now it is sufficient to determine in which branch of the multifunction LW(t) the solutions are contained. Since x1 ∈ (0, 1] and x2 ∈ (1, +∞], it holds that

x1=-LW(0,-e-t)andx2=-LW(-1,-e-t)

and the proof is complete. □

Proof of Lemma 3.2

The real function F(a, b) = beb-a is continuously differentiable in both variables. From the properties of the Lambert W-function we know that, for any a0 ∈ (-e-1, 0), the terms b01 = LW (0, a0) and b02 = LW (-1, a0) are real, and (a0, b01) and (a0, b02) are solutions of the equation F(a, b). Moreover, Fb(a,b)=eb+beb, so that Fb(a0,b01)0Fb(a0,b02) because b01 ≠ -1 ≠ b02 (⇔ a0 ≠ -e-1)

Applying the implicit function theorem at points (a0, b01) and (a0, b02), a0 ∈ (-e-1, 0), we find that there exist continuously differentiable functions bi: (-e-1, 0) → ℝ, i = 1, 2, satisfying the equality F(a, bi(a)) = 0, i = 1, 2 for a ∈ (-e-1, 0). However, b1(a) = LW (0, a) and b2(a) = LW (-1, a), i.e., the functions LW (0, -e-x) and LW (-1, -e-x) are continuously differentiable on the interval (-1, +∞).

Differentiation of the equality (3.8) leads to

[ddzLW(k,z)]eLW(k,z)+LW(k,z)eLW(k,z)[ddzLW(k,z)]=1,

from which

ddzLW(k,z)=1eLW(k,z)+LW(k,z)eLW(k,z)z.

Multiplying both sides by -LW(k,z)LW(k,z) we get

-ddzLW(k,z)=-LW(k,z)LW(k,z)eLW(k,z)z+zLW(k,z),

so that

[ddzLW(k,z)](-z)=-LW(k,z)1+LW(k,z).(7.1)

Substituting z = -e-x and -z=e-x=dzdx into (7.1) we finally obtain

[ddzLW(k,z)]dzdx=-LW(k,-e-x)1+LW(k,-e-x),

with the left-hand side being ddxLW(k,-e-x), which finishes the proof. □

Note 6. By Lemmas 3.1 and 3.2 it is easy to derive the distribution function FY(t) and the density fY(t) of the random variable Y = X-ln (X), where X ∼ Exp(1) and P(X < x) = 1-e-x. Indeed,

FY(t)=P(XlnX<t)=P(x1<X<x2)

where x1 and x2 are the real numbers guaranteed by Lemma 3.1. We can therefore write

FY(t)=P(LW(0,et)<X<LW(1,et))=eLW(0,et)eLW(1,et),(7.2)

from which for all t ∈ [1, +∞)

fY(t)=ddtFY(t)=ddteLW(0,et)eLW(1,et)=LW(1,et)1+LW(1,et)eLW(1,et)LW(0,et)1+LW(0,et)eLW(0,et),

being the expression (3.7).

Proof of Lemma 4.1

a) We use mathematical induction to show that y(ti) < 0 for all i = 0, 1,…,n, and y(.) is increasing on the intervals [ti, ti+1]. As y(.) is piecewise linear, the assertion will follow.

1° Let i = 0, then y(t0) = LW (0, -e-a) < 0 and for all t ∈ [t0, t1]

ddty(t)=-LW(0,-e-a)1+LW(0,-e-a)>0,

i.e., y(.) is increasing on the interval [t0, t1].

2° Let i > 0 and h=b-an. Then y(ti)=y(ti-1)+-y(ti-1)1+y(ti-1)h and it is evident that y (ti) < 0 if 1+y(ti-1) > h. According to the induction assumption, y(ti-1) ≥ y(t0) > -1, so that it is sufficient to take

n>b-a1+y(t0)=b-a1+u(t0).

According to the induction assumption, y(ti) ≥ y(t0) > -1. Moreover, y(ti+1)y(ti)+y(ti)1+y(ti)h and we have just shown y(ti) < 0, from which y(ti+1). Therefore, y(.) is increasing on the interval [ti, ti+1].

b) This proof also uses mathematical induction. We will show that for all t ∈ (ti, ti+1], i = 0, 1, …, n-1, it holds that y(t) > u(t).

1° Let i = 0. Then, from the definition of y(t), we have y(t0) = u(t0) and dy(t)dt|t=t0=-u(t0)1+u(t0)=du(t)dt|t=t0. From the strict concavity of u(.) and linearity of y(.) on the interval [t0, t1] we get for all t ∈(t0, t1] that y(t) > u(t).

2° Let i > 0 and assume that y(t) ≤ u(t) for a certain t ∈ (ti, ti+1]. Then, from the induction assumption, we have y(ti) > u(ti). From the continuity of y(.) and u(.) and from the nonlinearity of u(.) there exists T ∈ (ti, ti+1], such that (T, u(T)) = (T, y(T)), i.e., the graphs of the functions u(.) and y(.) intersect and

-y(ti)1+y(ti)=dy(t)dt|t=Tdu(t)dt|t=T.(7.3)

Note that y(.) is increasing, u(T) = y(T), and the induction assumption implies u(ti) < y(ti). These facts yield that there exists T0 ∈ (ti, T) such that u(T0) = y(ti). Further,

-y(ti)1+y(ti)=-u(T0)1+u(T0)=du(t)dt|t=T0

and together with (7.3) we obtain

du(t)dt|t=T0du(t)dt|t=T,

being a contradiction with the descent of the first derivative, because d2dt2u(t)>0 on [a, b]. □

Proof of Lemma 4.2

Since u(.) is increasing, it is sufficient to take x > y and prove the inequality without the absolute value. Lagrange’s mean value theorem assures that there exists t0 ∈ [y, x] such that

u(x)-u(y)=du(t)dt|t=t0(x-y).(7.4)

Since d2dt2u(t)<0 for all t ∈ [a, b], then

du(t)dt|u=a>du(t)dt|t=t0,(7.5)

so that, using Lemma 3.2, we get

du(t)dt|u=a=-u(a)1+u(a)=c.(7.6)

Finally, from (7.4), (7.5) and (7.6), we get the assertion of Lemma 4.2. □

Proof of Theorem 4.1

Let us denote by di the difference between the grid points (knots), i.e., di = y (ti) - u (ti). Using Lagrange’s mean value theorem, we have, for i ≥ 0,

di+1=y(ti+1)u(ti+1)=y(ti)+y(ti)1+y(ti)hu(ti)+du(t){textdt|t=Th=di+u(T)1+u(T)y(ti)1+y(ti)h=di+u(T)y(ti)(1+y(ti))(1+u(T))h,T[ti,ti+1].(7.7)

For i = 0 we use the facts that y(t0) = u(t0), d0, u(.) is increasing, and u(t0) > -1; hence

d1=u(T)u(t0)(1+u(t0))1+u(T)h<u(T)u(t0)(1+u(t0))2h.(7.8)

Now, again using Lagrange’s mean value theorem, we get

d1=ddsu(s)(Tt0)(1+u(t0))2h=u(s)1+u(s)(Tt0)(1+u(t0))2hu(t0)(1+u(t0))3h2=u(t0)(1+u(t0))3(ban)2,s[t0,T].

Let 0 < i < n and u(T) - y(ti) > 0. Then, using the fact that both functions u(.) and y(.) are increasing, y(t) ≥ u(t) > -1 for all t ∈ [a, b] (this follows from Lemma 4.1) and formula (7.7), we obtain that

di+1di+u(T)y(ti)(1+y(t0))(1+u(t0))h=di+u(T)y(ti)(1+u(t0))2h<di+u(ti+1)u(ti)(1+u(t0))2h

and by the inequality

u(ti+1)u(ti)<c(ti+1ti)=ch,

which follows directly from Lemma 4.2, we have

di+1<di+Bh2,(7.9)

where B=c(1+u(t0))2 is a positive constant. If 0 < i < n and u(T) - y(ti) ≤ 0, then we obtain (7.9) immediately, since by (7.7): di+1di < di + B h2.

An iterative use of (7.9) leads to

di+1<[di1+Bh2]+Bh2<<d1+iBh2<d1+nBh2=d1+nBban2=d1+B(ba)2n(7.10)

From (7.8) it follows that limn+d1=0, which ensures that the last term in (7.10) converges to 0 for n→ + ∞. Since it does not depend on i, we have

ε>0n0n>n0i{0,1,,n}itholdsdi<ε.(7.11)

The function y(.) is linear on each interval [ti, ti+1], i = 0, 1,…, n, and u(.) is concave. Therefore, the function y(t) - u(t) is convex on each [ti, ti+1], and since it is continuous, it can be shown easily, that its supremum is attained on one of the grid points ti or ti+1. Lemma 4.1 ensures that y(t) ≥ u(t) for all t ∈ [a, b]. It follows that

supt[a,b]y(t)u(t)=maxt[a,b]y(t)u(t)=maxi{0,1,,n1}[maxt[ti,ti+1]y(t)u(t)]=maxi{0,1,,n1}max{di,di+1}=maxi{0,1,,n}di

Finally, using (7.11) we get

ε>0n0n>n0supt[a,b]y(t)u(t)<ε,

which proves the uniform convergence. □

Proof of Lemma 4.3

a) The proof follows the approach used in the proof of Lemma 4.1 a). In step 1∘ it is sufficient to use the fact that

LW(1,ea)1+LW(1,ea)<0,

and in step 2∘ the fact that

y(ti)1+y(ti)<0.

b) We proceed analogously to the proof of Lemma 4.1 b). In step 1∘, the only difference is that we use the strict convexity of u(.). In step 2∘ we can show easily that du(t)dt|t=T0du(t)dt|t=T, which is the contradiction with the growth of the first derivative, because d2dt2u(t)>0 for all t ∈ [a, b]. □

Proof of Lemma 4.4

Since u(.) is decreasing, it is enough to prove that u(y)-u(x) < c(x-y) for all y. Similarly as in the proof of Lemma 4.2 we get

there existst0[y,x]:u(y)-u(x)=du(t)dt|t=t0(y-x).(7.12)

Since d2dt2u(t)>0 for all t ∈ [a, b], it is true that

du(t)dt|t=a<du(t)dt|t=t0.(7.13)

Moreover, according to Lemma 3.2 we have

du(t)dt|t=a=-u(a)1+u(a)=:-c.(7.14)

Substituting (7.13) and (7.14) into (7.12), we finally get

u(y)u(x)=du(t)dt|t=t0(xy)<du(t)dt|t=a(xy)=c(xy),

which completes the proof. □

Proof of Theorem 4.2

In this proof we again denote by di the difference between the grid points (knots); however, now di = u(ti)-y(ti). Then analogously to the proof of Theorem 4.1 we have

di+1=di+y(ti)-u(T)(1+y(ti))(1+u(T))h,T[ti,ti+1],i0.(7.15)

If i = 0, then y(t0) = u(t0) and d0 = 0. Further, u(.) is decreasing and u(t) < -1 for all t ∈ [a, b]. Similarly to the proof of Theorem 4.1 we have

d1<u(t0)(1+u(t0))3(ban)2.(7.16)

Let 0 < i < n and y(ti) - u(T) > 0. Then, using the fact that both functions u(.) and y(.) are decreasing, y(t) < u(t) < -1 for all t ∈ [a, b] (see Lemma 4.3), and (7.15), similarly to the proof of Theorem 4.1 we get that the following equality holds, i.e.,

di+1=di+y(ti)-u(ti+1)(1+u(t0))2h.

Finally, we use the inequality u(ti) - u(ti+1) < c(ti+1 - ti) = ch, which follows directly from Lemma 4.4, to get

di+1<di+Bh2,

where B=c(1+u(t0))2>0 is a positive constant. Further, we follow the same approach as in the proof of Theorem 4.1. □

Received: 2016-05-12
Accepted: 2017-09-22
Published Online: 2018-10-20
Published in Print: 2018-10-25

© 2018 Mathematical Institute Slovak Academy of Sciences

Downloaded on 29.3.2024 from https://www.degruyter.com/document/doi/10.1515/ms-2017-0177/html
Scroll to top button