Exponential inequalities for nonstationary Markov Chains

Exponential inequalities are main tools in machine learning theory. To prove exponential inequalities for non i.i.d random variables allows to extend many learning techniques to these variables. Indeed, much work has been done both on inequalities and learning theory for time series, in the past 15 years. However, for the non independent case, almost all the results concern stationary time series. This excludes many important applications: for example any series with a periodic behavior is non-stationary. In this paper, we extend the basic tools of Dedecker and Fan (2015) to nonstationary Markov chains. As an application, we provide a Bernstein-type inequality, and we deduce risk bounds for the prediction of periodic autoregressive processes with an unknown period.


Introduction
Exponential inequalities are corner stones of machine learning theory. For example, distribution free generalization bounds were proven by Vapnik and Cervonenkis based on Hoeffding's inequality, see Vapnik (1998). Model selection bounds in Massart (2007) also rely on exponential moment inequalities.
To prove such inequalities in the non i.i.d setting is thus crucial to study the generalization ability of machine learning algorithms on time series. As an example, a Bernstein type inequality for α-mixing time series is proven in Modha and Masry (2002). This result is used by Steinwart and Christmann (2009) to prove generalization bounds for general learning problems with α-mixing observations. Exponential inequalities and machine learning with non-i.i.d observations actually became an important research direction, a more detailed list of references is given below. However, most of these references assume stationarity. That is, only the independence assumption was removed. The observations are still assumed to be identically distributed, or at least erdogic. This excludes many applications: in addition to trends, data related to a human activity such as in industry or economics has some periodicity (hourly, daily, yearly. . . ) and some regime switching; the same remark applies to data with a physical origin, such as in geology, astrophysics. . .
In this paper, the inequalities proven by Dedecker and Fan (2015) for time homogeneous Markov chains to non-homogeneous chains. This allows to study a large set of nonstationary processes. We obtain Bernstein, McDiarmid inequality as well as moments inequalities. As an application, we study periodic autoregressive processes of the form X t = f t (X t−1 ) + ε t where f t+T = f t for any t, for some period T . Thanks to our version of Bernstein's inequality we show that the Empirical Risk Minimizer (ERM) leads to consistent predictions in this setting. We also show that a penalized version of the ERM enjoys the same property even when T is unknown.
The paper is organized as follows. The rest of this introduction is dedicated to a state-of-the-art on exponential inequalities for time series. Section 2 introduces the notations and assumptions that will be used in the whole paper. In Section 3, we state an extension of Proposition 2.1 of Dedecker and Fan (2015): this is Lemma 3.1. As a proof of concept, we use this lemma to prove a version of Bernstein inequality for nonstationary Markov chains. We also provide Cramer and McDiarmid inequalities based on this lemma. We study periodic autoregressive series in Section 4. Finally, Section 5 contains the proof of Lemma 3.1 and of the results in Section 4.

State of the art
We refer the reader to Boucheron et al. (2013) for an overview on exponential and concentration inequalities in the i.i.d case. This book also provides references for applications of these results to machine learning theory.
Exponential inequalities were proven for time under a various range of assumptions. We refer the reader to Doukhan (2018) for various approaches on modelling time series.
Inequalities for Markov chains (X t ) t≥1 are proven in Catoni (2003); Adamczak (2008); Bertail and Clémençon (2010); Joulin and Ollivier (2010); Wintenberger (2017); Bertail and Portier (2018); Paulin (2018); Bertail and Ciolek (2019). Note that most of these inequalities require the chain to be time homogeneous. While this does not imply the chain to be stationary, in some sense the X t 's are asymptotically identically distributed in these papers. For example, consider the powerful renewal technique used in Bertail and Ciolek (2019) to prove a version of Bernstein inequality. The proof is based on the fact that blocks (X τi , . . . , X τi+1−1 ) between two renewal times τ i and τ i+1 are actually i.i.d. It is thus possible to apply the i.i.d version of Bernstein inequality to these blocks. The spectral technique used in Paulin (2018) still relies on the ergodicity of the Markov chain (we thank the anonymous Referee for pointing out some of these references). Exponential inequalities for hidden Markov chains are given in Kontorovich and Ramanan (2008).
It is a well-known fact that Hoeffding's inequality is not only valid for independent observations, but also for martingales increments (it is sometimes refered to as Hoeffding-Azuma inequality in this case). To decompose a function of the process as a sum of martingales increments is actually one of the most powerful techniques to prove exponential inequalities, see Chapter 3 in Boucheron et al. (2013). More exponential inequalities for martingales can be found in Seldin et al (2012); Rio (2013a); Bercu et al. (2015). This technique is actually used by Dedecker and Fan (2015); Fan et al. (2018) to prove exponential inequalities for Markov chains.
Markov chains are extremely useful in modelisation and simulations, however, many time series have a very different dependence structure. Mixing coefficients allow to quantify the dependence between observations without giving an explicit structure on this dependence. We refer the reader to Rio (2017) for a comprehensive introduction. Exponential inequalities for mixing processes are proven in Samson (2000); Merlevède et al. (2009);Rio (2017); Hang and Steinwart (2017). Mixing series however exclude many stochastic processes, as discussed in the monograph Dedecker et al. (2007). Weak dependance coefficients cover a wider range of processes for which Bernstein type inequalities are proven for example in Collet et al. (2002); Doukhan and Neumann (2007); Wintenberger (2010); Merlevède et al. (2011);Blanchard and Zadorozhnyi (2017). Dynamic systems are examples of processes where only X 1 is random, each X t is then a deterministic fonction of X t−1 . Based on weak dependence arguments, it is possible to prove exponential inequalities for such processes Collet et al. (2002).
In this paper we prove provide tools to prove exponential inequalities for nonstationary, non homogeneous Markov chains. Rather than the renewal or spectral techniques discussed above, we extend the martingale approach of Dedecker and Fan (2015) to non-homogeneous chains.

Notations
From now, all the random variables are defined on a probability space (Ω, A, P). Let (X , d) and (Y, δ) be two complete separable metric spaces. Let (ε t ) t≥2 be a sequence of i.i.d Y-valued random variables. Let X 1 be a X -valued random variable independent of (ε t ) t≥2 . Let (X t ) t≥1 be the Markov chain given by where the functions F t : X × Y → X are such that for some constant ρ ∈ [0, 1), and for some constant C > 0. In particular, when F t ≡ F, this is the model studied by Dedecker and Fan (2015). This class of Markov chains, that we call "one-step contracting", contains a lot of pertinent examples. The classical AR(1)-process is given by X t = F (X t−1 , ε t ) where F (x, y) = ax+y. Condition (3) is satisfied, and Condition (2) will be satisfied as soon as |a| < 1. Now, consider a time-varying AR(1) process: This process may be non-stationary. Condition (3) is still satisfied, and |F t (x, y)− F t (x , y)| ≤ |a t ||x − x | so Condition (2) will be satisfied as soon as sup t |a t | < 1. This process is studied by Bardet and Doukhan (2018) under various assumptions: local stationarity, that means a slow variation of a t as a function of t, see Dahlhaus (1996), and periodicity, that is, for any t: a t+T = a t for some (known) period T . If T is unknown, a cross-validation procedure to estimate T is proposed (Remark 2.4) without a consistency result. Below we will propose a penalized procedure with some statistical guarantees.
As a much more general example, consider the following functional autoregressive model. Let X be a separable Banach space with norm | · |. The functional auto-regressive model is defined by where f t : X → X is such that Clearly (1) and (2) are satisfied once ρ ∈ [0, 1), see Diaconis and Freedman (1999) for more examples.
We introduce the natural filtration of the chain F 0 = {∅, Ω} and for t ∈ N, Consider a separately Lipschitz function f : We define The objective of what follows will be to derive inequalities on the tails of P(|S n | > x).

Main results
Dedecker and Fan (2015) proved several exponential and moments inequalities if F t ≡ F , by using a martingale decomposition. We will first extend this martingale decomposition to the general case. As an example, we will use it to prove a Bernstein type inequality. Other inequalities are given in the appendix.

Main lemma: martingale decomposition
Let P X1 denote the distribution of X 1 and P ε the (common) distribution of the ε t 's. Let G X1 , G ε and H t,ε be defined by We are now in position to state our main lemma.
Lemma 3.1. Assume (1) and (2), then: bla 1. The function g t is separately Lipschitz and 3. Assume moreover that the F t 's satisfy (3). Then H t,ε (x, y) ≤ G ε (y), and consequently, for t ∈ [2, n], The proof of this lemma is given in Section 5. First, we want to show that the inequalities in this lemma can be used to prove exponential inequalities on S n .

Application: Bernstein inequality
Note that van de Geer (1995) and de la Pena (1999) obtained some tight Bernstein type inequalities for martingales. Here, we can use the martingale decomposition and apply Lemma 3.1 to obtain the following result.
Theorem 3.1. Assume that there exist some constants M > 0, V 1 ≥ 0 and V 2 ≥ 0 such that, for any integer k ≥ 2, Let δ = M K n−1 (ρ) and Then, for any s ∈ [0, δ −1 ), Consequently, for any x > 0, The quantity V (n) can be computed explicitely from the definition for each n but note that Proof. For any s ∈ [0, δ −1 ), We use Lemma 3.1 for the second inequality, the moment assumption for the third one, and the inequality 1 + s ≤ e s , for the final inequality. Similarly, for any k ∈ [2, n], By the tower property of conditional expectation, it follows that which gives inequality (6). Using the exponential Markov inequality, we deduce that, for any x ≥ 0 Minimizing the right-hand side with respect to s leads to the result.

McDiarmid and Cramer inequalities
Here, we state other consequences of Lemma 3.1. However, as our applications are based on Bernstein inequality, we postpone the proof of these results to Section 5. When the Laplace transform of the dominating random variables G X1 (X 1 ) and G ε (ε k ) satisfy the Cramér condition, we obtain the following proposition.
Proposition 3.1. Assume that there exist some constants a > 0, K 1 ≥ 1 and Consequently, for any x > 0, Now, consider the case where the increments d k are bounded. We shall use an improved version of the well known inequality by McDiarmid, stated by Rio (2013b). For this inequality, we do not assume that (3) holds. Thus, Proposition 3.2 applies to any Markov chain be the Young transform of (t). As quoted by Rio (2013b), the following in- Let also (X 1 , (ε i ) i≥2 ) be an independent copy of (X 1 , (ε i ) i≥2 ).

Application to periodic autoregressive models
In this section, we apply Theorem 3.1 to predict a nonstationary Markov chain. We will use periodic autoregressive predictors. Of course, these predictors will work well when the Markov chain is indeed periodic autoregressive. However, we will state the results in a more general context -when the model is wrong, we simply estimate its best prediction by a periodic autoregression.

Context
Let (X t ) t≥1 be an R d -valued process defined by the distribution of X 1 and, for t > 0, where the ε t are i.i.d and centered, and each f * t belong to a fixed family of We are interested by periodic predictors: f t+T = f t , defined by a sequence (f 1 , . . . , f T ) ∈ F T . Of course, if the series (X t ) actually satisfies f * t+T = f * t , then this family of predictors can give optimal predictions. But they might also perform well when this equality is not exact (for example, when there is a very small drift).
Prediction is assessed with respect to a non-negative loss function: (·). We assume that is L-Lipschitz. Note that this includes the absolute loss, the Huber loss and all the quantile losses. This also includes the quadratic loss if we assume that X t , and hence ε t , is bounded. Given a sample X 1 , . . . , X n we define the empirical risk, for any f 1:T = (f 1 , . . . , f T ) ∈ F T : Note that when the process has actually T -periodic distribution, in the sense that the distribution of the vectors (X kT +1 , . . . , X (k+1)T ) are the same for any k, then alsmot surely f * t = f * t+T for any t and the prediction averaged over one period, which appears to be equal to R T +1 (f 1:T ). We can actually give a more accurate statement.
Proposition 4.1. When the distribution of (X kT +1 , . . . , X (k+1)T ) does not depend on k, (All the proofs are postponed to Section 5 for the clarity of exposition). The simplest use of Bernstein's inequality is to control the deviation between r n (f 1:T ) and R n (f 1:T ) for a fixed predictor f 1:T .

Estimation with a fixed period
In this subsection we assume that T is known (we will later show how to deal with the case were it is unknown). Thus, we define the estimator r n (f 1:T ).
In order to study the statistical performances off 1:T , a few definitions are in order. For any function f : R d → R d we will use the notation x .
When considering linear functions, this actually coincides with the operator norm.
Definition 4.1. Define the covering number N (F, ) as the cardinality of the smallest set F such that ∀f ∈ F, ∃f ∈ F such that f − f sup ≤ . Define the entropy of F by H(F, ) = 1 ∨ log N (F, ).
Covering numbers are standard tools to measure the complexity of set of predictors in machine learning.
Theorem 4.1. As soon as n ≥ 1 + 4δ 2 T H(F, 1 Ln )/V we have, for any η > 0, The theorem states that the predictorf 1:T predict as well as the best possible one up to an estimation error that vanishes at rate √ n. For example, using (periodic) VAR(1) predictors in dimension d, we get a bound in Remark 4.2. When the series is indeed stationary for a known T , it is to be noted that (X iT +1 , . . . , X i(T +1) ) i≥0 is a time homogeneous Markov chain. In this case, our technique is not really necessary: it would be possible to apply the inequality from Dedecker and Fan (2015). However, when T is not known, this becomes impossible. In this case, one has to compare the empirical risks of f 1:T for the various possible T 's, and for most of them, (X iT +1 , . . . , X i(T +1) ) i≥0 is not homogeneous. In this case, vectorization cannot help. On the other hand, our inequality can be used for period selection, as detailed in the next subsection.

Period and model selection
We define a penalized estimator in the spirit of Massart (2007). Fix a maximal period T max , for example T max = n/2 . We propose the following penalized estimator for T : Using this estimator, we have the following result.

Note thatT depends on
. While L depends only on the loss that is chosen by the statistician, in many applications ρ, V 1 and V 2 are unknown. We recommend to use an empirical criterion like the slope heuristic, introduced by Birgé and Massart (2006), to calibrate C 1 . This procedure is as follows: 1. define, for any c > 0,T (c) = arg min 1≤T ≤Tmax [r n (f 1:T ) + c √ T ].
2. fix a small step > 0 and defineĉ as the maximiser of the jump J (c) = T (c + ) − T (c).

selectT (2ĉ).
Many variants, details on fast implementations and references for theoretical results (in the i.i.d case) can be found in see Baudry et al. (2012). A theoretical study of the slope heuristic in the context could be the object of future works.

Simulation study
As an illustration we simulate X t+1 = a t X t + ε t for t = 1, . . . , 400, where a t+4 = a t , (a 1 , a 2 , a 3 , a 4 ) = (0.8, 0.5, 0.9, −0.7) and ε t ∼ N (0, 1). The data is shown in Figure 1 and the autocorrelation function in Figure 2. It is clear that a statistician trying to estimate an AR(1) model with a fixed coefficient would be puzzled by this situation. Figure 1: Simulated data.
The dependence of r n (â 1:T ) with respect to T ∈ {1, ..., T max } with T max = 20 is shown in Figure 3. The choice T = 4 leads to an improvement with respect to T < 4. On the other hand, we observe a slow linear decrease of r n (â 1:4 ), r n (â 1:8 ), r n (â 1:12 ) . . . this is a sign of overfitting. And indeed,

Acknowledgements
We thank the anonymous Referees for their very constructive comments that helped to improve the clarity of the paper.

Proofs
Proof of Lemma 3.1. The first point will be proved by backward induction. The result is obvious for t = n, since g n = f . Assume that it is true at step t, and let us prove it at step t − 1. By definition g t−1 (X 1 , . . . , X t−1 ) = E[g t (X 1 , . . . , X t )|F t−1 ] = g t (X t , . . . , X t−1 , F t (X t−1 , y))P ε (dy) .

Point 1 follows from this last inequality and (12).
Let us now prove Point 2. First note that where the inequality comes from the first point of Lemma 3.1. In the same way, for t ≥ 2, Finally, the proof of Point 3 is direct: if (3) is true, then H t,ε (x, y) = d(F t (x, y), F t (x, y ))P ε (dy ) ≤ Cδ(y, y )P ε (dy ) = G ε (y) .
The proof of the proposition is now complete.
We state a lemma that will be used in the following proofs.
From the proof of Lemma 3.1, it is easy to see that By Lemma 3.1 and the hypothesis of the proposition, it follows that v k−1 (X 1 , . . . , X k−1 ) − u k−1 (X 1 , . . . , X k−1 ) Following exactly the proof of Theorem 3.1 of Rio (2013b) with ∆ k = K n−k (ρ)M k , we get (9) and (10).
Proof of Proposition 4.1. Put k = n/T , then where we used Lemma 5.1 and ρ n−1 < 1 for the last inequality. In the same way, Combining both inequalities leads to the result.
Proof of Corollary 4.1. By definition, we have (2) is satisfied, and F t (x, y) − F t (x, y ) = y − y so that (3) is satisfied with C = 1. We consider the random variable S n = f (X 1 , . . . , So the assumptions of Theorem 3.1 are satisfied and Remind that S n = n−1 L(1+ρ) [r n (f 1:T ) − E[r n (f 1:T )]], and R n (f 1:T ) = E[r n (f 1:T )], so that by setting s = t(n − 1)/L(1 + ρ) we end the proof. Proof of Theorem 4.1. Fix > 0. We have, for any f 1:T ∈ F , the deviation inequality from Corollary 4.1. A union bound on f 1:T ∈ F leads to, for any s ∈ 0, n−1 L(1+ρ)δ , Now, for any f 1:T ∈ F T we construct f 1:T = (f 1 , . . . , f T ) by chosing, for any t ∈ {1, . . . , T }, a function f t such that f t − f t sup ≤ , as allowed from the definition of F . Obviously and as a consequence, and Using Theorem 3.1 with f (X 1 , . . . , X n ) = n−1 t=1 X t we have, for any y > 0, Lemma 5.1 leads to where we introduce the last notation for short. Now let us consider the "favorable" event The previous inequalities show that On E, we have: R n (f 1:T ) ≤ R n (f 1:T ) + L z ρ,n n − 1 ≤ r n (f 1:T ) + x + L z ρ,n n − 1 ≤ r n (f 1:T ) + x + L 2 z ρ,n n − 1 + y n − 1 = min R n (f 1:T ) + 2x + L 3z ρ,n + y n − 1 .