Prediction of time series by statistical learning: general losses and fast rates

We establish rates of convergences in time series forecasting using the statistical learning approach based on oracle inequalities. A series of papers extends the oracle inequalities obtained for iid observations to time series under weak dependence conditions. Given a family of predictors and $n$ observations, oracle inequalities state that a predictor forecasts the series as well as the best predictor in the family up to a remainder term $\Delta_n$. Using the PAC-Bayesian approach, we establish under weak dependence conditions oracle inequalities with optimal rates of convergence. We extend previous results for the absolute loss function to any Lipschitz loss function with rates $\Delta_n\sim\sqrt{c(\Theta)/ n}$ where $c(\Theta)$ measures the complexity of the model. We apply the method for quantile loss functions to forecast the french GDP. Under additional conditions on the loss functions (satisfied by the quadratic loss function) and on the time series, we refine the rates of convergence to $\Delta_n \sim c(\Theta)/n$. We achieve for the first time these fast rates for uniformly mixing processes. These rates are known to be optimal in the iid case and for individual sequences. In particular, we generalize the results of Dalalyan and Tsybakov on sparse regression estimation to the case of autoregression.

(4) CREST-LFA 15, boulevard Gabriel Péri 92245 Malakoff CEDEX, France Abstract: We establish rates of convergences in time series forecasting using the statistical learning approach based on oracle inequalities. A series of papers (e.g. [MM98,Mei00,BCV01,AW12]) extends the oracle inequalities obtained for iid observations to time series under weak dependence conditions. Given a family of predictors and n observations, oracle inequalities state that a predictor forecasts the series as well as the best predictor in the family up to a remainder term ∆n. Using the PAC-Bayesian approach, we establish under weak dependence conditions oracle inequalities with optimal rates of convergence ∆n. We extend results given in [AW12] for the absolute loss function to any Lipschitz loss function with rates ∆n ∼ c(Θ)/n where c(Θ) measures the complexity of the model. We apply the method for quantile loss functions to forecast the french GDP. Under additional conditions on the loss functions (satisfied by the quadratic loss function) and on the time series, we refine the rates of convergence to ∆n ∼ c(Θ)/n. We achieve for the first time these fast rates for uniformly mixing processes. These rates are known to be optimal in the iid case, see [Tsy03], and for individual sequences, see [CBL06]. In particular, we generalize the results of [DT08] on sparse regression estimation to the case of autoregression.

Introduction
Time series forecasting is a fundamental subject in the mathematical statistics literature. The parametric approach contains a wide range of models associated with efficient estimation and prediction methods, see e.g. [Ham94]. Classical parametric models include linear processes such as ARMA models [BD09]. More recently, non-linear processes such as stochastic volatility and ARCH models received a lot of attention in financial applications -see, e.g., the seminal paper by Nobel prize winner [Eng82], and [FZ10] for a more recent introduction. However, parametric assumptions rarely hold on data. Assuming that the data satisfy a model can biased the prediction and underevaluate the risks, see among others the the polemical but highly informative discussion in [Tal07].
In the last few years, several universal approaches emerged from various fields such as non-parametric statistics, machine learning, computer science and game theory. These approaches share some common features: the aim is to build a procedure that predicts the time series as well as the best predictor in a given set of initial predictors Θ, without any parametric assumption on the distribution of the observed time series. However, the set of predictors can be inspired by different parametric or non-parametric statistical models. We can distinguish two classes in these approaches, with different quantification of the objective, and different terminologies: • in the "prediction of individual sequences" approach, predictors are usually called "experts". The objective is online prediction: at each date t, a prediction of the future realization x t+1 is based on the previous observations x 1 , ..., x t , the objective being to minimize the cumulative prediction loss. See for example [CBL06,Sto10] for an introduction. • in the statistical learning approach, the given predictors are sometimes referred as "models" or "concepts". The batch setting is more classical in this approach. A prediction procedure is built on a complete sample X 1 , ..., X n . The performance of the procedure is compared on the expected loss, called the risk, with the best predictor, called the "oracle". The environment is not deterministic and some hypotheses like mixing or weak dependence are required: see [Mei00,MM98,AW12].
In both settings, one is usually able to predict a time series as well as the best model or expert, up to an error term that decreases with the number of observations n. This type of results is referred in statistical theory as oracle inequalities. In other words, one builds on the basis of the observations a predictor θ such that R(θ) ≤ inf θ∈Θ R(θ) + ∆(n, Θ) (1.1) where R(θ) is a measure of the prediction risk of the predictor θ ∈ Θ. In general, the remainder term is of the order ∆(n, Θ) ∼ c(Θ)/n in both approaches, where c(Θ) measures the complexity of Θ. See, e.g., [CBL06] for the "individual sequences" approach; for the "statistical learning approach" the rate c(Θ)/n is reached in [AW12] with the absolute loss function and under a weak dependence assumption. Different procedures are used to reach these rates. Let us mention the empirical risk minimization [Vap99] and aggregation procedures with exponential weights, usually referred as EWA [DT08,Ger11] or Gibbs estimator [Cat04,Cat07] in the batch approach, linked to the weighted majority algorithm of the online approach [LW94], see also [Vov90]. Note that results from the "individual sequences" approach can sometimes be extended to the batch setting, see e.g. [Ger11] for the iid case, and [AD11,DAJJ12] for mixing time series.
In this paper, we extend the results of [AW12] to the case of a general loss function. Another improvement with respect to [AW12] is to study both the ERM and the Gibbs estimator under various hypotheses. We achieve here inequalities of the form of (1.1) that hold with large probability (1 − ε for any arbitratily small confidence level ε > 0) with ∆(n, Θ) ∼ c(Θ)/n. We assume to do so that the observations are taken from a bounded stationary process (X t ) (see [AW12] however for some possible extensions to unbounded observations). We also assume weak dependence conditions on the process process (X t ). Then we prove that the fast rate ∆(n, Θ) ∼ c(Θ)/n can be reached for some loss functions including the quadratic loss. Note that [Mei00,MM98] deal with the quadratic loss, their rate can be better than c(Θ)/n but cannot reach c(Θ)/n.
Our main results are based on PAC-Bayesian oracle inequalities. The PAC-Bayesian point of view emerged in statistical learning in supervised classification using the 0/1-loss, see the seminal papers [STW97,McA99]. These results were then extended to general loss functions and more accurate bounds were given, see for example [Cat04, Cat07, Alq08, Aud10, AL11, SLCB+12, DS12]. In PAC-Bayesian inequalities the complexity term c(Θ) is defined thanks to a prior distribution on the set Θ.
The paper is organized as follows: Section 2 provides notations used in the whole paper. We give a definition of the Gibbs estimator and of the ERM in Section 3. The main hypotheses necessary to prove theoretical results on these estimators are provided in Section 4. We give examples of inequalities of the form (1.1) for classical set of predictors Θ in Section 5. When possible, we also prove some results on the ERM in these settings. These results only require a general weak-dependence type assumption on the time series to forecast. We then study fast rates under a stronger φ−mixing assumptions of [Ibr62] in Section 6. Note that the φ-mixing setting coincides with the one of [AD11,DAJJ12] when (X t ) is stationary. In particular, we are able to generalize the results of [DT08, Ger11, AL11] on sparse regression estimation to the case of autoregression. In Section 7 we provide an application to French GDP forecasting. A short simulation study is provided in Section 8. Finally, the proofs of all the theorems are given in Appendices A and B.

Notations
Let X 1 , . . . , X n denote the observations at time t ∈ {1, . . . , n} of a time series X = (X t ) t∈Z defined on (Ω, A, P). We assume that this series is stationary and take values in R p equipped with the Euclidean norm · . We fix an integer k, that might depend on n, k = k(n), and assume that family of predictors is available: f θ : (R p ) k → R p , θ ∈ Θ . For any parameter θ and any time t, f θ (X t−1 , . . . , X t−k ) is the prediction of X t returned by the predictor θ when given (X t−1 , . . . , X t−k ). For the sake of shortness, we use the notation: . We assume that θ → f θ is a linear function. Let us fix a loss function ℓ that measures a distance between the forecast and the actual realization of the series. Assumptions on ℓ will be given in Section 4.
Definition 1. For any θ ∈ Θ we define the prediction risk as does not depend on t thanks to the stationarity assumption).
Using the statistics terminology, note that we may want to include parametric set of predictors as well as non-parametric ones (i.e. respectively finite dimensional and infinite dimensional Θ). Let us mention classical parametric and non-parametric families of predictors: Example 1. Define the set of linear autoregressive predictors as In order to deal with non-parametric settings, we will also use a modelselection type notation: Θ = ∪ M j=1 Θ j . Example 2. Consider non-parametric auto-regressive predictors i=0 is a dictionnary of functions (R p ) k → R p (e.g. Fourier basis, wavelets, splines...).

The estimators
As the objective is to minimize the risk R(·), we use the empirical risk r n (·) as an estimator of R(·).
Let T be a σ-algebra on Θ and M 1 + (Θ) denote the set of all probability measures on (Θ, T ). The Gibbs estimator depends on a fixed probability measure π ∈ M 1 + (Θ) called the prior that will be involved when measuring the complexity of Θ.
Definition 4 (Gibbs estimator or EWA). Define the Gibbs estimator with inverse temperature λ > 0 aŝ The choice of π and λ in practice is discussed in Section 5.

Overview of the results
Our results assert that the risk of the ERM or Gibbs estimator is close to inf θ R(θ) up to a remainder term ∆(n, Θ) called the rate of convergence. For the sake of simplicity, let θ ∈ Θ be such that If θ does not exist, it is replaced by an approximative minimizer θ α satisfying R(θ α ) ≤ inf θ R(θ) + α where α is negligible w.r.t. ∆(n, Θ) (e.g. α < 1/n 2 ). We want to prove that the ERM satisfies, for any ε > 0, where ∆(n, Θ, ε) → 0 as n → ∞. We also want to prove that and that the Gibbs estimator satisfies, for any ε > 0, where ∆(n, λ, π, ε) → 0 as n → ∞ for some λ = λ(n). To obtain such results called oracle inequalities, we require some assumptions discussed in the next section.

Main assumptions
We prove oracle inequalities under assumptions of two different types. On the one hand, assumptions LipLoss(K) and Lip(L) hold respectively on the loss function ℓ and the set of predictors Θ. In some extent, we choose the loss function and the predictors, so these assumptions can always be satisfied. Assumption Margin(K) also holds on ℓ.
On the other hand, assumptions Bound(B), WeakDep(C), PhiMix(C) hold on the dependence and boundedness of the time series. In practice, we cannot know whether these assumptions are satisfied on data. However, remark that these assumptions are not parametric and are satisfied for many classical models, see [Dou94,DDL+07].

Example 4. The class of quantile loss functions introduced in [KB78] is given by
where τ ∈ (0, 1) and x, y ∈ R. The risk minimizer of t → E(ℓ τ (V − t)) is the quantile of order τ of the random variable V . Choosing this loss function one can deal with rare events and build confidence intervals, see [Koe05,BC11,BP11]. In this case, LipLoss(K) is satisfied with K = max(τ, 1 − τ ) ≤ 1.
Remark that under Assumptions LipLoss(K), Lip(L) and Bound(B), the empirical risk is a bounded random variable. Such a condition is required in the approach of individual sequences. We assume it here for simplicity but it is possible to extend the slow rates oracles inequalities to unbounded cases see [AW12].
In order to prove the fast rates oracle inequalities, a more restrictive dependence condition is assumed. It holds on the uniform mixing coefficients introduced by [Ibr62].
Definition 6. The φ-mixing coefficients of the stationary sequence (X t ) with distribution P are defined as This assumption appears to be more restrictive than WeakDep(C) for bounded time series:

Bound(B) and PhiMix(C) ⇒ Bound(B) and WeakDep(CB).
(This result is not stated in [Rio00] but it is a direct consequence of the last inequality in the proof of Corollaire 1, p. 907 in [Rio00]).
Finally, for fast rates oracle inequalities, an additional assumption on the loss function ℓ is required. In the iid case, such a condition is also required. It is called Margin assumption, e.g. in [MT99,Alq08], or Bernstein hypothesis, [Lec11].
As assumptions Margin(K) and PhiMix(C) are used only to obtain fast rates, we give postpone examples to Section 6.

Slow rates oracle inequalities
In this section, we give oracle inequalities (3.1) and/or (3.2) with slow rates of convergence ∆(n, Θ) ∼ c(Θ)/n. The proof of these results are given in Section B. Note that the results concerning the Gibbs estimator are actually corollaries of a general result, Theorem 9, stated in Section A. We introduce the following notation for the sake of shortness.

Finite classes of predictors
Consider first the toy example where Θ is finite with |Θ| = M , M ≥ 1. In this case, the optimal rate in the iid case is known to be log(M )/n, see e.g. [Vap99].
Theorem 1. Assume that |Θ| = M and that SlowRates(κ) is satisfied for κ > 0. Let π be the uniform probability distribution on Θ. Then the oracle inequality (3.2) is satisfied for any λ > 0, ε > 0 with The choice of λ in practice in this toy example is already not trivial. The choice λ = log(M )n yields the oracle inequality: However, this choice is not optimal and one would like to choose λ as the minimizer of the upper bound 2λκ 2 However κ = κ(K, L, B, C) and the constants B and C are, usually, unknown. In this context we will prefer the ERM predictor that performs as well as the Gibbs estimator with optimal λ: Theorem 2. Assume that |Θ| = M and that SlowRates(κ) is satisfied for κ > 0. Then the oracle inequality (3.1) is satisfied for any ε > 0 with

Linear autoregressive predictors
We focus on the linear predictors given in Example 1.
Theorem 3. Consider the linear autoregressive model of AR(k) predictors Assume that Assumptions Bound(B), LipLoss(K) and WeakDep(C) are satisfied. Let π be the uniform probability distribution on the extended parameter set {θ ∈ R k+1 , θ ≤ L + 1}. Then the oracle inequality (3.2) is satisfied for any λ > 0, In theory, λ can be chosen of the order (k + 1)n to achieve the optimal rates (k + 1)/n up to a logarithmic factor. But the choice of the optimal λ in practice is still a problem. The ERM predictor still performs as well as the Gibbs predictor with optimal λ. Theorem 4. Under the assumptions of Theorem 3, the oracle inequality (3.1) is satisfied for any ε > 0 with The additional constraint on λ does not depend on n. It is restrictive only when k + 1, the complexity of the autoregressive model, has the same order than n. For n sufficiently large and λ = ((1 − k/n)/κ) ((k + 1)n/2) satisfying the constraint λ ≥ 2KB/(k + 1) we obtain the oracle inequality Theorems 3 and 4 are both direct consequences of the following results about general classes of predictors.

General parametric classes of predictors
We state a general result about finite-dimensional families of predictors. The complexity k + 1 of the autoregressive model is replaced by a more general measure of the dimension d(Θ, π). We also introduce some general measure D(Θ, π) of the diameter that will, for most compact models, be linked to the diameter of the model.
Theorem 6. Assume that Assume also that Bound(B), LipLoss(K) and WeakDep(C) are satisfied and that Lip(L) holds on the extended model This result yields to nearly optimal rates of convergence for the ERM predictors. Indeed, for n sufficiently large and λ = ((1 − k/n)/κ) (dn/2) ≥ 2Kψ/d we obtain the oracle inequality Thus, the ERM procedure yields prediction that are close to the oracle with an optimal rate of convergence up to a logarithmic factor.
Example 6. Consider the linear autoregressive model of AR(k) predictors studied in Theorems 3 and 4. Then Lip(L) is automatically satisfied with L = D +1.
The assumptions of Theorem 6 are satisfied with d = k + 1 and ψ = B. Moreover, thanks to Remark 1, the assumptions of Theorem 5 are satisfied with D(Θ, π) = (KB ∨ K 2 B 2 )(R + 1). Then Theorems 3 and 4 are actually direct consequences of Theorems 5 and 6.
Note that the context of Theorem 6 are less general than the one of Theorem 5: Remark 1. Under the assumptions of Theorem 6 we have for any θ ∈ Θ Define π as the uniform distribution on Θ ′ = {θ ∈ R d : θ 1 ≤ D + 1}. We derive from simple computation the inequality Thus, in any case, and the assumptions of Theorem 5 are satisfied for d(Θ, π) = d and D(Θ, π) = (Kψ ∨ K 2 ψ 2 )(D + 1).
As a conclusion, for some predictors set with a non classical structure, the Gibbs estimator might be preferred to the ERM.

Aggregation in the model-selection setting
Consider now several models of predictors Θ 1 , ..., Θ M and consider Θ = Our aim is to predict as well as the best predictors among all Θ j 's, but paying only the price for learning in the Θ j that contains the oracle. In order to get such a result, let us choose M priors π j on each models such that π j (Θ j ) = 1 for all j ∈ {1, ..., M }. Let π = M j=1 p j π j be a mixture of these priors with prior weights p j ≥ 0 satisfying M j=1 p j = 1. Denote the oracle of the model Θ j for any 1 ≤ j ≤ M . For any λ > 0, denoteρ λ,j the Gibbs distribution on Θ j andθ λ,j = Θj θρ λ,j (dθ) the corresponding Gibbs estimator. A Gibbs predictor based on a model selection procedure satisfies an oracle inequality with slow rate of convergence: Theorem 7. Assume that: Then, with probability at least 1 − ε, the following oracle inequality holds The proof is given in Appendix B. A similar result can be obtained if we replace the Gibbs predictor in each model by the ERM predictor in each model. The resulting procedure is known in the iid case under the name SRM (Structural Risk Minimization), see [Vap99], or penalized risk minimization, [BM01]. However, as it was already the case for a fixed model, additional assumptions are required to deal with ERM predictors. In the model-selection context, the procedure to choose among all the ERM predictors also depends on the unknown κ j 's. Thus the model-selection procedure based on Gibbs predictors outperforms the one based on the ERM predictors.

Discussion on the assumptions
In this section, we study conditions under which the rate 1/n can be achieved. These conditions are restrictive: • now p = 1, i.e. the process (X t ) t∈Z is real-valued; • the dependence condition WeakDep(C) is replaced by PhiMix(C); • we assume additionally Margin(K) for some K > 0.
Let us provide some examples of processes satisfying the uniform mixing assumption PhiMix(C). In the three following examples (ǫ t ) denotes an iid sequence (called the innovations).
Example 7 (AR(p) process). Consider the stationary solution (X t ) of an AR(p) model: ∀t ∈ Z, X t = p j=1 a j X t−j + ǫ t . Assume that (ǫ t ) is bounded with a distribution possessing an absolutely continuous component. If A(z) = p j=1 a j z j has no root inside the unit disk in C then (X t ) is a geometrically φ-mixing processe, see [AP86] and PhiMix(C) is satisfied for some C.
Example 8 (MA(p) process). Consider the stationary process (X t ) such that X t = p j=1 b j ǫ t−j for all t ∈ Z. By definition, the process (X t ) is stationary and φ-dependent -it is even p-dependent, in the sense that φ r = 0 for r > p. Thus PhiMix(C) is satisfied for some C > 0.
We now provide an example of predictive model satisfying all the assumptions required to obtain fast rates oracle inequalities, in particular Margin(K), when the loss function ℓ is quadratic, i.e. ℓ(x, i=0 of (R p ) k to R p , and θ = (θ 1 , . . . , θ N ) ∈ R N . Assume the ϕ i upper bounded by 1 and Θ = {θ ∈ R N , θ 1 ≤ L} such that Lip(L). Moreover LipLoss(K) is satisfied with K = 2B. Assume that θ = arg min θ∈R N R(θ) ∈ Θ in order to have: Assumption Margin(K) is satisfied with K = 4B 2 (1 + D) 2 . According to Theorem 8 below, the oracle inequality with fast rates holds as soon as Assumption PhiMix(C) is satisfied.

General result
We only give oracle inequalities for the Gibbs predictor in the model-selection setting. In the case of one single model, this result can be extended to the ERM predictor. For several models, the approach based on the ERM predictors requires a penalized risk minimization procedure as in the slow rates case. In the fast rates case, the Gibbs predictor itself directly have nice properties. Let Θ = M i=1 Θ i (disjoint union), choose π = M j=1 p j π j and denote θ j ∈ arg min θ∈Θj R(θ) as previously.
Theorem 8. Assume that: 1. Margin(K) and LipLoss(K) are satisfied for some K, K > 0; 2. Bound(B) is satisfied for some B > 0; 3. PhiMix(B) is satisfied for some C > 0; 4. Lip(L) is satisfied for some L > 0; 5. for any j ∈ {1, ..., M }, there exist d j = d(Θ j , π) and D j = D(Θ j , π j ) satisfying the relation Then for Compare with the slow rates case, we don't have to optimize with respect to λ as the optimal order for λ is independent of j. In practice, the value of λ provided by Theorem 8 is too conservative. In the iid case, it is shown in [DT08] that the value λ = n/(4σ 2 ), where σ 2 is the variance of the noise of the regression yields good results. In our simulations results, we will use λ = n/v ar(X), wherê var(X) is the empirical variance of the observed time series.
Notice that for the index j 0 such that R(θ j0 ) = R(θ) we obtain: So, the oracle inequality achieves the fast rate d j0 /n log (n/d j0 ) where j 0 is the model of the oracle. However, note that the choice j = j 0 does not necessarily reach the infimum in Theorem 8.
Let us compare the rates in Theorem 8 to the ones in [Mei00, MM98, AD11, DAJJ12]. In [Mei00,MM98], the optimal rate 1/n is never obtained. The paper [AD11] proves fast rates for online algorithms that are also computationally efficient, see also [DAJJ12]. The fast rate 1/n is reached when the coefficients (φ r ) are geometrically decreasing. In other cases, the rate is slower. Note that we do not suffer such a restriction. The Gibbs estimator of Theorem 8 can also be computed efficiently thanks to MCMC procedures, see [AL11, DT08].
Corollary 1. Assume that θ = arg min θ∈R N R(θ) ∈ Θ and PhiMix(C) is satisfied for some C > 0 as well as Bound(B). Then the oracle inequality (3.2) is satisfied for any ε > 0 with This extends the results of [AL11, DT08, Ger11] to the case of autoregression.
Proof. The proof follows the computations of Example 10 that we do not reproduce here: we check the conditions LipLoss(K) with K = 2B, Lip(L) and Margin(K) with K = 4B 2 (1 + L) 2 . We can apply Theorem 8 with d J = |J| and D j = L.

Uncertainty in GDP forecasting
Every quarter t ≥ 1, the French national bureau of statistics, INSEE 1 , publishes the growth rate of the French GDP (Gross Domestic Product). Since it involves a huge amount of data that take months to be collected and processed, the computation of the GDP growth rate log(GDP t /GDP t−1 ) takes a long time (two years). This means that at time t, the value log(GDP t /GDP t−1 ) is actually not known. However, a preliminary value of the growth rate is published 45 days only after the end of the current quarter t. This value is called a flash estimate and is the quantity that INSEE forecasters actually try to predict, at least in a first time. As we want to work under the same constraint as the INSEE, we will now focus on the prediction on the flash estimate and let ∆GDP t denote this quantity. To forecast at time t, we will use: 1. the past forecastings 2 ∆GDP j , 0 < j < t; 2. past climate indicators I j , 0 < j < t, based on business surveys.
Business surveys are questionnaires of about ten questions sent monthly to a representative panel of French companies (see [Dev84] for more details). As a consequence, these surveys provide informations from the economic decision makers. Moreover, they are available each end of months and thus can be used to forecast the french GDP. INSEE publishes a composite indicator, the French business climate indicator that summarizes information of the whole business survey, see [CM09,DM06]. Following [Cor10], let I t be the mean of the last three (monthly based) climate indicators available for each quarter t > 0 at the date of publication of ∆GDP t . All these values (GDP, climate indicator) are available from the INSEE website. Note that a similar approach is used in other countries, see e.g. [BBR08] on forecasting the European Union GDP growth thanks to EUROSTATS data. In order to provide a quantification of the uncertainty of the forecasting, associated interval confidences are usually provided. The ASA and the NBER started using density forecasts in 1968, while the Central Bank of England and INSEE provide their prediction with a fan chart, see ee [DTW97,TW00] for surveys on density forecasting and [BFW98] for fan charts. However, the statistical methodology used is often crude and, until 2012, the fan charts provided by the INSEE was based on the homoscedasticity of the Gaussian forecasting errors, see [Cor10,Dow04]. However, empirical evidences are 1. the GDP forecasting is more uncertain in a period of crisis or recession; 2. the forecasting errors are not symmetrically distributed.
We apply Theorem 6 as Lip(L) is satisfied Θ ′ with L = D+1 and LipLoss(K) with K = 1. If the observations are bounded, stationary such that WeakDep(C) holds for some C > 0, the assumptions of Theorem 6 are satisfied with ψ = B and d = 4: Corollary 2. Let us fix τ ∈ (0, 1). If the observations are bounded, stationary such that WeakDep(C) holds for some C > 0 then for any ε > 0 and n large enough, we have In practice the choice of D has little importance as soon as D is large enough (only the theoretical bound is influenced). As a consequence we take D = 100 in our experiments.

Results
The results are shown in Figure 1 for forecasting corresponding to τ = 0.5. Figure 2 represents the confidence intervals of order 50%, i.e. τ = 0.25 and τ = 0.75 (left) and for confidence interval of order 90%, i.e. τ = 0.05 and τ = 0.95 (right). We report only the results for the period 2000-Q1 to 2011-Q3 (using the period 1988-Q1 to 1999-Q4 for learning). We denoteθ ERM,τ [t] the estimator computed at time t − 1, based on the observations X j , j < t. We report the online performance: mean abs. pred. error = 1 n n t=1 ∆GDP t − fθ ERM,0.5 [t] (X t−1 , X t−2 ) mean quad. pred. error = 1 n n t=1 ∆GDP t − fθ ERM,0.5 [t] (X t−1 , X t−2 ) 2 and compare it to the INSEE performance, see Table 1. We also report the frequency that the GDPs fall above the predicted τ -quantiles for each τ , see Table 2. Note that this quantity should be close to τ .   Table 2 Empirical frequencies of the event: GDP falls under the predicted τ -quantile.
The methodology fails to forecast the importance of the 2008 subprime crisis as it was the case for the INSEE forecaster, see [Cor10]. However, it is interesting to note that the confidence interval is larger at that date: the forecast is less reliable, but thanks to our adaptive confidence interval, it would have been possible to know at that time that the prediction was not reliable. Another interesting point is to remark that the lower bound of the confidence intervals are varying over time while the upper bound is almost constant for τ = 0.95. It supports the idea of asymmetric forecasting errors. A parametric model with gaussian innovations would lead to underestimate the recessions risk.

Simulation study
In this section, we finally compare the ERM or Gibbs estimators to the Quasi Maximum Likelihood Estimator (QMLE) based method used by the R function ARMA [R D08]. The idea is not to claim any superiority of one method over another, it is rather to check that the ERM and Gibbs estimators can be safely

Table 3
Performances of the ERM estimators and ARMA, on the simulations. The first row "ERM abs." is for the ERM estimator with absolute loss, the second row "ERM quad." for the ERM with quadratic loss. The standard deviations are given in parentheses.
used in various contexts as their performances are close to the standard QMLE even in the context where the series is generated from an ARMA model. It is also the opportunity to check the robustness of our estimators in case of misspecification.

Parametric family of predictors
Here, we compare the ERM to the QMLE. We draw simulations from an AR(1) models (8.1) and a non linear model (8.2): where ε t are iid innovations. We consider two cases of distributions for ε t : the uniform case, ε t ∼ U[−a, a], and the Gaussian case, ε t ∼ N (0, σ 2 ). Note that, in the first case, both models satisfy the assumptions of Theorem 8: there exists a stationary solutions (X t ) that is φ-mixing when the innovations are uniformly distributed and WeakDep(C) is satisfied for some C > 0. This paper does not provide any theoretical results for the Gaussian case as it is unbounded. However, we refer the reader to [AW12] for truncations techniques that allows to deal with this case too. We fix σ = 0.4 and a = 0.70 such that V ar(ǫ t ) ≃ 0.16 in both cases. For each model, we simulate first a sequence of length n and then we predict X n using the observations (X 1 , . . . , X n−1 ). Each simulation is repeated 100 times and we report the mean quadratic prediction errors on the Table 3. It is interesting to note that the ERM estimator with absolute loss performs better on model (8.1) while the ERM with quadratic loss performs slightly better on model (8.2). The difference tends be too small to be significative, however, the numerical results tends to indicate that both methods are robust to model mispecification. Also, both estimators seem to perform better than the R QMLE procedure when n = 100, but the differences tends to be less perceptible when n grows. Table 4 Performances of the Gibbs, AIC and "full model" predictors on simulations.

Sparse autoregression
To illustrate Corollary 1, we compare the Gibbs predictor to the model selection approach of the ARMA procedure in the R software. This procedure computes the QMLE estimator in each AR(p) model, 1 ≤ p ≤ q, and then selects the order p by Akaike's AIC criterion [Aka73]. The Gibbs estimator is computed using a Reversible Jump MCMC algorithm as in [AL11]. The parameter λ is taken as λ = n/v ar(X), the empirical variance of the observed time series. We draw the data according to the following models: where ε t are iid innovations. We still consider the uniform (ε t ∼ U[−a, a]) and the Gaussian (ε t ∼ N (0, σ 2 )) cases with σ = 0.4 and a = 0.70. We compare the Gibbs predictor performances to those of the estimator based on the AIC criterion and to the QMLE in the AR(q) model, so called "full model". For each model, we first simulate a time series of length 2n, use the observations 1 to n as a learning set and n + 1 to 2n as a test set, for n = 100 and n = 1000. Each simulation is repeated 20 times and we report in Table 4 the mean and the standard deviation of the empirical quadratic errors for each method and each model. Note that the Gibbs predictor performs better on Models (8.4) and (8.5) while the AIC predictor performs slightly better on Model (8.3). The difference tends to be negligible when n grows -this is coherent with the fact that we develop here a non-asymptotic theory. Note that the Gibbs predictor performs also well in the case of a Gaussian noise where the boundedness assumption is not satisfied.

Appendix A: A general PAC-Bayesian inequality
Theorems 1 and 5 are actually both corollaries of a more general result that we would like to state for the sake of completeness. This result is the analogous of the PAC-Bayesian bounds proved by Catoni in the case of iid data [Cat07].
Let us estimate the upper bound at the probability distribution ρ δ defined as .
Then we have: Under the assumptions of Theorem 5 we have: The infimum is reached for δ = d/λ and we have: Appendix B: Proofs
Others exponential inequalities can be used to obtain PAC-Bounds in the context of time series: the inequalities in [Dou94,Sam00] for mixing time series, and [DDL+07,Win10] under weakest "weak dependence" assumptions, [SLCB+12] for martingales. Lemma 1 is very general and yields optimal low rates of convergence. For fast rates of convergence, we will use Samson's inequality that is an extension of Bernstein's inequality in a dependent context.
Lemma 2 (Samson [Sam00]). Let N ≥ 1, (Z i ) i∈Z be a stationary process on R k and φ Z r denote its φ-mixing coefficients. For any measurable function f : Proof of Lemma 2. This result can be deduced easily from the proof of Theorem 3 of [Sam00] which states a more general result on empirical processes. In page 457 of [Sam00], replace the definition of f N (x 1 , . . . , x n ) by f N (x 1 , . . . , x n ) = n i=1 g(x i ) (following the notations of [Sam00]). Then check that all the arguments of the proof remain valid, the claim of Lemma 2 is obtained page 460, line 7.
We also remind the variational formula of the Kullback divergence.
Actually, it seems that in the case of discrete probabilities, this result was already known by Kullback (Problem 8.28 of Chapter 2 in [Kul59]). For a complete proof of this variational formula, even in the non integrable cases, we refer the reader to [DV76,Cat,Cat07].
We are now ready to state the following key Lemma.
Lemma 5. Let us assume that LowRates(κ) is satisfied satisfied for some κ > 0. Then for any λ > 0 we have n(1−k/n) 2 + K(ρ,π)+log(2/ε) λ and r n dρ ≤ Rdρ + λκ 2 n(1−k/n) 2 + K(ρ,π)+log(2/ε) Proof of Lemma 5. Let us fix θ > 0 and λ > 0, and apply the first inequality of Lemma 4. We have: and we multiply this result by ε/2 and integrate it with respect to π(dθ). An application of Fubini's Theorem yields We apply Lemma 3 and we get: As e x ≥ 1 R+ (x), we have: Using the same arguments than above but starting with the second inequality of Lemma 4: we obtain: A union bound ends the proof.
The following variant of Lemma 5 will also be useful.

B.3. Proof of Theorems 9 and 7
In this subsection we prove the general result on the Gibbs predictor. Proof of Theorem 9. We apply Lemma 5. So, with probability at least 1 − ε we are on the event given by (B.3). From now, we work on that event. The first inequality of (B.3), when applied toρ λ (dθ), gives According to Lemma 3 we have: so we obtain (B.4) We now estimate from above r(θ) by R(θ). Applying the second inequality of (B.3) and plugging it into Inequality B.4 gives We end the proof by the remark that θ → R(θ) is convex and so by Jensen's inequality R(θ)ρ λ (dθ) ≥ R θρ λ (dθ) = R(θ λ ).

B.4. Proof of Theorems 2 and 6
Let us now prove the results about the ERM.
In particular, we apply the first inequality toθ ERM . We remind that θ minimizes R on Θ and thatθ ERM minimizes r n on Θ, and so we have The result still holds if we choose λ as a minimizer of 2λκ 2 n (1 − k/n) 2 + 2 log (2M /ε) λ .
We just choose λ as the minimizer of the r.h.s., subject to t ≥ 2Kψ/d, to end the proof.
Notice finally that Margin(K) leads to V (θ, θ) = K R(θ) − R(θ) This proves the first inequality of Lemma 7. The second inequality is proved exacly in the same way, but replacing f by −f .
We are now ready to state the following key Lemma.
A union bound ends the proof.