We design and analyze an algorithm for estimating the mean of a function of a conditional expectation when the outer expectation is related to a rare event. The outer expectation is evaluated through the average along the path of an ergodic Markov chain generated by a Markov chain Monte Carlo sampler. The inner conditional expectation is computed as a non-parametric regression, using a least-squares method with a general function basis and a design given by the sampled Markov chain. We establish non-asymptotic bounds for the -empirical risks associated to this least-squares regression; this generalizes the error bounds usually obtained in the case of i.i.d. observations. Global error bounds are also derived for the nested expectation problem. Numerical results in the context of financial risk computations illustrate the performance of the algorithms.
Statement of the problem.
We consider the problem of estimating the mean of a function of a conditional expectation in a rare-event regime, using Monte Carlo simulations. More precisely, the quantity of interest writes
where R and Y are vector-valued random variables, and is a so-called rare subset, i.e. is small. This is a problem of nested Monte Carlo computations with a special emphasis on the distribution tails. In the evaluation of (1.1), which is equivalent to
where the distribution of X is the conditional distribution of Y given , there are two intertwined issues, which we now explain to emphasize our contributions.
The outer Monte Carlo stage samples distributions restricted to . A naive acceptance-rejection on Y fails to be efficient because most of simulations of Y are wasted. Therefore, specific rare-event techniques have to be used. Importance sampling is one of these methods (see e.g. [23, 3]), which can be efficient in small dimension (10 to 100) but fails to deal with larger dimensions. In addition, this approach relies heavily on particular types of models for Y and on suitable information about the problem at hand.
Another option consists in using Markov Chain Monte Carlo (MCMC) methods. Such methods amount to construct a Markov chain , such that the chain possesses an unique stationary distribution π equal to the conditional distribution of Y given the event . In such case, for π-almost every initial condition , the Birkhoff ergodic theorem shows that
for any (say) bounded function φ. This approach has been developed, analyzed and experimented in  in quite general and complex situations, demonstrating its efficiency over alternative methods. Therefore, a natural idea for the estimation of (1.1) would be the computation of
emphasizing the need for approximating the quantity .
The inner Monte Carlo stage is used to approximate these conditional expectations at any previously sampled. A first idea is to replace by a Crude Monte Carlo sum computed with N draws:
This approach is refereed to as nested simulation method in  (with the difference that their are i.i.d. and not given by a Markov chain). This algorithm based on (1.2) is briefly presented and studied in Appendix A. Having a large N reduces the variance of this approximation (and thus ensures convergence as proved in Theorem 5) but it yields a prohibitive computational cost. Furthermore, this naive idea does not take into account cross-information between the different approximations at the points . Instead, we follow a non-parametric regression approach for the approximation of the function satisfying (almost-surely): given L basis functions , we regress against the variables , where is sampled from the conditional distribution of R given . Note that this inner Monte Carlo stage only requires a single draw for each sample of the outer stage. Our discussion in Section 2.4 shows that the regression Monte Carlo method for the inner stage outperforms the crude Monte Carlo method as soon as the regression function can be well approximated by the basis functions (which is especially true when is smooth, with a degree of smoothness qualitatively higher than the dimension d, see details in Section 2.4).
The major difference with the standard setting for non-parametric regression  comes from the design which is not a i.i.d. sample: the independence fails because is a Markov chain path, which is ergodic but not stationary in general.
A precise description of the algorithm is given in Section 2, with a discussion on implementation issues. We also provide some error estimates, in terms of the size M of the sample, and of the function space used for approximating the inner conditional expectation. Proofs are postponed to Section 4. Section 3 gathers some numerical experiments, in the field of financial and actuarial risks. We conclude in Section 5. Appendix A presents the analysis of a Monte Carlo scheme for computing (1.1), by using an MCMC scheme for the outer stage and a crude Monte Carlo scheme for the inner stage.
Numerical evaluation of nested conditional expectations arises in several fields. This pops up naturally in solving dynamic programming equations for stochastic control and optimal stopping problems, see [24, 18, 8, 16, 2]; however, coupling these latter problems with rare event is usually not required from the problem at hand.
In financial and actuarial management , we often retrieve nested conditional expectations, with an additional account for such estimations in the tails (like (1.1)). A major application is the risk management of portfolios written with derivative options : regarding (1.1), R stands for the aggregated cashflows of derivatives at time , and Y for the underlying asset or financial variables at time . Then represents the portfolio value at T given a scenario Y, and the aim is to compute the extreme exposure (Value at Risk, Conditional VaR) of the portfolio. These computations are an essential concern for Solvency Capital Requirement in insurance .
Literature background and our contributions.
In view of the aforementioned applications, it is natural to find most of background results in relation to risk management in finance and insurance. Alternatively to the crude nested Monte Carlo methods (i.e. with an inner and an outer stage, both including sample Monte Carlo averages), several works have tried to speed-up the algorithms, notably by using spatial approximation of the inner conditional expectation: we refer to  for kernel estimators, to  for kriging techniques, to  for least-squares regression methods. However, these works do not account for the outside conditional expectation given , i.e. the learning design is sampled from the distribution of Y and not from the conditional distribution of Y given . While the latter distribution distorsion is presumably unessential in the computation of (1.1) in the case that is not rare, it certainly becomes a major flaw when because the estimator of is built using quite irrelevant data. We mention that the weighted regression method of  better accounts for extreme values of Y in the resolution of the least-squares regression, but still, the design remains sampled from the distribution of Y instead of the conditional distribution of Y given and therefore most of the samples are wasted.
In this work, we use least-squares regression methods to compute the function . Our results are derived under weaker conditions than what is usually assumed: contrary to , the basis functions are not necessarily orthonormalized and the design matrix is not necessarily invertible. Therefore we allow general basis functions and we avoid conditions on the underlying distribution. Furthermore, we do not restrict our convergence analysis to (large sample) but we also account for the approximation error (due to the function space). This allows a fine tuning of all parameters to achieve a tolerance on the global error. Finally, as a difference with the usual literature on non-parametric regression [14, 8], the learning sample is not an i.i.d. sample of the conditional distribution of Y given : the error analysis is significantly modified. Among the most relevant references in the case of non-i.i.d. learning sample, we refer to [1, 21, 5]. Namely, in , is autoregressive or β-mixing: as a difference with our setting, they assume that the learning sample is stationary and that the noise sequence (i.e. , ) is essentially i.i.d. (and independent of the learning sample). In , the authors relax the condition on the noise but they impose R to be bounded; the learning sample is still assumed to be stationary and β-mixing. In  the authors study kernel estimators for (instead of least-squares like we do), under the assumption that the noise is a martingale with uniform exponential moments (we only impose finite variance).
2 Algorithm and convergence results
Let be a -random vector; the distribution of X is the conditional distribution of Y given , with density μ with respect to a positive σ-finite measure λ on . For any Borel set A, we denote
is a Markov kernel, it is the conditional distribution of R given X. Let be the function from to , defined by
It satisfies, -almost surely,
For the regression step, choose L measurable functions , , such that
Denote by the vector space spanned by the functions , , and by the function from to collecting the basis functions :
By convention, vectors are column vectors. For a matrix A, denotes its transpose. We denote by the scalar product in , and we will use to denote both the Euclidean norm in and the absolute value. The identity matrix of size N is denoted by .
We adopt the short notation for the sequence .
In Algorithm 1, we provide a description of a Monte Carlo approximation of the unknown quantity (1.1). Note that as a byproduct, this algorithm also provides an approximation of the function given by (2.1).
Let be a Markov transition kernel on with unique invariant distribution .
Algorithm 1 (Full algorithm with M data, .)
The optimization problem Line 7 of Algorithm 1 is equivalent to find a vector solving
There exists at least one solution, and the solution with minimal (Euclidean) norm is given by
where denotes the Moore–Penrose pseudo-inverse matrix; when the rank of is L, and in that case, equation (2.3) possesses an unique solution.
An example of efficient transition kernel is proposed in : this kernel, hereafter denoted by , can be read as a Hastings–Metropolis transition kernel targeting and with a proposal kernel with transition density q which is reversible with respect to μ, i.e. for all ,
An algorithmic description for sampling a path of length M of a Markov chain with transition kernel and with initial distribution ξ is given in Algorithm 2.
Algorithm 2 (MCMC for rare event: A Markov chain with kernel .)
When is a Gaussian distribution on restricted to , is a candidate with distribution satisfying (2.6); here, is a design parameter chosen by the user (see [11, Section 4] for a discussion on the choice of ρ). Other proposal kernels q satisfying (2.6) are given in [11, Section 3] in the non-Gaussian case.
More generally, building a transition kernel with invariant distribution is well known using Hastings–Metropolis schemes. Actually, there is no need to impose condition (2.6) about reversibility of q with respect to μ. Indeed, given an arbitrary transition density , it is sufficient to replace Lines 5–6 of Algorithm 2 by the following acceptance rule: if , accept with probability
In the subsequent numerical tests with Gaussian distribution restricted to , , we will make use of as a candidate for the transition density , where is a well-chosen point in . In that case, we easily check that the acceptance probability is given by
2.2 Convergence results for the estimation of
Let be the set of measurable functions such that ; and define the norm
Let be the projection of on the linear span of the functions , with respect to the norm given by (2.9): , where solves
Assume that the following conditions hold:
the transition kernel and the initial distribution ξ satisfy: there exists a constant and a rate sequence such that for any ,(2.10)
the conditional distribution satisfies(2.11)
Let and be given by Algorithm 1. Then
See Section 4.1. ∎
Note that measures the mean squared error along the design sequence . The proof consists in decomposing this error into a variance term and a squared bias term:
on the right-hand side is the statistical error, decreasing as the size of the design M gets larger and increasing as the size of the approximation space L gets larger.
The quantity is the residual error under the best approximation of by the basis functions with respect to the -norm: it is naturally expected as the limit of when .
The term with describes how rapidly the Markov chain converges to its stationary distribution .
This theorem extends known results in the case of i.i.d. design , which is the major novelty of our contribution. The i.i.d. case is a special case of this general setting: it is retrieved by setting . Note that in that case, assumption (i) is satisfied with , and the upper bound in (2.12) coincides with classic results (see e.g. [14, Theorem 11.1]). The theorem covers the situation when the outer Monte Carlo stage relies on a Markov chain Monte Carlo sampler; we will discuss below how to check assumption (i) in practice.
The assumptions on the basis functions are weaker than what is usually assumed in the literature on nested simulation. Namely, as a difference with [4, Assumption A2] in the i.i.d. case, Theorem 1 holds even when the functions are not orthonormal in , and it holds without assuming that almost-surely, the rank of the matrix is L.
Assumption (ii) says that the conditional variance of R given X is uniformly bounded. This condition could be weakened and replaced by an ergodic condition on the Markov kernel implying that
We conclude this subsection by conditions on and implying the ergodicity assumption (2.10) with a geometric rate sequence
Assume that is phi-irreducible and there exists a measurable function such that
there exist and such that for any , ,
there exists such that the level set is 1 -small: there exist and a probability distribution ν on (with ) such that for any , .
Then there exist and a finite constant such that for any measurable function , any and any ,
In addition, there exists a finite constant such that for any measurable function and any ,
Assume the following conditions:
For all , implies that .
There exists such that
There exist , a measurable function and a set such that
For some , the level set is such that
Then the assumptions of Proposition 2 are satisfied for the kernel .
See Section 4.2. ∎
When μ is a Gaussian density on restricted to and the proposal density is a Gaussian random variable with mean and covariance (with ), it is easily seen that conditions (i), (iii) and (iv) of Corollary 3 are satisfied (choose, e.g., , with ). Condition (ii) is problem specific since it depends on the geometry of .
2.3 Convergence results for the estimation of
Assume that the following conditions hold:
is globally Lipschitz in the second variable: there exists a finite constant such that for any ,
There exists a finite constant C such that for any M
See Section 4.3. ∎
2.4 Asymptotic optimal tuning of parameters
In this subsection, we discuss how to tune the parameters of the algorithm (i.e. M, and L), given a Markov kernel . To simplify the discussion, we assume from now on that
The above condition on ρ is quite little demanding: see Proposition 2 where the convergence is geometric. Regarding the condition on , although not trivial, this assumption seems reasonable since is the best approximation of on the function basis with respect to the target measure : it means that first
and second is expected to converge to 0 in as the number L of basis functions increases. Besides, in the context of Proposition 2, the control of would follow from the control of , which is a delicate task because of the lack of knowledge on .
uniformly in the function basis. In other words, the mean empirical squared error is bounded by
as in the case of i.i.d. design (see [14, Theorem 11.1]).
There are many choices of function basis , but due to the lack of knowledge on the target measure and in the perspective of discussing convergence rates, it is relevant to adopt local approximation techniques, like piecewise polynomial partitioning estimates (i.e. local polynomials defined on a tensored grid); for a detailed presentation, see [12, Section 4.4.]. Assume that the conditional expectation is smooth on , namely is continuously differentiable, with bounded derivatives, and the -th derivatives is -Hölder continuous. Set . If is bounded, it is well known [14, Corollary 11.1 for ] that taking local polynomials of order on a tensored grid with edge length equal to ensures that both the statistical error and the approximation error have the same magnitude and we get
If is not anymore bounded, under the additional assumption that has tails with exponential decay, it is enough to consider similar local polynomials but on a tensored grid truncated at distance ; this choice maintains the validity of estimate (2.13), up to logarithmic factors [12, Section 4.4.], which we omit to write for the sake of simplicity.
Regarding the complexity (computational cost), the simulation cost (for ) is proportional to M, the computation of needs operations (taking advantage of the tensored grid), as well as the final evaluation of . Thus we have , with another constant. Finally, in view of Theorem 4, we derive
This is similar to the rate we would obtain in a i.i.d. setting. For very smooth (), we retrieve asymptotically the order of convergence. This global error may be compared to the situation where the inner conditional expectation is computed using a crude Monte Carlo method (using N samples of for each of the M samples ); this scheme is described and analyzed in Appendix A. Its computational cost is and its global error is if f is Lipschitz (resp. if f is smoother); thus we have (by taking resp. )
In the standard case of Lipschitz f, the regression-based Algorithm 1 converges faster than Algorithm 3 under the condition . In low dimension, this condition is easy to satisfy but it becomes problematic as the dimension increases, this is the usual curse of dimensionality.
3 Application: Put options in a rare-event regime
The goal is to approximate the quantity
for various choices of h, where is a d-dimensional geometric Brownian motion, and is a rare event.
3.1 A toy example in dimension 1
We start with a toy example: in dimension , when and so that
is the Put payoff written on one stock with price , with strike K and maturity : this is a standard financial product used by asset managers to insure their portfolio against the decrease of stock price. We take the point of view of the seller of the contract, who is mostly concerned by large values of the Put price, i.e. he aims at valuing the excess of the Put price at time beyond the threshold , for stock value smaller than . We assume that evolves like a geometric Brownian motion, with volatility and zero drift. For the sake of simplicity, we assume that the interest rate is 0; extension to non-zero interest rate is obvious.
Upon noting that and , where are independent standard Gaussian variables and
and . In this example, and are explicit. We have indeed , where Φ denotes the cumulative distribution function (cdf) of a standard Gaussian distribution. Furthermore, , where
note that . The parameter values for the numerical tests are given in Table 1.
We first illustrate the behavior of the kernel described by Algorithm 2. Since Y is a standard Gaussian random variable, we design as a Hastings–Metropolis sampler, with invariant distribution equal to a standard restricted to and with proposal distribution . Observe that this proposal kernel is reversible with respect to μ, see (2.6). Note that condition (ii) in Corollary 3 gets into
which holds true since . In the following, the performance of the kernel is compared to that of the kernel defined as a Hastings–Metropolis kernel with proposal and with invariant distribution a standard Gaussian random variable restricted to . As a main difference with , this proposal transition density q is not reversible with respect to μ (whence the notation for the kernel); therefore, the acceptance-rejection ratio of the new point z is given by (see equality (2.8))
In Figure 1 (bottom right), the true cdf of Y given (which is a density on ) is displayed on together with three empirical cdfs : the first one is computed from i.i.d. samples with distribution and the second one (resp. the third one) is computed from a Markov chain path of length M with kernel (resp. ) and started at . The two kernels provide a similar approximation of the true cdf. Here , and for both kernels. We also display the normalized histograms of the points sampled respectively from (top left), (top right) and the crude rejection algorithm with Gaussian proposal (bottom left). In the latter plot, the histogram is built with only around 50–60 points which correspond to the accepted points among proposal points.
To assess the speed of convergence of the samplers and to their stationary distributions, we additionally plot in Figure 2 the autocorrelation function for both chains. For the choice of ρ is quite significant, as observed in ; values of ρ around 0.9 give usually good results. For , in this example the choice of ρ is less significant because we are able to define a proposal which takes advantage of the knowledge on the rare set. A comparison of acceptance rates is provided below (see Figure 3 (left)).
We also illustrate the behavior of these two MCMC samplers for the estimation of the rare-event probability . Following the approach of , we use the decomposition
where , and is a Markov chain with kernel or having a standard Gaussian restricted to as invariant distribution. The J intermediate levels are chosen such that . Figure 3 (right) displays the boxplot of 100 independent realizations of the estimator for different values of ; the horizontal dotted line indicates the true value . Here , and . Figure 3 (left) displays the boxplot of 100 mean acceptance rates computed along 100 independent chains , for different values of ρ; the horizontal dotted line is set to 0.234 which is usually chosen as the target rate when fixing some design parameters in a Hastings–Metropolis algorithm (see e.g. ). We observe that the use of non-reversible proposal kernel yields more accurate results than ; this is intuitively easy to understand since better accounts for the point around which one should sample.
We now run Algorithm 1 for the estimation of the conditional expectation on . The algorithm is run with , successively with and both with ; the L basis functions are and we consider successively . In Figure 4 (right), the error function is displayed for different values of L when computing . It is displayed on the interval , which is an interval with probability larger than under the distribution of Y given (see Figure 1). Note that the errors may be quite large for x close to -5; however these values are very unlikely (see Figure 1), and therefore these large errors are not representative of the global quadratic error. In Figure 4 (left), we display 1000 sampled points of . These points are taken from the sampler , every twenty iterations, in order to obtain quite uncorrelated design points. Observe that the regression function looks like affine, which explains why the results with only are quite accurate.
We finally illustrate Algorithm 1 for the estimation of (see (3.1)). In Figure 5 (right), the boxplot of 100 independent outputs of Algorithm 1 is displayed when run with (top) and (bottom); different values of ρ and M are considered, namely and ; the regression step is performed with basis functions. Figure 5 (right) illustrates well the benefit of using MCMC sampler for the current regression problems: when , compare the distribution for (i.i.d. samples) and : observe the bias when which does not disappear even when and note that the variance is very significantly reduced (when respectively, the standard deviation is reduced by a factor 1.11, 6.58 and 11.96).
Figure 5 (left) is an empirical verification of the statement of Theorem 1. One hundred independent runs of Algorithm 1 are performed, and for different values of M, the quantities are collected; here is computed with basis functions. The mean value over these 100 points is displayed as a function of M; it is a Monte Carlo approximation of (see (2.12)). We compare two implementations of Algorithm 1: first, with and then with . Theorem 1 establishes that is upper bounded by a quantity of the form ; such a curve is fitted by a mean square technique (we obtain for both kernels, which is in adequation with the theorem since this term does not depend on the Monte Carlo stages). The fitted curves are shown in Figure 5 (left) and they demonstrate a good match between the theory and the numerical studies.
3.2 Correlated geometric Brownian motions in dimension 2
We adapt the one-dimensional example, taking a Put on the geometric average of two correlated assets . In this example, , and . We denote by , and ϱ, respectively, each volatility and the correlation; the drift of is zero. Set
We have , where . Furthermore, it is easy to verify that is still a geometric Brownian motion, with volatility and drift given by
where is independent of Y, and .
For the outer Monte Carlo stage, is defined as the Hastings–Metropolis kernel with proposal distribution (with ) and with invariant distribution, a bi-dimensional Gaussian distribution restricted to the set . We compare this Markov kernel to the kernel with non-reversible proposal, defined as a Hastings–Metropolis with proposal distribution and with invariant distribution, a bi-dimensional Gaussian distribution restricted to the set . The acceptance-rejection ratio for this algorithm is given by (2.8) with and .
In this example, the inner conditional expectation is explicit: with
For the basis functions, we take
The parameter values for the numerical tests are given in Table 2.
Figure 6 depicts the rare event : on the left (resp. on the right), some level curves of the distribution of (resp. distribution of ) are displayed, together with the rare event in the bottom left corner.
We run two Markov chains respectively with kernel and and compute the mean acceptance-rejection rate after iterations. For different values of ρ, this experiment is repeated 100 times, independently; Figure 7 reports the boxplot of these mean acceptance rates. It shows that a rate close to 0.234 is reached with for and for . In all the experiments below involving these kernels, we will use these values of the design parameter ρ.
In Figure 8 (left), the normalized histogram of the errors is displayed when and the samples are sampled from (left) or (right). Figure 8 (right) shows the case . Here, . This clearly shows an improvement by choosing more basis functions. Especially, the sixth basis function brings much accuracy, as expected, since the regression function depends directly on it.
In Figure 9 (left), the errors are displayed on when and the outer samples used in the computation of are sampled from (left) and (right). Figure 9 (right) shows the case . Here, . This is complementary to Figure 8 since it shows the prediction error everywhere in the space, and not only along the design points.
In Figure 10 (left), a Monte Carlo approximation of (see (2.12)) computed from 100 independent estimators is displayed as a function of M for M in the range ; where is computed with . We also fit a curve of the form to illustrate the sharpness of the upper bound in (2.12). In Figure 10 (right), the boxplot of 100 independent outputs of Algorithm 1 is displayed, for and different values of (resp ) – the design parameter in (resp. ). Here again, we observe the advantage of using MCMC samplers to reduce the variance in this regression problem coupled with rare-event regime: when respectively, the standard deviation is reduced by a factor 6.89, 7.27 and 7.74.
4 Proofs of the results of Section 2.2
4.1 Proof of Theorem 1
By the construction of the random variables and (see Algorithm 1), for any bounded and positive measurable functions , it holds
If , then . In other words, any coefficient solution α of the least-squares problem (2.3) yields the same values for the approximated regression function along the design .
Denote by r the rank of and write for the singular value decomposition of . It holds
by using and . This implies that the first r components of and are equal and thus . This concludes the proof. ∎
The next result justifies a possible interchange between least-squares projection and conditional expectation.
Set , where is any solution to . Then the function
is a solution to the least-squares problem
It is sufficient to prove that
where . The solution of the above least-squares problem is of the form
where satisfies . By (4.1) and the definition of , this yields
We then conclude by Lemma 1 that . We are done. ∎
Proof of Theorem 1.
Using Lemma 2 and the Pythagoras theorem, it holds
This concludes the control of .
Using again Lemma 2,
4.2 Proof of Corollary 3
Note that is a Hastings–Metropolis kernel; hence, for any and any measurable set A in ,
Let A be a measurable subset of such that (which implies that ). Then
and the right-hand side is positive since, owing to assumption (i),
This implies that is phi-irreducible with as irreducibility measure.
which implies by (iii) that
Small set assumption.
Let be given by assumption (iv). We have ; thus define the probability measure
Then for any and any measurable subset A of , it readily follows from (4.3) that
Thanks to the lower bounds of (iv), we complete the proof.
4.3 Proof of Theorem 4
We write with
For the first term, we have
The second term is controlled by assumption (ii). We then conclude by the Minkowski inequality.
We have designed a new methodology to compute nested expectations in the rare-event regime. The outer expectation is evaluated using an ergodic Markov chain restricted to the rare set, whereas the inner expectation is approximated using a linear regression method with general basis functions. We quantified the error bounds as a function of the number of outer samples and of the size of the basis. This highlights that, in the regression scheme, replacing the usual i.i.d. design by an ergodic Markov chain design does not alter significantly the statistical errors.
When the inner expectation is alternatively computed pointwise with i.i.d. samples, we also provided error bounds, which show that this approach for the inner expectation is more suitable than the regression method in the case of large dimensional problems (curse of dimensionality).
In our numerical tests, we illustrated how to choose appropriately the parameters of the ergodic Markov chain so that the mean acceptance rate for staying in the rare set is about 20–30 %. It usually ensures low variance in the full scheme.
Funding source: Agence Nationale de la Recherche
Award Identifier / Grant number: ANR-15-CE05-0024
Funding statement: The second author’s research is part of the Chair Financial Risks of the Risk Foundation, the Finance for Energy Market Research Centre and the ANR project CAESARS (ANR-15-CE05-0024).
A Algorithm where the inner stage uses a crude Monte Carlo method and the outer stage uses MCMC sampling
Algorithm 3 (Full algorithm with M outer samples, and N inner samples for each outer one.)
A.2 Convergence results for the estimation of
We extend Theorem 1 to this new setting. Actually when the function f in (1.1) is smoother than Lipschitz continuous, we can prove that the impact of N on the quadratic error is instead of the usual . This kind of improvement has been derived by  in the i.i.d. setting (for both the inner and outer stages).
Assume that the following conditions hold:
The (second and) fourth conditional moments of are bounded: for and , we have
There exists a finite constant C such that for any M
If is globally -Lipschitz in the second variable, then
where and are respectively given by (1.1) and Algorithm 3. If f is continuously differentiable in the second variable, with a derivative in the second variable which is bounded and globally -Lipschitz, then
A.3 Proof of Theorem 5
First case: f Lipschitz.
We write with
For the first term, since f is globally Lipschitz with constant , we have
Since are independent conditionally on , we have
The second term is controlled by assumption (ii). We are done.
Second case: f smooth.
A Taylor expansion gives
Invoking that are independent conditionally on leads to
Moreover, upon noting that for ,
This concludes the proof. ∎
 Baraud Y., Comte F. and Viennet G., Adaptive estimation in autoregression or β-mixing regression via model selection, Ann. Statist. 29 (2001), no. 3, 839–875. 10.1214/aos/1009210692Search in Google Scholar
 Belomestny D., Kolodko A. and Schoenmakers J., Regression methods for stochastic control problems and their convergence analysis, SIAM J. Control Optim. 48 (2010), no. 5, 3562–3588. 10.1137/090752651Search in Google Scholar
 Blanchet J. and Lam H., State-dependent importance sampling for rare event simulation: An overview and recent advances, Surv. Oper. Res. Manag. Sci. 17 (2012), 38–59. 10.1016/j.sorms.2011.09.002Search in Google Scholar
 Devineau L. and Loisel S., Construction d’un algorithme d’accélération de la méthode des “simulations dans les simulations” pour le calcul du capital économique solvabilité ii, Bull. Français d’Actuariat 10 (2009), no. 17, 188–221. Search in Google Scholar
 Douc R., Fort G., Moulines E. and Soulier P., Practical drift conditions for subgeometric rates of convergence, Ann. Appl. Probab. 14 (2004), no. 3, 1353–1377. 10.1214/105051604000000323Search in Google Scholar
 Fort G. and Moulines E., Convergence of the Monte Carlo expectation maximization for curved exponential families, Ann. Statist. 31 (2003), no. 4, 1220–1259. 10.1214/aos/1059655912Search in Google Scholar
 Gobet E. and Turkedjiev P., Linear regression MDP scheme for discrete backward stochastic differential equations under general conditions, Math. Comp. 299 (2016), no. 85, 1359–1391. 10.1090/mcom/3013Search in Google Scholar
 Hong L. J. and Juneja S., Estimating the mean of a non-linear function of conditional expectation, Proceedings of the 2009 Winter Simulation Conference (WSC), IEEE Press, Piscataway (2009), 1223–1236. 10.1109/WSC.2009.5429428Search in Google Scholar
 Lemor J-P., Gobet E. and Warin X., Rate of convergence of an empirical regression method for solving generalized backward stochastic differential equations, Bernoulli 12 (2006), no. 5, 889–916. 10.3150/bj/1161614951Search in Google Scholar
 McNeil A. J., Frey R. and Embrechts P., Quantitative Risk Management, Princeton Ser. Finance, Princeton University Press, Princeton, 2005. Search in Google Scholar
 Ren Q. and Mojirsheibani M., A note on nonparametric regression with β-mixing sequences, Comm. Statist. Theory Methods 39 (2010), no. 12, 2280–2287. 10.1080/03610920903039480Search in Google Scholar
 Rosenthal J. S., Optimal Proposal Distributions and Adaptive MCMC, Chapman & Hall/CRC Handb. Mod. Stat. Methods, CRC Press, Boca Raton, 2008. Search in Google Scholar
 Tsitsiklis J. N. and Van Roy B., Regression methods for pricing complex American-style options, IEEE Trans. Neural Netw. 12 (2001), no. 4, 694–703. 10.1109/72.935083Search in Google Scholar PubMed
© 2017 by De Gruyter