# MCMC design-based non-parametric regression for rare event. Application to nested risk computations

Gersende Fort , Emmanuel Gobet and Eric Moulines

## Abstract

We design and analyze an algorithm for estimating the mean of a function of a conditional expectation when the outer expectation is related to a rare event. The outer expectation is evaluated through the average along the path of an ergodic Markov chain generated by a Markov chain Monte Carlo sampler. The inner conditional expectation is computed as a non-parametric regression, using a least-squares method with a general function basis and a design given by the sampled Markov chain. We establish non-asymptotic bounds for the L2-empirical risks associated to this least-squares regression; this generalizes the error bounds usually obtained in the case of i.i.d. observations. Global error bounds are also derived for the nested expectation problem. Numerical results in the context of financial risk computations illustrate the performance of the algorithms.

MSC 2010: 65C40; 62G08; 37M25

## 1 Introduction

### Statement of the problem.

We consider the problem of estimating the mean of a function of a conditional expectation in a rare-event regime, using Monte Carlo simulations. More precisely, the quantity of interest writes

(1.1):=𝔼[f(Y,𝔼[RY])Y𝒜],

where R and Y are vector-valued random variables, and 𝒜 is a so-called rare subset, i.e. (Y𝒜) is small. This is a problem of nested Monte Carlo computations with a special emphasis on the distribution tails. In the evaluation of (1.1), which is equivalent to

𝔼[f(X,𝔼[RX])],

where the distribution of X is the conditional distribution of Y given {Y𝒜}, there are two intertwined issues, which we now explain to emphasize our contributions.

The outer Monte Carlo stage samples distributions restricted to {Y𝒜}. A naive acceptance-rejection on Y fails to be efficient because most of simulations of Y are wasted. Therefore, specific rare-event techniques have to be used. Importance sampling is one of these methods (see e.g. [23, 3]), which can be efficient in small dimension (10 to 100) but fails to deal with larger dimensions. In addition, this approach relies heavily on particular types of models for Y and on suitable information about the problem at hand.

Another option consists in using Markov Chain Monte Carlo (MCMC) methods. Such methods amount to construct a Markov chain (X(m))m0, such that the chain possesses an unique stationary distribution π equal to the conditional distribution of Y given the event {Y𝒜}. In such case, for π-almost every initial condition X0=x, the Birkhoff ergodic theorem shows that

limM+1Mm=1Mφ(X(m))=𝔼[φ(Y)Y𝒜]a.s.

for any (say) bounded function φ. This approach has been developed, analyzed and experimented in [11] in quite general and complex situations, demonstrating its efficiency over alternative methods. Therefore, a natural idea for the estimation of (1.1) would be the computation of

1Mm=1Mf(X(m),𝔼[RX(m)]),

emphasizing the need for approximating the quantity 𝔼[RX(m)].

The inner Monte Carlo stage is used to approximate these conditional expectations at any X(m) previously sampled. A first idea is to replace 𝔼[RX(m)] by a Crude Monte Carlo sum computed with N draws:

(1.2)𝔼[RX(m)]1Nk=1NR(m,k).

This approach is refereed to as nested simulation method in [4] (with the difference that their X(m) are i.i.d. and not given by a Markov chain). This algorithm based on (1.2) is briefly presented and studied in Appendix A. Having a large N reduces the variance of this approximation (and thus ensures convergence as proved in Theorem 5) but it yields a prohibitive computational cost. Furthermore, this naive idea does not take into account cross-information between the different approximations at the points {X(m):m=1,,M}. Instead, we follow a non-parametric regression approach for the approximation of the function ϕ satisfying ϕ(X)=𝔼[RX] (almost-surely): given L basis functions ϕ1,,ϕL, we regress {R(m):m=1,,M} against the variables {ϕ1(X(m)),,ϕL(X(m)):m=1,,M}, where R(m) is sampled from the conditional distribution of R given {X=X(m)}. Note that this inner Monte Carlo stage only requires a single draw R(m) for each sample X(m) of the outer stage. Our discussion in Section 2.4 shows that the regression Monte Carlo method for the inner stage outperforms the crude Monte Carlo method as soon as the regression function can be well approximated by the basis functions (which is especially true when ϕ is smooth, with a degree of smoothness qualitatively higher than the dimension d, see details in Section 2.4).

The major difference with the standard setting for non-parametric regression [14] comes from the design {X(m):m=1,,M} which is not a i.i.d. sample: the independence fails because {X(m):m=1,,M} is a Markov chain path, which is ergodic but not stationary in general.

A precise description of the algorithm is given in Section 2, with a discussion on implementation issues. We also provide some error estimates, in terms of the size M of the sample, and of the function space used for approximating the inner conditional expectation. Proofs are postponed to Section 4. Section 3 gathers some numerical experiments, in the field of financial and actuarial risks. We conclude in Section 5. Appendix A presents the analysis of a Monte Carlo scheme for computing (1.1), by using an MCMC scheme for the outer stage and a crude Monte Carlo scheme for the inner stage.

### Applications.

Numerical evaluation of nested conditional expectations arises in several fields. This pops up naturally in solving dynamic programming equations for stochastic control and optimal stopping problems, see [24, 18, 8, 16, 2]; however, coupling these latter problems with rare event is usually not required from the problem at hand.

In financial and actuarial management [19], we often retrieve nested conditional expectations, with an additional account for such estimations in the tails (like (1.1)). A major application is the risk management of portfolios written with derivative options [13]: regarding (1.1), R stands for the aggregated cashflows of derivatives at time T, and Y for the underlying asset or financial variables at time T<T. Then 𝔼[RY] represents the portfolio value at T given a scenario Y, and the aim is to compute the extreme exposure (Value at Risk, Conditional VaR) of the portfolio. These computations are an essential concern for Solvency Capital Requirement in insurance [6].

### Literature background and our contributions.

In view of the aforementioned applications, it is natural to find most of background results in relation to risk management in finance and insurance. Alternatively to the crude nested Monte Carlo methods (i.e. with an inner and an outer stage, both including sample Monte Carlo averages), several works have tried to speed-up the algorithms, notably by using spatial approximation of the inner conditional expectation: we refer to [15] for kernel estimators, to [17] for kriging techniques, to [4] for least-squares regression methods. However, these works do not account for the outside conditional expectation given Y𝒜, i.e. the learning design is sampled from the distribution of Y and not from the conditional distribution of Y given {Y𝒜}. While the latter distribution distorsion is presumably unessential in the computation of (1.1) in the case that 𝒜 is not rare, it certainly becomes a major flaw when (Y𝒜)1 because the estimator of 𝔼[RY] is built using quite irrelevant data. We mention that the weighted regression method of [4] better accounts for extreme values of Y in the resolution of the least-squares regression, but still, the design remains sampled from the distribution of Y instead of the conditional distribution of Y given {Y𝒜} and therefore most of the samples are wasted.

In this work, we use least-squares regression methods to compute the function ϕ. Our results are derived under weaker conditions than what is usually assumed: contrary to [4], the basis functions ϕ1,,ϕL are not necessarily orthonormalized and the design matrix is not necessarily invertible. Therefore we allow general basis functions and we avoid conditions on the underlying distribution. Furthermore, we do not restrict our convergence analysis to M (large sample) but we also account for the approximation error (due to the function space). This allows a fine tuning of all parameters to achieve a tolerance on the global error. Finally, as a difference with the usual literature on non-parametric regression [14, 8], the learning sample (X(m))1mM is not an i.i.d. sample of the conditional distribution of Y given {Y𝒜}: the error analysis is significantly modified. Among the most relevant references in the case of non-i.i.d. learning sample, we refer to [1, 21, 5]. Namely, in [1], (X(m))1mM is autoregressive or β-mixing: as a difference with our setting, they assume that the learning sample (X(1),,X(M)) is stationary and that the noise sequence (i.e. X(m)-ϕ(X(m)), m1) is essentially i.i.d. (and independent of the learning sample). In [21], the authors relax the condition on the noise but they impose R to be bounded; the learning sample is still assumed to be stationary and β-mixing. In [5] the authors study kernel estimators for ϕ (instead of least-squares like we do), under the assumption that the noise is a martingale with uniform exponential moments (we only impose finite variance).

## 2 Algorithm and convergence results

Let (X,R) be a d×-random vector; the distribution of X is the conditional distribution of Y given {Y𝒜}, with density μ with respect to a positive σ-finite measure λ on d. For any Borel set A, we denote

𝖰(x,A):=𝔼[𝟏A(R)X=x];

𝖰 is a Markov kernel, it is the conditional distribution of R given X. Let ϕ be the function from d to , defined by

(2.1)ϕ(x):=r𝖰(x,dr).

It satisfies, μdλ-almost surely,

ϕ(X)=𝔼[RX]

when Xμdλ.

For the regression step, choose L measurable functions ϕ:d, {1,,L}, such that

ϕ2(x)μ(x)dλ(x)<.

Denote by the vector space spanned by the functions ϕ, {1,,L}, and by ϕ¯ the function from d to L collecting the basis functions ϕ:

ϕ¯(x):=[ϕ1(x)ϕL(x)].

By convention, vectors are column vectors. For a matrix A, A denotes its transpose. We denote by ; the scalar product in p, and we will use || to denote both the Euclidean norm in p and the absolute value. The identity matrix of size N is denoted by IN.

We adopt the short notation X(1:M) for the sequence (X(1),,X(M)).

### 2.1 Algorithm

In Algorithm 1, we provide a description of a Monte Carlo approximation of the unknown quantity (1.1). Note that as a byproduct, this algorithm also provides an approximation ϕ^M of the function ϕ given by (2.1).

Let 𝖯 be a Markov transition kernel on 𝒜 with unique invariant distribution μdλ.

Algorithm 1

### Algorithm 1 (Full algorithm with M data, M≥L.)

(2.2)

The optimization problem Line 7 of Algorithm 1 is equivalent to find a vector αL solving

(2.3)𝐀𝐀α=𝐀𝐑¯,

where

(2.4)𝐑¯:=[R(1)R(M)],𝐀:=[ϕ1(X(1))ϕL(X(1))ϕ1(X(M))ϕL(X(M))].

There exists at least one solution, and the solution with minimal (Euclidean) norm is given by

(2.5)α^M:=(𝐀𝐀)#𝐀𝐑¯,

where (𝐀𝐀)# denotes the Moore–Penrose pseudo-inverse matrix; (𝐀𝐀)#=(𝐀𝐀)-1 when the rank of 𝐀 is L, and in that case, equation (2.3) possesses an unique solution.

An example of efficient transition kernel 𝖯 is proposed in [11]: this kernel, hereafter denoted by 𝖯GL, can be read as a Hastings–Metropolis transition kernel targeting μdλ and with a proposal kernel with transition density q which is reversible with respect to μ, i.e. for all x,z𝒜,

(2.6)μ(x)q(x,z)=q(z,x)μ(z).

An algorithmic description for sampling a path of length M of a Markov chain with transition kernel 𝖯GL and with initial distribution ξ is given in Algorithm 2.

Algorithm 2

### Algorithm 2 (MCMC for rare event: A Markov chain with kernel PGL.)

(2.7)

When μdλ is a Gaussian distribution 𝒩d(0,Σ) on d restricted to 𝒜, X~𝒩d(ρx,(1-ρ2)Σ) is a candidate with distribution zq(x,z) satisfying (2.6); here, ρ[0,1) is a design parameter chosen by the user (see [11, Section 4] for a discussion on the choice of ρ). Other proposal kernels q satisfying (2.6) are given in [11, Section 3] in the non-Gaussian case.

More generally, building a transition kernel 𝖯 with invariant distribution μdλ is well known using Hastings–Metropolis schemes. Actually, there is no need to impose condition (2.6) about reversibility of q with respect to μ. Indeed, given an arbitrary transition density q(,), it is sufficient to replace Lines 5–6 of Algorithm 2 by the following acceptance rule: if X~(m)𝒜, accept X~(m) with probability

αaccept(X(m-1),X~(m)):=1[μ(X~(m))q(X~(m),X(m-1))μ(X(m-1))q(X(m-1),X~(m))].

In the subsequent numerical tests with Gaussian distribution restricted to 𝒜, μdλ𝒩d(0,Σ)𝟏𝒜, we will make use of X~𝒩d(ρx+(1-ρ)x𝒜,(1-ρ2)Σ) as a candidate for the transition density zq(x,z), where x𝒜 is a well-chosen point in 𝒜. In that case, we easily check that the acceptance probability is given by

(2.8)αaccept(x,z)=1exp(x𝒜Σ-1(x-z)).

### 2.2 Convergence results for the estimation of ϕ⋆

Let L2(μ) be the set of measurable functions φ:d such that φ2μdλ<; and define the norm

(2.9)|φ|L2(μ):=(φ2μdλ)1/2.

Let ψ be the projection of ϕ on the linear span of the functions ϕ1,,ϕL, with respect to the norm given by (2.9): ψ=α;ϕ¯, where αL solves

(ϕ¯ϕ¯μdλ)α=ψϕ¯μdλ.
Theorem 1

Assume that the following conditions hold:

1. the transition kernel 𝖯 and the initial distribution ξ satisfy: there exists a constant C𝖯 and a rate sequence {ρ(m):m1} such that for any m1,

(2.10)|ξ𝖯m[(ψ-ϕ)2]-(ψ-ϕ)2μdλ|C𝖯ρ(m).
2. the conditional distribution 𝖰 satisfies

(2.11)σ2:=supx𝒜{r2𝖰(x,dr)-(r𝖰(x,dr))2}<.

Let X(1:M) and ϕ^M be given by Algorithm 1. Then

(2.12)ΔM:=𝔼[1Mm=1M(ϕ^M(X(m))-ϕ(X(m)))2]σ2LM+|ψ-ϕ|L2(μ)2+C𝖯Mm=1Mρ(m).

### Proof.

See Section 4.1. ∎

Note that ΔM measures the mean squared error ϕ^M-ϕ along the design sequence X(1:M). The proof consists in decomposing this error into a variance term and a squared bias term:

1. σ2L/M on the right-hand side is the statistical error, decreasing as the size of the design M gets larger and increasing as the size of the approximation space L gets larger.

2. The quantity |ψ-ϕ|L2(μ)2 is the residual error under the best approximation of ϕ by the basis functions ϕ1,,ϕL with respect to the L2(μ)-norm: it is naturally expected as the limit of ΔM when M.

3. The term with {ρ(m):m1} describes how rapidly the Markov chain {X(m):m1} converges to its stationary distribution μdλ.

This theorem extends known results in the case of i.i.d. design X(1:M), which is the major novelty of our contribution. The i.i.d. case is a special case of this general setting: it is retrieved by setting 𝖯(x,dz)=μ(z)dλ(z). Note that in that case, assumption (i) is satisfied with C𝖯=0, and the upper bound in (2.12) coincides with classic results (see e.g. [14, Theorem 11.1]). The theorem covers the situation when the outer Monte Carlo stage relies on a Markov chain Monte Carlo sampler; we will discuss below how to check assumption (i) in practice.

The assumptions on the basis functions ϕ1,,ϕL are weaker than what is usually assumed in the literature on nested simulation. Namely, as a difference with [4, Assumption A2] in the i.i.d. case, Theorem 1 holds even when the functions ϕ1,,ϕL are not orthonormal in L2(μ), and it holds without assuming that almost-surely, the rank of the matrix 𝐀 is L.

Assumption (ii) says that the conditional variance of R given X is uniformly bounded. This condition could be weakened and replaced by an ergodic condition on the Markov kernel 𝖯 implying that

σ~L2:=supML𝔼[|𝐀(𝐀𝐀)#𝐀(𝐑¯-𝔼[𝐑¯X(1:M)])|2]<;

𝐀 and 𝐑¯ are given by (2.4) and depend on X(1:M). In that case, the upper bound (2.12) holds with σ2L replaced by σ~L2 (see inequality (4.2) in the proof of Theorem 1).

We conclude this subsection by conditions on 𝖯 and 𝒜 implying the ergodicity assumption (2.10) with a geometric rate sequence

ρ(m)=κm

for some κ(0,1). Sufficient conditions for sub-geometric rate sequences can be found, e.g., in [10, 7].

Proposition 2

### Proposition 2 ([20, Theorem 15.0.1], [9, Proposition 2])

Assume that P is phi-irreducible and there exists a measurable function V:A[1,+) such that

1. there exist δ(0,1) and b< such that for any x𝒜, 𝖯V(x)δV(x)+b,

2. there exists υ(b/(1-δ),+) such that the level set 𝒞:={Vυ} is 1 -small: there exist ϵ>0 and a probability distribution ν on 𝒜 (with ν(𝒞)=1) such that for any x𝒞, 𝖯(x,dz)ϵν(dz).

Then there exist κ(0,1) and a finite constant C1 such that for any measurable function g:AR, any m1 and any xA,

|𝖯mg(x)-gμdλ|C1(sup𝒜|g|V)κmV(x).

In addition, there exists a finite constant C2 such that for any measurable function g:AR and any M1,

𝔼[|m=1M{g(X(m))-gμdλ}|2]C2(sup𝒜|g|V)2𝔼[V(X(0))]M.

An explicit expression of the constant C2 is given in [9, Proposition 2]. When 𝖯=𝖯GL as described in Algorithm 2, we have the following corollary.

Corollary 3

Assume the following conditions:

1. For all x𝒜, μ(z)>0 implies that q(x,z)>0.

2. There exists δ1(0,1) such that supx𝒜𝒜cq(x,z)dλ(z)δ1.

3. There exist δ2(δ1,1), a measurable function V:𝒜[1,+) and a set 𝒜 such that

b:=supx𝒜V(z)q(x,z)dλ(z)<,supxcV-1(x)𝒜V(z)q(x,z)dλ(z)δ2-δ1.
4. For some υ>b/(1-δ2), the level set 𝒞:={Vυ} is such that

inf(x,z)𝒞2(q(x,z)𝟏μ(z)0μ(z))>0,𝒞μdλ>0.

Then the assumptions of Proposition 2 are satisfied for the kernel P=PGL.

### Proof.

See Section 4.2. ∎

When μ is a Gaussian density 𝒩d(0,Id) on d restricted to 𝒜 and the proposal density q(x,y) is a Gaussian random variable with mean ρx and covariance 1-ρ2Id (with ρ(0,1)), it is easily seen that conditions (i), (iii) and (iv) of Corollary 3 are satisfied (choose, e.g., V(x)=exp(s|x|), with s>0). Condition (ii) is problem specific since it depends on the geometry of 𝒜.

### 2.3 Convergence results for the estimation of ℐ

When problem (1.1) is of the form 𝔼[f(Y,𝔼[RY])Y𝒜] for a globally Lipschitz function f (in the second variable), we have the following control on the Monte Carlo error ^M- from Algorithm 1.

Theorem 4

Assume that the following conditions hold:

1. f:d× is globally Lipschitz in the second variable: there exists a finite constant Cf such that for any (r1,r2,y)××d,

|f(y,r1)-f(y,r2)|Cf|r1-r2|.
2. There exists a finite constant C such that for any M

𝔼[(M-1m=1Mf(X(m),ϕ(X(m)))-f(x,ϕ(x))μ(x)dλ(x))2]CM.

Then

(𝔼[|^M-|2])1/2CfΔM+CM,

where I, I^M and ΔM are respectively given by (1.1), Algorithm 1 and (2.12).

### Proof.

See Section 4.3. ∎

Sufficient conditions for assumption (ii) to hold are given in Proposition 2 when {X(m):m1} is a Markov chain. When the draws {X(m):m1} are i.i.d. with distribution μdλ, then condition (ii) is verified with C=𝕍ar(f(X,ϕ(X))) with Xμdλ.

### 2.4 Asymptotic optimal tuning of parameters

In this subsection, we discuss how to tune the parameters of the algorithm (i.e. M, ϕ1,,ϕL and L), given a Markov kernel 𝖯. To simplify the discussion, we assume from now on that

Hyp(2.8)

The constant C𝖯 of (2.10) can be chosen independently of ψ; furthermore the series (ρ(m))m1 defined in (2.10) is convergent.

The above condition on ρ is quite little demanding: see Proposition 2 where the convergence is geometric. Regarding the condition on C𝖯, although not trivial, this assumption seems reasonable since ψ is the best approximation of ϕ on the function basis with respect to the target measure μdλ: it means that first

|ψ-ϕ|L2(μ)|ϕ|L2(μ),

and second ψ-ϕ is expected to converge to 0 in L2(μ) as the number L of basis functions increases. Besides, in the context of Proposition 2, the control of C𝖯 would follow from the control of sup𝒜|ψ-ϕ|V, which is a delicate task because of the lack of knowledge on ψ.

A direct consequence of Hyp(2.10) is that the last term in (2.12) is such that

C𝖯Mm=1Mρ(m)=O(1M),

uniformly in the function basis. In other words, the mean empirical squared error ΔM is bounded by

Cst×(LM+|ψ-ϕ|L2(μ)2),

as in the case of i.i.d. design (see [14, Theorem 11.1]).

There are many choices of function basis [14], but due to the lack of knowledge on the target measure and in the perspective of discussing convergence rates, it is relevant to adopt local approximation techniques, like piecewise polynomial partitioning estimates (i.e. local polynomials defined on a tensored grid); for a detailed presentation, see [12, Section 4.4.]. Assume that the conditional expectation ϕ is smooth on 𝒜, namely ϕ is p0 continuously differentiable, with bounded derivatives, and the p0-th derivatives is p1-Hölder continuous. Set p:=p0+p1. If 𝒜 is bounded, it is well known [14, Corollary 11.1 for d=1] that taking local polynomials of order p0 on a tensored grid with edge length equal to Cst×M-12p+d ensures that both the statistical error L/M and the approximation error |ψ-ϕ|L2(μ)2 have the same magnitude and we get

(2.13)ΔM=O(M-2p2p+d).

If 𝒜 is not anymore bounded, under the additional assumption that μdλ has tails with exponential decay, it is enough to consider similar local polynomials but on a tensored grid truncated at distance Cst×log(M); this choice maintains the validity of estimate (2.13), up to logarithmic factors [12, Section 4.4.], which we omit to write for the sake of simplicity.

Regarding the complexity 𝖢𝗈𝗌𝗍 (computational cost), the simulation cost (for X(1:M),R(1:M)) is proportional to M, the computation of ϕ^M needs Cst×M operations (taking advantage of the tensored grid), as well as the final evaluation of ^M. Thus we have 𝖢𝗈𝗌𝗍Cst×M, with another constant. Finally, in view of Theorem 4, we derive

𝖤𝗋𝗋𝗈𝗋Regression Alg. 1=O(𝖢𝗈𝗌𝗍-p2p+d).

This is similar to the rate we would obtain in a i.i.d. setting. For very smooth ϕ (p+), we retrieve asymptotically the order 12 of convergence. This global error may be compared to the situation where the inner conditional expectation is computed using a crude Monte Carlo method (using N samples of R(m,k) for each of the M samples X(m)); this scheme is described and analyzed in Appendix A. Its computational cost is Cst×MN and its global error is O(1/N+1/M) if f is Lipschitz (resp. O(1/N+1/M) if f is smoother); thus we have (by taking M=N resp. N=M)

𝖤𝗋𝗋𝗈𝗋Crude MC Alg. 3={O(𝖢𝗈𝗌𝗍-14)if f Lipschitz,O(𝖢𝗈𝗌𝗍-13)if f smoother.

In the standard case of Lipschitz f, the regression-based Algorithm 1 converges faster than Algorithm 3 under the condition pd2. In low dimension, this condition is easy to satisfy but it becomes problematic as the dimension increases, this is the usual curse of dimensionality.

## 3 Application: Put options in a rare-event regime

The goal is to approximate the quantity

(3.1):=𝔼[(𝔼[(K-h(ST))+ST]-p)+ST𝒮]

for various choices of h, where {St:t0} is a d-dimensional geometric Brownian motion, T<T and {ST𝒮} is a rare event.

### 3.1 A toy example in dimension 1

We start with a toy example: in dimension d=1, when h(y)=y and 𝒮={s+:ss} so that

=𝔼[(𝔼[(K-ST)+ST]-p)+STs];

(K-ST)+ is the Put payoff written on one stock with price (St)t0, with strike K and maturity T: this is a standard financial product used by asset managers to insure their portfolio against the decrease of stock price. We take the point of view of the seller of the contract, who is mostly concerned by large values of the Put price, i.e. he aims at valuing the excess of the Put price at time T(0,T) beyond the threshold p>0, for stock value ST smaller than s>0. We assume that {St:t0} evolves like a geometric Brownian motion, with volatility σ>0 and zero drift. For the sake of simplicity, we assume that the interest rate is 0; extension to non-zero interest rate is obvious.

Upon noting that ST=ξ(Y) and ST=ξ(Y)exp(-12σ2τ+στZ), where Y,Z are independent standard Gaussian variables and

ξ(y):=S0exp(-12σ2T+σTy),τ:=T-T

we have

=𝔼[(𝔼[(K-ξ(Y)exp(-12σ2τ+στZ))+|Y]-p)+|Yy],

where

y:=1σTln(sS0)+12σT.

Therefore, problem (3.1) is of the form (1.1) with

R=(K-ξ(Y)exp(-12σ2τ+στZ))+,f(y,r)=(r-p)+,𝒜={y:yy},

and [Y,Z]𝒩2(0,I2). In this example, (Y𝒜) and 𝔼[RY] are explicit. We have indeed (Y𝒜)=Φ(y), where Φ denotes the cumulative distribution function (cdf) of a standard Gaussian distribution. Furthermore, 𝔼[RY]=Φ(ξ(Y)), where

Φ(s):=KΦ(d+(s))-sΦ(d-(s)),with d±(s):=1στln(Ks)±12στ;

note that ϕ=Φξ. The parameter values for the numerical tests are given in Table 1.

Table 1

Parameter values for the one-dimensional example.

TTS0Kσsp
1210010030%3010
Figure 1

Normalized histograms of the M points from the Markov chains GL (top left), NR (top right) and from the i.i.d. sampler with rejection (bottom left). Bottom right: Restricted to [-6,y], the cdf of Y given {Y𝒜}, two MCMC approximations (with 𝖯GL and 𝖯NR) and an i.i.d. approximation.

We first illustrate the behavior of the kernel 𝖯GL described by Algorithm 2. Since Y is a standard Gaussian random variable, we design 𝖯GL as a Hastings–Metropolis sampler, with invariant distribution μdλ equal to a standard 𝒩(0,1) restricted to 𝒜 and with proposal distribution q(x,)dλ𝒩(ρx,1-ρ2). Observe that this proposal kernel is reversible with respect to μ, see (2.6). Note that condition (ii) in Corollary 3 gets into

supyyΦ(ρy-y1-ρ2)<1,

which holds true since ρ>0. In the following, the performance of the kernel 𝖯GL is compared to that of the kernel 𝖯NR defined as a Hastings–Metropolis kernel with proposal q(x,)dλ𝒩((1-ρ)y+ρx,1-ρ2) and with invariant distribution a standard Gaussian random variable restricted to 𝒜. As a main difference with 𝖯GL, this proposal transition density q is not reversible with respect to μ (whence the notation 𝖯NR for the kernel); therefore, the acceptance-rejection ratio of the new point z is given by (see equality (2.8))

(1exp(y(x-z)))𝟏zy.

In Figure 1 (bottom right), the true cdf of Y given {Y𝒜} (which is a density on (-,y]) is displayed on [-6,y] together with three empirical cdfs xM-1m=1M𝟏{X(m)x}: the first one is computed from i.i.d. samples with distribution 𝒩(0,1) and the second one (resp. the third one) is computed from a Markov chain path X(1:M) of length M with kernel 𝖯GL (resp. 𝖯NR) and started at X(0)=y. The two kernels provide a similar approximation of the true cdf. Here M=1e6, and ρ=0.85 for both kernels. We also display the normalized histograms of the points X(m) sampled respectively from 𝖯GL (top left), 𝖯NR (top right) and the crude rejection algorithm with Gaussian proposal (bottom left). In the latter plot, the histogram is built with only around 50–60 points which correspond to the accepted points among M=1e6 proposal points.

Figure 2

For different values of ρ, estimation of the autocorrelation function (over 100 independent runs) of the chain 𝖯GL (left) and 𝖯NR (right). Each curve is computed using 1e6 sampled points.

To assess the speed of convergence of the samplers 𝖯GL and 𝖯NR to their stationary distributions, we additionally plot in Figure 2 the autocorrelation function for both chains. For 𝖯GL the choice of ρ is quite significant, as observed in [11]; values of ρ around 0.9 give usually good results. For 𝖯NR, in this example the choice of ρ is less significant because we are able to define a proposal which takes advantage of the knowledge on the rare set. A comparison of acceptance rates is provided below (see Figure 3 (left)).

Figure 3

Comparison of the MCMC sampler 𝖯GL (top) and 𝖯NR (bottom), for different values of ρ{0.1,,0.9,0.99}.Left: Mean acceptance rate when computing (YyYwJ-1) after M iterations of the chain. Right: Estimation of (Y𝒜) by combining splitting and MCMC.

We also illustrate the behavior of these two MCMC samplers for the estimation of the rare-event probability (Y𝒜). Following the approach of [11], we use the decomposition

(Y𝒜)=j=1J(YwjYwj-1)π^:=j=1J(1Mm=1M𝟏X(m,j)wj),

where w0=+>w1>>wJ=y, and {X(m,j):m0} is a Markov chain with kernel 𝖯GL(j) or 𝖯NR(j) having a standard Gaussian restricted to (-,wj-1] as invariant distribution. The J intermediate levels are chosen such that (YwjYwj-1)0.1. Figure 3 (right) displays the boxplot of 100 independent realizations of the estimator π^ for different values of ρ{0.1,,0.9}; the horizontal dotted line indicates the true value (Y𝒜)=5.6e-5. Here J=5, (w1,,w4)=(0,-1.6,-2.5,-3.2) and M=1e4. Figure 3 (left) displays the boxplot of 100 mean acceptance rates M-1m=1M𝟏{X(m,J)=X~(m,J)} computed along 100 independent chains {X(m,J):mM}, for different values of ρ; the horizontal dotted line is set to 0.234 which is usually chosen as the target rate when fixing some design parameters in a Hastings–Metropolis algorithm (see e.g. [22]). We observe that the use of non-reversible proposal kernel 𝖯NR yields more accurate results than 𝖯GL; this is intuitively easy to understand since 𝖯GL better accounts for the point y around which one should sample.

Figure 4

Left: 1000 sampled points (X(m),R(m)) (using the sampler 𝖯GL), together with ϕ. Right: A realization of the error function xϕ^M(x)-ϕ(x) on [-5,y], for different values of L{2,3,4} and two different kernels when sampling X(1:M).

We now run Algorithm 1 for the estimation of the conditional expectation xϕ(x) on (-,y]. The algorithm is run with M=1e6, successively with 𝖯=𝖯GL and 𝖯=𝖯NR both with ρ=0.85; the L basis functions are {xϕ(x)=(ξ(x))-1:l=1,,L} and we consider successively L{2,3,4}. In Figure 4 (right), the error function xϕ^M(x)-ϕ(x) is displayed for different values of L when computing ϕ^M. It is displayed on the interval [-5,y], which is an interval with probability larger than 1-5e-3 under the distribution of Y given {Y𝒜} (see Figure 1). Note that the errors may be quite large for x close to -5; however these values are very unlikely (see Figure 1), and therefore these large errors are not representative of the global quadratic error. In Figure 4 (left), we display 1000 sampled points of (X(m),R(m)). These points are taken from the sampler 𝖯GL, every twenty iterations, in order to obtain quite uncorrelated design points. Observe that the regression function ϕ looks like affine, which explains why the results with L=2 only are quite accurate.

Figure 5

Left: Monte Carlo approximations of MΔM, and fitted curves of the form Mα+β/M. Right: For different values of ρ, and for three different values of M, boxplot of 100 independent estimates ^M when X(1:M) is sampled from a chain with kernel 𝖯GL (top) and 𝖯NR (bottom).

We finally illustrate Algorithm 1 for the estimation of (see (3.1)). In Figure 5 (right), the boxplot of 100 independent outputs ^M of Algorithm 1 is displayed when run with 𝖯=𝖯GL (top) and 𝖯=𝖯NR (bottom); different values of ρ and M are considered, namely ρ{0,0.1,0.5,0.85} and M{5e2,5e3,1e4}; the regression step is performed with L=2 basis functions. Figure 5 (right) illustrates well the benefit of using MCMC sampler for the current regression problems: when 𝖯=𝖯GL, compare the distribution for ρ=0 (i.i.d. samples) and ρ=0.85: observe the bias when ρ=0 which does not disappear even when M=1e4 and note that the variance is very significantly reduced (when M=5e2,5e3,1e4 respectively, the standard deviation is reduced by a factor 1.11, 6.58 and 11.96).

Figure 5 (left) is an empirical verification of the statement of Theorem 1. One hundred independent runs of Algorithm 1 are performed, and for different values of M, the quantities M-1m=1M(ϕ^M(X(m))-ϕ(X(m)))2 are collected; here ϕ^M is computed with L=2 basis functions. The mean value over these 100 points is displayed as a function of M; it is a Monte Carlo approximation of ΔM (see (2.12)). We compare two implementations of Algorithm 1: first, 𝖯=𝖯GL with ρ=0.85 and then 𝖯=𝖯NR with ρ=0.85. Theorem 1 establishes that ΔM is upper bounded by a quantity of the form α+β/M; such a curve is fitted by a mean square technique (we obtain α=0.001 for both kernels, which is in adequation with the theorem since this term does not depend on the Monte Carlo stages). The fitted curves are shown in Figure 5 (left) and they demonstrate a good match between the theory and the numerical studies.

### 3.2 Correlated geometric Brownian motions in dimension 2

We adapt the one-dimensional example, taking a Put on the geometric average of two correlated assets {St=(St,1,St,2):t0}. In this example, d=2, h(s1,s2)=s1s2 and 𝒮={(s1,s2)+×+:s1s,s2s}. We denote by σ1, σ2 and ϱ, respectively, each volatility and the correlation; the drift of {St:t0} is zero. Set

Γ:=[1ϱϱ1],ξ(y1,y2):=[S0,1exp(-12σ12T+Tσ1y1)S0,2exp(-12σ22T+Tσ2y2)].

We have ST=ξ(Y), where Y𝒩2(0,Γ). Furthermore, it is easy to verify that {St,1St,t:t0} is still a geometric Brownian motion, with volatility σ and drift μ given by

σ:=12σ12+σ22+2ϱσ1σ2,μ:=-18(σ12+σ22-2ϱσ1σ2).

Hence, problem (3.1) is of the form (1.1) with

f(y,r):=(r-p)+,
𝒜:={y2:ξ(y)(-,s]×(-,s]}
={y2:yiy,i},y:=[1σiTln(sS0,i)+12σiT]i=1,2,
R:=(K-Ψ(Y)exp{(μ-12(σ)2)(T-T)+T-TσZ})+,

where Z𝒩(0,1) is independent of Y, and Ψ(y):=(ξ(y))1(ξ(y))2.

For the outer Monte Carlo stage, 𝖯GL is defined as the Hastings–Metropolis kernel with proposal distribution q(x,)dλ𝒩2(ρx,(1-ρ2)Γ) (with ρ(0,1)) and with invariant distribution, a bi-dimensional Gaussian distribution 𝒩2(0,Γ) restricted to the set 𝒜. We compare this Markov kernel to the kernel 𝖯NR with non-reversible proposal, defined as a Hastings–Metropolis with proposal distribution 𝒩2(ρx+(1-ρ)y,(1-ρ2)Γ) and with invariant distribution, a bi-dimensional Gaussian distribution 𝒩2(0,Γ) restricted to the set 𝒜. The acceptance-rejection ratio for this algorithm is given by (2.8) with x𝒜y and ΣΓ.

In this example, the inner conditional expectation is explicit: ϕ(x)=Φ(Ψ(x)) with

Φ(u):=KΦ(d+(u))-ueμ(T-T)Φ(d-(u)),u>0,
d±(u):=1σT-Tln(Kueμ(T-T))±12σT-T.

For the basis functions, we take

φ1(x)=1,φ2(x)=(ξ(x))1,φ3(x)=(ξ(x))2,
φ4(x)=(ξ(x))1,φ5(x)=(ξ(x))2,φ6(x)=(ξ(x))1(ξ(x))2.

The parameter values for the numerical tests are given in Table 2.

Table 2

Parameter values for the two-dimensional example.

TTS0,1S0,2Kσ1σ2ϱsp
1210010010025%35%50%505

Figure 6 depicts the rare event 𝒜: on the left (resp. on the right), some level curves of the distribution of 𝒩2(0,Γ) (resp. distribution of (ST,1,ST,2)) are displayed, together with the rare event in the bottom left corner.

Figure 6

Left: Level curves of 𝒩2(0,Γ) and the rare set in the lower left area delimited by the two hyperplanes. Right: Level curves of the density function of (ST,1,ST,2) and the rare set in the lower left area delimited by the two hyperplanes.

We run two Markov chains respectively with kernel 𝖯GL and 𝖯NR and compute the mean acceptance-rejection rate after M=1e4 iterations. For different values of ρ, this experiment is repeated 100 times, independently; Figure 7 reports the boxplot of these mean acceptance rates. It shows that a rate close to 0.234 is reached with ρ=0.8 for 𝖯=𝖯GL and ρ=0.7 for 𝖯=𝖯NR. In all the experiments below involving these kernels, we will use these values of the design parameter ρ.

Figure 7

Boxplot over 100 independent runs, of the mean acceptance rate after M=1e4 iterations for the kernel 𝖯=𝖯GL (top) and the kernel 𝖯=𝖯NR (bottom). Different values of ρ are considered.

Figure 8

Left: Normalized histograms of the error {ϕ^M(X(m))-ϕ(X(m)):m=1,,M}, when L=3, with design pointssampled with 𝖯GL (left) and 𝖯NR (right). Right: The same case with L=6.

In Figure 8 (left), the normalized histogram of the errors {ϕ^M(X(m))-ϕ(X(m)):m=1,,M} is displayed when L=3 and the samples X(1:M) are sampled from 𝖯=𝖯GL (left) or 𝖯=𝖯NR (right). Figure 8 (right) shows the case L=6. Here, M=1e6. This clearly shows an improvement by choosing more basis functions. Especially, the sixth basis function brings much accuracy, as expected, since the regression function ϕ depends directly on it.

Figure 9

Left: Error function sϕ^M(ξ-1(s))-ϕ(ξ-1(s)), with L=3, with design points sampled with 𝖯GL (left) and 𝖯NR (right). Right: The same case with L=6.

In Figure 9 (left), the errors sϕ^M(ξ-1(s))-ϕ(ξ-1(s)) are displayed on [15,s]×[15,s] when L=3 and the outer samples X(1:M) used in the computation of ϕ^M are sampled from 𝖯=𝖯GL (left) and 𝖯=𝖯NR (right). Figure 9 (right) shows the case L=6. Here, M=1e6. This is complementary to Figure 8 since it shows the prediction error everywhere in the space, and not only along the design points.

Figure 10

Left: A Monte Carlo approximation of MΔM, and a fitted curve of the form Mα+β/M. Right: For differentvalues of ρGL and ρNR, and for four different values of M – namely M{1e2,5e3,1e4,2e4} –, boxplot of 100 independentestimates ^M.

In Figure 10 (left), a Monte Carlo approximation of ΔM (see (2.12)) computed from 100 independent estimators ϕ^M is displayed as a function of M for M in the range [3e3,5e4]; where ϕ^M is computed with L=6. We also fit a curve of the form Mα+β/M to illustrate the sharpness of the upper bound in (2.12). In Figure 10 (right), the boxplot of 100 independent outputs ^M of Algorithm 1 is displayed, for M{1e2,5e3,1e4,2e4} and different values of ρGL (resp ρNR) – the design parameter in 𝖯GL (resp. 𝖯NR). Here again, we observe the advantage of using MCMC samplers to reduce the variance in this regression problem coupled with rare-event regime: when M=5e3,1e4,2e4 respectively, the standard deviation is reduced by a factor 6.89, 7.27 and 7.74.

## 4 Proofs of the results of Section 2.2

### 4.1 Proof of Theorem 1

By the construction of the random variables 𝐑¯ and X(1:M) (see Algorithm 1), for any bounded and positive measurable functions g1,,gM, it holds

(4.1)𝔼[m=1Mgm(R(m))|X(1:M)]=m=1M𝔼[gm(R(m))X(m)]=m=1Mgm(r)𝖰(X(m),dr).
Lemma 1

If AAα=AAα~, then Aα=Aα~. In other words, any coefficient solution α of the least-squares problem (2.3) yields the same values for the approximated regression function along the design X(1:M).

### Proof.

Denote by r the rank of 𝐀 and write 𝐀=UDV for the singular value decomposition of 𝐀. It holds

𝐀𝐀α=𝐀𝐀α~DDVα=DDVα~

by using VV=IL and UU=IM. This implies that the first r components of Vα and Vα~ are equal and thus DVα=DVα~. This concludes the proof. ∎

The next result justifies a possible interchange between least-squares projection and conditional expectation.

Lemma 2

Set ϕ^M=α^M;ϕ¯, where α^MRL is any solution to AAα=AR¯. Then the function

x𝔼[ϕ^M(x)X(1:M)]

is a solution to the least-squares problem

minφ1Mm=1M(ϕ(X(m))-φ(X(m)))2,

where F:={φ=α;ϕ¯:αRL}.

### Proof.

It is sufficient to prove that

minφ1Mm=1M(ϕ(X(m))-φ(X(m)))2=1M|ϕ¯-𝐀𝔼[α^MX(1:M)]|2

where ϕ¯:=(ϕ(X(1)),,ϕ(X(M))). The solution of the above least-squares problem is of the form

xα^M,;ϕ¯(x),

where α^M,L satisfies 𝐀𝐀α^M,=𝐀ϕ¯. By (4.1) and the definition of α^M, this yields

𝐀ϕ¯=𝐀[𝔼[R(1)X(1)]𝔼[R(M)X(M)]]=𝔼[𝐀𝐑¯X(1:M)]=𝐀𝐀𝔼[α^MX(1:M)].

We then conclude by Lemma 1 that 𝐀α^M,=𝐀𝔼[α^MX(1:M)]. We are done. ∎

### Proof of Theorem 1.

Using Lemma 2 and the Pythagoras theorem, it holds

1Mm=1M(ϕ^M(X(m))-ϕ(X(m)))2=𝒯1+𝒯2

with

𝒯1:=1Mm=1M(ϕ^M(X(m))-𝔼[ϕ^M(X(m))X(1:M)])2=1M|𝐀(α^M-𝔼[α^MX(1:M)])|2,
𝒯2:=1Mm=1M(𝔼[ϕ^M(X(m))X(1:M)]-ϕ(X(m)))2.

By Lemma 1, we can take α^M=(𝐀𝐀)#𝐀𝐑¯, which is the coefficient with minimal norm among the solutions of least-squares problem of Algorithm 1. Let us consider 𝒯1. Set 𝐁:=𝐀(𝐀𝐀)#𝐀, a M×M matrix. By (2.5) and Lemma 2,

M𝒯1=|𝐁Υ|2=Trace(𝐁ΥΥ𝐁),withΥ:=[R(1)-ϕ(X(1))R(M)-ϕ(X(M))],

so M𝔼[𝒯1X(1:M)] is equal to Trace(𝐁𝔼[ΥΥX(1:M)]𝐁). Under (4.1) and (2.11), the matrix 𝔼[ΥΥX(1:M)] is diagonal with diagonal entries upper bounded by σ2. Therefore,

(4.2)M𝔼[𝒯1X(1:M)]σ2Trace(𝐁2)=σ2rank(𝐀)σ2L.

This concludes the control of 𝔼[𝒯1].

Using again Lemma 2,

𝒯2=minφ1Mm=1M(φ(X(m))-ϕ(X(m)))21Mm=1M(ψ(X(m))-ϕ(X(m)))2.

Hence,

𝔼[𝒯2]1Mm=1M𝔼[(ψ(X(m))-ϕ(X(m)))2]
=|ψ-ϕ|L2(μ)2+1Mm=1M{𝔼[(ψ(X(m))-ϕ(X(m)))2]-(ψ-ϕ)2μdλ}.

By (2.10), the right-hand side is upper bounded by |ψ-ϕ|L2(μ)2+C𝖯m=1Mρ(m)/M. This concludes the proof of (2.12). ∎

### 4.2 Proof of Corollary 3

Note that 𝖯GL is a Hastings–Metropolis kernel; hence, for any x𝒜 and any measurable set A in 𝒜,

(4.3)𝖯GL(x,A)=A𝒜q(x,z)dλ(z)+δx(A)𝒜cq(x,z)dλ(z).

#### Irreducibility.

Let A be a measurable subset of 𝒜 such that Aμdλ>0 (which implies that Adλ>0). Then

𝖯GL(x,A)A𝒜{z:μ(z)>0}q(x,z)dλ(z)

and the right-hand side is positive since, owing to assumption (i),

q(x,z)>0for all x𝒜, zA𝒜{z:μ(z)>0}.

This implies that 𝖯GL is phi-irreducible with μdλ as irreducibility measure.

#### Drift assumption.

By assumption (ii) and from (4.3), we have

𝖯GL(x,A)δ1δx(A)+A𝒜q(x,z)dλ(z),

which implies by (iii) that

𝖯GLV(x)δ1V(x)+𝒜V(z)q(x,z)dλ(z)
δ1V(x)+𝟏(x)supx𝒜V(z)q(x,z)dλ(z)+𝟏c(x)(δ2-δ1)V(x)
δ2V(x)+supx𝒜V(z)q(x,z)dλ(z).

#### Small set assumption.

Let 𝒞 be given by assumption (iv). We have 𝒞μdλ>0; thus define the probability measure

dν:=𝟏𝒞μdλ𝒞μdλ.

Then for any x𝒞 and any measurable subset A of 𝒞, it readily follows from (4.3) that

𝖯GL(x,A)A𝒜q(x,z)𝟏μ(z)0dλ(z)
inf(x,z)𝒞2(q(x,z)𝟏μ(z)0μ(z))A𝒜μ(z)dλ(z)
=inf(x,z)𝒞2(q(x,z)𝟏μ(z)0μ(z))(𝒞μdλ)ν(A𝒜).

Thanks to the lower bounds of (iv), we complete the proof.

### 4.3 Proof of Theorem 4

We write ^M-=𝒯1+𝒯2 with

𝒯1:=1Mm=1Mf(X(m),ϕ^M(X(m)))-1Mm=1Mf(X(m),ϕ(X(m))),
𝒯2:=1Mm=1Mf(X(m),ϕ(X(m)))-f(x,ϕ(x))μ(x)dλ(x).

For the first term, we have

𝔼[|𝒯1|2]𝔼[1Mm=1M|f(X(m),ϕ^M(X(m)))-f(X(m),ϕ(X(m)))|2]
Cf2𝔼[1Mm=1M|ϕ^M(X(m))-ϕ(X(m))|2]
=Cf2ΔM.

The second term is controlled by assumption (ii). We then conclude by the Minkowski inequality.

## 5 Conclusion

We have designed a new methodology to compute nested expectations in the rare-event regime. The outer expectation is evaluated using an ergodic Markov chain restricted to the rare set, whereas the inner expectation is approximated using a linear regression method with general basis functions. We quantified the error bounds as a function of the number of outer samples and of the size of the basis. This highlights that, in the regression scheme, replacing the usual i.i.d. design by an ergodic Markov chain design does not alter significantly the statistical errors.

When the inner expectation is alternatively computed pointwise with i.i.d. samples, we also provided error bounds, which show that this approach for the inner expectation is more suitable than the regression method in the case of large dimensional problems (curse of dimensionality).

In our numerical tests, we illustrated how to choose appropriately the parameters of the ergodic Markov chain so that the mean acceptance rate for staying in the rare set is about 20–30 %. It usually ensures low variance in the full scheme.

Award Identifier / Grant number: ANR-15-CE05-0024

Funding statement: The second author’s research is part of the Chair Financial Risks of the Risk Foundation, the Finance for Energy Market Research Centre and the ANR project CAESARS (ANR-15-CE05-0024).

## A Algorithm where the inner stage uses a crude Monte Carlo method and the outer stage uses MCMC sampling

Here, the regression function ϕ is approximated by an empirical mean using N (conditionally) independent samples R(m,k), as in (1.2). We keep the same notations as in Section 2.

Algorithm 3

### A.2 Convergence results for the estimation of ℐ~M

We extend Theorem 1 to this new setting. Actually when the function f in (1.1) is smoother than Lipschitz continuous, we can prove that the impact of N on the quadratic error is 1/N instead of the usual 1/N. This kind of improvement has been derived by [13] in the i.i.d. setting (for both the inner and outer stages).

Theorem 5

Assume that the following conditions hold:

1. The (second and) fourth conditional moments of 𝖰 are bounded: for p=2 and p=4, we have

σp:=(supx𝒜|r-r𝖰(x,dr)|p𝖰(x,dr))1/p<.
2. There exists a finite constant C such that for any M

𝔼[(M-1m=1Mf(X(m),ϕ(X(m)))-f(x,ϕ(x))μ(x)dλ(x))2]CM.

If f:Rd×RR is globally Cf-Lipschitz in the second variable, then

(𝔼[|~M-|2])1/2Cfσ2N+CM,

where I and I~M are respectively given by (1.1) and Algorithm 3. If f is continuously differentiable in the second variable, with a derivative in the second variable which is bounded and globally Crf-Lipschitz, then

(𝔼[|~M-|2])1/2Crf21N3σ24+σ44N+σ2NMsupx|rf(x,ϕ(x))|+CM.

### A.3 Proof of Theorem 5

#### First case: f Lipschitz.

We write ~M-=𝒯1+𝒯2 with

𝒯1:=1Mm=1Mf(X(m),R¯N(m))-1Mm=1Mf(X(m),ϕ(X(m))),
𝒯2:=1Mm=1Mf(X(m),ϕ(X(m)))-f(x,ϕ(x))μ(x)dλ(x).

For the first term, since f is globally Lipschitz with constant Cf, we have

𝔼[|𝒯1|2]Cf2𝔼[1Mm=1M|R¯N(m)-ϕ(X(m))|2].

Since (R(m,k):1kN) are independent conditionally on X(1:M), we have

𝔼[R¯N(m)|X(1:M)]=ϕ(X(m))

and

Var[R¯N(m)X(1:M)]σ22N.

Thus,

𝔼[|𝒯1|2]Cf2Nσ22.

The second term is controlled by assumption (ii). We are done.

#### Second case: f smooth.

Set 𝒯1=𝒯1,a+𝒯1,b with

𝒯1,a:=1Mm=1M(f(X(m),R¯N(m))-f(X(m),ϕ(X(m)))-rf(X(m),ϕ(X(m)))(R¯N(m)-ϕ(X(m)))),
𝒯1,b:=1Mm=1Mrf(X(m),ϕ(X(m)))(R¯N(m)-ϕ(X(m))).

A Taylor expansion gives

|𝒯1,a|12Crf1Mm=1M|R¯N(m)-ϕ(X(m))|2,
𝔼[|𝒯1,a|2](12Crf)21Mm=1M𝔼[|R¯N(m)-ϕ(X(m))|4].

Invoking that (R(m,k):1kN) are independent conditionally on X(1:M) leads to

𝔼[|R¯N(m)-ϕ(X(m))|4|X(1:M)]3σ24N-1N3+σ441N3.

Moreover, upon noting that for mm,

𝔼[(rf(X(m),ϕ(X(m)))(R¯N(m)-ϕ(X(m))))(rf(X(m),ϕ(X(m)))(R¯N(m)-ϕ(X(m))))]=0,

we have

𝔼[|𝒯1,b|2]=𝔼[1M2m=1M|rf(X(m),ϕ(X(m)))|2𝔼[|R¯N(m)-ϕ(X(m))|2|X(1:M)]]
supx|rf(x,ϕ(x))|2Mσ22N.

This concludes the proof. ∎

## References

[1] Baraud Y., Comte F. and Viennet G., Adaptive estimation in autoregression or β-mixing regression via model selection, Ann. Statist. 29 (2001), no. 3, 839–875. 10.1214/aos/1009210692Search in Google Scholar

[2] Belomestny D., Kolodko A. and Schoenmakers J., Regression methods for stochastic control problems and their convergence analysis, SIAM J. Control Optim. 48 (2010), no. 5, 3562–3588. 10.1137/090752651Search in Google Scholar

[3] Blanchet J. and Lam H., State-dependent importance sampling for rare event simulation: An overview and recent advances, Surv. Oper. Res. Manag. Sci. 17 (2012), 38–59. 10.1016/j.sorms.2011.09.002Search in Google Scholar

[4] Broadie M., Du Y. and Moallemi C. C., Risk Estimation via regression, Oper. Res. 63 (2015), no. 5, 1077–1097. 10.1287/opre.2015.1419Search in Google Scholar

[5] Delattre S. and Gaïffas S., Nonparametric regression with martingale increment errors, Stochastic Process. Appl. 121 (2011), 2899–2924. 10.1016/j.spa.2011.08.002Search in Google Scholar

[6] Devineau L. and Loisel S., Construction d’un algorithme d’accélération de la méthode des “simulations dans les simulations” pour le calcul du capital économique solvabilité ii, Bull. Français d’Actuariat 10 (2009), no. 17, 188–221. Search in Google Scholar

[7] Douc R., Fort G., Moulines E. and Soulier P., Practical drift conditions for subgeometric rates of convergence, Ann. Appl. Probab. 14 (2004), no. 3, 1353–1377. 10.1214/105051604000000323Search in Google Scholar

[8] Egloff D., Monte Carlo algorithms for optimal stopping and statistical learning, Ann. Appl. Probab. 15 (2005), 1396–1432. 10.1214/105051605000000043Search in Google Scholar

[9] Fort G. and Moulines E., Convergence of the Monte Carlo expectation maximization for curved exponential families, Ann. Statist. 31 (2003), no. 4, 1220–1259. 10.1214/aos/1059655912Search in Google Scholar

[10] Fort G. and Moulines E., Polynomial ergodicity of Markov transition kernels, Stochastic Process. Appl. 103 (2003), no. 1, 57–99. 10.1016/S0304-4149(02)00182-5Search in Google Scholar

[11] Gobet E. and Liu G., Rare event simulation using reversible shaking transformations, SIAM J. Sci. Comput. 37 (2015), no. 5, A2295–A2316. 10.1137/14098418XSearch in Google Scholar

[12] Gobet E. and Turkedjiev P., Linear regression MDP scheme for discrete backward stochastic differential equations under general conditions, Math. Comp. 299 (2016), no. 85, 1359–1391. 10.1090/mcom/3013Search in Google Scholar

[13] Gordy M. B. and Juneja S., Nested simulation in portfolio risk measurement, Manag. Sci. 56 (2010), no. 10, 1833–1848. 10.1287/mnsc.1100.1213Search in Google Scholar

[14] Gyorfi L., Kohler M., Krzyzak A. and Walk H., A Distribution-Free Theory of Nonparametric Regression, Springer Ser. Statist., Springer, New York, 2002. 10.1007/b97848Search in Google Scholar

[15] Hong L. J. and Juneja S., Estimating the mean of a non-linear function of conditional expectation, Proceedings of the 2009 Winter Simulation Conference (WSC), IEEE Press, Piscataway (2009), 1223–1236. 10.1109/WSC.2009.5429428Search in Google Scholar

[16] Lemor J-P., Gobet E. and Warin X., Rate of convergence of an empirical regression method for solving generalized backward stochastic differential equations, Bernoulli 12 (2006), no. 5, 889–916. 10.3150/bj/1161614951Search in Google Scholar

[17] Liu M. and Staum J., Stochastic kriging for efficient nested simulation of expected shortfall, J. Risk 12 (2010), no. 3, 3–27. 10.21314/JOR.2010.211Search in Google Scholar

[18] Longstaff F. and Schwartz E. S., Valuing American options by simulation: A simple least squares approach, Rev. Financ. Stud. 14 (2001), 113–147. 10.1093/rfs/14.1.113Search in Google Scholar

[19] McNeil A. J., Frey R. and Embrechts P., Quantitative Risk Management, Princeton Ser. Finance, Princeton University Press, Princeton, 2005. Search in Google Scholar

[20] Meyn S. P. and Tweedie R. L., Markov Chains and Stochastic Stability, Springer, Berlin, 1993. 10.1007/978-1-4471-3267-7Search in Google Scholar

[21] Ren Q. and Mojirsheibani M., A note on nonparametric regression with β-mixing sequences, Comm. Statist. Theory Methods 39 (2010), no. 12, 2280–2287. 10.1080/03610920903039480Search in Google Scholar

[22] Rosenthal J. S., Optimal Proposal Distributions and Adaptive MCMC, Chapman & Hall/CRC Handb. Mod. Stat. Methods, CRC Press, Boca Raton, 2008. Search in Google Scholar

[23] Rubinstein R. Y. and Kroese D. P., Simulation and the Monte-Carlo Method, 2nd ed., Wiley Ser. Probab. Stat., John Wiley & Sons, Hoboken, 2008. 10.1002/9780470230381Search in Google Scholar

[24] Tsitsiklis J. N. and Van Roy B., Regression methods for pricing complex American-style options, IEEE Trans. Neural Netw. 12 (2001), no. 4, 694–703. 10.1109/72.935083Search in Google Scholar PubMed