Show Summary Details
More options …

# Monte Carlo Methods and Applications

Managing Editor: Sabelfeld, Karl K.

Editorial Board: Binder, Kurt / Bouleau, Nicolas / Chorin, Alexandre J. / Dimov, Ivan / Dubus, Alain / Egorov, Alexander D. / Ermakov, Sergei M. / Halton, John H. / Heinrich, Stefan / Kalos, Malvin H. / Lepingle, D. / Makarov, Roman / Mascagni, Michael / Mathe, Peter / Niederreiter, Harald / Platen, Eckhard / Sawford, Brian R. / Schmid, Wolfgang Ch. / Schoenmakers, John / Simonov, Nikolai A. / Sobol, Ilya M. / Spanier, Jerry / Talay, Denis

4 Issues per year

CiteScore 2016: 0.70

SCImago Journal Rank (SJR) 2016: 0.647
Source Normalized Impact per Paper (SNIP) 2016: 0.908

Mathematical Citation Quotient (MCQ) 2016: 0.33

Online
ISSN
1569-3961
See all formats and pricing
More options …
Volume 23, Issue 1

# MCMC design-based non-parametric regression for rare event. Application to nested risk computations

Gersende Fort
/ Emmanuel Gobet
• Corresponding author
• Centre de Mathématiques Appliquées (CMAP), Ecole Polytechnique and CNRS, Université Paris-Saclay, Route de Saclay, 91128 Palaiseau Cedex, France
• Email
• Other articles by this author:
• De Gruyter OnlineGoogle Scholar
/ Eric Moulines
• Centre de Mathématiques Appliquées (CMAP), Ecole Polytechnique and CNRS, Université Paris-Saclay,Route de Saclay, 91128 Palaiseau Cedex, France
• Email
• Other articles by this author:
• De Gruyter OnlineGoogle Scholar
Published Online: 2017-02-03 | DOI: https://doi.org/10.1515/mcma-2017-0101

## Abstract

We design and analyze an algorithm for estimating the mean of a function of a conditional expectation when the outer expectation is related to a rare event. The outer expectation is evaluated through the average along the path of an ergodic Markov chain generated by a Markov chain Monte Carlo sampler. The inner conditional expectation is computed as a non-parametric regression, using a least-squares method with a general function basis and a design given by the sampled Markov chain. We establish non-asymptotic bounds for the ${L}_{2}$-empirical risks associated to this least-squares regression; this generalizes the error bounds usually obtained in the case of i.i.d. observations. Global error bounds are also derived for the nested expectation problem. Numerical results in the context of financial risk computations illustrate the performance of the algorithms.

MSC 2010: 65C40; 62G08; 37M25

## Statement of the problem.

We consider the problem of estimating the mean of a function of a conditional expectation in a rare-event regime, using Monte Carlo simulations. More precisely, the quantity of interest writes

$\mathcal{ℐ}:=𝔼\left[f\left(Y,𝔼\left[R\mid Y\right]\right)\mid Y\in \mathcal{𝒜}\right],$(1.1)

where R and Y are vector-valued random variables, and $\mathcal{𝒜}$ is a so-called rare subset, i.e. $ℙ\left(Y\in \mathcal{𝒜}\right)$ is small. This is a problem of nested Monte Carlo computations with a special emphasis on the distribution tails. In the evaluation of (1.1), which is equivalent to

$𝔼\left[f\left(X,𝔼\left[R\mid X\right]\right)\right],$

where the distribution of X is the conditional distribution of Y given $\left\{Y\in \mathcal{𝒜}\right\}$, there are two intertwined issues, which we now explain to emphasize our contributions.

The outer Monte Carlo stage samples distributions restricted to $\left\{Y\in \mathcal{𝒜}\right\}$. A naive acceptance-rejection on Y fails to be efficient because most of simulations of Y are wasted. Therefore, specific rare-event techniques have to be used. Importance sampling is one of these methods (see e.g. [23, 3]), which can be efficient in small dimension (10 to 100) but fails to deal with larger dimensions. In addition, this approach relies heavily on particular types of models for Y and on suitable information about the problem at hand.

Another option consists in using Markov Chain Monte Carlo (MCMC) methods. Such methods amount to construct a Markov chain ${\left({X}^{\left(m\right)}\right)}_{m\ge 0}$, such that the chain possesses an unique stationary distribution π equal to the conditional distribution of Y given the event $\left\{Y\in \mathcal{𝒜}\right\}$. In such case, for π-almost every initial condition ${X}_{0}=x$, the Birkhoff ergodic theorem shows that

$\underset{M\to +\mathrm{\infty }}{lim}\frac{1}{M}\sum _{m=1}^{M}\phi \left({X}^{\left(m\right)}\right)=𝔼\left[\phi \left(Y\right)\mid Y\in \mathcal{𝒜}\right]\mathit{ }\text{a.s.}$

for any (say) bounded function φ. This approach has been developed, analyzed and experimented in [11] in quite general and complex situations, demonstrating its efficiency over alternative methods. Therefore, a natural idea for the estimation of (1.1) would be the computation of

$\frac{1}{M}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},𝔼\left[R\mid {X}^{\left(m\right)}\right]\right),$

emphasizing the need for approximating the quantity $𝔼\left[R\mid {X}^{\left(m\right)}\right]$.

The inner Monte Carlo stage is used to approximate these conditional expectations at any ${X}^{\left(m\right)}$ previously sampled. A first idea is to replace $𝔼\left[R\mid {X}^{\left(m\right)}\right]$ by a Crude Monte Carlo sum computed with N draws:

$𝔼\left[R\mid {X}^{\left(m\right)}\right]\approx \frac{1}{N}\sum _{k=1}^{N}{R}^{\left(m,k\right)}.$(1.2)

This approach is refereed to as nested simulation method in [4] (with the difference that their ${X}^{\left(m\right)}$ are i.i.d. and not given by a Markov chain). This algorithm based on (1.2) is briefly presented and studied in Appendix A. Having a large N reduces the variance of this approximation (and thus ensures convergence as proved in Theorem 5) but it yields a prohibitive computational cost. Furthermore, this naive idea does not take into account cross-information between the different approximations at the points $\left\{{X}^{\left(m\right)}:m=1,\mathrm{\dots },M\right\}$. Instead, we follow a non-parametric regression approach for the approximation of the function ${\varphi }_{\star }$ satisfying ${\varphi }_{\star }\left(X\right)=𝔼\left[R\mid X\right]$ (almost-surely): given L basis functions ${\varphi }_{1},\mathrm{\dots },{\varphi }_{L}$, we regress $\left\{{R}^{\left(m\right)}:m=1,\mathrm{\dots },M\right\}$ against the variables $\left\{{\varphi }_{1}\left({X}^{\left(m\right)}\right),\mathrm{\dots },{\varphi }_{L}\left({X}^{\left(m\right)}\right):m=1,\mathrm{\dots },M\right\}$, where ${R}^{\left(m\right)}$ is sampled from the conditional distribution of R given $\left\{X={X}^{\left(m\right)}\right\}$. Note that this inner Monte Carlo stage only requires a single draw ${R}^{\left(m\right)}$ for each sample ${X}^{\left(m\right)}$ of the outer stage. Our discussion in Section 2.4 shows that the regression Monte Carlo method for the inner stage outperforms the crude Monte Carlo method as soon as the regression function can be well approximated by the basis functions (which is especially true when ${\varphi }_{\star }$ is smooth, with a degree of smoothness qualitatively higher than the dimension d, see details in Section 2.4).

The major difference with the standard setting for non-parametric regression [14] comes from the design $\left\{{X}^{\left(m\right)}:m=1,\mathrm{\dots },M\right\}$ which is not a i.i.d. sample: the independence fails because $\left\{{X}^{\left(m\right)}:m=1,\mathrm{\dots },M\right\}$ is a Markov chain path, which is ergodic but not stationary in general.

A precise description of the algorithm is given in Section 2, with a discussion on implementation issues. We also provide some error estimates, in terms of the size M of the sample, and of the function space used for approximating the inner conditional expectation. Proofs are postponed to Section 4. Section 3 gathers some numerical experiments, in the field of financial and actuarial risks. We conclude in Section 5. Appendix A presents the analysis of a Monte Carlo scheme for computing (1.1), by using an MCMC scheme for the outer stage and a crude Monte Carlo scheme for the inner stage.

## Applications.

Numerical evaluation of nested conditional expectations arises in several fields. This pops up naturally in solving dynamic programming equations for stochastic control and optimal stopping problems, see [24, 18, 8, 16, 2]; however, coupling these latter problems with rare event is usually not required from the problem at hand.

In financial and actuarial management [19], we often retrieve nested conditional expectations, with an additional account for such estimations in the tails (like (1.1)). A major application is the risk management of portfolios written with derivative options [13]: regarding (1.1), R stands for the aggregated cashflows of derivatives at time ${T}^{\prime }$, and Y for the underlying asset or financial variables at time $T<{T}^{\prime }$. Then $𝔼\left[R\mid Y\right]$ represents the portfolio value at T given a scenario Y, and the aim is to compute the extreme exposure (Value at Risk, Conditional VaR) of the portfolio. These computations are an essential concern for Solvency Capital Requirement in insurance [6].

## Literature background and our contributions.

In view of the aforementioned applications, it is natural to find most of background results in relation to risk management in finance and insurance. Alternatively to the crude nested Monte Carlo methods (i.e. with an inner and an outer stage, both including sample Monte Carlo averages), several works have tried to speed-up the algorithms, notably by using spatial approximation of the inner conditional expectation: we refer to [15] for kernel estimators, to [17] for kriging techniques, to [4] for least-squares regression methods. However, these works do not account for the outside conditional expectation given $Y\in \mathcal{𝒜}$, i.e. the learning design is sampled from the distribution of Y and not from the conditional distribution of Y given $\left\{Y\in \mathcal{𝒜}\right\}$. While the latter distribution distorsion is presumably unessential in the computation of (1.1) in the case that $\mathcal{𝒜}$ is not rare, it certainly becomes a major flaw when $ℙ\left(Y\in \mathcal{𝒜}\right)\ll 1$ because the estimator of $𝔼\left[R\mid Y\right]$ is built using quite irrelevant data. We mention that the weighted regression method of [4] better accounts for extreme values of Y in the resolution of the least-squares regression, but still, the design remains sampled from the distribution of Y instead of the conditional distribution of Y given $\left\{Y\in \mathcal{𝒜}\right\}$ and therefore most of the samples are wasted.

In this work, we use least-squares regression methods to compute the function ${\varphi }_{\star }$. Our results are derived under weaker conditions than what is usually assumed: contrary to [4], the basis functions ${\varphi }_{1},\mathrm{\dots },{\varphi }_{L}$ are not necessarily orthonormalized and the design matrix is not necessarily invertible. Therefore we allow general basis functions and we avoid conditions on the underlying distribution. Furthermore, we do not restrict our convergence analysis to $M\to \mathrm{\infty }$ (large sample) but we also account for the approximation error (due to the function space). This allows a fine tuning of all parameters to achieve a tolerance on the global error. Finally, as a difference with the usual literature on non-parametric regression [14, 8], the learning sample ${\left({X}^{\left(m\right)}\right)}_{1\le m\le M}$ is not an i.i.d. sample of the conditional distribution of Y given $\left\{Y\in \mathcal{𝒜}\right\}$: the error analysis is significantly modified. Among the most relevant references in the case of non-i.i.d. learning sample, we refer to [1, 21, 5]. Namely, in [1], ${\left({X}^{\left(m\right)}\right)}_{1\le m\le M}$ is autoregressive or β-mixing: as a difference with our setting, they assume that the learning sample $\left({X}^{\left(1\right)},\mathrm{\dots },{X}^{\left(M\right)}\right)$ is stationary and that the noise sequence (i.e. ${X}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)$, $m\ge 1$) is essentially i.i.d. (and independent of the learning sample). In [21], the authors relax the condition on the noise but they impose R to be bounded; the learning sample is still assumed to be stationary and β-mixing. In [5] the authors study kernel estimators for ${\varphi }_{\star }$ (instead of least-squares like we do), under the assumption that the noise is a martingale with uniform exponential moments (we only impose finite variance).

## 2 Algorithm and convergence results

Let $\left(X,R\right)$ be a ${ℝ}^{d}×ℝ$-random vector; the distribution of X is the conditional distribution of Y given $\left\{Y\in \mathcal{𝒜}\right\}$, with density μ with respect to a positive σ-finite measure λ on ${ℝ}^{d}$. For any Borel set A, we denote

$𝖰\left(x,A\right):=𝔼\left[{𝟏}_{A}\left(R\right)\mid X=x\right];$

$𝖰$ is a Markov kernel, it is the conditional distribution of R given X. Let ${\varphi }_{\star }$ be the function from ${ℝ}^{d}$ to $ℝ$, defined by

${\varphi }_{\star }\left(x\right):={\int }_{ℝ}r𝖰\left(x,\mathrm{d}r\right).$(2.1)

It satisfies, $\mu \mathrm{d}\lambda$-almost surely,

${\varphi }_{\star }\left(X\right)=𝔼\left[R\mid X\right]$

when $X\sim \mu \mathrm{d}\lambda$.

For the regression step, choose L measurable functions ${\varphi }_{\mathrm{\ell }}:{ℝ}^{d}\to ℝ$, $\mathrm{\ell }\in \left\{1,\mathrm{\dots },L\right\}$, such that

$\int {\varphi }_{\mathrm{\ell }}^{2}\left(x\right)\mu \left(x\right)d\lambda \left(x\right)<\mathrm{\infty }.$

Denote by $\mathcal{ℱ}$ the vector space spanned by the functions ${\varphi }_{\mathrm{\ell }}$, $\mathrm{\ell }\in \left\{1,\mathrm{\dots },L\right\}$, and by $\underset{¯}{\mathbit{\varphi }}$ the function from ${ℝ}^{d}$ to ${ℝ}^{L}$ collecting the basis functions ${\varphi }_{\mathrm{\ell }}$:

$\underset{¯}{\mathbit{\varphi }}\left(x\right):=\left[\begin{array}{c}\hfill {\varphi }_{1}\left(x\right)\hfill \\ \hfill \mathrm{⋮}\hfill \\ \hfill {\varphi }_{L}\left(x\right)\hfill \end{array}\right].$

By convention, vectors are column vectors. For a matrix A, ${A}^{\prime }$ denotes its transpose. We denote by $〈\cdot ;\cdot 〉$ the scalar product in ${ℝ}^{p}$, and we will use $|\cdot |$ to denote both the Euclidean norm in ${ℝ}^{p}$ and the absolute value. The identity matrix of size N is denoted by ${I}_{N}$.

We adopt the short notation ${X}^{\left(1:M\right)}$ for the sequence $\left({X}^{\left(1\right)},\mathrm{\dots },{X}^{\left(M\right)}\right)$.

## 2.1 Algorithm

In Algorithm 1, we provide a description of a Monte Carlo approximation of the unknown quantity (1.1). Note that as a byproduct, this algorithm also provides an approximation ${\stackrel{^}{\varphi }}_{M}$ of the function ${\varphi }_{\star }$ given by (2.1).

Let $𝖯$ be a Markov transition kernel on $\mathcal{𝒜}$ with unique invariant distribution $\mu \mathrm{d}\lambda$.

#### (Full algorithm with M data, $M\mathrm{\ge }L$.)

The optimization problem Line 7 of Algorithm 1 is equivalent to find a vector $\alpha \in {ℝ}^{L}$ solving

${𝐀}^{\prime }𝐀\alpha ={𝐀}^{\prime }\underset{¯}{𝐑},$(2.3)

where

$\underset{¯}{𝐑}:=\left[\begin{array}{c}\hfill {R}^{\left(1\right)}\hfill \\ \hfill \mathrm{⋮}\hfill \\ \hfill {R}^{\left(M\right)}\hfill \end{array}\right],𝐀:=\left[\begin{array}{ccc}\hfill {\varphi }_{1}\left({X}^{\left(1\right)}\right)\hfill & \hfill \mathrm{\cdots }\hfill & \hfill {\varphi }_{L}\left({X}^{\left(1\right)}\right)\hfill \\ \hfill \mathrm{⋮}\hfill & \hfill \mathrm{\ddots }\hfill & \hfill \mathrm{⋮}\hfill \\ \hfill {\varphi }_{1}\left({X}^{\left(M\right)}\right)\hfill & \hfill \mathrm{\cdots }\hfill & \hfill {\varphi }_{L}\left({X}^{\left(M\right)}\right)\hfill \end{array}\right].$(2.4)

There exists at least one solution, and the solution with minimal (Euclidean) norm is given by

${\stackrel{^}{\alpha }}_{M}:={\left({𝐀}^{\prime }𝐀\right)}^{\mathrm{#}}{𝐀}^{\prime }\underset{¯}{𝐑},$(2.5)

where ${\left({𝐀}^{\prime }𝐀\right)}^{\mathrm{#}}$ denotes the Moore–Penrose pseudo-inverse matrix; ${\left({𝐀}^{\prime }𝐀\right)}^{\mathrm{#}}={\left({𝐀}^{\prime }𝐀\right)}^{-1}$ when the rank of $𝐀$ is L, and in that case, equation (2.3) possesses an unique solution.

An example of efficient transition kernel $𝖯$ is proposed in [11]: this kernel, hereafter denoted by ${𝖯}_{\mathrm{GL}}$, can be read as a Hastings–Metropolis transition kernel targeting $\mu \mathrm{d}\lambda$ and with a proposal kernel with transition density q which is reversible with respect to μ, i.e. for all $x,z\in \mathcal{𝒜}$,

$\mu \left(x\right)q\left(x,z\right)=q\left(z,x\right)\mu \left(z\right).$(2.6)

An algorithmic description for sampling a path of length M of a Markov chain with transition kernel ${𝖯}_{\mathrm{GL}}$ and with initial distribution ξ is given in Algorithm 2.

#### (MCMC for rare event: A Markov chain with kernel ${\mathrm{P}}_{\mathrm{GL}}$.)

When $\mu \mathrm{d}\lambda$ is a Gaussian distribution ${\mathcal{𝒩}}_{d}\left(0,\mathrm{\Sigma }\right)$ on ${ℝ}^{d}$ restricted to $\mathcal{𝒜}$, $\stackrel{~}{X}\sim {\mathcal{𝒩}}_{d}\left(\rho x,\left(1-{\rho }^{2}\right)\mathrm{\Sigma }\right)$ is a candidate with distribution $z↦q\left(x,z\right)$ satisfying (2.6); here, $\rho \in \left[0,1\right)$ is a design parameter chosen by the user (see [11, Section 4] for a discussion on the choice of ρ). Other proposal kernels q satisfying (2.6) are given in [11, Section 3] in the non-Gaussian case.

More generally, building a transition kernel $𝖯$ with invariant distribution $\mu \mathrm{d}\lambda$ is well known using Hastings–Metropolis schemes. Actually, there is no need to impose condition (2.6) about reversibility of q with respect to μ. Indeed, given an arbitrary transition density $q\left(\cdot ,\cdot \right)$, it is sufficient to replace Lines 5–6 of Algorithm 2 by the following acceptance rule: if ${\stackrel{~}{X}}^{\left(m\right)}\in \mathcal{𝒜}$, accept ${\stackrel{~}{X}}^{\left(m\right)}$ with probability

${\alpha }_{\mathrm{accept}}\left({X}^{\left(m-1\right)},{\stackrel{~}{X}}^{\left(m\right)}\right):=1\wedge \left[\frac{\mu \left({\stackrel{~}{X}}^{\left(m\right)}\right)q\left({\stackrel{~}{X}}^{\left(m\right)},{X}^{\left(m-1\right)}\right)}{\mu \left({X}^{\left(m-1\right)}\right)q\left({X}^{\left(m-1\right)},{\stackrel{~}{X}}^{\left(m\right)}\right)}\right].$

In the subsequent numerical tests with Gaussian distribution restricted to $\mathcal{𝒜}$, $\mu \mathrm{d}\lambda \propto {\mathcal{𝒩}}_{d}\left(0,\mathrm{\Sigma }\right){𝟏}_{\mathcal{𝒜}}$, we will make use of $\stackrel{~}{X}\sim {\mathcal{𝒩}}_{d}\left(\rho x+\left(1-\rho \right){x}_{\mathcal{𝒜}},\left(1-{\rho }^{2}\right)\mathrm{\Sigma }\right)$ as a candidate for the transition density $z↦q\left(x,z\right)$, where ${x}_{\mathcal{𝒜}}$ is a well-chosen point in $\mathcal{𝒜}$. In that case, we easily check that the acceptance probability is given by

${\alpha }_{\mathrm{accept}}\left(x,z\right)=1\wedge \mathrm{exp}\left({x}_{\mathcal{𝒜}}^{\prime }{\mathrm{\Sigma }}^{-1}\left(x-z\right)\right).$(2.8)

## 2.2 Convergence results for the estimation of ${\varphi }_{\star }$

Let ${L}_{2}\left(\mu \right)$ be the set of measurable functions $\phi :{ℝ}^{d}\to ℝ$ such that $\int {\phi }^{2}\mu d\lambda <\mathrm{\infty }$; and define the norm

${|\phi |}_{{L}_{2}\left(\mu \right)}:={\left(\int {\phi }^{2}\mu d\lambda \right)}^{1/2}.$(2.9)

Let ${\psi }_{\star }$ be the projection of ${\varphi }_{\star }$ on the linear span of the functions ${\varphi }_{1},\mathrm{\dots },{\varphi }_{L}$, with respect to the norm given by (2.9): ${\psi }_{\star }=〈{\alpha }_{\star };\underset{¯}{\mathbit{\varphi }}〉$, where ${\alpha }_{\star }\in {ℝ}^{L}$ solves

$\left(\int \underset{¯}{\mathbit{\varphi }}{\underset{¯}{\mathbit{\varphi }}}^{\prime }\mu d\lambda \right){\alpha }_{\star }=\int \psi \underset{¯}{\mathbit{\varphi }}\mu d\lambda .$

Assume that the following conditions hold:

• (i)

the transition kernel $𝖯$ and the initial distribution ξ satisfy: there exists a constant ${C}_{𝖯}$ and a rate sequence $\left\{\rho \left(m\right):m\ge 1\right\}$ such that for any $m\ge 1$,

$|\xi {𝖯}^{m}\left[{\left({\psi }_{\star }-{\varphi }_{\star }\right)}^{2}\right]-\int {\left({\psi }_{\star }-{\varphi }_{\star }\right)}^{2}\mu d\lambda |\le {C}_{𝖯}\rho \left(m\right).$(2.10)

• (ii)

the conditional distribution $𝖰$ satisfies

${\sigma }^{2}:=\underset{x\in \mathcal{𝒜}}{sup}\left\{\int {r}^{2}𝖰\left(x,\mathrm{d}r\right)-{\left(\int r𝖰\left(x,\mathrm{d}r\right)\right)}^{2}\right\}<\mathrm{\infty }.$(2.11)

Let ${X}^{\mathrm{\left(}\mathrm{1}\mathrm{:}M\mathrm{\right)}}$ and ${\stackrel{\mathrm{^}}{\varphi }}_{M}$ be given by Algorithm 1. Then

${\mathrm{\Delta }}_{M}:=𝔼\left[\frac{1}{M}\sum _{m=1}^{M}{\left({\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}\right]\le \frac{{\sigma }^{2}L}{M}+{|{\psi }_{\star }-{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)}^{2}+\frac{{C}_{𝖯}}{M}\sum _{m=1}^{M}\rho \left(m\right).$(2.12)

#### Proof.

See Section 4.1. ∎

Note that ${\mathrm{\Delta }}_{M}$ measures the mean squared error ${\stackrel{^}{\varphi }}_{M}-{\varphi }_{\star }$ along the design sequence ${X}^{\left(1:M\right)}$. The proof consists in decomposing this error into a variance term and a squared bias term:

• (a)

${\sigma }^{2}L/M$ on the right-hand side is the statistical error, decreasing as the size of the design M gets larger and increasing as the size of the approximation space L gets larger.

• (b)

The quantity ${|{\psi }_{\star }-{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)}^{2}$ is the residual error under the best approximation of ${\varphi }_{\star }$ by the basis functions ${\varphi }_{1},\mathrm{\dots },{\varphi }_{L}$ with respect to the ${L}_{2}\left(\mu \right)$-norm: it is naturally expected as the limit of ${\mathrm{\Delta }}_{M}$ when $M\to \mathrm{\infty }$.

• (c)

The term with $\left\{\rho \left(m\right):m\ge 1\right\}$ describes how rapidly the Markov chain $\left\{{X}^{\left(m\right)}:m\ge 1\right\}$ converges to its stationary distribution $\mu \mathrm{d}\lambda$.

This theorem extends known results in the case of i.i.d. design ${X}^{\left(1:M\right)}$, which is the major novelty of our contribution. The i.i.d. case is a special case of this general setting: it is retrieved by setting $𝖯\left(x,\mathrm{d}z\right)=\mu \left(z\right)\mathrm{d}\lambda \left(z\right)$. Note that in that case, assumption (i) is satisfied with ${C}_{𝖯}=0$, and the upper bound in (2.12) coincides with classic results (see e.g. [14, Theorem 11.1]). The theorem covers the situation when the outer Monte Carlo stage relies on a Markov chain Monte Carlo sampler; we will discuss below how to check assumption (i) in practice.

The assumptions on the basis functions ${\varphi }_{1},\mathrm{\dots },{\varphi }_{L}$ are weaker than what is usually assumed in the literature on nested simulation. Namely, as a difference with [4, Assumption A2] in the i.i.d. case, Theorem 1 holds even when the functions ${\varphi }_{1},\mathrm{\dots },{\varphi }_{L}$ are not orthonormal in ${L}_{2}\left(\mu \right)$, and it holds without assuming that almost-surely, the rank of the matrix $𝐀$ is L.

Assumption (ii) says that the conditional variance of R given X is uniformly bounded. This condition could be weakened and replaced by an ergodic condition on the Markov kernel $𝖯$ implying that

${\stackrel{~}{\sigma }}_{L}^{2}:=\underset{M\ge L}{sup}𝔼\left[|𝐀{\left({𝐀}^{\prime }𝐀\right)}^{\mathrm{#}}{𝐀}^{\prime }\left(\underset{¯}{𝐑}-𝔼\left[\underset{¯}{𝐑}\mid {X}^{\left(1:M\right)}\right]\right){|}^{2}\right]<\mathrm{\infty };$

$𝐀$ and $\underset{¯}{𝐑}$ are given by (2.4) and depend on ${X}^{\left(1:M\right)}$. In that case, the upper bound (2.12) holds with ${\sigma }^{2}L$ replaced by ${\stackrel{~}{\sigma }}_{L}^{2}$ (see inequality (4.2) in the proof of Theorem 1).

We conclude this subsection by conditions on $𝖯$ and $\mathcal{𝒜}$ implying the ergodicity assumption (2.10) with a geometric rate sequence

$\rho \left(m\right)={\kappa }^{m}$

for some $\kappa \in \left(0,1\right)$. Sufficient conditions for sub-geometric rate sequences can be found, e.g., in [10, 7].

#### ([20, Theorem 15.0.1], [9, Proposition 2])

Assume that $\mathrm{P}$ is phi-irreducible and there exists a measurable function $V\mathrm{:}\mathcal{A}\mathrm{\to }\mathrm{\left[}\mathrm{1}\mathrm{,}\mathrm{+}\mathrm{\infty }\mathrm{\right)}$ such that

• (i)

there exist $\delta \in \left(0,1\right)$ and $b<\mathrm{\infty }$ such that for any $x\in \mathcal{𝒜}$, $𝖯V\left(x\right)\le \delta V\left(x\right)+b$,

• (ii)

there exists ${\upsilon }_{\star }\in \left(b/\left(1-\delta \right),+\mathrm{\infty }\right)$ such that the level set ${\mathcal{𝒞}}_{\star }:=\left\{V\le {\upsilon }_{\star }\right\}$ is 1 -small: there exist $ϵ>0$ and a probability distribution ν on $\mathcal{𝒜}$ (with $\nu \left({\mathcal{𝒞}}_{\star }\right)=1$ ) such that for any $x\in {\mathcal{𝒞}}_{\star }$, $𝖯\left(x,\mathrm{d}z\right)\ge ϵ\nu \left(\mathrm{d}z\right)$.

Then there exist $\kappa \mathrm{\in }\mathrm{\left(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{\right)}$ and a finite constant ${C}_{\mathrm{1}}$ such that for any measurable function $g\mathrm{:}\mathcal{A}\mathrm{\to }\mathrm{R}$, any $m\mathrm{\ge }\mathrm{1}$ and any $x\mathrm{\in }\mathcal{A}$,

$|{𝖯}^{m}g\left(x\right)-\int g\mu d\lambda |\le {C}_{1}\left(\underset{\mathcal{𝒜}}{sup}\frac{|g|}{V}\right){\kappa }^{m}V\left(x\right).$

In addition, there exists a finite constant ${C}_{\mathrm{2}}$ such that for any measurable function $g\mathrm{:}\mathcal{A}\mathrm{\to }\mathrm{R}$ and any $M\mathrm{\ge }\mathrm{1}$,

$𝔼\left[{|\sum _{m=1}^{M}\left\{g\left({X}^{\left(m\right)}\right)-\int g\mu d\lambda \right\}|}^{2}\right]\le {C}_{2}{\left(\underset{\mathcal{𝒜}}{sup}\frac{|g|}{\sqrt{V}}\right)}^{2}𝔼\left[V\left({X}^{\left(0\right)}\right)\right]M.$

An explicit expression of the constant ${C}_{2}$ is given in [9, Proposition 2]. When $𝖯={𝖯}_{\mathrm{GL}}$ as described in Algorithm 2, we have the following corollary.

Assume the following conditions:

• (i)

For all $x\in \mathcal{𝒜}$, $\mu \left(z\right)>0$ implies that $q\left(x,z\right)>0$.

• (ii)

There exists ${\delta }_{1}\in \left(0,1\right)$ such that ${sup}_{x\in \mathcal{𝒜}}{\int }_{{\mathcal{𝒜}}^{c}}q\left(x,z\right)d\lambda \left(z\right)\le {\delta }_{1}.$

• (iii)

There exist ${\delta }_{2}\in \left({\delta }_{1},1\right)$ , a measurable function $V:\mathcal{𝒜}\to \left[1,+\mathrm{\infty }\right)$ and a set $\mathcal{ℬ}\subset \mathcal{𝒜}$ such that

$b:=\underset{x\in \mathcal{ℬ}}{sup}{\int }_{\mathcal{𝒜}}V\left(z\right)q\left(x,z\right)d\lambda \left(z\right)<\mathrm{\infty },\underset{x\in {\mathcal{ℬ}}^{c}}{sup}{V}^{-1}\left(x\right){\int }_{\mathcal{𝒜}}V\left(z\right)q\left(x,z\right)d\lambda \left(z\right)\le {\delta }_{2}-{\delta }_{1}.$

• (iv)

For some ${\upsilon }_{\star }>b/\left(1-{\delta }_{2}\right)$, the level set ${\mathcal{𝒞}}_{\star }:=\left\{V\le {\upsilon }_{\star }\right\}$ is such that

$\underset{\left(x,z\right)\in {\mathcal{𝒞}}_{\star }^{2}}{inf}\left(\frac{q\left(x,z\right){𝟏}_{\mu \left(z\right)\ne 0}}{\mu \left(z\right)}\right)>0,{\int }_{{\mathcal{𝒞}}_{\star }}\mu d\lambda >0.$

Then the assumptions of Proposition 2 are satisfied for the kernel $\mathrm{P}\mathrm{=}{\mathrm{P}}_{\mathrm{GL}}$.

#### Proof.

See Section 4.2. ∎

When μ is a Gaussian density ${\mathcal{𝒩}}_{d}\left(0,{I}_{d}\right)$ on ${ℝ}^{d}$ restricted to $\mathcal{𝒜}$ and the proposal density $q\left(x,y\right)$ is a Gaussian random variable with mean $\rho x$ and covariance $\sqrt{1-{\rho }^{2}}{I}_{d}$ (with $\rho \in \left(0,1\right)$), it is easily seen that conditions (i), (iii) and (iv) of Corollary 3 are satisfied (choose, e.g., $V\left(x\right)=\mathrm{exp}\left(s|x|\right)$, with $s>0$). Condition (ii) is problem specific since it depends on the geometry of $\mathcal{𝒜}$.

## 2.3 Convergence results for the estimation of $\mathcal{ℐ}$

When problem (1.1) is of the form $𝔼\left[f\left(Y,𝔼\left[R\mid Y\right]\right)\mid Y\in \mathcal{𝒜}\right]$ for a globally Lipschitz function f (in the second variable), we have the following control on the Monte Carlo error ${\stackrel{^}{\mathcal{ℐ}}}_{M}-\mathcal{ℐ}$ from Algorithm 1.

Assume that the following conditions hold:

• (i)

$f:{ℝ}^{d}×ℝ\to ℝ$ is globally Lipschitz in the second variable: there exists a finite constant ${C}_{f}$ such that for any $\left({r}_{1},{r}_{2},y\right)\in ℝ×ℝ×{ℝ}^{d}$,

$|f\left(y,{r}_{1}\right)-f\left(y,{r}_{2}\right)|\le {C}_{f}|{r}_{1}-{r}_{2}|.$

• (ii)

There exists a finite constant C such that for any M

$𝔼\left[{\left({M}^{-1}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)-\int f\left(x,{\varphi }_{\star }\left(x\right)\right)\mu \left(x\right)d\lambda \left(x\right)\right)}^{2}\right]\le \frac{C}{M}.$

Then

${\left(𝔼\left[{|{\stackrel{^}{\mathcal{ℐ}}}_{M}-\mathcal{ℐ}|}^{2}\right]\right)}^{1/2}\le {C}_{f}\sqrt{{\mathrm{\Delta }}_{M}}+\sqrt{\frac{C}{M}},$

where $\mathcal{I}$, ${\stackrel{\mathrm{^}}{\mathcal{I}}}_{M}$ and ${\mathrm{\Delta }}_{M}$ are respectively given by (1.1), Algorithm 1 and (2.12).

#### Proof.

See Section 4.3. ∎

Sufficient conditions for assumption (ii) to hold are given in Proposition 2 when $\left\{{X}^{\left(m\right)}:m\ge 1\right\}$ is a Markov chain. When the draws $\left\{{X}^{\left(m\right)}:m\ge 1\right\}$ are i.i.d. with distribution $\mu \mathrm{d}\lambda$, then condition (ii) is verified with $C=𝕍\mathrm{ar}\left(f\left(X,{\varphi }_{\star }\left(X\right)\right)\right)$ with $X\sim \mu \mathrm{d}\lambda$.

## 2.4 Asymptotic optimal tuning of parameters

In this subsection, we discuss how to tune the parameters of the algorithm (i.e. M, ${\varphi }_{1},\mathrm{\dots },{\varphi }_{L}$ and L), given a Markov kernel $𝖯$. To simplify the discussion, we assume from now on that

The constant ${C}_{𝖯}$ of (2.10) can be chosen independently of ${\psi }_{\star }$; furthermore the series ${\left(\rho \left(m\right)\right)}_{m\ge 1}$ defined in (2.10) is convergent.

The above condition on ρ is quite little demanding: see Proposition 2 where the convergence is geometric. Regarding the condition on ${C}_{𝖯}$, although not trivial, this assumption seems reasonable since ${\psi }_{\star }$ is the best approximation of ${\varphi }_{\star }$ on the function basis with respect to the target measure $\mu \mathrm{d}\lambda$: it means that first

${|{\psi }_{\star }-{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)}\le {|{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)},$

and second ${\psi }_{\star }-{\varphi }_{\star }$ is expected to converge to 0 in ${L}_{2}\left(\mu \right)$ as the number L of basis functions increases. Besides, in the context of Proposition 2, the control of ${C}_{𝖯}$ would follow from the control of ${sup}_{\mathcal{𝒜}}\frac{|{\psi }_{\star }-{\varphi }_{\star }|}{V}$, which is a delicate task because of the lack of knowledge on ${\psi }_{\star }$.

A direct consequence of Hyp(2.10) is that the last term in (2.12) is such that

$\frac{{C}_{𝖯}}{M}\sum _{m=1}^{M}\rho \left(m\right)=O\left(\frac{1}{M}\right),$

uniformly in the function basis. In other words, the mean empirical squared error ${\mathrm{\Delta }}_{M}$ is bounded by

$\mathrm{Cst}×\left(\frac{L}{M}+{|{\psi }_{\star }-{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)}^{2}\right),$

as in the case of i.i.d. design (see [14, Theorem 11.1]).

There are many choices of function basis [14], but due to the lack of knowledge on the target measure and in the perspective of discussing convergence rates, it is relevant to adopt local approximation techniques, like piecewise polynomial partitioning estimates (i.e. local polynomials defined on a tensored grid); for a detailed presentation, see [12, Section 4.4.]. Assume that the conditional expectation ${\varphi }_{\star }$ is smooth on $\mathcal{𝒜}$, namely ${\varphi }_{\star }$ is ${p}_{0}$ continuously differentiable, with bounded derivatives, and the ${p}_{0}$-th derivatives is ${p}_{1}$-Hölder continuous. Set $p:={p}_{0}+{p}_{1}$. If $\mathcal{𝒜}$ is bounded, it is well known [14, Corollary 11.1 for $d=1$] that taking local polynomials of order ${p}_{0}$ on a tensored grid with edge length equal to $\mathrm{Cst}×{M}^{-\frac{1}{2p+d}}$ ensures that both the statistical error $L/M$ and the approximation error ${|{\psi }_{\star }-{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)}^{2}$ have the same magnitude and we get

${\mathrm{\Delta }}_{M}=O\left({M}^{-\frac{2p}{2p+d}}\right).$(2.13)

If $\mathcal{𝒜}$ is not anymore bounded, under the additional assumption that $\mu \mathrm{d}\lambda$ has tails with exponential decay, it is enough to consider similar local polynomials but on a tensored grid truncated at distance $\mathrm{Cst}×\mathrm{log}\left(M\right)$; this choice maintains the validity of estimate (2.13), up to logarithmic factors [12, Section 4.4.], which we omit to write for the sake of simplicity.

Regarding the complexity $\mathrm{𝖢𝗈𝗌𝗍}$ (computational cost), the simulation cost (for ${X}^{\left(1:M\right)},{R}^{\left(1:M\right)}$) is proportional to M, the computation of ${\stackrel{^}{\varphi }}_{M}$ needs $\mathrm{Cst}×M$ operations (taking advantage of the tensored grid), as well as the final evaluation of ${\stackrel{^}{\mathcal{ℐ}}}_{M}$. Thus we have $\mathrm{𝖢𝗈𝗌𝗍}\sim \mathrm{Cst}×M$, with another constant. Finally, in view of Theorem 4, we derive

${\mathrm{𝖤𝗋𝗋𝗈𝗋}}_{\text{Regression Alg. 1}}=O\left({\mathrm{𝖢𝗈𝗌𝗍}}^{-\frac{p}{2p+d}}\right).$

This is similar to the rate we would obtain in a i.i.d. setting. For very smooth ${\varphi }_{\star }$ ($p\to +\mathrm{\infty }$), we retrieve asymptotically the order $\frac{1}{2}$ of convergence. This global error may be compared to the situation where the inner conditional expectation is computed using a crude Monte Carlo method (using N samples of ${R}^{\left(m,k\right)}$ for each of the M samples ${X}^{\left(m\right)}$); this scheme is described and analyzed in Appendix A. Its computational cost is $\mathrm{Cst}×MN$ and its global error is $O\left(1/\sqrt{N}+1/\sqrt{M}\right)$ if f is Lipschitz (resp. $O\left(1/N+1/\sqrt{M}\right)$ if f is smoother); thus we have (by taking $M=N$ resp. $N=\sqrt{M}$)

${\mathrm{𝖤𝗋𝗋𝗈𝗋}}_{\text{Crude MC Alg. 3}}=\left\{\begin{array}{cc}O\left({\mathrm{𝖢𝗈𝗌𝗍}}^{-\frac{1}{4}}\right)\hfill & \text{if}f\text{Lipschitz},\hfill \\ O\left({\mathrm{𝖢𝗈𝗌𝗍}}^{-\frac{1}{3}}\right)\hfill & \text{if}f\text{smoother}.\hfill \end{array}$

In the standard case of Lipschitz f, the regression-based Algorithm 1 converges faster than Algorithm 3 under the condition $p\ge \frac{d}{2}$. In low dimension, this condition is easy to satisfy but it becomes problematic as the dimension increases, this is the usual curse of dimensionality.

## 3 Application: Put options in a rare-event regime

The goal is to approximate the quantity

$\mathcal{ℐ}:=𝔼\left[{\left(𝔼\left[{\left(K-h\left({S}_{{T}^{\prime }}\right)\right)}_{+}\mid {S}_{T}\right]-{p}_{\star }\right)}_{+}\mid {S}_{T}\in \mathcal{𝒮}\right]$(3.1)

for various choices of h, where $\left\{{S}_{t}:t\ge 0\right\}$ is a d-dimensional geometric Brownian motion, $T<{T}^{\prime }$ and $\left\{{S}_{T}\in \mathcal{𝒮}\right\}$ is a rare event.

## 3.1 A toy example in dimension 1

We start with a toy example: in dimension $d=1$, when $h\left(y\right)=y$ and $\mathcal{𝒮}=\left\{s\in {ℝ}_{+}:s\le {s}_{\star }\right\}$ so that

$\mathcal{ℐ}=𝔼\left[{\left(𝔼\left[{\left(K-{S}_{{T}^{\prime }}\right)}_{+}\mid {S}_{T}\right]-{p}_{\star }\right)}_{+}\mid {S}_{T}\le {s}_{\star }\right];$

${\left(K-{S}_{{T}^{\prime }}\right)}_{+}$ is the Put payoff written on one stock with price ${\left({S}_{t}\right)}_{t\ge 0}$, with strike K and maturity ${T}^{\prime }$: this is a standard financial product used by asset managers to insure their portfolio against the decrease of stock price. We take the point of view of the seller of the contract, who is mostly concerned by large values of the Put price, i.e. he aims at valuing the excess of the Put price at time $T\in \left(0,{T}^{\prime }\right)$ beyond the threshold ${p}_{\star }>0$, for stock value ${S}_{T}$ smaller than ${s}_{\star }>0$. We assume that $\left\{{S}_{t}:t\ge 0\right\}$ evolves like a geometric Brownian motion, with volatility $\sigma >0$ and zero drift. For the sake of simplicity, we assume that the interest rate is 0; extension to non-zero interest rate is obvious.

Upon noting that ${S}_{T}=\xi \left(Y\right)$ and ${S}_{{T}^{\prime }}=\xi \left(Y\right)\mathrm{exp}\left(-\frac{1}{2}{\sigma }^{2}\tau +\sigma \sqrt{\tau }Z\right)$, where $Y,Z$ are independent standard Gaussian variables and

$\xi \left(y\right):={S}_{0}\mathrm{exp}\left(-\frac{1}{2}{\sigma }^{2}T+\sigma \sqrt{T}y\right),\tau :={T}^{\prime }-T$

we have

$\mathcal{ℐ}=𝔼\left[{\left(𝔼\left[{\left(K-\xi \left(Y\right)\mathrm{exp}\left(-\frac{1}{2}{\sigma }^{2}\tau +\sigma \sqrt{\tau }Z\right)\right)}_{+}|Y\right]-{p}_{\star }\right)}_{+}|Y\le {y}_{\star }\right],$

where

${y}_{\star }:=\frac{1}{\sigma \sqrt{T}}\mathrm{ln}\left(\frac{{s}_{\star }}{{S}_{0}}\right)+\frac{1}{2}\sigma \sqrt{T}.$

Therefore, problem (3.1) is of the form (1.1) with

$R={\left(K-\xi \left(Y\right)\mathrm{exp}\left(-\frac{1}{2}{\sigma }^{2}\tau +\sigma \sqrt{\tau }Z\right)\right)}_{+},f\left(y,r\right)={\left(r-{p}_{\star }\right)}_{+},\mathcal{𝒜}=\left\{y\in ℝ:y\le {y}_{\star }\right\},$

and ${\left[Y,Z\right]}^{\prime }\sim {\mathcal{𝒩}}_{2}\left(0,{I}_{2}\right)$. In this example, $ℙ\left(Y\in \mathcal{𝒜}\right)$ and $𝔼\left[R\mid Y\right]$ are explicit. We have indeed $ℙ\left(Y\in \mathcal{𝒜}\right)=\mathrm{\Phi }\left({y}_{\star }\right)$, where Φ denotes the cumulative distribution function (cdf) of a standard Gaussian distribution. Furthermore, $𝔼\left[R\mid Y\right]={\mathrm{\Phi }}_{\star }\left(\xi \left(Y\right)\right)$, where

${\mathrm{\Phi }}_{\star }\left(s\right):=K\mathrm{\Phi }\left({d}_{+}\left(s\right)\right)-s\mathrm{\Phi }\left({d}_{-}\left(s\right)\right),\text{with}{d}_{±}\left(s\right):=\frac{1}{\sigma \sqrt{\tau }}\mathrm{ln}\left(\frac{K}{s}\right)±\frac{1}{2}\sigma \sqrt{\tau };$

note that ${\varphi }_{\star }={\mathrm{\Phi }}_{\star }\circ \xi$. The parameter values for the numerical tests are given in Table 1.

Table 1

Parameter values for the one-dimensional example.

Figure 1

Normalized histograms of the M points from the Markov chains $\mathrm{GL}$ (top left), $\mathrm{NR}$ (top right) and from the i.i.d. sampler with rejection (bottom left). Bottom right: Restricted to $\left[-6,{y}_{\star }\right]$, the cdf of Y given $\left\{Y\in \mathcal{𝒜}\right\}$, two MCMC approximations (with ${𝖯}_{\mathrm{GL}}$ and ${𝖯}_{\mathrm{NR}}$) and an i.i.d. approximation.

We first illustrate the behavior of the kernel ${𝖯}_{\mathrm{GL}}$ described by Algorithm 2. Since Y is a standard Gaussian random variable, we design ${𝖯}_{\mathrm{GL}}$ as a Hastings–Metropolis sampler, with invariant distribution $\mu \mathrm{d}\lambda$ equal to a standard $\mathcal{𝒩}\left(0,1\right)$ restricted to $\mathcal{𝒜}$ and with proposal distribution $q\left(x,\cdot \right)\mathrm{d}\lambda \equiv \mathcal{𝒩}\left(\rho x,1-{\rho }^{2}\right)$. Observe that this proposal kernel is reversible with respect to μ, see (2.6). Note that condition (ii) in Corollary 3 gets into

$\underset{y\le {y}_{\star }}{sup}\mathrm{\Phi }\left(\frac{\rho y-{y}_{\star }}{\sqrt{1-{\rho }^{2}}}\right)<1,$

which holds true since $\rho >0$. In the following, the performance of the kernel ${𝖯}_{\mathrm{GL}}$ is compared to that of the kernel ${𝖯}_{\mathrm{NR}}$ defined as a Hastings–Metropolis kernel with proposal $q\left(x,\cdot \right)\mathrm{d}\lambda \equiv \mathcal{𝒩}\left(\left(1-\rho \right){y}_{\star }+\rho x,1-{\rho }^{2}\right)$ and with invariant distribution a standard Gaussian random variable restricted to $\mathcal{𝒜}$. As a main difference with ${𝖯}_{\mathrm{GL}}$, this proposal transition density q is not reversible with respect to μ (whence the notation ${𝖯}_{\mathrm{NR}}$ for the kernel); therefore, the acceptance-rejection ratio of the new point z is given by (see equality (2.8))

$\left(1\wedge \mathrm{exp}\left({y}_{\star }\left(x-z\right)\right)\right){𝟏}_{z\le {y}_{\star }}.$

In Figure 1 (bottom right), the true cdf of Y given $\left\{Y\in \mathcal{𝒜}\right\}$ (which is a density on $\left(-\mathrm{\infty },{y}_{\star }\right]$) is displayed on $\left[-6,{y}_{\star }\right]$ together with three empirical cdfs $x↦{M}^{-1}{\sum }_{m=1}^{M}{𝟏}_{\left\{{X}^{\left(m\right)}\le x\right\}}$: the first one is computed from i.i.d. samples with distribution $\mathcal{𝒩}\left(0,1\right)$ and the second one (resp. the third one) is computed from a Markov chain path ${X}^{\left(1:M\right)}$ of length M with kernel ${𝖯}_{\mathrm{GL}}$ (resp. ${𝖯}_{\mathrm{NR}}$) and started at ${X}^{\left(0\right)}={y}_{\star }$. The two kernels provide a similar approximation of the true cdf. Here $M=1\mathrm{e}6$, and $\rho =0.85$ for both kernels. We also display the normalized histograms of the points ${X}^{\left(m\right)}$ sampled respectively from ${𝖯}_{\mathrm{GL}}$ (top left), ${𝖯}_{\mathrm{NR}}$ (top right) and the crude rejection algorithm with Gaussian proposal (bottom left). In the latter plot, the histogram is built with only around 50–60 points which correspond to the accepted points among $M=1\mathrm{e}6$ proposal points.

Figure 2

For different values of ρ, estimation of the autocorrelation function (over 100 independent runs) of the chain ${𝖯}_{\mathrm{GL}}$ (left) and ${𝖯}_{\mathrm{NR}}$ (right). Each curve is computed using $1\mathrm{e}6$ sampled points.

To assess the speed of convergence of the samplers ${𝖯}_{\mathrm{GL}}$ and ${𝖯}_{\mathrm{NR}}$ to their stationary distributions, we additionally plot in Figure 2 the autocorrelation function for both chains. For ${𝖯}_{\mathrm{GL}}$ the choice of ρ is quite significant, as observed in [11]; values of ρ around 0.9 give usually good results. For ${𝖯}_{\mathrm{NR}}$, in this example the choice of ρ is less significant because we are able to define a proposal which takes advantage of the knowledge on the rare set. A comparison of acceptance rates is provided below (see Figure 3 (left)).

Figure 3

Comparison of the MCMC sampler ${𝖯}_{\mathrm{GL}}$ (top) and ${𝖯}_{\mathrm{NR}}$ (bottom), for different values of $\rho \in \left\{0.1,\mathrm{\dots },0.9,0.99\right\}$.Left: Mean acceptance rate when computing $ℙ\left(Y\le {y}_{\star }\mid Y\le {w}_{J-1}\right)$ after M iterations of the chain. Right: Estimation of $ℙ\left(Y\in \mathcal{𝒜}\right)$ by combining splitting and MCMC.

We also illustrate the behavior of these two MCMC samplers for the estimation of the rare-event probability $ℙ\left(Y\in \mathcal{𝒜}\right)$. Following the approach of [11], we use the decomposition

$ℙ\left(Y\in \mathcal{𝒜}\right)=\prod _{j=1}^{J}ℙ\left(Y\le {w}_{j}\mid Y\le {w}_{j-1}\right)\approx \stackrel{^}{\pi }:=\prod _{j=1}^{J}\left(\frac{1}{M}\sum _{m=1}^{M}{𝟏}_{{X}^{\left(m,j\right)}\le {w}_{j}}\right),$

where ${w}_{0}=+\mathrm{\infty }>{w}_{1}>\mathrm{\cdots }>{w}_{J}={y}_{\star }$, and $\left\{{X}^{\left(m,j\right)}:m\ge 0\right\}$ is a Markov chain with kernel ${𝖯}_{\mathrm{GL}}^{\left(j\right)}$ or ${𝖯}_{\mathrm{NR}}^{\left(j\right)}$ having a standard Gaussian restricted to $\left(-\mathrm{\infty },{w}_{j-1}\right]$ as invariant distribution. The J intermediate levels are chosen such that $ℙ\left(Y\le {w}_{j}\mid Y\le {w}_{j-1}\right)\approx 0.1$. Figure 3 (right) displays the boxplot of 100 independent realizations of the estimator $\stackrel{^}{\pi }$ for different values of $\rho \in \left\{0.1,\mathrm{\dots },0.9\right\}$; the horizontal dotted line indicates the true value $ℙ\left(Y\in \mathcal{𝒜}\right)=5.6\mathrm{e}-5$. Here $J=5$, $\left({w}_{1},\mathrm{\dots },{w}_{4}\right)=\left(0,-1.6,-2.5,-3.2\right)$ and $M=1\mathrm{e}4$. Figure 3 (left) displays the boxplot of 100 mean acceptance rates ${M}^{-1}{\sum }_{m=1}^{M}{𝟏}_{\left\{{X}^{\left(m,J\right)}={\stackrel{~}{X}}^{\left(m,J\right)}\right\}}$ computed along 100 independent chains $\left\{{X}^{\left(m,J\right)}:m\le M\right\}$, for different values of ρ; the horizontal dotted line is set to 0.234 which is usually chosen as the target rate when fixing some design parameters in a Hastings–Metropolis algorithm (see e.g. [22]). We observe that the use of non-reversible proposal kernel ${𝖯}_{\mathrm{NR}}$ yields more accurate results than ${𝖯}_{\mathrm{GL}}$; this is intuitively easy to understand since ${𝖯}_{\mathrm{GL}}$ better accounts for the point ${y}_{\star }$ around which one should sample.

Figure 4

Left: 1000 sampled points $\left({X}^{\left(m\right)},{R}^{\left(m\right)}\right)$ (using the sampler ${𝖯}_{\mathrm{GL}}$), together with ${\varphi }_{\star }$. Right: A realization of the error function $x↦{\stackrel{^}{\varphi }}_{M}\left(x\right)-{\varphi }_{\star }\left(x\right)$ on $\left[-5,{y}_{\star }\right]$, for different values of $L\in \left\{2,3,4\right\}$ and two different kernels when sampling ${X}^{\left(1:M\right)}$.

We now run Algorithm 1 for the estimation of the conditional expectation $x↦{\varphi }_{\star }\left(x\right)$ on $\left(-\mathrm{\infty },{y}_{\star }\right]$. The algorithm is run with $M=1\mathrm{e}6$, successively with $𝖯={𝖯}_{\mathrm{GL}}$ and $𝖯={𝖯}_{\mathrm{NR}}$ both with $\rho =0.85$; the L basis functions are $\left\{x↦{\varphi }_{\mathrm{\ell }}\left(x\right)={\left(\xi \left(x\right)\right)}^{\mathrm{\ell }-1}:l=1,\mathrm{\dots },L\right\}$ and we consider successively $L\in \left\{2,3,4\right\}$. In Figure 4 (right), the error function $x↦{\stackrel{^}{\varphi }}_{M}\left(x\right)-{\varphi }_{\star }\left(x\right)$ is displayed for different values of L when computing ${\stackrel{^}{\varphi }}_{M}$. It is displayed on the interval $\left[-5,{y}_{\star }\right]$, which is an interval with probability larger than $1-5\mathrm{e}-3$ under the distribution of Y given $\left\{Y\in \mathcal{𝒜}\right\}$ (see Figure 1). Note that the errors may be quite large for x close to -5; however these values are very unlikely (see Figure 1), and therefore these large errors are not representative of the global quadratic error. In Figure 4 (left), we display 1000 sampled points of $\left({X}^{\left(m\right)},{R}^{\left(m\right)}\right)$. These points are taken from the sampler ${𝖯}_{\mathrm{GL}}$, every twenty iterations, in order to obtain quite uncorrelated design points. Observe that the regression function ${\varphi }_{\star }$ looks like affine, which explains why the results with $L=2$ only are quite accurate.

Figure 5

Left: Monte Carlo approximations of $M↦{\mathrm{\Delta }}_{M}$, and fitted curves of the form $M↦\alpha +\beta /M$. Right: For different values of ρ, and for three different values of M, boxplot of 100 independent estimates ${\stackrel{^}{\mathcal{ℐ}}}_{M}$ when ${X}^{\left(1:M\right)}$ is sampled from a chain with kernel ${𝖯}_{\mathrm{GL}}$ (top) and ${𝖯}_{\mathrm{NR}}$ (bottom).

We finally illustrate Algorithm 1 for the estimation of $\mathcal{ℐ}$ (see (3.1)). In Figure 5 (right), the boxplot of 100 independent outputs ${\stackrel{^}{\mathcal{ℐ}}}_{M}$ of Algorithm 1 is displayed when run with $𝖯={𝖯}_{\mathrm{GL}}$ (top) and $𝖯={𝖯}_{\mathrm{NR}}$ (bottom); different values of ρ and M are considered, namely $\rho \in \left\{0,0.1,0.5,0.85\right\}$ and $M\in \left\{5\mathrm{e}2,5\mathrm{e}3,1\mathrm{e}4\right\}$; the regression step is performed with $L=2$ basis functions. Figure 5 (right) illustrates well the benefit of using MCMC sampler for the current regression problems: when $𝖯={𝖯}_{\mathrm{GL}}$, compare the distribution for $\rho =0$ (i.i.d. samples) and $\rho =0.85$: observe the bias when $\rho =0$ which does not disappear even when $M=1\mathrm{e}4$ and note that the variance is very significantly reduced (when $M=5\mathrm{e}2,5\mathrm{e}3,1\mathrm{e}4$ respectively, the standard deviation is reduced by a factor 1.11, 6.58 and 11.96).

Figure 5 (left) is an empirical verification of the statement of Theorem 1. One hundred independent runs of Algorithm 1 are performed, and for different values of M, the quantities ${M}^{-1}{\sum }_{m=1}^{M}{\left({\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}$ are collected; here ${\stackrel{^}{\varphi }}_{M}$ is computed with $L=2$ basis functions. The mean value over these 100 points is displayed as a function of M; it is a Monte Carlo approximation of ${\mathrm{\Delta }}_{M}$ (see (2.12)). We compare two implementations of Algorithm 1: first, $𝖯={𝖯}_{\mathrm{GL}}$ with $\rho =0.85$ and then $𝖯={𝖯}_{\mathrm{NR}}$ with $\rho =0.85$. Theorem 1 establishes that ${\mathrm{\Delta }}_{M}$ is upper bounded by a quantity of the form $\alpha +\beta /M$; such a curve is fitted by a mean square technique (we obtain $\alpha =0.001$ for both kernels, which is in adequation with the theorem since this term does not depend on the Monte Carlo stages). The fitted curves are shown in Figure 5 (left) and they demonstrate a good match between the theory and the numerical studies.

## 3.2 Correlated geometric Brownian motions in dimension 2

We adapt the one-dimensional example, taking a Put on the geometric average of two correlated assets $\left\{{S}_{t}=\left({S}_{t,1},{S}_{t,2}\right):t\ge 0\right\}$. In this example, $d=2$, $h\left({s}_{1},{s}_{2}\right)=\sqrt{{s}_{1}{s}_{2}}$ and $\mathcal{𝒮}=\left\{\left({s}_{1},{s}_{2}\right)\in {ℝ}_{+}×{ℝ}_{+}:{s}_{1}\le {s}_{\star },{s}_{2}\le {s}_{\star }\right\}$. We denote by ${\sigma }_{1}$, ${\sigma }_{2}$ and ϱ, respectively, each volatility and the correlation; the drift of $\left\{{S}_{t}:t\ge 0\right\}$ is zero. Set

$\mathrm{\Gamma }:=\left[\begin{array}{cc}\hfill 1\hfill & \hfill \varrho \hfill \\ \hfill \varrho \hfill & \hfill 1\hfill \end{array}\right],\xi \left({y}_{1},{y}_{2}\right):=\left[\begin{array}{c}\hfill {S}_{0,1}\mathrm{exp}\left(-\frac{1}{2}{\sigma }_{1}^{2}T+\sqrt{T}{\sigma }_{1}{y}_{1}\right)\hfill \\ \hfill {S}_{0,2}\mathrm{exp}\left(-\frac{1}{2}{\sigma }_{2}^{2}T+\sqrt{T}{\sigma }_{2}{y}_{2}\right)\hfill \end{array}\right].$

We have ${S}_{T}=\xi \left(Y\right)$, where $Y\sim {\mathcal{𝒩}}_{2}\left(0,\mathrm{\Gamma }\right)$. Furthermore, it is easy to verify that $\left\{\sqrt{{S}_{t,1}{S}_{t,t}}:t\ge 0\right\}$ is still a geometric Brownian motion, with volatility ${\sigma }^{\prime }$ and drift ${\mu }^{\prime }$ given by

${\sigma }^{\prime }:=\frac{1}{2}\sqrt{{\sigma }_{1}^{2}+{\sigma }_{2}^{2}+2\varrho {\sigma }_{1}{\sigma }_{2}},{\mu }^{\prime }:=-\frac{1}{8}\left({\sigma }_{1}^{2}+{\sigma }_{2}^{2}-2\varrho {\sigma }_{1}{\sigma }_{2}\right).$

Hence, problem (3.1) is of the form (1.1) with

$f\left(y,r\right):={\left(r-{p}_{\star }\right)}_{+},$$\mathcal{𝒜}:=\left\{y\in {ℝ}^{2}:\xi \left(y\right)\in \left(-\mathrm{\infty },{s}_{\star }\right]×\left(-\mathrm{\infty },{s}_{\star }\right]\right\}$$=\left\{y\in {ℝ}^{2}:{y}_{i}\le {y}_{\star ,i}\right\},{y}_{\star }:={\left[\frac{1}{{\sigma }_{i}\sqrt{T}}\mathrm{ln}\left(\frac{{s}_{\star }}{{S}_{0,i}}\right)+\frac{1}{2}{\sigma }_{i}\sqrt{T}\right]}_{i=1,2},$$R:={\left(K-\mathrm{\Psi }\left(Y\right)\mathrm{exp}\left\{\left({\mu }^{\prime }-\frac{1}{2}{\left({\sigma }^{\prime }\right)}^{2}\right)\left({T}^{\prime }-T\right)+\sqrt{{T}^{\prime }-T}{\sigma }^{\prime }Z\right\}\right)}_{+},$

where $Z\sim \mathcal{𝒩}\left(0,1\right)$ is independent of Y, and $\mathrm{\Psi }\left(y\right):=\sqrt{{\left(\xi \left(y\right)\right)}_{1}{\left(\xi \left(y\right)\right)}_{2}}$.

For the outer Monte Carlo stage, ${𝖯}_{\mathrm{GL}}$ is defined as the Hastings–Metropolis kernel with proposal distribution $q\left(x,\cdot \right)\mathrm{d}\lambda \equiv {\mathcal{𝒩}}_{2}\left(\rho x,\left(1-{\rho }^{2}\right)\mathrm{\Gamma }\right)$ (with $\rho \in \left(0,1\right)$) and with invariant distribution, a bi-dimensional Gaussian distribution ${\mathcal{𝒩}}_{2}\left(0,\mathrm{\Gamma }\right)$ restricted to the set $\mathcal{𝒜}$. We compare this Markov kernel to the kernel ${𝖯}_{\mathrm{NR}}$ with non-reversible proposal, defined as a Hastings–Metropolis with proposal distribution ${\mathcal{𝒩}}_{2}\left(\rho x+\left(1-\rho \right){y}_{\star },\left(1-{\rho }^{2}\right)\mathrm{\Gamma }\right)$ and with invariant distribution, a bi-dimensional Gaussian distribution ${\mathcal{𝒩}}_{2}\left(0,\mathrm{\Gamma }\right)$ restricted to the set $\mathcal{𝒜}$. The acceptance-rejection ratio for this algorithm is given by (2.8) with ${x}_{\mathcal{𝒜}}←{y}_{\star }$ and $\mathrm{\Sigma }←\mathrm{\Gamma }$.

In this example, the inner conditional expectation is explicit: ${\varphi }_{\star }\left(x\right)={\mathrm{\Phi }}_{\star }\left(\mathrm{\Psi }\left(x\right)\right)$ with

${\mathrm{\Phi }}_{\star }\left(u\right):=K\mathrm{\Phi }\left({d}_{+}\left(u\right)\right)-u{e}^{{\mu }^{\prime }\left({T}^{\prime }-T\right)}\mathrm{\Phi }\left({d}_{-}\left(u\right)\right),u>0,$${d}_{±}\left(u\right):=\frac{1}{{\sigma }^{\prime }\sqrt{{T}^{\prime }-T}}\mathrm{ln}\left(\frac{K}{u{e}^{{\mu }^{\prime }\left({T}^{\prime }-T\right)}}\right)±\frac{1}{2}{\sigma }^{\prime }\sqrt{{T}^{\prime }-T}.$

For the basis functions, we take

${\phi }_{1}\left(x\right)=1,$${\phi }_{2}\left(x\right)=\sqrt{{\left(\xi \left(x\right)\right)}_{1}},$${\phi }_{3}\left(x\right)\mathit{ }=\sqrt{{\left(\xi \left(x\right)\right)}_{2}},$${\phi }_{4}\left(x\right)={\left(\xi \left(x\right)\right)}_{1},$${\phi }_{5}\left(x\right)={\left(\xi \left(x\right)\right)}_{2},$${\phi }_{6}\left(x\right)\mathit{ }=\sqrt{{\left(\xi \left(x\right)\right)}_{1}{\left(\xi \left(x\right)\right)}_{2}}.$

The parameter values for the numerical tests are given in Table 2.

Table 2

Parameter values for the two-dimensional example.

Figure 6 depicts the rare event $\mathcal{𝒜}$: on the left (resp. on the right), some level curves of the distribution of ${\mathcal{𝒩}}_{2}\left(0,\mathrm{\Gamma }\right)$ (resp. distribution of $\left({S}_{T,1},{S}_{T,2}\right)$) are displayed, together with the rare event in the bottom left corner.

Figure 6

Left: Level curves of ${\mathcal{𝒩}}_{2}\left(0,\mathrm{\Gamma }\right)$ and the rare set in the lower left area delimited by the two hyperplanes. Right: Level curves of the density function of $\left({S}_{T,1},{S}_{T,2}\right)$ and the rare set in the lower left area delimited by the two hyperplanes.

We run two Markov chains respectively with kernel ${𝖯}_{\mathrm{GL}}$ and ${𝖯}_{\mathrm{NR}}$ and compute the mean acceptance-rejection rate after $M=1\mathrm{e}4$ iterations. For different values of ρ, this experiment is repeated 100 times, independently; Figure 7 reports the boxplot of these mean acceptance rates. It shows that a rate close to 0.234 is reached with $\rho =0.8$ for $𝖯={𝖯}_{\mathrm{GL}}$ and $\rho =0.7$ for $𝖯={𝖯}_{\mathrm{NR}}$. In all the experiments below involving these kernels, we will use these values of the design parameter ρ.

Figure 7

Boxplot over 100 independent runs, of the mean acceptance rate after $M=1\mathrm{e}4$ iterations for the kernel $𝖯={𝖯}_{\mathrm{GL}}$ (top) and the kernel $𝖯={𝖯}_{\mathrm{NR}}$ (bottom). Different values of ρ are considered.

Figure 8

Left: Normalized histograms of the error $\left\{{\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right):m=1,\mathrm{\dots },M\right\}$, when $L=3$, with design pointssampled with ${𝖯}_{\mathrm{GL}}$ (left) and ${𝖯}_{\mathrm{NR}}$ (right). Right: The same case with $L=6$.

In Figure 8 (left), the normalized histogram of the errors $\left\{{\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right):m=1,\mathrm{\dots },M\right\}$ is displayed when $L=3$ and the samples ${X}^{\left(1:M\right)}$ are sampled from $𝖯={𝖯}_{\mathrm{GL}}$ (left) or $𝖯={𝖯}_{\mathrm{NR}}$ (right). Figure 8 (right) shows the case $L=6$. Here, $M=1\mathrm{e}6$. This clearly shows an improvement by choosing more basis functions. Especially, the sixth basis function brings much accuracy, as expected, since the regression function ${\varphi }_{\star }$ depends directly on it.

Figure 9

Left: Error function $s↦{\stackrel{^}{\varphi }}_{M}\left({\xi }^{-1}\left(s\right)\right)-{\varphi }_{\star }\left({\xi }^{-1}\left(s\right)\right)$, with $L=3$, with design points sampled with ${𝖯}_{\mathrm{GL}}$ (left) and ${𝖯}_{\mathrm{NR}}$ (right). Right: The same case with $L=6$.

In Figure 9 (left), the errors $s↦{\stackrel{^}{\varphi }}_{M}\left({\xi }^{-1}\left(s\right)\right)-{\varphi }_{\star }\left({\xi }^{-1}\left(s\right)\right)$ are displayed on $\left[15,{s}_{\star }\right]×\left[15,{s}_{\star }\right]$ when $L=3$ and the outer samples ${X}^{\left(1:M\right)}$ used in the computation of ${\stackrel{^}{\varphi }}_{M}$ are sampled from $𝖯={𝖯}_{\mathrm{GL}}$ (left) and $𝖯={𝖯}_{\mathrm{NR}}$ (right). Figure 9 (right) shows the case $L=6$. Here, $M=1\mathrm{e}6$. This is complementary to Figure 8 since it shows the prediction error everywhere in the space, and not only along the design points.

Figure 10

Left: A Monte Carlo approximation of $M↦{\mathrm{\Delta }}_{M}$, and a fitted curve of the form $M↦\alpha +\beta /M$. Right: For differentvalues of ${\rho }_{\mathrm{GL}}$ and ${\rho }_{\mathrm{NR}}$, and for four different values of M – namely $M\in \left\{1\mathrm{e}2,5\mathrm{e}3,1\mathrm{e}4,2\mathrm{e}4\right\}$ –, boxplot of 100 independentestimates ${\stackrel{^}{\mathcal{ℐ}}}_{M}$.

In Figure 10 (left), a Monte Carlo approximation of ${\mathrm{\Delta }}_{M}$ (see (2.12)) computed from 100 independent estimators ${\stackrel{^}{\varphi }}_{M}$ is displayed as a function of M for M in the range $\left[3\mathrm{e}3,5\mathrm{e}4\right]$; where ${\stackrel{^}{\varphi }}_{M}$ is computed with $L=6$. We also fit a curve of the form $M↦\alpha +\beta /M$ to illustrate the sharpness of the upper bound in (2.12). In Figure 10 (right), the boxplot of 100 independent outputs ${\stackrel{^}{\mathcal{ℐ}}}_{M}$ of Algorithm 1 is displayed, for $M\in \left\{1\mathrm{e}2,5\mathrm{e}3,1\mathrm{e}4,2\mathrm{e}4\right\}$ and different values of ${\rho }_{\mathrm{GL}}$ (resp ${\rho }_{\mathrm{NR}}$) – the design parameter in ${𝖯}_{\mathrm{GL}}$ (resp. ${𝖯}_{\mathrm{NR}}$). Here again, we observe the advantage of using MCMC samplers to reduce the variance in this regression problem coupled with rare-event regime: when $M=5\mathrm{e}3,1\mathrm{e}4,2\mathrm{e}4$ respectively, the standard deviation is reduced by a factor 6.89, 7.27 and 7.74.

## 4.1 Proof of Theorem 1

By the construction of the random variables $\underset{¯}{𝐑}$ and ${X}^{\left(1:M\right)}$ (see Algorithm 1), for any bounded and positive measurable functions ${g}_{1},\mathrm{\dots },{g}_{M}$, it holds

$𝔼\left[\prod _{m=1}^{M}{g}_{m}\left({R}^{\left(m\right)}\right)|{X}^{\left(1:M\right)}\right]=\prod _{m=1}^{M}𝔼\left[{g}_{m}\left({R}^{\left(m\right)}\right)\mid {X}^{\left(m\right)}\right]=\prod _{m=1}^{M}\int {g}_{m}\left(r\right)𝖰\left({X}^{\left(m\right)},\mathrm{d}r\right).$(4.1)

If ${\mathrm{A}}^{\mathrm{\prime }}\mathit{}\mathrm{A}\mathit{}\alpha \mathrm{=}{\mathrm{A}}^{\mathrm{\prime }}\mathit{}\mathrm{A}\mathit{}\stackrel{\mathrm{~}}{\alpha }$, then $\mathrm{A}\mathit{}\alpha \mathrm{=}\mathrm{A}\mathit{}\stackrel{\mathrm{~}}{\alpha }$. In other words, any coefficient solution α of the least-squares problem (2.3) yields the same values for the approximated regression function along the design ${X}^{\mathrm{\left(}\mathrm{1}\mathrm{:}M\mathrm{\right)}}$.

#### Proof.

Denote by r the rank of $𝐀$ and write $𝐀=UD{V}^{\prime }$ for the singular value decomposition of $𝐀$. It holds

${𝐀}^{\prime }𝐀\alpha ={𝐀}^{\prime }𝐀\stackrel{~}{\alpha }⇔{D}^{\prime }D{V}^{\prime }\alpha ={D}^{\prime }D{V}^{\prime }\stackrel{~}{\alpha }$

by using ${V}^{\prime }V={I}_{L}$ and ${U}^{\prime }U={I}_{M}$. This implies that the first r components of ${V}^{\prime }\alpha$ and ${V}^{\prime }\stackrel{~}{\alpha }$ are equal and thus $D{V}^{\prime }\alpha =D{V}^{\prime }\stackrel{~}{\alpha }$. This concludes the proof. ∎

The next result justifies a possible interchange between least-squares projection and conditional expectation.

Set ${\stackrel{\mathrm{^}}{\varphi }}_{M}\mathrm{=}\mathrm{〈}{\stackrel{\mathrm{^}}{\alpha }}_{M}\mathrm{;}\underset{\mathrm{¯}}{\mathbf{\varphi }}\mathrm{〉}$, where ${\stackrel{\mathrm{^}}{\alpha }}_{M}\mathrm{\in }{\mathrm{R}}^{L}$ is any solution to ${\mathrm{A}}^{\mathrm{\prime }}\mathit{}\mathrm{A}\mathit{}\alpha \mathrm{=}{\mathrm{A}}^{\mathrm{\prime }}\mathit{}\underset{\mathrm{¯}}{\mathrm{R}}$. Then the function

$x↦𝔼\left[{\stackrel{^}{\varphi }}_{M}\left(x\right)\mid {X}^{\left(1:M\right)}\right]$

is a solution to the least-squares problem

$\underset{\phi \in \mathcal{ℱ}}{\mathrm{min}}\frac{1}{M}\sum _{m=1}^{M}{\left({\varphi }_{\star }\left({X}^{\left(m\right)}\right)-\phi \left({X}^{\left(m\right)}\right)\right)}^{2},$

where $\mathcal{F}\mathrm{:=}\mathrm{\left\{}\phi \mathrm{=}\mathrm{〈}\alpha \mathrm{;}\underset{\mathrm{¯}}{\mathbf{\varphi }}\mathrm{〉}\mathrm{:}\alpha \mathrm{\in }{\mathrm{R}}^{L}\mathrm{\right\}}$.

#### Proof.

It is sufficient to prove that

$\underset{\phi \in \mathcal{ℱ}}{\mathrm{min}}\frac{1}{M}\sum _{m=1}^{M}{\left({\varphi }_{\star }\left({X}^{\left(m\right)}\right)-\phi \left({X}^{\left(m\right)}\right)\right)}^{2}=\frac{1}{M}{|{\underset{¯}{\mathbit{\varphi }}}_{\star }-𝐀𝔼\left[{\stackrel{^}{\alpha }}_{M}\mid {X}^{\left(1:M\right)}\right]|}^{2}$

where ${\underset{¯}{\mathbit{\varphi }}}_{\star }:={\left({\varphi }_{\star }\left({X}^{\left(1\right)}\right),\mathrm{\dots },{\varphi }_{\star }\left({X}^{\left(M\right)}\right)\right)}^{\prime }$. The solution of the above least-squares problem is of the form

$x↦〈{\stackrel{^}{\alpha }}_{M,\star };\underset{¯}{\mathbit{\varphi }}\left(x\right)〉,$

where ${\stackrel{^}{\alpha }}_{M,\star }\in {ℝ}^{L}$ satisfies ${𝐀}^{\prime }𝐀{\stackrel{^}{\alpha }}_{M,\star }={𝐀}^{\prime }{\underset{¯}{\mathbit{\varphi }}}_{\star }$. By (4.1) and the definition of ${\stackrel{^}{\alpha }}_{M}$, this yields

${𝐀}^{\prime }{\underset{¯}{\mathbit{\varphi }}}_{\star }={𝐀}^{\prime }\left[\begin{array}{c}\hfill 𝔼\left[{R}^{\left(1\right)}\mid {X}^{\left(1\right)}\right]\hfill \\ \hfill \mathrm{⋮}\hfill \\ \hfill 𝔼\left[{R}^{\left(M\right)}\mid {X}^{\left(M\right)}\right]\hfill \end{array}\right]=𝔼\left[{𝐀}^{\prime }\underset{¯}{𝐑}\mid {X}^{\left(1:M\right)}\right]={𝐀}^{\prime }𝐀𝔼\left[{\stackrel{^}{\alpha }}_{M}\mid {X}^{\left(1:M\right)}\right].$

We then conclude by Lemma 1 that $𝐀{\stackrel{^}{\alpha }}_{M,\star }=𝐀𝔼\left[{\stackrel{^}{\alpha }}_{M}\mid {X}^{\left(1:M\right)}\right]$. We are done. ∎

#### Proof of Theorem 1.

Using Lemma 2 and the Pythagoras theorem, it holds

$\frac{1}{M}\sum _{m=1}^{M}{\left({\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}={\mathcal{𝒯}}_{1}+{\mathcal{𝒯}}_{2}$

with

${\mathcal{𝒯}}_{1}:=\frac{1}{M}\sum _{m=1}^{M}{\left({\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)-𝔼\left[{\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)\mid {X}^{\left(1:M\right)}\right]\right)}^{2}=\frac{1}{M}{|𝐀\left({\stackrel{^}{\alpha }}_{M}-𝔼\left[{\stackrel{^}{\alpha }}_{M}\mid {X}^{\left(1:M\right)}\right]\right)|}^{2},$${\mathcal{𝒯}}_{2}:=\frac{1}{M}\sum _{m=1}^{M}{\left(𝔼\left[{\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)\mid {X}^{\left(1:M\right)}\right]-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}.$

By Lemma 1, we can take ${\stackrel{^}{\alpha }}_{M}={\left({𝐀}^{\prime }𝐀\right)}^{\mathrm{#}}{𝐀}^{\prime }\underset{¯}{𝐑}$, which is the coefficient with minimal norm among the solutions of least-squares problem of Algorithm 1. Let us consider ${\mathcal{𝒯}}_{1}$. Set $𝐁:=𝐀{\left({𝐀}^{\prime }𝐀\right)}^{\mathrm{#}}{𝐀}^{\prime }$, a $M×M$ matrix. By (2.5) and Lemma 2,

$M{\mathcal{𝒯}}_{1}={|𝐁\mathsf{Υ}|}^{2}=\mathrm{Trace}\left(𝐁\mathsf{Υ}{\mathsf{Υ}}^{\prime }𝐁\right),\text{with}\mathsf{Υ}:=\left[\begin{array}{c}\hfill {R}^{\left(1\right)}-{\varphi }_{\star }\left({X}^{\left(1\right)}\right)\hfill \\ \hfill \mathrm{⋮}\hfill \\ \hfill {R}^{\left(M\right)}-{\varphi }_{\star }\left({X}^{\left(M\right)}\right)\hfill \end{array}\right],$

so $M𝔼\left[{\mathcal{𝒯}}_{1}\mid {X}^{\left(1:M\right)}\right]$ is equal to $\mathrm{Trace}\left(𝐁𝔼\left[\mathsf{Υ}{\mathsf{Υ}}^{\prime }\mid {X}^{\left(1:M\right)}\right]𝐁\right)$. Under (4.1) and (2.11), the matrix $𝔼\left[\mathsf{Υ}{\mathsf{Υ}}^{\prime }\mid {X}^{\left(1:M\right)}\right]$ is diagonal with diagonal entries upper bounded by ${\sigma }^{2}$. Therefore,

$M𝔼\left[{\mathcal{𝒯}}_{1}\mid {X}^{\left(1:M\right)}\right]\le {\sigma }^{2}\mathrm{Trace}\left({𝐁}^{2}\right)={\sigma }^{2}\mathrm{rank}\left(𝐀\right)\le {\sigma }^{2}L.$(4.2)

This concludes the control of $𝔼\left[{\mathcal{𝒯}}_{1}\right]$.

Using again Lemma 2,

${\mathcal{𝒯}}_{2}=\underset{\phi \in \mathcal{ℱ}}{\mathrm{min}}\frac{1}{M}\sum _{m=1}^{M}{\left(\phi \left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}\le \frac{1}{M}\sum _{m=1}^{M}{\left({\psi }_{\star }\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}.$

Hence,

$𝔼\left[{\mathcal{𝒯}}_{2}\right]\le \frac{1}{M}\sum _{m=1}^{M}𝔼\left[{\left({\psi }_{\star }\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}\right]$$={|{\psi }_{\star }-{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)}^{2}+\frac{1}{M}\sum _{m=1}^{M}\left\{𝔼\left[{\left({\psi }_{\star }\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)}^{2}\right]-\int {\left({\psi }_{\star }-{\varphi }_{\star }\right)}^{2}\mu d\lambda \right\}.$

By (2.10), the right-hand side is upper bounded by ${|\psi -{\varphi }_{\star }|}_{{L}_{2}\left(\mu \right)}^{2}+{C}_{𝖯}{\sum }_{m=1}^{M}\rho \left(m\right)/M$. This concludes the proof of (2.12). ∎

## 4.2 Proof of Corollary 3

Note that ${𝖯}_{\mathrm{GL}}$ is a Hastings–Metropolis kernel; hence, for any $x\in \mathcal{𝒜}$ and any measurable set A in $\mathcal{𝒜}$,

${𝖯}_{\mathrm{GL}}\left(x,A\right)={\int }_{A\cap \mathcal{𝒜}}q\left(x,z\right)d\lambda \left(z\right)+{\delta }_{x}\left(A\right){\int }_{{\mathcal{𝒜}}^{c}}q\left(x,z\right)d\lambda \left(z\right).$(4.3)

## Irreducibility.

Let A be a measurable subset of $\mathcal{𝒜}$ such that ${\int }_{A}\mu d\lambda >0$ (which implies that ${\int }_{A}d\lambda >0$). Then

${𝖯}_{\mathrm{GL}}\left(x,A\right)\ge {\int }_{A\cap \mathcal{𝒜}\cap \left\{z:\mu \left(z\right)>0\right\}}q\left(x,z\right)d\lambda \left(z\right)$

and the right-hand side is positive since, owing to assumption (i),

$q\left(x,z\right)>0\mathit{ }\text{for all}x\in \mathcal{𝒜}\text{,}z\in A\cap \mathcal{𝒜}\cap \left\{z:\mu \left(z\right)>0\right\}.$

This implies that ${𝖯}_{\mathrm{GL}}$ is phi-irreducible with $\mu \mathrm{d}\lambda$ as irreducibility measure.

## Drift assumption.

By assumption (ii) and from (4.3), we have

${𝖯}_{\mathrm{GL}}\left(x,A\right)\le {\delta }_{1}{\delta }_{x}\left(A\right)+{\int }_{A\cap \mathcal{𝒜}}q\left(x,z\right)d\lambda \left(z\right),$

which implies by (iii) that

${𝖯}_{\mathrm{GL}}V\left(x\right)\le {\delta }_{1}V\left(x\right)+{\int }_{\mathcal{𝒜}}V\left(z\right)q\left(x,z\right)d\lambda \left(z\right)$$\le {\delta }_{1}V\left(x\right)+{𝟏}_{\mathcal{ℬ}}\left(x\right)\underset{x\in \mathcal{ℬ}}{sup}{\int }_{\mathcal{𝒜}}V\left(z\right)q\left(x,z\right)d\lambda \left(z\right)+{𝟏}_{{\mathcal{ℬ}}^{c}}\left(x\right)\left({\delta }_{2}-{\delta }_{1}\right)V\left(x\right)$$\le {\delta }_{2}V\left(x\right)+\underset{x\in \mathcal{ℬ}}{sup}{\int }_{\mathcal{𝒜}}V\left(z\right)q\left(x,z\right)d\lambda \left(z\right).$

## Small set assumption.

Let ${\mathcal{𝒞}}_{\star }$ be given by assumption (iv). We have ${\int }_{{\mathcal{𝒞}}_{\star }}\mu d\lambda >0$; thus define the probability measure

$\mathrm{d}\nu :=\frac{{𝟏}_{{\mathcal{𝒞}}_{\star }}\mu \mathrm{d}\lambda }{{\int }_{{\mathcal{𝒞}}_{\star }}\mu d\lambda }.$

Then for any $x\in {\mathcal{𝒞}}_{\star }$ and any measurable subset A of ${\mathcal{𝒞}}_{\star }$, it readily follows from (4.3) that

${𝖯}_{\mathrm{GL}}\left(x,A\right)\ge {\int }_{A\cap \mathcal{𝒜}}q\left(x,z\right){𝟏}_{\mu \left(z\right)\ne 0}d\lambda \left(z\right)$$\ge \underset{\left(x,z\right)\in {\mathcal{𝒞}}_{\star }^{2}}{inf}\left(\frac{q\left(x,z\right){𝟏}_{\mu \left(z\right)\ne 0}}{\mu \left(z\right)}\right){\int }_{A\cap \mathcal{𝒜}}\mu \left(z\right)d\lambda \left(z\right)$$=\underset{\left(x,z\right)\in {\mathcal{𝒞}}_{\star }^{2}}{inf}\left(\frac{q\left(x,z\right){𝟏}_{\mu \left(z\right)\ne 0}}{\mu \left(z\right)}\right)\left({\int }_{{\mathcal{𝒞}}_{\star }}\mu d\lambda \right)\nu \left(A\cap \mathcal{𝒜}\right).$

Thanks to the lower bounds of (iv), we complete the proof.

## 4.3 Proof of Theorem 4

We write ${\stackrel{^}{\mathcal{ℐ}}}_{M}-\mathcal{ℐ}={\mathcal{𝒯}}_{1}+{\mathcal{𝒯}}_{2}$ with

${\mathcal{𝒯}}_{1}:=\frac{1}{M}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)\right)-\frac{1}{M}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right),$${\mathcal{𝒯}}_{2}:=\frac{1}{M}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)-\int f\left(x,{\varphi }_{\star }\left(x\right)\right)\mu \left(x\right)d\lambda \left(x\right).$

For the first term, we have

$𝔼\left[{|{\mathcal{𝒯}}_{1}|}^{2}\right]\le 𝔼\left[\frac{1}{M}\sum _{m=1}^{M}{|f\left({X}^{\left(m\right)},{\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)\right)-f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)|}^{2}\right]$$\le {C}_{f}^{2}𝔼\left[\frac{1}{M}\sum _{m=1}^{M}{|{\stackrel{^}{\varphi }}_{M}\left({X}^{\left(m\right)}\right)-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)|}^{2}\right]$$={C}_{f}^{2}{\mathrm{\Delta }}_{M}.$

The second term is controlled by assumption (ii). We then conclude by the Minkowski inequality.

## 5 Conclusion

We have designed a new methodology to compute nested expectations in the rare-event regime. The outer expectation is evaluated using an ergodic Markov chain restricted to the rare set, whereas the inner expectation is approximated using a linear regression method with general basis functions. We quantified the error bounds as a function of the number of outer samples and of the size of the basis. This highlights that, in the regression scheme, replacing the usual i.i.d. design by an ergodic Markov chain design does not alter significantly the statistical errors.

When the inner expectation is alternatively computed pointwise with i.i.d. samples, we also provided error bounds, which show that this approach for the inner expectation is more suitable than the regression method in the case of large dimensional problems (curse of dimensionality).

In our numerical tests, we illustrated how to choose appropriately the parameters of the ergodic Markov chain so that the mean acceptance rate for staying in the rare set is about 20–30 %. It usually ensures low variance in the full scheme.

## A Algorithm where the inner stage uses a crude Monte Carlo method and the outer stage uses MCMC sampling

Here, the regression function ${\varphi }_{\star }$ is approximated by an empirical mean using N (conditionally) independent samples ${R}^{\left(m,k\right)}$, as in (1.2). We keep the same notations as in Section 2.

## A.2 Convergence results for the estimation of ${\stackrel{~}{\mathcal{ℐ}}}_{M}$

We extend Theorem 1 to this new setting. Actually when the function f in (1.1) is smoother than Lipschitz continuous, we can prove that the impact of N on the quadratic error is $1/N$ instead of the usual $1/\sqrt{N}$. This kind of improvement has been derived by [13] in the i.i.d. setting (for both the inner and outer stages).

Assume that the following conditions hold:

• (i)

The (second and) fourth conditional moments of $𝖰$ are bounded: for $p=2$ and $p=4$ , we have

${\sigma }_{p}:={\left(\underset{x\in \mathcal{𝒜}}{sup}\int {|r-\int r𝖰\left(x,\mathrm{d}r\right)|}^{p}𝖰\left(x,\mathrm{d}r\right)\right)}^{1/p}<\mathrm{\infty }.$

• (ii)

There exists a finite constant C such that for any M

$𝔼\left[{\left({M}^{-1}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)-\int f\left(x,{\varphi }_{\star }\left(x\right)\right)\mu \left(x\right)d\lambda \left(x\right)\right)}^{2}\right]\le \frac{C}{M}.$

If $f\mathrm{:}{\mathrm{R}}^{d}\mathrm{×}\mathrm{R}\mathrm{\to }\mathrm{R}$ is globally ${C}_{f}$-Lipschitz in the second variable, then

${\left(𝔼\left[{|{\stackrel{~}{\mathcal{ℐ}}}_{M}-\mathcal{ℐ}|}^{2}\right]\right)}^{1/2}\le \frac{{C}_{f}{\sigma }_{2}}{\sqrt{N}}+\sqrt{\frac{C}{M}},$

where $\mathcal{I}$ and ${\stackrel{\mathrm{~}}{\mathcal{I}}}_{M}$ are respectively given by (1.1) and Algorithm 3. If f is continuously differentiable in the second variable, with a derivative in the second variable which is bounded and globally ${C}_{{\mathrm{\partial }}_{r}\mathit{}f}$-Lipschitz, then

${\left(𝔼\left[{|{\stackrel{~}{\mathcal{ℐ}}}_{M}-\mathcal{ℐ}|}^{2}\right]\right)}^{1/2}\le \frac{{C}_{{\partial }_{r}f}}{2}\frac{1}{N}\sqrt{3{\sigma }_{2}^{4}+\frac{{\sigma }_{4}^{4}}{N}}+\frac{{\sigma }_{2}}{\sqrt{NM}}\underset{x}{sup}|{\partial }_{r}f\left(x,{\varphi }_{\star }\left(x\right)\right)|+\sqrt{\frac{C}{M}}.$

## First case: f Lipschitz.

We write ${\stackrel{~}{\mathcal{ℐ}}}_{M}-\mathcal{ℐ}={\mathcal{𝒯}}_{1}+{\mathcal{𝒯}}_{2}$ with

${\mathcal{𝒯}}_{1}:=\frac{1}{M}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\overline{R}}_{N}^{\left(m\right)}\right)-\frac{1}{M}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right),$${\mathcal{𝒯}}_{2}:=\frac{1}{M}\sum _{m=1}^{M}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)-\int f\left(x,{\varphi }_{\star }\left(x\right)\right)\mu \left(x\right)d\lambda \left(x\right).$

For the first term, since f is globally Lipschitz with constant ${C}_{f}$, we have

$𝔼\left[{|{\mathcal{𝒯}}_{1}|}^{2}\right]\le {C}_{f}^{2}𝔼\left[\frac{1}{M}\sum _{m=1}^{M}{|{\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)|}^{2}\right].$

Since $\left({R}^{\left(m,k\right)}:1\le k\le N\right)$ are independent conditionally on ${X}^{\left(1:M\right)}$, we have

$𝔼\left[{\overline{R}}_{N}^{\left(m\right)}|{X}^{\left(1:M\right)}\right]={\varphi }_{\star }\left({X}^{\left(m\right)}\right)$

and

$\mathrm{Var}\left[{\overline{R}}_{N}^{\left(m\right)}\mid {X}^{\left(1:M\right)}\right]\le \frac{{\sigma }_{2}^{2}}{N}.$

Thus,

$𝔼\left[{|{\mathcal{𝒯}}_{1}|}^{2}\right]\le \frac{{C}_{f}^{2}}{N}{\sigma }_{2}^{2}.$

The second term is controlled by assumption (ii). We are done.

## Second case: f smooth.

Set ${\mathcal{𝒯}}_{1}={\mathcal{𝒯}}_{1,a}+{\mathcal{𝒯}}_{1,b}$ with

${\mathcal{𝒯}}_{1,a}:=\frac{1}{M}\sum _{m=1}^{M}\left(f\left({X}^{\left(m\right)},{\overline{R}}_{N}^{\left(m\right)}\right)-f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)-{\partial }_{r}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)\left({\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)\right),$${\mathcal{𝒯}}_{1,b}:=\frac{1}{M}\sum _{m=1}^{M}{\partial }_{r}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)\left({\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right).$

A Taylor expansion gives

$|{\mathcal{𝒯}}_{1,a}|\le \frac{1}{2}{C}_{{\partial }_{r}f}\frac{1}{M}\sum _{m=1}^{M}{|{\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)|}^{2},$$𝔼\left[{|{\mathcal{𝒯}}_{1,a}|}^{2}\right]\le {\left(\frac{1}{2}{C}_{{\partial }_{r}f}\right)}^{2}\frac{1}{M}\sum _{m=1}^{M}𝔼\left[{|{\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)|}^{4}\right].$

Invoking that $\left({R}^{\left(m,k\right)}:1\le k\le N\right)$ are independent conditionally on ${X}^{\left(1:M\right)}$ leads to

$𝔼\left[{|{\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)|}^{4}|{X}^{\left(1:M\right)}\right]\le 3{\sigma }_{2}^{4}\frac{N-1}{{N}^{3}}+{\sigma }_{4}^{4}\frac{1}{{N}^{3}}.$

Moreover, upon noting that for $m\ne {m}^{\prime }$,

$𝔼\left[\left({\partial }_{r}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)\left({\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)\right)\left({\partial }_{r}f\left({X}^{\left({m}^{\prime }\right)},{\varphi }_{\star }\left({X}^{\left({m}^{\prime }\right)}\right)\right)\left({\overline{R}}_{N}^{\left({m}^{\prime }\right)}-{\varphi }_{\star }\left({X}^{\left({m}^{\prime }\right)}\right)\right)\right)\right]=0,$

we have

$𝔼\left[|{\mathcal{𝒯}}_{1,b}{|}^{2}\right]=𝔼\left[\frac{1}{{M}^{2}}\sum _{m=1}^{M}{|{\partial }_{r}f\left({X}^{\left(m\right)},{\varphi }_{\star }\left({X}^{\left(m\right)}\right)\right)|}^{2}𝔼\left[{|{\overline{R}}_{N}^{\left(m\right)}-{\varphi }_{\star }\left({X}^{\left(m\right)}\right)|}^{2}|{X}^{\left(1:M\right)}\right]\right]$$\le \frac{{sup}_{x}{|{\partial }_{r}f\left(x,{\varphi }_{\star }\left(x\right)\right)|}^{2}}{M}\frac{{\sigma }_{2}^{2}}{N}.$

This concludes the proof. ∎

## References

• [1]

Baraud Y., Comte F. and Viennet G., Adaptive estimation in autoregression or β-mixing regression via model selection, Ann. Statist. 29 (2001), no. 3, 839–875.  Google Scholar

• [2]

Belomestny D., Kolodko A. and Schoenmakers J., Regression methods for stochastic control problems and their convergence analysis, SIAM J. Control Optim. 48 (2010), no. 5, 3562–3588.  Google Scholar

• [3]

Blanchet J. and Lam H., State-dependent importance sampling for rare event simulation: An overview and recent advances, Surv. Oper. Res. Manag. Sci. 17 (2012), 38–59.  Google Scholar

• [4]

Broadie M., Du Y. and Moallemi C. C., Risk Estimation via regression, Oper. Res. 63 (2015), no. 5, 1077–1097.  Google Scholar

• [5]

Delattre S. and Gaïffas S., Nonparametric regression with martingale increment errors, Stochastic Process. Appl. 121 (2011), 2899–2924.  Google Scholar

• [6]

Devineau L. and Loisel S., Construction d’un algorithme d’accélération de la méthode des “simulations dans les simulations” pour le calcul du capital économique solvabilité ii, Bull. Français d’Actuariat 10 (2009), no. 17, 188–221.  Google Scholar

• [7]

Douc R., Fort G., Moulines E. and Soulier P., Practical drift conditions for subgeometric rates of convergence, Ann. Appl. Probab. 14 (2004), no. 3, 1353–1377.  Google Scholar

• [8]

Egloff D., Monte Carlo algorithms for optimal stopping and statistical learning, Ann. Appl. Probab. 15 (2005), 1396–1432.  Google Scholar

• [9]

Fort G. and Moulines E., Convergence of the Monte Carlo expectation maximization for curved exponential families, Ann. Statist. 31 (2003), no. 4, 1220–1259.  Google Scholar

• [10]

Fort G. and Moulines E., Polynomial ergodicity of Markov transition kernels, Stochastic Process. Appl. 103 (2003), no. 1, 57–99.  Google Scholar

• [11]

Gobet E. and Liu G., Rare event simulation using reversible shaking transformations, SIAM J. Sci. Comput. 37 (2015), no. 5, A2295–A2316.  Google Scholar

• [12]

Gobet E. and Turkedjiev P., Linear regression MDP scheme for discrete backward stochastic differential equations under general conditions, Math. Comp. 299 (2016), no. 85, 1359–1391.  Google Scholar

• [13]

Gordy M. B. and Juneja S., Nested simulation in portfolio risk measurement, Manag. Sci. 56 (2010), no. 10, 1833–1848.  Google Scholar

• [14]

Gyorfi L., Kohler M., Krzyzak A. and Walk H., A Distribution-Free Theory of Nonparametric Regression, Springer Ser. Statist., Springer, New York, 2002.  Google Scholar

• [15]

Hong L. J. and Juneja S., Estimating the mean of a non-linear function of conditional expectation, Proceedings of the 2009 Winter Simulation Conference (WSC), IEEE Press, Piscataway (2009), 1223–1236.  Google Scholar

• [16]

Lemor J-P., Gobet E. and Warin X., Rate of convergence of an empirical regression method for solving generalized backward stochastic differential equations, Bernoulli 12 (2006), no. 5, 889–916.  Google Scholar

• [17]

Liu M. and Staum J., Stochastic kriging for efficient nested simulation of expected shortfall, J. Risk 12 (2010), no. 3, 3–27.  Google Scholar

• [18]

Longstaff F. and Schwartz E. S., Valuing American options by simulation: A simple least squares approach, Rev. Financ. Stud. 14 (2001), 113–147.  Google Scholar

• [19]

McNeil A. J., Frey R. and Embrechts P., Quantitative Risk Management, Princeton Ser. Finance, Princeton University Press, Princeton, 2005.  Google Scholar

• [20]

Meyn S. P. and Tweedie R. L., Markov Chains and Stochastic Stability, Springer, Berlin, 1993.  Google Scholar

• [21]

Ren Q. and Mojirsheibani M., A note on nonparametric regression with β-mixing sequences, Comm. Statist. Theory Methods 39 (2010), no. 12, 2280–2287.  Google Scholar

• [22]

Rosenthal J. S., Optimal Proposal Distributions and Adaptive MCMC, Chapman & Hall/CRC Handb. Mod. Stat. Methods, CRC Press, Boca Raton, 2008.  Google Scholar

• [23]

Rubinstein R. Y. and Kroese D. P., Simulation and the Monte-Carlo Method, 2nd ed., Wiley Ser. Probab. Stat., John Wiley & Sons, Hoboken, 2008.  Google Scholar

• [24]

Tsitsiklis J. N. and Van Roy B., Regression methods for pricing complex American-style options, IEEE Trans. Neural Netw. 12 (2001), no. 4, 694–703.  Google Scholar

## About the article

Accepted: 2017-01-19

Published Online: 2017-02-03

Published in Print: 2017-03-01

Funding Source: Agence Nationale de la Recherche

Award identifier / Grant number: ANR-15-CE05-0024

The second author’s research is part of the Chair Financial Risks of the Risk Foundation, the Finance for Energy Market Research Centre and the ANR project CAESARS (ANR-15-CE05-0024).

Citation Information: Monte Carlo Methods and Applications, Volume 23, Issue 1, Pages 21–42, ISSN (Online) 1569-3961, ISSN (Print) 0929-9629,

Export Citation

© 2017 by De Gruyter.