(Universitext) Pagès, Gilles - Numerical Probability - An Introduction With Applications To Finance-Springer (2018)
(Universitext) Pagès, Gilles - Numerical Probability - An Introduction With Applications To Finance-Springer (2018)
Gilles Pagès
Numerical
Probability
An Introduction with Applications to
Finance
Universitext
Universitext
Series editors
Sheldon Axler
San Francisco State University
Carles Casacuberta
Universitat de Barcelona
Angus MacIntyre
Queen Mary, University of London
Kenneth Ribet
University of California, Berkeley
Claude Sabbah
École polytechnique, CNRS, Université Paris-Saclay, Palaiseau
Endre Süli
University of Oxford
Wojbor A. Woyczyński
Case Western Reserve University
Thus as research topics trickle down into graduate-level teaching, first textbooks
written for new, cutting-edge courses may make their way into Universitext.
Numerical Probability
An Introduction with Applications to Finance
123
Gilles Pagès
Laboratoire de Probabilités,
Statistique et Modélisation
Sorbonne Université
Paris
France
Mathematics Subject Classification (2010): 65C05, 91G60, 65C30, 68U20, 60H35, 62L15, 60G40,
62L20, 91B25, 91G20
This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Julie, Romain and Nicolas
Preface
vii
viii Preface
Once these biased methods have been designed, it remains to devise a “bias
chase.” This is the objective of Chap. 9, entirely devoted to recent developments in
the multilevel paradigm with or without weights, where we successively consider a
“smooth” and “non-smooth” frameworks.
Beyond these questions directly connected to Monte Carlo simulation, three
chapters deal with less classical aspects: Chap. 6 investigates recursive stochastic
approximation “à la Robbins–Monro”, viewed as a tool for implicitation or cali-
bration but also for risk measure computation (quantiles, VaR). The quasi-Monte
Carlo method, its theoretical foundation, and its limitations, are developed and
analyzed in Chap. 4, supported by plenty of practical advice to practitioners (not
only for the celebrated Sobol sequences). Finally, we dedicate a chapter (Chap. 5)
to optimal vector quantization-based cubature formulas for the fast computation of
expectations in medium dimensions (say up to 10) with possible applications to a
universal stratified sampling variance reduction method.
In the – last – Chap. 12, we present a nonlinear problem (compared to expec-
tation computation), optimal stopping theory, mostly in a Markovian framework.
Time discretization schemes of the so-called Snell envelope are briefly presented,
corresponding in finance of derivative products to the approximation of American
style options by Bermudan options which can only be exercised at discrete time
instants up to the maturity. Due to the nonlinear Backward Dynamic Programming
Principle, such a discretization is not enough to devise implementable algorithms.
A spatial discretization is necessary: To this end, we present a detailed analysis of
regression methods (“à la Longstaff–Schwartz”) and of the quantization tree
approach, both developed during the decade 1990–2000.
Each chapter has been written in such a way that, as far as possible, an auton-
omous reading independent of former chapters is possible. However, some remain
connected (in particular Chaps. 7 and 8) and the reading of Chap. 1, devoted to
basic simulation, or the beginning of Chap. 2, which presents Monte Carlo simu-
lation and the crucial notion of confidence interval, remains mandatory to benefit
from the whole book. Within each chapter, we have purposefully scattered various
exercises, in direct connection with their environment, either of a theoretical or
applied nature (simulations).
We have added a large bibliography (more than 250 articles or books) with two
objectives: first to refer to seminal, or at least important papers, on a topic devel-
oped in the book and, secondly, propose further readings which complete partially
tackled topics in the book: For instance, in Chap. 12, most efforts are focused on
numerical methods for pricing of American options. For the adaptation to the
computation of d-hedge or the extension to swing options (such as take-or-pay or
swing option contracts), we only provide references in order to keep the size of the
book within a reasonable limit.
The mathematical prerequisites to successfully approach the reading or the
consultation of this book are, on the one hand, a solid foundation in the theory of
Lebesgue integration (including r-fields and measurability), Probability and
Preface ix
Measure Theory (see among others [44, 52, 154]). The reader is also assumed to be
familiar with discrete- and continuous-time martingale theory with a focus on
stochastic calculus, at least in a Brownian framework (see [183] for an introduction
or [251], [162, 163] for more advanced textbooks). A basic background on discrete-
and continuous-time Markov processes is also requested, but we will not go far
beyond their very definitions in our proofs. As for programming ability, we leave to
the reader the choice of his/her favorite language. Algorithms are always described
in a meta-language. Of course, experts in high-performance computing and paral-
lelized programming (multi-core, CUDA, etc.) are invited to take advantage of their
expertise to convince themselves of the high compatibility of Monte Carlo simu-
lation with these modern technologies.
Of course, we would like to mention that, though most of our examples are
drawn or inspired by the pricing and hedging of derivatives products, the field of
application of the presented methods is far larger. Thus, stratified sampling or
importance sampling to the best of our knowledge are not often used by quants,
structurers, and traders, but given their global importance in Monte Carlo simula-
tion, we decided to present them in detail (with some clues on how to apply them in
the finance of derivative products). By contrast, to limit the size of the manuscript,
we renounced to present more advanced and important notions such as the dis-
cretization and simulation of Stochastic Differential Equations with jumps, typically
driven by Lévy processes. However we hope that, having assimilated the Brownian
case, our readers will be well prepared to successfully immerse themselves in the
huge literature, from which we selected the most significant references for further
reading.
The book contains more than 150 exercises, some theoretical, others devoted to
simulation of numerical experiments. The exercises are distributed over the chap-
ters, usually close to the theoretical results necessary to solve them. Several of them
are accompanied by hints.
To conclude, I wish to thank those who preceded me in teaching this course at
Université Pierre and Marie Curie: Bruno Bouchard, Benjamin Jourdain, and
Bernard Lapeyre but also Damien Lamberton and Annie Millet who taught similar
courses in other universities in Paris. Their approaches inspired me on several
occasions. I would like to thank the students of the Master 2 “Probabilités and
Finance” course since I began to teach it in the academic year 2007–08 who had
access to the successive versions of the manuscript in its evolutionary phases: Many
of their questions, criticisms, and suggestions helped improving it. I would like to
express my special gratitude to one of them, Emilio Saïd, for his enthusiastic
involvement in this task. I am grateful to my colleagues Fabienne Comte, Sylvain
Corlay, Noufel Frikha, Daphné Giorgi, Damien Lamberton, Yating Liu, Sophie
Laruelle, Vincent Lemaire, Harald Luschgy, Thibaut Montes, Fabien Panloup,
Victor Reutenauer, and Abass Sagna, who volunteered to read carefully some
chapters at different stages of drafting. Of course, all remaining errors are mine.
I also have widely benefited from many illustrations and simulations provided by
x Preface
xi
xii Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
Notation
⊳ General Notation
xix
xx Notation
• kd denotes the Lebesgue measure on ðRd ; BorðRd ÞÞ where BorðRd Þ denotes the
r-field of Borel subsets of Rd .
R
• Let p 2 ð0; þ 1Þ. LpRd ðX; A; lÞ ¼ ff : ðX; AÞ ! Rd s:t: X jf jp dl\ þ 1g,
where l is nonnegative measure (changing the canonical Euclidean norm j j on
Rd for another norm has no impact on this space but does have an impact on the
value of the integral itself). LpRd ðX; A; lÞ denotes the set of equivalence classes
of LpRd ðX; A; lÞ with respect to the l-a:e: equality. We write
R 1
k f kp ¼ X jf jp dl p .
• PX or P X 1 ðor Lð X ÞÞ denotes the distribution (or law) on ðRd ; BorðRd ÞÞ of a
random vector X : ðX; A; PÞ ! Rd but also of an Rd -valued stochastic process
X ¼ ðXt Þt2½0;T .
• Let X be a set. If C PðXÞ, rðCÞ denotes the r-field spanned by C or,
equivalently, the smallest r-field on X which contains C.
• The cumulative distribution function (cdf) of a random variable X, usually
denoted FX or UX , is defined for every x 2 R, FX ðxÞ ¼ PðX xÞ.
Notation xxi
However, this naive and abstract definition is not satisfactory because the “sce-
nario” ω ∈ may be not a “good” one, i.e. not a “generic”…? Many probabilistic
properties (like the law of large numbers, to quote the most basic one) are only sat-
isfied P-a.s. Thus, if ω lies in the negligible set that does not satisfy one of them, the
induced sequence will not be “admissible”.
In any case, one usually cannot have access to an i.i.d. sequence of random
variables (Un ) with distribution U ([0, 1])! Any physical device producing such a
sequence would be too slow and unreliable. The works of by logicians like Martin-
Löf lead to the conclusion that, roughly speaking, a sequence (xn ) that can be gen-
erated by an algorithm cannot be considered as “random” one. Thus the digits of
the real number π are not random in that sense. This is quite embarrassing since an
essential requested feature for such sequences is to be generated almost instantly on
a computer!
The approach coming from the computer and algorithmic sciences is not much
more tractable since their definition of a sequence of random numbers is that the com-
plexity of the algorithm to generate the first n terms behaves like O(n). The rapidly
growing need for good (pseudo-)random sequences that came with the explosion of
Monte Carlo simulation in many fields of Science and Technology after World War II
(we refer not only to neutronics) led to the adoption of a more pragmatic approac –
say heuristic – based on statistical tests. The idea is to submit some sequences to
statistical tests (uniform distribution, block non-correlation, rank tests, etc).
© Springer International Publishing AG, part of Springer Nature 2018 1
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_1
2 1 Simulation of Random Variables
For practical implementation, such sequences are finite, as the accuracy of com-
puters is. One considers some sequences (xn ) of so-called pseudo-random numbers
defined by
yn
xn = , yn ∈ {0, . . . , N − 1}.
N
One classical process is to generate the integers yn by a congruential induction:
1
ϕ(N ) = N 1− .
p
p|N , p prime
(b) If p = 2, α ≥ 3, (Z/N Z)∗ (whose cardinality is 2α−1 = N /2) is not cyclic. The
maximal period is then τ = 2α−2 with a ≡ ±3 mod 8. (If α = 2 (N = 4), the group
of size 2 is trivially cyclic!)
(c) If p = 2, then (Z/N Z)∗ (whose cardinality is p α−1 ( p − 1)) is cyclic, hence
τ = p α−1 ( p − 1). by any element a whose class a in (Z/ p Z) spans
It is generated
the cyclic group (Z/ pZ)∗ , × .
What does this theorem say in connection with our pseudo-random number gen-
eration problem?
First some very good news: when N is a prime number, the group (Z/N Z)∗ , ×
∗
i.e. there exists an a ∈ {1, . . . , N } such that (Z/N Z) = ā , 1 ≤ n ≤
n
is cyclic,
N − 1 . The bad news is that we do not know which a satisfies this property, not all do
and, even worse, we do not know how to find any. Thus, if p = 7, ϕ(7) = 7 − 1 = 6
and o(3) = o(5) = 6 but o(2) = o(4) = 3 (which divides 6) and o(6) = 2 (which
again divides 6).
The second bad news is that the length of a sequence, though a necessary property
of a sequence (yn )n , provides no guarantee or even clue that (xn )n is a good sequence
of pseudo-random numbers! Thus, the (homogeneous) generator of the FORTRAN
IMSL library does not fit into the formerly described setting: one sets N := 231 − 1 =
2 147 483 647 (which is a Mersenne prime, see below) discovered by Leonhard Euler
in 1772), a := 75 , and b := 0 (a ≡ 0 mod 8), the point being that the period of 75 is
not maximal.
Another approach to random number generation is based on a shift register and
relies upon the theory of finite fields.
At this stage, a sequence must pass successfully through various statistical tests,
keeping in mind that such a sequence is finite by construction and consequently
cannot satisfy asymptotically such common properties as the Law of Large Numbers,
the Central Limit Theorem or the Law of the Iterated Logarithm (see the next chapter).
Dedicated statistical toolboxes like DIEHARD (Marsaglia, 1998) have been devised
to test and “certify” sequences of pseudo-random numbers.
The aim of this introductory section is just to give the reader the flavor of pseudo-
random number generation, but in no case we do recommend the specific use of any
of the above generators or discuss the virtues of such or such a generator.
For more recent developments on random numbers generator (like a shift register,
for example, we refer e.g. to [77, 219]). Nevertheless, let us mention the Mersenne
twister generators. This family of generator has been introduced in 1997 by Makoto
Matsumata and Takuji Nishimura in [211]. The first level of Mersenne Twister Gener-
ators (denoted by M T - p) are congruential generators whose period N p is a Mersenne
prime, i.e. an prime number of the form N p = 2 p − 1 where p is itself prime. The
most popular and now worldwide implemented generator is the M T -19937, owing
to its unrivaled period 219937 − 1 106000 since 210 103 ). It can simulate a uni-
form distribution in 623 dimensions (i.e. on [0, 1]623 ). A second “shuffling” device
improves it further. For recent improvements and their implementations in C ++,
see
www.math.sci.hiroshima-u.ac.jp/ m-mat/MT/emt.html
4 1 Simulation of Random Variables
PX = λ[0,1] ◦ ϕ−1 ,
where λ[0,1] ◦ ϕ−1 denotes the image measure by ϕ of the Lebesgue measure λ[0,1]
on the unit interval.
We will admit this theorem. For a proof, we refer to [48] (Theorem A.3.1, p. 38).
It also appears as a “building block” in the proof of the Skorokhod representation
theorem for weakly converging sequences of random variables having values in a
Polish space.
As a consequence this means that, if U denotes any random variable with uniform
distribution on (0, 1) defined on a probability space (, A, P), then
d
X = ϕ(U ).
The interpretation is that any E-valued random variable can be simulated from
a uniform distribution…provided the function ϕ can be computed (at a reasonable
computational cost, i.e. in algorithmic terms with reasonable complexity). If this is the
case, the yield of the simulation is 1 since every (pseudo-)random number u ∈ [0, 1]
produces a PX -distributed random number. Except in very special situations (see
below), this result turns out to be of purely theoretical interest and is of little help for
1.2 The Fundamental Principle of Simulation 5
The function F is always non-decreasing, “càdlàg”(a French acronym for “right con-
tinuous with left limits” which we shall adopt in what follows) and lim x→+∞ F(x) =
1, lim x→−∞ F(x) = 0.
One can always associate to F its canonical left inverse Fl−1 defined on the open
unit interval (0, 1) by
∀u ∈ (0, 1), Fl−1 (u) = inf x ∈ R : F(x) ≥ u .
{X ≤ x} = {Fl−1 (U ) ≤ x} = {U ≤ F(x)}
Remarks. • If F is increasing and continuous on the real line, then F has an inverse
function denoted by F −1 defined (0, 1) (also increasing and continuous) such that
F ◦ F −1 = Id|(0,1) and F −1 ◦ F = IdR . Clearly F −1 = Fl−1 by the very definition
of Fl−1 . The above proof can be made even more straightforward since the event
{F −1 (U ) ≤ x} = {U ≤ F(x)} by simple left composition of F −1 by the increasing
function F.
6 1 Simulation of Random Variables
≤ P(F(x) ≥ U ) = F(x)
P(Fr−1 (U ) ≤ x) = P(X ≤ x) = F(x).
≥ P(F(x) > U ) = F(x)
When X takes finitely many values in R, we will see in Example 4 below that this
simulation method corresponds to the standard simulation method of such random
variables.
Exercise. (a) Show that, for every u ∈ (0, 1),
F Fl−1 (u)− ≤ u ≤ F ◦ Fl−1 (u),
so that if F is continuous (or equivalently μ has no atom: μ({x}) = 0 for every x ∈ R),
then F ◦ Fl−1 = I d(0,1)
d
(b) Show that if F is continuous, then F(X ) = U ([0, 1]).
(c) Show that if F is (strictly) increasing, then Fl−1 is continuous and Fl−1 ◦ F = IdR.
(d) One defines the survival function of μ by F̄(x) = 1 − F(x) = μ (x, +∞) ,
x ∈ R. One defines the canonical right inverse of F̄ by
∀u ∈ (0, 1), F̄r−1 (u) = inf x : F̄(x) ≤ u .
Show that F̄r−1 (u) = Fl−1 (1 − u). Deduce that F̄r−1 is right continuous on (0, 1)
and that F̄r−1 (U ) has distribution μ. Define F̄l−1 and show that F̄r−1 (U ) has distribu-
tion μ. Finally, establish for F̄r−1 similar properties to (a)–(b)–(c).
d
Consequently, for every u ∈ (0, 1), FX−1 (u) = − log(1 − u)/λ. Now, using that U =
d
1 − U if U = U ((0, 1)) yields
log(U ) d
X =− = E(λ).
λ
2. Simulation of a Cauchy(c), c > 0, distribution. We know that P X (d x) =
c
d x.
π(x 2 + c2 )
x
c du 1 x π
∀x ∈ R, FX (x) = = Arctan + .
−∞ u2 +c π
2 π c 2
Hence FX−1 (x) = c tan π(u − 1/2) . It follows that
d
X = c tan π(U − 1/2) = Cauchy(c).
X = U − θ = Pareto(θ).
1 d
N
−1
∀u ∈ (0, 1), FX,l (u) = xk 1{ p1 +···+ pk−1 <u≤ p1 +···+ pk }
k=1
so that
d
N
X= xk 1{ p1 +···+ pk−1 <U ≤ p1 +···+ pk } .
k=1
The yield of the procedure is still r = 1. However, when implemented naively, its
complexity – which corresponds to (at most) N comparisons for every simulation –
may be quite high. See [77] for some considerations (in the spirit of quick sort algo-
rithms) which lead to a O(log N ) complexity. Furthermore, this procedure underlines
that one has access to the probability weights pk with an arbitrary accuracy. This is
not always the case, even in a priori simple situations, as emphasized in Example 6
below.
It should be observed of course that the above simulation formula is still appro-
priate for a random variable taking values in any set E, not only for subsets of R!
5. Simulation of a Bernoulli random variable B( p), p ∈ (0, 1). This is the simplest
application of the previous method since
d
X = 1{U ≤ p} = B( p),
n
d
X= 1{Uk ≤ p} = B(n, p).
k=1
where U1 , . . . , Un are i.i.d. random variables, uniformly distributed over [0, 1].
Note that this procedure has a very bad yield, namely r = n1 . Moreover, it needs
n comparisons like the standard method (without any shortcut).
Why not use the above standard method for a random variable taking finitely
many values? Because the cost of the computation of the probability weights pk is
much too high as n grows.
7. Simulation of geometric random variables G( p) and G ∗ ( p), p ∈ (0, 1). This
is the distribution of the first success instant when independently repeating the
same Bernoulli experiment with parameter p. Conventionally, G( p) starts at time 0
whereas G ∗ ( p) starts at time 1.
1.3 The (Inverse) Distribution Function Method 9
τ ∗ := min{k ≥ 1 : X k = 1} = G ∗ ( p)
d
and
d
τ := min{k ≥ 0 : X k = 1} = G( p).
Hence
P(τ ∗ = k) = p(1 − p)k−1 , k ∈ N∗ := {1, 2, . . . , n, . . .}
and
P(τ = k) = p(1 − p)k , k ∈ N := {0, 1, 2, . . . , n, . . .}
(so
that both random variables are P-a.s. finite since one has k≥0 P(τ = k) =
∗ ∗ ∗
k≥1 P(τ = k) = 1). Note that τ + 1 has the same G ( p)-distribution as τ .
∗
The (random) yields of the above two procedures are r = τ ∗ and r = τ +1 , respec-
1 1
Show that
Compute the mean yield of its (natural and straightforward) simulation method.
This method is due to Von Neumann (1951). It is contemporary with the development
of computers and of the Monte Carlo method. The original motivation was to devise
a simulation method for a probability distribution ν on a measurable space (E, E),
absolutely continuous with respect to a σ-finite non-negative measure μ with a density
given, up to a constant, by f : (E, E) → R+ when we know that f is dominated by
a probability distribution g · μ which can be simulated at “low computational cost”.
(Note that ν = ff dμ · μ.)
E
In most elementary applications (see below), E is either a Borel set of Rd equipped
with its Borel σ-field and μ is the trace of the Lebesgue measure on E or a subset
E ⊂ Zd equipped with the counting measure.
Let us be more precise. So, let μ be a non-negative σ-finite measure on (E, E)
and let f, g : (E, E) −→ R+ be two Borel functions. Assume that f ∈ L 1R+ (μ) with
E f dμ > 0 and that g is a probability density with respect to μ satisfying further-
more g > 0 μ-a.s. and there exists a positive real constant c > 0 such that
Note that this implies c ≥ f dμ. As mentioned above, the aim of this section is to
E
show how to simulate some random numbers distributed according to the probability
distribution
f
ν= ·μ
Rd f dμ
c> f dμ.
Rd
As a first (not so) preliminary step, we will explore a natural connection (in
distribution) between an E-valued random variable X with distribution ν and Y
an E-valued random variable with distribution g · μ. We will see that the key idea
is completely elementary. Let h : (E, E) → R be a test function (measurable and
bounded or non-negative). On the one hand
1
E h(X ) = h(x) f (x)μ(d x)
E f dμ E
1 f
= h(y) (y)g(y)μ(dy) since g > 0 μ−a.e.
f dμ E g
E
1 f
= E h(Y ) (Y ) .
E f dμ g
We can also forget about the last line, stay on the state space E and note in a somewhat
artificial way that
1
c
E h(X ) = h(y) 1{u≤ 1 f
(y)} du g(y)μ(dy)
E f dμ
c g
E 0
1
c
= h(y)1{u≤ 1 f (y)} g(y)μ(dy)du
E f dμ E 0
c g
c
= E h(Y )1{c U ≤ f (Y )}
E f dμ
g
Remark. The (random) yield of the method is τ1 . Hence, we know that its mean yield
is given by
1 p log p E f dμ c
E =− = log .
τ 1− p c − E f dμ E f dμ
p log p
Since lim − = 1, the closer to f dμ the constant c is, the higher the yield
p→1 1− p E
of the simulation.
E h(Y )1{c U g(Y )≤ f (Y )} = h(y)1{c ug(y)≤ f (y)} g(y)μ(dy)⊗du
E×[0,1]
= 1{c ug(y)≤ f (y)}∩{g(y)>0} du h(y)g(y)μ(dy)
E [0,1]
1
= h(y) 1{u≤ f (y) }∩{g(y)>0} du g(y)μ(dy)
cg(y)
E 0
f (y)
= h(y) g(y)μ(dy)
{g(y)>0} cg(y)
1
= h(y) f (y)μ(dy),
c E
where we used successively Fubini’s Theorem, g(y) > 0 μ(dy)-a.e., Fubini’s Theo-
rem again and cgf (y) ≤ 1 μ(dy)-a.e. Note that we can apply Fubini’s Theorem since
the reference measure μ is σ-finite.
Putting h ≡ 1 yields
E f dμ
c= .
P(c U g(Y ) ≤ f (Y ))
1.4 The Acceptance-Rejection Method 13
f (y)
E h(Y )|{c U g(Y ) ≤ f (Y )} = h(y) μ(dy) = h(y)ν(dy),
E E f dμ E
i.e.
L (Y |{c U g(Y ) ≤ f (Y )}) = ν.
= 1× h(y)ν(dy)
E
= h(y)ν(dy). ♦
E
and
τn+1 := min k ≥ τn + 1 : c Uk g(Yk ) ≤ f (Yk ) .
(a) The sequence (τn − τn−1 )n≥1 (with the convention τ0 = 0) is i.i.d. with distribu-
tion G ∗ ( p) and the sequence
X n := Yτn
n a.s.
ρn = −→ p as n → +∞.
τn
14 1 Simulation of Random Variables
n
1{cUk g(Yk )≤ f (Yk )} h(Yk )
k=1
.
n
1{cUk g(Yk )≤ f (Yk )}
k=1
Note that
n
1
n
1{cUk g(Yk )≤ f (Yk )} h(Yk ) 1{cUk g(Yk )≤ f (Yk )} h(Yk )
k=1
n k=1
= , n ≥ 1.
n
1
n
1{cUk g(Yk )≤ f (Yk )} 1{cUk g(Yk )≤ f (Yk )}
k=1
n k=1
Hence, owing to the strong Law of Large Numbers (see the next chapter if necessary)
this quantity a.s. converges as n goes to infinity toward
1 f (y)
du μ(dy)1{cug(y)≤ f (y)} h(y)g(y) h(y)g(y)μ(dy)
E cg(y)
0
=
1 f (y)
du μ(dy)1{cug(y)≤ f (y)} g(y) g(y)μ(dy)
0 E cg(y)
f (y)
h(y)μ(dy)
c
= E
f (y)
μ(dy)
E c
f (y)
= h(y) μ(dy)
E R f dμ
d
= h(y)ν(dy).
E
1.4 The Acceptance-Rejection Method 15
This third way to present the same computations shows that in terms of practical
implementation, this method is in fact very elementary.
Classical applications
Uniform distributions on a bounded domain D ⊂ Rd .
Let D ⊂ a + [−M, M]d , λd (D) > 0 (where a ∈ Rd and M > 0) and let
d
Y = U (a + [−M, M]d ), let τ D := min{n : Yn ∈ D}, where (Yn )n≥1 is an i.i.d.
sequence defined on a probability space (, A, P) with the same distribution as
Y . Then,
d
Yτ D = U (D),
and
f (x) = 1 D (x) ≤ (2M)d g(x)
=:c
f
so that · μ = U (D).
Rd f dμ
As a matter of fact, with the notations of the above proposition,
f
τ = min k ≥ 1 : c Uk ≤ (Yk ) .
g
The fact that gα is a probability density function follows from an elementary com-
putation. First, one checks that f α (x) ≤ cα gα (x) for every x ∈ R+ , where
α+e
cα = .
αe
Note that this choice of cα is optimal since f α (1) = cα gα (1). Then, one uses
the inverse distribution function to simulate the random variable with distribution
PY (dy) = gα (y)λ(dy). Namely, if G α denotes the distribution function of Y , one
checks that, for every x ∈ R,
e αe 1 1
G α (x) = x α 1{0<x<1} + + − e−x 1{x>1}
α+e α+e e α
Note that the computation of the function is never required to implement this
method.
– Case α ≥ 1 . We rely on the following classical lemma about the γ distribution
that we leave as an exercise to the reader.
Lemma 1.1 Let X and X two independent random variables with distributions
γ(α ) and γ(α ), respectively. Then X = X + X has distribution γ(α + α ).
X = ξ1 + · · · + ξn
where the random variables ξk are i.i.d. with exponential distribution since γ(1) =
E(1). Consequently, if U1 , . . . , Un are i.i.d. uniformly distributed random variables
n
d
X = − log Uk .
k=1
3. Let α = α + n, α = α ∈ [0, 1). Show that the yield of the simulation is given
by r = n+τ1
α
, where τα has a G ∗ ( pα ) distribution with pα related to the simulation
of the γ(α)-distribution. Show that
p (1 − p)k
n
r̄ := E r = − log p + .
(1 − p)n+1 k=1
k
d
supp( f ) = { f = 0} ⊂ K = κ + [ai , bi ].
i=1
d
f (x) ≤ c g(x), x ∈ K with c = f sup λd (K ) = f sup (bi − ai ).
i=1
Then, if (Yn )n≥1 denotes an i.i.d. sequence defined on a probability space (, A, P)
with the uniform distribution over K , the stopping strategy τ of Von Neumann’s
acceptance-rejection method reads
τ = min k : f sup Uk ≤ f (Yk ) .
d
Vτ1 = ν where τ = min k ≥ 1 : Vk2 ≤ f (Vk1 ) .
κ
f k ≤ c gk with c= .
p
pak d
Consequently, if τ := min k ≥ 1 : Uk ≤ κ
, then Yτ = ν where νk :=
ak (1 − p)k−1
, k ≥ 1.
n an (1 − p)
n−1
λk
∀ k ∈ N, P(λ) {k} = e−λ .
k!
To simulate this distribution in an exact way, one relies on its close connection with
the Poisson counting process. The (normalized) Poisson counting process is the
counting process induced by the Exponential random walk (with parameter 1). It is
defined by
∀ t ≥ 0, Nt = 1{Sn ≤t} = min n : Sn+1 > t
n≥1
Proposition 1.3 The process (Nt )t≥0 has càdlàg (1 ) paths and independent station-
ary increments. In particular, for every s, t ≥ 0, s ≤ t, Nt − Ns is independent of
Ns and has the same distribution as Nt−s . Furthermore, for every t ≥ 0, Nt has a
Poisson distribution with parameter t.
1A French acronym for right continuous with left limits (continu à droite limitée à gauche).
1.5 Simulation of Poisson Distributions (and Poisson Processes) 19
Sn = X 1 + · · · + X n .
Now, if we set A = P(Sk1 ≤ t1 < Sk1 +1 ≤ Sk1 +k2 ≤ t2 < Sk1 +k2 +1 ) for convenience,
we get
A= k +k2 +1
1{x e−(x1 +···+xk1 +k2 +1 ) d x1 · · · d xk1 +k2 +1
R+1 1 +···+xk1 ≤t1 ≤x1 +···+xk1 +1 , x1 +···+xk1 +k2 ≤t2 <x1 +···+xk1 +k2 +1 }
A= du 1 · · · du k1 +k2 e−t2
{0≤u 1 ≤···≤u k1 ≤t1 ≤u k1 +1 ≤···≤u k1 +k2 ≤t2 <u k1 +k2 +1 }
t1k1 −t2 t k1
P(Sk1 ≤ t1 < t2 < Sk1 +1 ) = e = e−t1 1 e−(t2 −t1 ) .
k1 ! k1 !
t1k1
P(Nt1 = k1 ) = e−t1
k1 !
20 1 Simulation of Random Variables
d
i.e. Nt1 = P(t1 ). Summing over k1 ∈ N shows that, for every k2 ∈ N,
d d
Nt2 − Nt1 = Nt2 −t1 = P(t2 − t1 ).
Proof. It follows from Example 1 in the former Sect. 1.3 that the exponentially
distributed i.i.d. sequence (X k )k≥1 can be written in the following form
X k = − log Uk , k ≥ 1.
Using that the random walk (Sn )n≥1 is non-decreasing it follows from the definition
of a Poisson process that for every t ≥ 0,
Proposition 1.4 Let R 2 and : (, A, P) → R be two independent r.v. with dis-
tributions L(R 2 ) = E( 21 ) and L() = U ([0, 2π]), respectively. Then
d
X := (R cos , R sin ) = N (0; I2 ),
√
where R := R2.
Proof. Let f be a bounded Borel function.
x 2 + x22 d x1 d x2 ρ2 dρdθ
f (x1 , x2 )exp − 1 = f (ρ cos θ, ρ sin θ)e− 2 1R∗ (ρ)1(0,2π) (θ)ρ
R2 2 2π + 2π
using the standard change of variable: x1 = ρ cos θ, x2 = ρ sin θ. We use the facts
√ (0, 2π) × (0, +∞) →
that (ρ, θ) → (ρ cos θ, ρ sin θ) is a C 1 -diffeomorphism from
R2 \ (R+ × {0}) and λ2 (R+ × {0}) = 0. Setting now ρ = r , one has:
x 2 + x22 d x1 d x2
r
√ √ e− 2 dr dθ
f (x1 , x2 ) exp − 1 = f ( r cos θ, r sin θ) 1 ∗ (ρ)1(0,2π) (θ)
R 2 2 2π 2 R+ 2π
= IE f R 2 cos , R 2 sin
= IE( f (X )). ♦
The yield of the simulation is r = 1/2 with respect to the N (0; 1) distribution and
r = 1 when the aim is to simulate an N (0; I2 )-distributed (pseudo-)random vector
or, equivalently, two N (0; 1)-distributed (pseudo-)random numbers.
Proof. Simulate the exponential distribution using the inverse distribution func-
d d d
tion using U1 = U ([0, 1]) and note that if U2 = U ([0, 1]), then 2πU2 = U ([0, 2π])
(where U2 is taken independent of U1 ). ♦
= U B(0; 1) . Derive a simulation method for N (0; I2 ) combining the above identity
and an appropriate acceptance-rejection algorithm to simulate the N (0; I2 ) distribu-
tion. What is the yield of the resulting procedure?
(c) Compare the performances of Marsaglia’s polar method with those of the
Box–Muller algorithm (i.e. the acceptance-rejection rule versus the computation
of trigonometric functions). Conclude.
d √ d
If Z = N (0; Id ) then Z = N (0; ).
√
One can compute by diagonalizing in the orthogonal group: if
= P Diag(λ1 , . . . , λd )P ∗ with P P ∗ = Id and λ1 , . . . , √
λd ≥ 0, then, by
√ unique-
ness
√ of the square root of as defined above, it is clear that = PDiag( λ1 , . . . ,
λd )P ∗ .
– Cholesky decomposition of . When the covariance matrix is invertible (i.e.
definite), it is much more efficient to rely on the Cholesky decomposition (see e.g.
Numerical Recipes [220]) by decomposing as
= TT∗
where T is a lower triangular matrix (i.e. such that Ti j = 0 if i < j). Then, owing to
Lemma 1.2,
d
T Z = N (0; ).
2. Let be a positive definite matrix. Show the existence of a unique lower tri-
angular matrix T and a diagonal matrix D such that both T and D have positive
2 The correlation ρ
X 1 ,X 2 between two square integrable non-a.s. constant random variables defined
on the same probability space is defined by
Cov(X 1 , X 2 ) Cov(X 1 , X 2 )
ρ X 1 ,X 2 = = √ .
σ(X 1 )σ(X 2 ) Var(X 1 )Var(X 2 )
24 1 Simulation of Random Variables
diagonal entries and i, j=1 Ti2j = 1 for every i = 1, . . . , d. [Hint: change the refer-
ence Euclidean norm to perform the Hilbert–Schmidt decomposition] (3 ).
Application to the simulation of a standard Brownian motion at fixed times
Let W = (Wt )t≥0 be a standard Brownian motion defined on a probability space
(, A, P). Let (t1 , . . . , tn ) be a non-decreasing n-tuple (0 ≤ t1 < t2 < . . . , tn−1 <
tn ) of instants. One elementary definition of a standard Brownian motion is that it
is a centered Gaussian process with covariance given by C W (s, t) = E (Ws Wt ) =
s ∧ t (4 ). The resulting simulation method relying on the Cholesky decomposition
of the covariance structure of the Gaussian vector (Wt1 , . . . , Wtn ) given by
(tW1 ,...,tn ) = ti ∧ t j 1≤i, j≤n
is a first possibility.
However, it seems more natural to use the independence and the stationarity of
the increments, i.e. that
d
Wt1 , Wt2 − Wt1 , . . . , Wtn − Wtn−1 = N 0; Diag(t1 , t2 − t1 , . . . , tn − tn−1 )
so that
⎡ ⎤
Wt1 ⎡ ⎤
⎢ ⎥ Z1
⎢ Wt2 − Wt1 ⎥ d √ √ √ ⎢ . ⎥
⎢ .. ⎥ = Diag t1 , t2 − t1 , . . . , tn − tn−1 ⎣ .. ⎦
⎣ . ⎦
Zn
Wtn − Wtn−1
d
where (Z 1 , . . . , Z n ) = N 0; In . The simulation of (Wt1 , . . . , Wtn ) follows by sum-
ming the increments.
Remark. To be more precise, one derives from the above result that
⎡ ⎤ ⎡ ⎤
Wt1 Wt1
⎢ Wt2 ⎥ ⎢ Wt2 − Wt1 ⎥
⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ = L ⎢ .. ⎥ where L = 1{i≥ j} 1≤i, j≤n .
⎣ . ⎦ ⎣ . ⎦
Wtn Wtn − Wtn−1
3 This modified Cholesky decomposition is faster than the standard one (see e.g. [275]) since it
avoids square root computations even if, in practice, the cost of the decomposition itself remains
negligible compared to that of a large-scale Monte Carlo simulation.
4 This definition does not include the fact that W has continuous paths, however, it can be derived
using the celebrated Kolmogorov criterion, that it has a modification with continuous paths (see e.g.
[251]).
1.6 Simulation of Gaussian Random Vectors 25
√ √ √
Hence, if we set T = L Diag t1, t2 − t1 , . . . , tn − tn−1 , we checks on the one
√
hand that T = t j − t j−1 1{i≥ j} 1≤i, j≤n and on the other hand that
⎡ ⎤
Wt1 ⎡ ⎤
⎢ Wt2 ⎥ Z1
⎢ ⎥ ⎢ .. ⎥
⎢ .. ⎥ = T ⎣ . ⎦ .
⎣ . ⎦
Zn
Wtn
We derive, owing to Lemma 1.2, that T T ∗ = T In T ∗ = tW1 ,...,tn . The matrix T being
lower triangular, it provides the Cholesky decomposition of the covariance matrix
tW1 ,...,tn .
ℵ Practitioner’s corner (Warning!)
In quantitative finance, especially when modeling the dynamics of several risky
assets, say d, the correlation between the Brownian sources of randomness B =
(B 1 , . . . , B d ) attached to the log-return is often misleading in terms of notations:
since it is usually written as
q
j
∀ i ∈ {1, . . . , d}, Bti = σi j Wt ,
j=1
where σi. is the column vector [σi j ]1≤ j≤q , σ = [σi j ]1≤i≤d,1≤ j≤q and (·|·) denotes the
canonical inner product on Rq . So one should process the Cholesky decomposition
on the symmetric non-negative matrix B1 = σσ ∗ . Also keep in mind that, if q < d,
then σ B1 has rank at most q and cannot be invertible.
Application to the simulation of a Fractional Brownian motion at fixed times
A fractional Brownian motion (fBm) W H = (WtH )t≥0 with Hurst coefficient H ∈
(0, 1) is defined as a centered Gaussian process with a correlation function given for
every s, t ∈ R+ by
H 1 2H
C W (s, t) = t + s 2H − |t − s|2H .
2
The basic principle of the Monte Carlo method is to implement on a computer the
Strong Law of Large Numbers (SLLN): if (X k )k≥1 denotes a sequence, defined on
a probability space (, A, P), of independent integrable random variables with the
same distribution as X , then
mean that 1{τ =k} = ψk (U1 , . . . , Uk ) where ψk has an explicit form for every k ≥ 1.
Of course, we also assume that, for every k ≥ 1, the function ϕk defined on the set
of finite [0, 1]k -valued sequences is a computable function as well.
This procedure is at the core of Monte Carlo simulation. We provided several
examples of such representations in the previous chapter. For further developments
on this wide topic, we refer to [77] and the references therein, but in some way,
one may consider that a significant part of scientific activity in Probability Theory is
motivated by or can be applied to simulation.
Once these conditions are fulfilled, a Monte Carlo simulation can be performed.
But this leads to two important issues:
– what about the rate of convergence of the method?
and
– how can the resulting error be controlled?
The answers to these questions relies on fundamental results in Probability Theory
and Statistics.
The (weak) rate of convergence in the SLLN is ruled by the Central Limit Theorem
(CLT), which says that if X is square integrable (X ∈ L 2 (P)), then
√ L
M X M − mX −→ N (0; σX2 ) as M → +∞,
where σX2 = Var(X ) := E (X − E X )2 = E X 2 − (E X )2 is the variance (its square
root σX is called standard deviation) (1 ). Also note that the mean quadratic error of
convergence (i.e. the convergence rate in L 2 (P)) is exactly
X − m = √σX .
M X 2
M
This error is also known as the RMSE (for Root Mean Square Error).
L
1 Thesymbol −→ stands for convergence in distribution (or “in law” whence the notation “L”): a
sequence of random variables (Yn )n≥1 converges in distribution toward Y∞ if
It can be defined equivalently as the weak convergence of the distributions PYn toward the distribution
PY∞ . Convergence in distribution is also characterized by the following property
A proof of this (difficult) result can be found e.g. in [263]. All these rates stress the
main drawback of the Monte Carlo method: it is a slow method since dividing the
error by 2 requires an increase of the size of the simulation by 4 (even a bit more
viewed from the LIL).
Assume σX > 0. As concerns the control of the error, one relies on the CLT. It is
obvious from the definition of convergence in distribution that the CLT also reads
√ X M − mX L
M −→ N (0; 1) as M → +∞
σX
d d
since σX Z = N (0; σX2 ) if Z = N (0; 1) and f · /σX ∈ Cb (R, R) if f ∈ Cb (R, R).
Moreover, the normal distribution having a density, it has no atom. Consequently,
this convergence implies (in fact it is equivalent, see [45]) that for all real numbers
a, b, a > b,
√ X M − mX
lim P M ∈ [b, a] = P N (0; 1) ∈ [b, a] as M → +∞.
M σX
= 0 (a) − 0 (b),
where 0 denotes the cumulative distribution function (c.d.f.) of the standard normal
distribution N (0; 1) defined by
x
ξ2 dξ
0 (x) = e− 2 √
−∞ 2π
∀ x ∈ R, 0 (x) + 0 (−x) = 1.
Deduce that P |N (0; 1)| ≤ a = 20 (a) − 1 for every a > 0.
In turn, if X 1 ∈ L 3 (P), the convergence rate in the Central Limit Theorem is ruled
by the following Berry–Esseen Theorem (see [263], p 344).
Theorem 2.1 (Berry–Esseen Theorem) Let X 1 ∈ L 3 (P) with σX > 0 and let
(X k )k≥1 be an i.i.d. sequence of real-valued random variables defined on (, A, P).
Then
√ C E |X 1 − E X 1 |3 1
∀ n ≥ 1, ∀ x ∈ R, P M( X̄M − mX ) ≤ x σX − 0 (x) ≤ √ .
σX3 M 1 + |x|3
√
Hence, the rate of convergence in the CLT is again 1/ M, which is rather slow,
at least from a statistical viewpoint. However, this is not a real problem within the
usual range of Monte Carlo simulations (at least many thousands, usually one hundred
√ X M − mX
thousand or one million paths). Consequently, one can assume that M
σX
has a standard normal distribution. This means in particular that one can design
a probabilistic control of the error directly derived from statistical concepts: let
α ∈ (0, 1) denote a confidence level (close to 1 in practice) and let qα be the two-
sided α-quantile defined as the unique solution to the equation
P |N (0; 1)| ≤ qα = α i.e. 2 0 (qα ) − 1 = α.
σ σ
JMα := X M − qα √ X , X M + qα √ X
M M
which satisfies
α
√ |X M − mX |
P mX ∈ JM = P M ≤ qα
σX
M→+∞
−→ P |N (0; 1)| ≤ qα = α.
However, at this stage this procedure remains purely theoretical since the confi-
dence interval JMα involves the standard deviation σX of X , which is usually unknown.
Here comes the “magic trick” which led to the tremendous success of the Monte
Carlo method: the variance of X can in turn be evaluated on-line, without any addi-
tional assumption, by simply adding a companion Monte Carlo simulation to estimate
the variance σX2 , namely
2.1 The Monte Carlo Method 31
M
1
VM = (X k − X M )2 (2.1)
M −1 k=1
M
1 M 2
= X k2 − X −→ Var(X ) = σX2 as M → +∞.
M −1 k=1
M −1 M
The above convergence (2 ) follows from the SLLN applied to the i.i.d. sequence of
integrable random variables (X k2 )k≥1 (and the convergence of X M , which also follows
from the SLLN). It is an easy exercise to show that moreover E V M = σX2 , i.e. using the
terminology of Statistics, V M is an unbiased estimator of σX2 . This remark is of little
importance in practice due to the common large ranges of Monte Carlo simulations.
Note that the above a.s. convergence is again ruled by a CLT if X ∈ L 4 (P) (so that
X 2 ∈ L 2 (P)).
Exercises. 1. Show that V M is unbiased, that is E V M = σX2 .
2. Show that the sequence (X M , V M ) M≥1 satisfies the following recursive equation
1
X M = X M−1 + XM − X M−1
M
M −1 X
= X M−1 + M , M ≥ 1
M M
1
V M = V M−1 − V M−1 − (XM − X M−1 )(XM − X M )
M −1
M −2 (X − X M−1 )2
= V M−1 + M , M ≥ 2.
M −1 M
√ X M − mX √ X M − mX σ L
M = M × X −→ N (0; 1) as M → +∞.
σX
VM VM
Of course, within the usual range of Monte Carlo simulations, one can always con-
sider that, for large M,
2 When X “already” has an N (0; 1) distribution, then V M has a χ2 (M − 1)-distribution. The χ2 (ν)-
distribution, known as the χ2 -distribution with ν ∈ N∗ degrees of freedom, is the distribution of the
sum Z 12 + · · · + Z ν2 , where Z 1 , . . . , Z ν are i.i.d. with N (0; 1) distribution. The loss of one degree
of freedom for V M comes from the fact that X 1 , . . . , XM and X M satisfies a linear equation which
induces a linear constraint. This result is known as Cochran’s Theorem.
3 If a sequence of Rd -valued random vectors satisfy L
Z n −→ Z ∞ and a sequence of random variables
P L
Cn −→ c∞ ∈ R, then Cn Z n −→ c∞ Z ∞ , see e.g. [155].
32 2 The Monte Carlo Method and Applications to Option Pricing
√ X M − mX
M N (0; 1).
VM
Note that when X is itself normally distributed, one shows that the empirical mean
X M and the empirical variance V M are independent, so that the true distribution of
the left-hand side of the above (approximate) equation is a Student distribution with
M − 1 degrees of freedom (4 ), denoted by T (M − 1).
Finally, one defines the confidence interval at level α of the Monte Carlo simula-
tion by ⎡ ⎤
VM VM ⎦
IMα = ⎣ X M − qα , X M + qα
M M
d X t0 = r X t0 dt, X 00 = 1,
(2.2)
d X t1 = X t1 r dt + σ1 dWt1 , X 01 = x01 , d X t2 = X t2 r dt + σ2 dWt2 , X 02 = x02 ,
W 1 , W 2 t = ρ t, ρ ∈ [−1, 1].
This implies that W 2 can be decomposed as Wt2 = ρWt1 + 1 − ρ2 W t2 , where
1 2
(W , W ) is a standard 2-dimensional Brownian motion. The filtration (Ft )t∈[0,T ] of
this market is the augmented filtration of W , i.e. Ft = FtW := σ(Ws , 0 ≤ s ≤ t, NP )
where NP denotes the family of P-negligible sets of A (5 ). By “filtration of the mar-
ket”, we mean that (Ft )t∈[0,T ] is the smallest filtration satisfying the usual conditions
towhich the process (X t )t∈[0,T ] is adapted. By “risk-neutral”, we mean that e−r t X t is
a P, (Ft )t )-martingale. We will not go further into financial modeling at this stage,
for which we refer e.g. to [185] or [163], but focus instead on numerical aspects.
For every t ∈ [0, T ], we have
σi2
X t0 = er t , X ti = x0i e(r − 2 )t+σi Wti
, i = 1, 2.
(One easily verifies using Itô’s Lemma, see Sect. 12.8, that X t thus defined satis-
fies (2.2); formally finding the solution by applying Itô’s Lemma to log X ti , i = 1, 2,
assuming a priori that the solutions of (2.2) are positive).
When r = 0, X i is called a geometric Brownian motion associated to W i with
volatility σi > 0.
A European vanilla option with maturity T > 0 is an option related to a European
payoff
h T := h(X T )
which only depends on X at time T . In such a complete market the option premium
at time 0 is given by
V0 = e−r t E h(X T )
5 One shows that, owing to the 0-1 Kolmogorov law, this filtration is right continuous, i.e. Ft =
∩s>t Fs . A right continuous filtration which contains the P-negligible sets satisfies the so-called
“usual conditions”.
34 2 The Monte Carlo Method and Applications to Option Pricing
The fact that W has independent stationary increments implies that X 1 and X 2
have independent stationary ratios, i.e.
X Ti d X Ti −t
= is independent of Ft .
X ti x0i
i=1,2 i=1,2
then
Vt = e−r (T −t) E h(X T ) | Ft
−r (T −t)
X Ti
=e E h X ti × Ft
X ti
i=1,2
−r (T −t) X Ti −t
=e Eh x i
by independence
x0i
i=1,2 |x i =X ti , i=1,2
= v(X t , T − t).
There is a closed form for such a call option – the celebrated Black–Scholes
formula for option on stock (without dividend) – given by
For this payoff no closed form is available. One has a choice between a PDE approach
(quite appropriate in this 2-dimensional setting but requiring some specific develop-
ments) and a Monte Carlo simulation.
We will illustrate below the regular Monte Carlo procedure on the example of a
Best-of-Call which is traded on an organized market, unlike its cousin the Exchange
Call Spread.
Pricing a best-of-call by a monte carlo simulation
To implement a (crude) Monte Carlo simulation we need to write the payoff as a
function of independent uniformly distributed random variables, or, equivalently, as
a tractable function of such random variables. In our case, we write it as a function of
two standard normal variables, i.e. a bi-variate standard normal distribution (Z 1 , Z 2 ),
namely
e−rt h T = ϕ(Z 1 , Z 2 )
d
σ12 √ σ2 √
:= max x01 exp − T +σ1 T Z 1 , x02 exp − 2 T +σ2 T ρZ 1 + 1 − ρ2 Z 2 − K e−rt
2 2
+
d
where Z = (Z 1 , Z 2 ) = N (0; I2 ) (the dependence of ϕ in x0i , etc, is dropped). Then,
simulating a M-sample (Z m )1≤m≤M of the N (0; I2 ) distribution using e.g. the Box–
Muller method yields the estimate
Best-o f -Call 0 = e−r t E max(X T1 , X T2 ) − K +
= E ϕ(Z 1 , Z 2 )
M
1
ϕM := ϕ(Z m ).
M m=1
One computes an estimate for the variance using the same sample
M
1 M
V M (ϕ) = ϕ(Z m )2 − ϕ2 Var(ϕ(Z ))
M − 1 m=1 M −1 M
since M is large enough. Then one designs a confidence interval for E ϕ(Z ) at level
α ∈ (0, 1) by setting
36 2 The Monte Carlo Method and Applications to Option Pricing
⎡ ⎤
V M (ϕ) V M (ϕ) ⎦
IMα = ⎣ϕ M − qα , ϕ M + qα
M M
where qα is defined by P |N (0; 1)| ≤ qα = α (or equivalently by 20 (qα )
− 1 = α).
Numerical Application. We consider a European “Best-of-Call” option with the
following parameters
Remark. Once the script is written for one option, i.e. one payoff function, it is
almost instantaneous to modify it to price another option based on a new payoff
function: the Monte Carlo method is very flexible, much more than a PDE approach.
Table 2.1 Black–Scholes Best - of- Call. Pointwise estimate and confidence intervals as a
function of the size M of the Monte Carlo simulation, M = 2k , k = 10, . . . , 20, 230
M ϕM Iα,M
210 = 1 024 18.5133 [17.4093; 19.6173]
211 = 2 048 18.7398 [17.9425; 19.5371]
212 = 4 096 18.8370 [18.2798; 19.3942]
213 = 8 192 19.3155 [18.9196; 19.7114]
214 = 16 384 19.1575 [18.8789; 19.4361]
215 = 32 768 19.0911 [18.8936; 19.2886]
216 = 65 536 19.0475 [18.9079; 19.1872]
217 = 131 072 19.0566 [18.9576; 19.1556]
218 = 262 144 19.0373 [18.9673; 19.1073]
219 = 524 288 19.0719 [19.0223; 19.1214]
220 = 1 048 576 19.0542 [19.0191 19.0892]
··· ··· ···
230 = 1.0737.109 19.0766 [19.0756; 19.0777]
2.1 The Monte Carlo Method 37
20.5
20
19.5
Iα,M
19
18.5
18
17.5
10 12 14 16 18 20 22 24 26 28 30
M = 2m, m = 1,…,30
Fig. 2.1 Black–Scholes Best- of- Call. The Monte Carlo estimator (⊗) and its confidence
interval at level α = 95% for sizes M ∈ {2k , k = 10, . . . , 30}
The practitioner should never forget that performing a Monte Carlo simulation to
compute mX = E X , consists of three mandatory steps:
1. Specification of a confidence level α ∈ (0, 1) (α 1, typically, α = 95%, 99%,
99.5%, etc).
2. Simulation of an M-sample X 1 , X 2 , . . . , X M of i.i.d. random vectors having the
same distribution as X and (possibly recursive) computation of both its empirical
mean X̄M and its empirical variance V̄M .
3. Computation of the resulting confidence interval I M at confidence level α, which
will be the only trustable result of the Monte Carlo simulation.
We will see further on in Chap. 3 how to specify the size M of the simulation to
comply with an a priori accuracy level. The case of biased simulation is deeply inves-
tigated in Chap. 9 (see Sect. 9.3 for the analysis of a crude Monte Carlo simulation
in a biased framework).
38 2 The Monte Carlo Method and Applications to Option Pricing
The greeks or sensitivities denote the set of parameters obtained as derivatives of the
premium of an option with respect to some of its parameters: the starting value, the
volatility, etc. It applies more generally to any function defined by an expectation.
In elementary situations, one simply needs to apply some more or less standard
theorems like the one reproduced below (see [52], Chap. 8 for a proof). A typical
example of such a “elementary situation” is the case of a possibly multi-dimensional
risk neutral Black–Scholes model.
∂ϕ(x, ω)
P(dω)-a.s. ≤ Y (ω),
∂x
then, the function (x) := E ϕ(x, . ) is defined and differentiable at every x ∈ I , with
derivative
∂ϕ
(x) = E (x, . ) .
∂x
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 39
Remarks. • The local version of the above theorem may be necessary to prove the
differentiability of a function defined by an expectation over the whole real line (see
the exercise that follows).
• All of the preceding remains true if one replaces the probability space (, A, P)
by any measurable space (E, E, μ) where μ is a non-negative measure (see again
Chap. 8 in [52]). However, this extension is no longer true as seen when dealing with
the uniform integrability assumption mentioned in the exercises hereafter.
• Some variants of the result can be established to obtain a theorem for right or
left differentiability of functions = Eω ϕ(x, ω) defined on the real line, (partially)
differentiable functions defined on Rd , for holomorphic functions on C, etc. The
proofs follow the same lines.
• There exists a local continuity result for such functions ϕ defined as an expectation
which is quite similar to Claim (a). The domination property by an integrable non-
negative random variable Y is requested on ϕ(x, ω) itself. A precise statement is
provided in Chap. 12 (with the same notations). For a proof we still refer to [52],
Chap. 8. This result is often useful to establish the (continuous) differentiability of a
multi-variable function by combining the existence and the continuity of its partial
derivatives.
d
Exercise. Let Z = N (0; 1) be defined on a probability space (, A, P), ϕ(x, ω) =
x − Z (ω) + and (x) = E ϕ(x, Z ) = E (x − Z )+ , x ∈ R.
(a) Show that is differentiable on the real line by applying the local version of
Theorem 2.2 and compute its derivative.
(b) Let I denote a non-trivial interval of R. Show that if ω ∈ {Z ∈ I } (i.e. Z (ω) ∈ I ),
the function x → x − Z (ω) + is never differentiable on the whole interval I .
Exercises (Extension to uniform integrability). One can replace the domination
property (iii) in Claim (a) (local version) of the above Theorem 2.2 by the less
stringent uniform integrability assumption
ϕ(x, .) − ϕ(x0 , .)
(iii)ui is P -uniformly integrable on (, A, P).
x − x0 x∈I \{x0 }
For the definition and some background on uniform integrability, see Chap. 12,
Sect. 12.4.
1. Show that (iii)ui implies (iii).
2. Show that (i)–(ii)–(iii)ui implies the conclusion of Claim (a) (local version) in
the above Theorem 2.2.
3. State a “uniform integrable” counterpart of (iii)glob to extend Claim (b) (global
version) of Theorem 2.2.
4. Show that uniform integrability of the above family of random variables follows
from its L p -boundedness for (any) p > 1.
40 2 The Monte Carlo Method and Applications to Option Pricing
To illustrate the different methods to compute the sensitivities, we will consider the
one dimensional risk-neutral Black–Scholes model with constant interest rate r and
volatility σ > 0:
d X tx = X tx r dt + σdWt , X 0x = x > 0,
σ2
so that X tx = x exp (r − 2
)t + σWt . Then, we consider for every x ∈ (0, +∞),
where ϕ : (0, +∞) → R lies in L 1 (P X Tx ) for every x ∈ (0, +∞) and T > 0. This
corresponds (when ϕ is non-negative), to vanilla payoffs with maturity T . However,
we skip on purpose the discounting factor in what follows to alleviate notation: one
can always imagine it is included as a constant in the function ϕ since we will work
at a fixed time T . The updating of formulas is obvious. Note that, in many cases,
new parameters directly attached to the function ϕ itself are of interest, typically the
strike price K when ϕ(x) = (x − K )+ (call option), (K − x)+ (Put option), |x − K |
(butterfly option), etc.
First, we are interested in computing the first two derivatives (x) and (x) of
the function which correspond (up to the discounting factor) to the δ-hedge of the
option and its γ parameter, respectively. The second parameter γ is involved in the
so-called “tracking error”. But other sensitivities are of interest to the practitioners
like the vega, i.e. the derivative of the (discounted) function with respect to the
volatility parameter, the ρ (derivative with respect to the interest rate r ), etc. The
aim is to derive representations of these sensitivities as expectations in order to
compute them using a Monte Carlo simulation, in parallel with the computation of
the premium.
The Black–Scholes model is here clearly a toy model to illustrate our approach
since, for such a model, closed forms s exist for standard payoffs (Call, Put and their
linear combinations) and more efficient methods can be successfully implemented
like solving the PDE derived from the Feynman–Kac formula (see Theorem 7.11).
At least in a one – like here – or a low-dimensional model (say d ≤ 3) using a finite
difference (elements, volumes) scheme after an appropriate change of variable. For
PDE methods we refer to [2].
We will first work on the scenarii space (, A, P), because this approach contains
the “seed” of methods that can be developed in much more general settings in which
the SDE no longer has an explicit solution. On the other hand, as soon as an explicit
expression is available for the density pT (x, y) of X Tx , it is more efficient to use the
method described in the next Sect. 2.2.3.
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 41
(b) If ϕ : (0, +∞) → R is differentiable outside a countable set and is locally Lip-
schitz continuous with polynomial growth in the following sense
Remark. The above formula (2.7) can be seen as a first example of Malliavin weight
to compute a greek – here the δ-hedge – and the first method that we will use to
establish it, based on an integration by parts, as the embryo of the so-called Malliavin
Monte Carlo approach to Greek computation. See, among other references, [47] for
more developments in this direction when the underlying diffusion process is no
longer an elementary geometric Brownian motion.
Proof. (a) This straightforwardly follows from the explicit expression for X Tx and
the above differentiation Theorem 2.2 (global version): for every x ∈ (0, +∞),
∂ ∂Xx Xx
ϕ(X Tx ) = ϕ (X Tx ) T = ϕ (X Tx ) T .
∂x ∂x x
Now |ϕ (u)| ≤ C 1 + |u|m , where m ∈ N and C ∈ (0, +∞), so that, if 0 < x ≤
L < +∞,
∂
ϕ(X Tx ) ≤ Cr,σ,T 1 + L m exp (m + 1)σWT ∈ L 1 (P),
∂x
where Cr,σ,T is another positive real constant. This yields the domination condition
of the derivative.
(b) This claim follows from Theorem 2.2(a) (local version) and the fact that, for
every T > 0, P(X Tx = ξ) = 0 for every ξ ≥ 0.
(c) First, we still assume that the assumption (a) is in force. Then
42 2 The Monte Carlo Method and Applications to Option Pricing
√ √ 2 du
(x) = ϕ x exp μT + σ T u) exp μT + σ T u e−u /2 √
R 2π
√
1 ∂ϕ x exp (μT + σ T u) −u 2 /2 du
= √ e √
xσ T R ∂u 2π
√
∂e−u /2 du
2
1
=− √ ϕ x exp (μT + σ T u) √
xσ T R ∂u 2π
√
1 −u 2 /2 du
= √ ϕ x exp (μT + σ T u) u e √
xσ T R 2π
√
1 √ du
ϕ x exp (μT + σ T u) T u e−u /2 √ ,
2
=
x σT R 2π
where we used an integration by parts to obtain the third equality, taking advantage
of the fact that, owing to the polynomial growth assumptions on ϕ,
√ 2
lim ϕ x exp (μT + σ T u) e−u /2 = 0.
|u|→+∞
Finally, returning to ,
1
(x) = E ϕ(X Tx )WT . (2.8)
x σT
When ϕ is not differentiable, let us first sketch the extension of the formula by a
density argument. When ϕ is continuous and has compact support in R+ , one may
assume without loss of generality that ϕ is defined on the whole real line as a con-
tinuous function with compact support. Then ϕ can be uniformly approximated by
differentiable functions ϕn with compact support (use a convolution by mollifiers,
see [52], Chap. 8). Then, with obvious notations, n (x) := x σT
1
E ϕn (X Tx )WT con-
verges uniformly on compact sets of (0, +∞) to (x) defined by (2.8) since
E |WT |
|n (x) − (x)| ≤ ϕn − ϕsup .
xσT
Remark. We will see in the next section a much quicker way to establish claim (c).
The above method of proof, based on an integration by parts, can be seen as a
W
toy-introduction to a systematic way to produce random weights like xσTT in the
differentiation procedure of , especially when the differential of ϕ does not exist.
The most general extension of this approach, developed on the Wiener space (6 ) for
functionals of the Brownian motion is known as the Malliavin-Monte Carlo method.
Exercise (Extension to Borel functions with polynomial growth). (a) Show that
as soon as ϕ is a Borel function with polynomial growth, the function f defined
by (2.5) is continuous. [Hint: use that the distribution X Tx has a probability density
pT (x, y) on the positive real line which continuously depends on x and apply the
continuity theorem for functions defined by an integral, see Theorem 12.5(a) in the
Miscellany Chap. 12.]
(b) Show that (2.8) holds true as soon as ϕ is a bounded Borel function. [Hint:
Apply the Functional Monotone Class Theorem (see Theorem 12.2 in the Miscellany
Chap. 12) to an appropriate vector subspace of functions ϕ and use the Baire σ-field
Theorem.]
(c) Extend the result to Borel functions ϕ with polynomial growth. [Hint: use that
ϕ(X Tx ) ∈ L 1 (P) and ϕ = limn ϕn with ϕn = (n ∧ ϕ ∨ (−n)).]
(d) Derive from the preceding a simple expression for (x) when ϕ = 1 I is the
indicator function of a nontrivial interval I .
Comments: The extension to Borel functions ϕ always needs at some place an
argument based on the regularizing effect of the diffusion induced by the Brownian
motion. As a matter of fact, if X tx were the solution to a regular ODE this extension to
non-continuous functions ϕ would fail. We propose in the next section an approach –
the log-likelihood method – directly based on this regularizing effect through the
direct differentiation of the probability density pT (x, y) of X Tx .
Exercise. Prove claim (b) in detail.
Note that the assumptions of claim (b) are satisfied by usual payoff functions like
ϕCall (x) = (x − K )+ or ϕ Put := (K − x)+ (when X Tx has a continuous distribution).
In particular, this shows that
∂E ϕCall (X Tx ) Xx
= E 1{X Tx ≥K } T .
∂x x
The computation of this quantity – which is part of that of the Black–Scholes for-
mula – finally yields as expected:
∂E ϕCall (X Tx )
= er t 0 (d1 ),
∂x
where d1 is given by (2.4) (keep in mind that the discounting factor is missing).
Exercises. 0. A comparison. Try a direct differentiation of the Black–Scholes
formula (2.3) and compare with a (formal) differentiation based on Theorem 2.2.
You should find by both methods
∂Call0B S
(x) = 0 (d1 ).
∂x
But the true question is: “how long did it take you to proceed?”
44 2 The Monte Carlo Method and Applications to Option Pricing
1. Application to the computation of the γ (i.e. (x)). Show that, if ϕ is differentiable
with a derivative having polynomial growth,
1
(x) := E ϕ (X Tx )X Tx − ϕ(X Tx ) WT
x 2 σT
and that, if ϕ is continuous with compact support,
1 WT2 1
(x) := 2 E ϕ(X T )
x
− WT − .
x σT σT σ
Extend this identity to the case where ϕ is simply Borel with polynomial growth.
Note that a (somewhat simpler) formula also exists when the function ϕ is itself
twice differentiable, but such a smoothness assumption is not realistic, at least for
financial applications.
2. Variance reduction for the δ (7 ). The above formulas are clearly not the unique
representations of the δ as an expectation: using that E WT = 0 and E X Tx = xer t ,
one derives immediately that
x Xx
(x) = ϕ (xer t )er t + E ϕ (X T ) − ϕ (xer t ) T
x
1
(x) = E ϕ(X Tx ) − ϕ(xer t ) WT .
xσT
3. Variance reduction for the γ. Show that
1 x x
(x) = E ϕ (X )X − ϕ(X x
) − xe rt
ϕ (xe rt
) + ϕ(xe rt
) WT .
x 2 σT T T T
4. Testing the variance reduction, if any. Although the former two exercises are enti-
tled “variance reduction” the above formulas do not guarantee a variance reduction at
a fixed time T . It seems intuitive that they do only when the maturity T is small. Per-
form some numerical experiments to test whether or not the above formulas induce
some variance reduction.
As the maturity increases, test whether or not the regression method introduced
in Sect. 3.2 works with these “control variates”.
5. Computation of the vega (8 ). Show likewise that E ϕ(X Tx ) is differentiable with
respect to the volatility parameter σ under the same assumptions on ϕ, namely
7 Inthis exercise we slightly anticipate the next chapter, which is entirely devoted to variance
reduction.
8 Which is not a greek letter…
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 45
∂
E ϕ(X Tx ) = E ϕ (X Tx )X Tx WT − σT
∂σ
if ϕ is differentiable with a derivative having polynomial growth. Derive without any
further computations – but with the help of the previous exercises – that
∂ WT2 1
E ϕ(X T ) = E ϕ(X T )
x x
− WT −
∂σ σT σ
if ϕ is simply Borel with polynomial growth. [Hint: use the former exercises.]
This derivative is known (up to an appropriate discounting) as the vega of the
option related to the payoff ϕ(X Tx ). Note that the γ and the vega of a Call satisfy
(after discounting by e−r t )
vega(x, K , r, σ, T ) = x 2 σT γ(x, K , r, σ, T ),
In fact, the formulas established in the former section for the Black–Scholes model
can be obtained by working directly on the state space (0, +∞), taking advantage of
the fact that X Tx has a smooth and explicit probability density pT (x, y) with respect to
the Lebesgue measure on (0, +∞), which is known explicitly since it is a log-normal
distribution.
This probability density also depends on the other parameters of the model like
the volatility, σ, the interest rate r and the maturity T . Let us denote by θ one of
these parameters which is assumed to lie in a parameter set . More generally,
we could imagine that X Tx (θ) is an Rd -valued solution at time T to a stochastic
differential equation whose coefficients b(θ, x) and σ(θ, x) depend on a parameter
θ ∈ ⊂ R. An important result of stochastic analysis for Brownian diffusions is that,
under uniform ellipticity assumptions (or the less stringent “parabolic Hörmander
ellipticity assumptions”, see [24, 139]), combined with smoothness assumptions on
both the drift and the diffusion coefficient, such a solution of an SDE does have a
smooth density pT (θ, x, y) – at least in (x, y) – with respect to the Lebesgue measure
on Rd . For more details, we refer to [25] or [11, 98]. Formally, we then get
(θ) = E ϕ(X Tx (θ)) = ϕ(y) pT (θ, x, y)μ(dy)
Rd
46 2 The Monte Carlo Method and Applications to Option Pricing
so that, formally,
∂ pT
(θ) = ϕ(y) (θ, x, y)μ(dy)
Rd ∂θ
∂ pT
(θ, x, y)
∂θ
= ϕ(y) p (θ, x, y)μ(dy)
Rd pT (θ, x, y) T
∂ log pT
= E ϕ(X Tx ) (θ, x, X Tx ) . (2.9)
∂θ
which can be viewed as ancestors of Malliavin calculus, provide the δ-hedge for
vanilla options in local volatility models (see Theorem 10.2 and the application that
follows).
Exercises. 1. Provide simple assumptions to justify the above formal computations
in (2.9), at some point θ0 or for all θ running over a non-empty open interval of
R (or domain of Rd if θ is vector valued). [Hint: use the remark directly below
Theorem 2.2.]
2. Compute the probability density pT (σ, x, y) of X Tx,σ in a Black–Scholes model
(σ > 0 stands for the volatility parameter).
3. Re-establish all the sensitivity formulas established in the former Sect. 2.2.2
(including the exercises at the end of the section) using this approach.
4. Apply these formulas to the case ϕ(x) := e−r t (x − K )+ and retrieve the classical
expressions for the greeks in a Black–Scholes model: the δ, the γ and the vega.
In this section we focused on the case of the marginal X Tx (θ) at time T of a Brow-
nian diffusion as encountered in local volatility models viewed as a generalization
of the Black–Scholes models investigated in the former section. In fact, this method,
known as the log-likelihood
method, has a much wider range of application since it
works for any family X (θ) θ∈ of Rd -valued vectors, ( ⊂ Rq ) such that, for every
θ ∈ , the distribution of X (θ) has a probability density p(θ, y) with respect to a
reference measure μ on Rd , usually the Lebesgue measure.
In fact, when both the payoff function/functional and the coefficients of the SDE are
regular enough, one can differentiate the function/functional of the process directly
with respect to a given parameter. The former Sect. 2.2.2 was a special case of this
method for vanilla payoffs in a Black–Scholes model. We refer to Sect. 10.2.2 for
more detailed developments.
Chapter 3
Variance Reduction
mX = E X ∈ R
qα2 Var(X )
M ≥ M X (ε, α) = . (3.1)
ε2
(the last condition only says that X and X are not a.s. equal).
Question: Which random vector (distribution…) is more appropriate?
Several examples of such a situation have already been pointed out in the previous
chapter: usually many formulas are available to compute a greek parameter, even
more if one takes into account the (potential) control variates introduced in the
exercises.
A natural answer is: if both X and X can be simulated with an equivalent cost
(complexity), then the one with the lowest variance is the best choice, i.e.
X = X − + E
So we have at hand two formulas for (x) that can be implemented in a Monte Carlo
simulation. Which one should we choose to compute (x) (i.e. the δ-hedge up to
a factor of e−r t )? Since they have the same expectations, the two random variables
should be discriminated through (the square of) their L 2 -norm, namely
2
X Tx 2 x WT
E ϕ (X T )
x
and E ϕ(X T ) .
x xσT
Such is the case for an in-the-money Call option with payoff ϕ(ξ) = (ξ − K )+ if
X 0x = x > K .
Long maturities. On the other hand, at least if ϕ is bounded,
W 2 1
E ϕ(X Tx ) T =O →0 as T → +∞,
xσT T
whereas, if ϕ is bounded away from zero at infinity, easy computations show that
X x 2
lim E ϕ (X Tx ) T = +∞.
T →+∞ x
However, these last two conditions on the function ϕ conflict with each other.
In practice, one observes on the greeks of usual payoffs that the pathwise differ-
entiated estimator has a significantly lower variance (even when this variance goes
to 0 like for Puts in long maturities).
Numerical Experiment (Short maturities). We consider the Call payoff ϕ(ξ) = (ξ −
K )+ , with K = 95 and x = 100, still r = 0 and a volatility σ = 0.5 in the Black–
Scholes model. Fig. 3.1 (left) depicts the variance of the pathwise differentiated and
the weighted estimators of the δ-hedge for maturities running from T = 0.001 up to
T = 1.
These estimators can be improved (at least for short maturities) by introducing
control variates as follows (see Exercise 2., Sect. 2.2.2 of the former chapter):
X Tx WT
(x) = ϕ (x) + E ϕ (X Tx ) − ϕ (x) =E ϕ(X Tx ) − ϕ(x) . (3.3)
x xσT
However the comparisons carried out with these new estimators tend to confirm the
above heuristics, as illustrated by Fig. 3.1 (right).
A variant (pseudo-control variate)
In option pricing, when the random variable X is a payoff it is usually non-negative.
In that case, any random variable satisfying (i)–(ii) and
(iii) 0≤≤X
3.1 The Monte Carlo Method Revisited: Static Control Variate 53
6 6
5 5
Variances
4 4
Variances
3 3
2 2
1 1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Mat ur it ie s T ∈ (0, 1] Mat ur it ie s T ∈ (0, 1]
Xx
Fig. 3.1 Black–Scholes Calls: Left: T → Var ϕ (X Tx ) xT (blue line) and T →
WT
Var ϕ(X Tx ) xσT (red line), T ∈ (0, 1], x = 100, K = 95, r = 0, σ = 0.5. Right: T →
Xx WT
Var ϕ (X Tx ) − ϕ (x) xT (blue line) and T → Var ϕ(X Tx ) − ϕ(x) xσT (red line), T ∈ (0, 1]
In particular, considering B = {∅, } yields the above inequality for regular expec-
tation, i.e.
g E X ≤ E g(X ).
Jensen’s inequality is an efficient tool for designing control variates when dealing
with path-dependent or multi-asset options as emphasized by the following examples.
Examples. 1. Basket or index option. We consider a payoff on a basket of d
(positive) risky assets (this basket can be an index). For the sake of simplicity we
suppose it is a Call with strike K , i.e.
d
hT = αi X T i,xi
−K
i=1 +
i,x
d
αi log(X T i )
0≤e 1≤i≤d ≤ αi X Ti,xi
i=1
so that i,xi
h T ≥ k T := e 1≤i≤d αi log(X T ) − K ≥ 0.
+
j=1
⎛ ⎞
q
σi.2
= xi exp ⎝ r − σi j Wt ⎠, t ∈ [0, T ], xi > 0, i = 1, . . . , d,
j
X ti,xi t+
2 j=1
where
q
σi.2 = σi2j , i = 1, . . . , d.
j=1
Exercise. Show that if the matrix σσ ∗ is positive definite (then q ≥ d) and one
may assume without modifying the model that X i,xi only depends on the first i com-
ponents of a d-dimensional standard Brownian motion. [Hint: consider the Cholesky
decomposition in Sect. 1.6.2.]
Now, let us describe the two phases of the variance reduction procedure:
– Phase I: = e−r t k T as a pseudo-control variate and computation of its expecta-
tion E .
The vanilla Call option has a closed form in a Black–Scholes model and elemen-
tary computations show that
1
d ∗ ∗
αi log(X T i,xi
/xi ) = N r− αi σi. T ; α σσ α T
2
1≤i≤d
2 1≤i≤d
Remark. The extension to more general payoffs of the form ϕ 1≤i≤d αi X T
i,xi
is
ϕ is non-decreasing
straightforward provided
i,xi
and a closed form exists for the vanilla
option with payoff ϕ e 1≤i≤d αi log(X T ) .
Exercise. Other ways to take advantage of the convexity of the exponential function
can be explored: thus one can start from
X Ti,xi
αi X T
i,xi
= αi xi
αi ,
1≤i≤d 1≤i≤d 1≤i≤d
xi
αi xi
αi =
where , i = 1, . . . , d. Compare on simulations the respective
1≤k≤d αk x k
performances of these different approaches.
2. Asian options and the Kemna–Vorst control variate in a Black–Scholes model
(see [166]). Let T
1
hT = ϕ X tx dt
T 0
be a regular Black–Scholes dynamics with volatility σ > 0 and interest rate r . Then,
the (standard) Jensen inequality applied to the probability measure T1 1[0,T ] (t)dt
implies
T T
1 1
X tx dt ≥ x exp (r − σ 2 /2)t + σWt dt
T 0 T 0
T σ T
= x exp (r − σ 2 /2) + Wt dt .
2 T 0
3.1 The Monte Carlo Method Revisited: Static Control Variate 57
Now
T T T
Wt dt = T WT − sdWs = (T − s)dWs
0 0 0
so that T
T
1 d 1 T
Wt dt = N 0; 2 s ds = N 0;
2
.
T 0 T 0 3
This suggests to rewrite the right-hand side of the above inequality in a “Black–
Scholes asset” style, namely:
1 T
σ2 σ T
X tx dt ≥ x e−( 2 + 12 )T exp (r − (σ 2 /3)/2)T +
r
Wt dt .
T 0 T 0
h T ≥ k TK V .
1
n
1 T
2k − 1
f (t)dt f T
T 0 n k=1 2n
or any other numerical integration method, keeping in mind nevertheless that the
(continuous) functions f of interest are here given for the first and the second payoff
functions by
σ2
f (t) = ϕ x exp r − t + σWt (ω) and f (t) = Wt (ω),
2
−
respectively. Hence, their regularity is 21 -Hölder (i.e. α-Hölder on [0, T ], for every
α < 21 , like for the payoff h T ). Finally, in practice, it will amount to simulating
independent copies of the n-tuple
W 2nT , . . . , W (2k−1)T , . . . , W (2n−1)T
2n 2n
from which one can reconstruct both a mid-point approximation of both integrals
appearing in h T and k T .
In fact, one can improve this first approach by taking advantage of the fact that
W is a Gaussian process as detailed in the exercise below.
Further developments to reduce the time discretization error are proposed in
Sect. 8.2.5 (see [187] where an in-depth study of the Asian option pricing in a
Black–Scholes model is carried out).
Exercises. 1. (a) Show that if f : [0, T ] → R is continuous then
1 kT
n
1 T
lim f = f (t) dt.
n n n T 0
k=1
Show that t → x exp r − σ2 t + σWt (ω) is α-Hölder for every α ∈ (0, 21 ) (with
2
with ak = 2(n−k)+1
2n
T , k = 1, . . . , n. [Hint: for Gaussian vectors, conditional expec-
tation and affine regression coincide.]
(c) Propose a variance reduction method in which the pseudo-control variate e−r t k T
will be simulated exactly.
2. Check that the preceding can be applied to payoffs of the form
3.1 The Monte Carlo Method Revisited: Static Control Variate 59
T
1
ϕ X tx dt − XT
x
T 0
(a) Using the convexity inequality (that can still be seen as an application of Jensen’s
inequality) √
ab ≤ max(a, b), a, b > 0,
show that
k T := X T1 X T2 − K
+
In this section we assume that X and X have not only the same expectation mX
but also the same variance, i.e. Var(X ) = Var(X ), can be simulated with the same
complexity κ = κ X = κ X . We also assume that E(X − X )2 > 0 so that X and X
are not a.s. equal. In such a situation, choosing between X or X may seem a priori
a question of little interest. However, it is possible to take advantage of this situation
to reduce the variance of a simulation when X and X are negatively correlated.
Set
X + X
χ= .
2
This corresponds to = X −X 2
with our formalism. It is reasonable (when no further
information on (X, X ) is available) to assume that the simulation complexity of χ
is twice that of X and X , i.e. κχ = 2κ. On the other hand
60 3 Variance Reduction
1
Var (χ) = Var X + X
4
1
= Var(X ) + Var(X ) + 2Cov(X, X )
4
Var(X ) + Cov(X, X )
= .
2
The size M X (ε, α) and M χ (ε, α) of the simulation using X and χ respectively
to enter a given interval [m − ε, m + ε] with the same confidence level α is given,
following (3.1), by
q 2 q 2
α α
MX = Var(X ) for X and Mχ = Var(χ) for χ.
ε ε
Taking into account the complexity as in the exercise that follows Definition 3.1, this
means in terms of C PU computation time that one should better use χ if and only if
κχ M χ < κ X M X ⇐⇒ 2κ M χ < κ M X ,
i.e.
2Var(χ) < Var(X ).
Cov(X, X ) < 0.
To take advantage of this remark in practice, one usually relies on the following
result.
Furthermore, the inequality holds as an equality if and only if ϕ(Z ) = E ϕ(Z ) P-a.s.
or ψ(Z ) = E ψ(Z ) P-a.s.
(b) Assume there exists a non-increasing (hence Borel) function T : R → R such
d
that Z = T (Z ). Then X = ϕ(Z ) and X = ϕ(T (Z )) are identically distributed and
satisfy
Cov(X, X ) ≤ 0.
3.1 The Monte Carlo Method Revisited: Static Control Variate 61
so that the expectation of this product is non-negative (and finite since all random
variables are square integrable). Consequently
E ϕ(Z )ψ(Z ) + E (ϕ(Z )ψ(Z ) − E ϕ(Z )ψ(Z ) − E ϕ(Z )ψ(Z ) ≥ 0.
d
Using that Z = Z and that Z , Z are independent, we get
2 E ϕ(Z )ψ(Z ) ≥ E ϕ(Z )E ψ(Z ) + E ϕ(Z ) E ψ(Z ) = 2 E ϕ(Z ) E ψ(Z ),
that is
Cov ϕ(Z ), ψ(Z ) = E ϕ(Z )ψ(Z ) − E ϕ(Z ) E ψ(Z ) ≥ 0.
1 This is always possible owing to Fubini’s Theorem for product measures by considering the
probability space (2 , A⊗2 , P⊗2 ): extend Z by Z (ω, ω ) = Z (ω) and define Z by Z (ω, ω ) =
Z (ω ).
62 3 Variance Reduction
ϕ(Z ) − ϕ(Z ) > 0 a.s. so that ψ(Z ) − (Z ) = 0 P-a.s.; which in turn implies that
ψ(a + ε) = ψ(b − ε). Then, letting ε go to 0, one derives that ψ(a) = ψ(b) (still
keeping in mind the convention on atoms). Finally, this shows that ψ(Z ) is P-a.s.
constant.
(b) Set ψ = ϕ ◦ T so that ϕ and ψ have opposite monotonicity. Noting that X and
X have the same distribution and applying claim (a) completes the proof. ♦
since, z d+1 being fixed, the functions z 1:d → ϕ(z 1:d , z d+1 ) and z 1:d → ψ(z 1:d , z d+1 )
satisfy the above joint marginal co-monotonicity assumption on Rd .
Now, the functions z d+1 → ϕ(z 1:d , z d+1 ) and z d+1 → ψ(z 1:d , z d+1 ) have the same
monotonicity, not depending on z 1:d so that : z d+1 → E ϕ(Z 1:d , z d+1 ) and :
z d+1 → E ψ(Z 1:d , z d+1 ) have the same monotonicity. Hence
E ϕ(Z d+1 )E ψ(Z 1:d , z d+1 )P Z d+1 (dz d+1 ) = E (Z d+1 )(Z d+1 )
R
≥ E (Z d+1 )E (Z d+1 )
= E ϕ(Z 1:d+1 ) E ψ(Z 1:d+1 ),
where we used Fubini’s Theorem twice (in a reverse way) in the last line. This
completes the proof. ♦
Remarks.
• This result may be successfully applied to functions of the form
f X̄ Tn , . . . , X̄ k Tn , . . . , X̄ n Tn of the Euler scheme with step Tn of a one-dimensional
Brownian diffusion with a non-decreasing drift and a deterministic strictly positive
diffusion coefficient, provided f is “marginally monotonic”, i.e. monotonic in each
of its variable with the same monotonicity. (We refer to Chap. 7 for an introduc-
tion to the Euler scheme of a diffusion.) The idea is to rewrite the functional as a
“marginally monotonic” function of the n (independent) Brownian increments which
play the role of the random variables Z i . Furthermore, passing to the limit as the step
size goes to zero yields some correlation results for a class of monotonic continuous
64 3 Variance Reduction
functionals defined on the canonical space C([0, T ], R) of the diffusion itself (the
monotonicity should be understood with respect to the naive pointwise partial order:
f ≤ g if f (t) ≤ g(t), t ∈ [0, T ]).
• The co-monotony inequality (3.4) is one of the most powerful tool to establish
lower bounds for expectations. For more insight about these kinds of co-monotony
properties and their consequences for the pricing of derivatives, we refer to [227].
Exercises. 1. A toy simulation. Let f and g be two functions defined on the real
line by f (u) = √uu2 +1 and g(u) = tanh u, u ∈ R. Set
ϕ(z 1 , z 2 ) = f z 1 eaz2 and ψ(z 1 , z 2 ) = g z 1 ebz2 with a, b > 0.
We return to the original situation of two square integrable random variables X and
X , having the same expectation
E X = E X = m
We assume again that X and X are not identical in the sense that P(X = X ) > 0,
which turns out to be equivalent to
3.2 Regression-Based Control Variates 65
Var(X − X ) > 0.
We saw that if Var(X ) Var(X ), one will naturally choose X to implement the
Monte Carlo simulation and we provided several classical examples in that direction.
However, we will see that with a little more effort it is possible to improve this naive
strategy.
This time we simply (and temporarily) set
The idea is simply to parametrize the impact of the control variate by a factor λ,
i.e. we set for every λ ∈ R,
X λ := X − λ = (1 − λ)X + λX .
Cov(X, ) E (X )
λmin := =
Var() E 2
Cov(X , ) E (X )
= 1+ =1+ .
Var() E 2
Consequently
so that
σmin
2
≤ min Var(X ), Var(X )
and σmin
2
= Var(X ) if and only if Cov(X, ) = 0.
Remark. Note that Cov(X, ) = 0 if and only if λmin = 0, i.e. Var(X ) = min (λ).
λ∈R
If we denote by ρ X, the correlation coefficient between X and , one gets
σmin
2
= Var(X )(1 − ρ2X, ) = Var(X )(1 − ρ2X , ).
where σX and σX denote the standard deviations of X and X , respectively, and ρX,X
is the correlation coefficient between X and X .
Exercise. We go back to the Toy-example from Sect. 3.1 with ϕ(ξ) = (ξ − K )+
(with x > K ) and (x) = E ϕ(X Tx ) (see Eq. (3.2)). In order to reduce the variance
for short maturities of the estimators of the δ-hedge, we note that, for every λ, μ ∈ R,
Xx WT
(x) = λϕ (x) + E ϕ (X Tx ) − λϕ (x) T = E ϕ(X Tx ) − ϕ(x) .
x xσT
Apply the preceding to this example and implement it with x = 100, K = 95, σ =
0.5 and T ∈ (0, 1] (after reading the next section). Compare the numerical results
with the “naive” variance reduction obtained by (3.3).
Let (X k , X k )k≥1 be an i.i.d. sequence of random vectors with the same distribution
as (X, X ) and let λ ∈ R. Set, for every k ≥ 1,
k = X k − X k , X kλ = X k − λ k .
1 2 1
M M
VM := , C M := X k k
M k=1 k M k=1
and
CM
λ M := (convention: λ0 = 0). (3.5)
VM
so that
λ M → λmin P-a.s. as M → +∞.
This suggests to introduce the batch estimator of m, defined for every size M ≥ 1
of the simulation by
1 λM
M
λM
XM = X .
M k=1 k
1 1
M M
λ
XM M = X k − λM k
M k=1 M k=1
= X M − λ M M (3.6)
1 λ M a.s.
M
λ
XM M = X −→ E X = m
M k=1 k
and satisfies a CLT (asymptotic normality) with an optimal asymptotic variance σmin
2
,
i.e. √ λ
L
M X M M − m −→ N 0; σmin 2
.
Remark. However, note that the batch estimator is a biased estimator of m since
E λ M M = 0.
1 λ M a.s.
M
X −→ m − λmin × 0 = m.
M k=1 k
1
M
L
√ k −→ N 0; E 2
M k=1
by the regular CLT applied to the centered square integrable i.i.d. random variables
k , k ≥ 1. Combining these two convergence results yields the announced CLT. ♦
Exercise. Let X m and m denote the empirical mean processes of the sequences
(X k )k≥1 and (k )k≥1 , respectively. Show that the quadruplet (X M , M , CM , VM ) can
be computed in a recursive way from the sequence (X k , X k )k≥1 . Derive a recursive
way to compute the batch estimator.
ℵ Practitioner’s corner
One may proceed as follows:
– Recursive implementation: Use the recursion satisfied by the sequence
(X k , k , Ck , Vk )k≥1 to compute λ M and the resulting batch estimator for each size
M.
– True batch implementation: A first phase of the simulation of size M , M
M (say M 5% or 10% of the total budget M of the simulation) devoted to a rough
estimate λ M of λmin , based on the Monte Carlo estimator (3.5).
A second phase of the simulation to compute the estimator of m defined by
1 M
λ
X M
M − M k=M +1 k
Theorem 3.2 Assume X, X ∈ L 2+δ (P) for some δ > 0. Let (X k , X k )k≥1 be an i.i.d.
sequence with the same distribution as (X, X ). We set, for every k ≥ 1,
Xk = Xk −
λk−1 k = (1 −
λk−1 )X k +
λk−1 X k where
λk = (−k) ∨ (λk ∧ k)
1
M
λ
XM = Xk
M k=1
λ
is unbiased E X M = m , convergent, i.e.
λ
a.s.
X M −→ m as M → +∞,
What follows in this section can be omitted at the occasion of a first reading
although the method of proof exposed below is quite standard when dealing with the
efficiency of an estimator by martingale methods the preceding
= 2 E X k2 + E
λ2k−1 E 2k < +∞,
where we used that k and λk−1 are independent and λk−1 ≤ k − 1. Finally, for every
k ≥ 1,
X k | Fk−1 ) = E (X k | Fk−1 ) −
E ( λk−1 E (k | Fk−1 ) = m.
k
X − m
Nk := .
=1
from the preceding that the sequence (Nk )k≥1 is a square integrable
It follows
(Fk )k , P-martingale since ( X k − m)k≥1 , is a sequence of square integrable
(Fk )k , P -martingale increments. Its conditional variance increment process (also
known as “bracket process”) N k , k ≥ 1, is given by
k
E ((
X − m)2 | F−1 )
N k =
=1
2
k
(
λ−1 )
= .
=1
2
Now, the above series is a.s. convergent since we know that ( λk ) a.s. con-
verges towards (λmin ) as k → +∞ since λk a.s. converges toward λmin and is
continuous. Consequently,
1
M
a.s.
X k − m −→ 0 as M → +∞,
M k=1
i.e.
1 a.s.
M
X M := X k −→ m as M → +∞.
M k=1
Xk − m
X M,k := √ , 1 ≤ k ≤ M.
M
There are essentially two assumptions to be checked. First the convergence of the
conditional variance increment process toward σmin
2
:
3.2 Regression-Based Control Variates 71
M
1
M
E X 2M,k | Fk−1 = E ( X k − m)2 | Fk−1
k=1
M k=1
1
M
= (λk−1 )
M k=1
−→ σmin
2
:= min (λ).
λ
The second one is the so-called Lindeberg condition (see again Theorem 12.8
or [142], p. 58) which reads in our framework:
M
P
∀ ε > 0, E X 2M, 1{|X M, |>ε} | F−1 −→ 0.
=1
In turn, owing to the conditional Markov inequality and the definition of X M, , this
condition classically follows from the slightly stronger: there exists a real number
δ > 0 such that
sup E |
X − m|2+δ | F−1 < +∞ P-a.s.
≥1
since
M
1
M
E X 2M, 1{|X M, |>ε} | F−1 ≤ δ
E |
X − m|2+δ | F−1 .
=1 εδ M 1+ 2 =1
Now, using that (u + v)2+δ ≤ 21+δ (u 2+δ + v 2+δ ), u, v ≥ 0, and the fact that X , X ∈
L 2+δ (P), one gets
X − m|2+δ | F−1 ) ≤ 21+δ E |X − m|2+δ + |
E (| λ−1 |2+δ E ||2+δ .
This means that the expected variance reduction does occur if one implements
the recursive approach described above. ♦
72 3 Variance Reduction
The variance reduction by regression introduced in the former section still relies on
the fact that κX κ X −λ or, equivalently, that the additional complexity induced by
the simulation of given that of X is negligible. This condition may look demanding
but we will see that in the framework of derivative pricing this requirement is always
fulfilled as soon as the payoff of interest satisfies a so-called parity equation, i.e. that
the original payoff can be duplicated by a “synthetic” version.
Furthermore, these parity equations are model free so they can be applied for
various specifications of the dynamics of the underlying asset.
In this section, we denote by (St )t≥0 the risky asset (with S0 = s0 > 0) and set
St0 = er t , the riskless asset. We work under the risk-neutral risk-neutral probability
P (supposed to exist), which means that
−r t
e St t∈[0,T ] is a martingale on the scenarii space (, A, P)
(with respect to the augmented filtration of (St )t∈[0,T ] ). Furthermore, to comply with
usual assumptions of AO A theory, we will assume that this risk neutral probability
is unique (complete market) to justify that we may price any derivative under this
probability. However this has no real impact on what follows.
the premium of this Call and this Put option, respectively. Since
(ST − K )+ − (K − ST )+ = ST − K
and e−r t St t∈[0,T ] is a martingale, one derives the classical Call-Put parity equation:
which turns out to be the terminal value of a martingale null at time 0 (this is in fact
the generic situation of application of this parity method).
3.3 Application to Option Pricing: Using Parity Equations to Produce Control Variates 73
Note that the simulation of X involves that of ST so that the additional cost of the
simulation of is definitely negligible.
and
! T "
Put 0As = e−r t E K− 1
T −T0 T0 St dt .
+
so that
Call0As = E (X ) = E (X )
with
T
1
X := e−r t St dt − K
T − T0 T0 +
−r (T −T0 )
1−e 1 T
X := s0 − e−r t K + e−r t K− St dt .
r (T − T0 ) T − T0 T0 +
This leads to
1 T
1 − e−r (T −T0 )
= e−r t St dt − s0 .
T − T0 T0 r (T − T0 )
Remark. In both cases, the parity equation directly follows from the P-martingale
property of
St = e−r t St .
In practical implementations, one often neglects the cost of the computation of λmin
since only a rough estimate is computed: this leads us to stop its computation after
the first 5% or 10% of the simulation.
74 3 Variance Reduction
– However, one must be aware that the case of the existence of parity equations is
quite specific since the random variable is involved in the simulation of X , so the
complexity of the simulation process is not increased: thus in the recursive approach
the updating of λ M and of (the empirical mean) X M is (almost) costless. Similar
observations can be made to some extent on batch approaches. As a consequence, in
that specific setting, the complexity of the adaptive linear regression procedure and
the original one are (almost) the same!
– Warning! This is no longer true in general…and in a general setting the com-
plexity of the simulation of X and X is double that of X itself. Then the regression
method is efficient if and only if
1
σmin
2
< min Var(X ), Var(X )
2
(provided one neglects the cost of the estimation of the coefficient λmin ).
The exercise below shows the connection with antithetic variables which then
appears as a special case of regression methods.
Exercise (Connection with the antithetic variable method). Let X , X ∈ L 2 (P)
such that E X = E X = m and Var(X ) = Var(X ).
1
(a) Show that λmin = .
2
λmin X + X X + X 1
(b) Show that X = and Var = Var(X ) + Cov(X, X ) .
2 2 2
Characterize the pairs (X, X ) for which the regression method does reduce the
variance. Make the connection with the antithetic method.
Asian Calls in a Heston model (See Figs. 3.5, 3.6 and 3.7)
The dynamics of the risky asset is this time a stochastic volatility model, namely
the Heston model, defined as follows. Let ϑ, k, a such that ϑ2 /(2ak) ≤ 1 (so that vt
remains a.s. positive, see [183], Proposition 6.2.4, p. 130).
√
d St = St r dt + vt dWt1 , s0 = x0 > 0, t ∈ [0, T ], (risky asset)
√
dvt = k(a − vt )dt + ϑ vt dWt , v0 > 0
2
Usually, no closed forms are available for Asian payoffs, even in the Black–
Scholes model, and this is also the case in the Heston model. Note however that
(quasi-)closed forms do exist for vanilla European options in this model (see [150]),
which is the origin of its success. The simulation has been carried out by replacing
the above diffusion by an Euler scheme (see Chap. 7 for an introduction to the Euler
time discretization scheme). In fact, the dynamics of the stochastic volatility process
does not fulfill the standard Lipschitz continuous assumptions required to make the
Euler scheme converge, at least at its usual rate. In the present case it is even difficult
√
to define this scheme because of the term vt . Since our purpose here is to illustrate
K ---> Call(s_0,K), CallPar(s_0,K), InterpolCall(s_0,K)
0.015
0.01
0.005
-0.005
-0.01
-0.015
-0.02
90 95 100 105 110 115 120
Strikes K=90,...,120
Fig. 3.2 Black–Scholes Calls: Error = Reference BS-(MC Premium). K = 90, . . . , 120. –o–
o–o– Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated Synthetic Call
76 3 Variance Reduction
0.7
0.6
0.5
K ---> lambda(K)
0.4
0.3
0.2
0.1
90 95 100 105 110 115 120
Strikes K=90,...,120
Fig. 3.3 Black–Scholes Calls: K → 1 − λmin (K ), K = 90, . . . , 120, for the Interpolated
Synthetic Call
18
K ---> StD_Call(s_0,K), StD_CallPar(s_0,K),
16
StD_InterpolCall(s_0,K)
14
12
10
4
90 95 100 105 110 115 120
Strikes K=90,...,120
Fig. 3.4 Black–Scholes Calls. Standard Deviation (MC Premium). K = 90, . . . , 120. –o–o–o–
Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call
14
StD-AsInterCall(s_0,K) 10
0
90 95 100 105 110 115 120
Strikes K=90,...,120
Fig. 3.5 Heston Asian Calls. Standard Deviation (MC Premium). K = 90, . . . , 120. –o–o–o–
Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated Synthetic Call
0.9
0.8
0.7
K ---> lambda(K)
0.6
0.5
0.4
0.3
0.2
0.1
0
90 95 100 105 110 115 120
Strikes K=90,...,120
Fig. 3.6 Heston Asian Calls. K → 1 − λmin (K ), K = 90, . . . , 120, for the Interpolated Syn-
thetic Asian Call
#
rT T 2 $
S̄ kTn = S̄ (k−1)T 1 + + |v̄ (k−1)T | ρZ k + 1 − ρ2 Z k1 ,
n n n n
S̄0 = s0 > 0,
T
v̄ kTn = k a − v̄ (k−1)T + ϑ |v̄ (k−1)T | Z k2 , v̄0 = v0 > 0,
n n n
78 3 Variance Reduction
0
0.015
0.01
0.005
−0.005
0
−0.01
−0.015
−0.02
90 95 100 105 110 115 120
Strikes K=90,...,120
Fig. 3.7 Heston Asian Calls. M = 106 (Reference: MC with M = 108 ). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call
1. Consider a vanilla Call with strike K = 80. The random variable is defined
as above. Estimate the λmin (this should be not too far from 0.825). Then, com-
pute a confidence interval for the Monte Carlo pricing of the Call with and with-
out the linear variance reduction for the following sizes of the simulation: M =
5 000, 10 000, 100 000, 500 000.
2. Proceed as above but with K = 150 (true price 1.49). What do you observe?
Provide an interpretation.
3.3 Application to Option Pricing: Using Parity Equations to Produce Control Variates 79
E X = m ∈ Rd , E = 0 ∈ Rq .
Let D(X ) := Cov(X i , X j ) 1≤i, j≤d and let D() denote the covariance (dispersion)
matrices of X and , respectively. Assume
where
C(X, ) = [Cov(X i , j )]1≤i≤d,1≤ j≤q .
3.4 Pre-conditioning
and
Var E (X | B) = E E (X | B)2 − (E X )2 ≤ E X 2 − (E X )2 = Var(X ) (3.7)
This shows that the above inequality (3.7) is strict, except if X is B-measurable, i.e.
is strict in any case of interest.
The archetypal situation is the following. Assume that
X = g(Z 1 , Z 2 ), g ∈ L 2 R2 , Bor (R2 ), P(Z 1 ,Z 2 ) ,
where Z 1 , Z 2 are independent random vectors. Set B := σ(Z 2 ). Then standard results
on conditional expectations show that
E X = E G(Z 2 ) where G(z 2 ) = E g(Z 1 , Z 2 ) | Z 2 = z 2 = E g(Z 1 , z 2 )
h T = (X T1 − X T2 − K )+ .
Finally, one takes advantage of the closed form available for vanilla Call options in
a Black–Scholes model to compute
Premium B S (x1 , x2 , K , σ1 , σ2 , r, T ) = E E (e−r t h T | Z 2 )
The starting idea of stratification is to localize the Monte Carlo method on the
elements of a measurable partition of the state space E of a random variable
X : (, A, P) → (E, E).
Let (Ai )i∈I be a finite E-measurable partition of the state space E. The Ai ’s are
called strata and (Ai )i∈I a stratification of E. Assume that the weights
pi = P(X ∈ Ai ), i ∈ I,
d
L(X | X ∈ Ai ) = ϕi (U ),
82 3 Variance Reduction
where U is uniformly distributed over [0, 1]ri (with ri ∈ N ∪ {∞}, the case ri = +∞
corresponding to the acceptance-rejection method) and ϕi : [0, 1]ri → E is an (eas-
ily) computable function. This second condition simply means that the conditional
distribution L(X | X ∈ Ai ) is easy to simulate on a computer. To be more precise we
implicitly assume in what follows that the simulation of X and of the conditional
distributions L(X | X ∈ Ai ), i ∈ I , or, equivalently, the random vectors ϕi (Ui ), have
approximately the same complexity. One must always keep this in mind since it is a
major constraint for practical implementations of stratification methods.
This simulability condition usually has a strong impact on the possible design of
the strata. For convenience, we will assume in what follows that ri = r .
Let F : (E, E) → (R, Bor (R)) such that E F 2 (X ) < +∞. By elementary con-
ditioning, we get
E F(X ) = E 1{X ∈Ai } F(X ) = pi E F(X ) | X ∈ Ai = pi E F(ϕi (Ui )) ,
i∈I i∈I i∈I
where the random variables Ui , i ∈ I , are i.i.d. with uniform distribution over [0, 1]r .
This is where the stratification idea is introduced. Let M be the global “budget”
allocated to the simulation of E F(X ). We split this budget into |I | groups by setting
Mi = qi M, i ∈ I,
to be the allocated budget to compute E F(ϕi (U )) in each stratum “Ai ”. This leads
us to define the following (unbiased) estimator
I 1
Mi
) M :=
F(X pi F ϕi (Uik )
i∈I
Mi k=1
where (Uik )1≤k≤Mi ,i∈I are uniformly distributed on [0, 1]r , i.i.d. random variables
(with an abuse of notation since the estimator actually depends on all the Mi ). Then,
elementary computations show that
I p2
) M = 1
Var F(X σ ,
i 2
M i∈I qi F,i
p2
min i 2
σ F,i where S I := (qi )i∈I ∈ (0, 1)|I | qi = 1 .
(qi )∈S I q i
i∈I i∈I
qi = pi , i ∈ I.
Such a choice is first motivated by the fact that the weights pi are known and of
course because it does reduce the variance since
p2
i 2
σ F,i = pi σ 2F,i
i∈I
q i i∈I
2
= E 1{X ∈Ai } F(X ) − E (F(X ) | X ∈ Ai )
i∈I
' '2
= ' F(X ) − E F(X ) A IX '2 , (3.8)
where A IX = σ {X ∈ Ai }, i ∈ I denotes the σ-field spanned by the measurable par-
tition {X ∈ Ai }, i ∈ I , of and
E F(X ) A IX = E F(X ) | X ∈ Ai 1{X ∈Ai } .
i∈I
Consequently
p2 ' '2
σ F,i ≤ ' F(X ) − E F(X )'2 = Var F(X )
i 2
(3.9)
i∈I
qi
with equality if and only if E F(X ) | A IX = E F(X ) P-a.s. Or, equivalently, equal-
ity holds in this inequality if and only if
E F(X ) | X ∈ Ai = E F(X ), i ∈ I.
So this choice always reduces the variance of the estimator since we assumed that the
stratification is not trivial. It corresponds in the opinion poll world to the so-called
quota method.
Optimal choice. The optimal choice is a solution to the above constrained min-
imization problem. It follows from a simple application of the Schwarz Inequality
(and its equality case) that
84 3 Variance Reduction
⎛ ⎞1 ⎛ ⎞1 ⎛ ⎞1
pi σ F,i √ pi2 σ 2F,i 2 2
pi2 σ 2F,i 2 1
pi σ F,i = √ qi ≤ ⎝ ⎠ ⎝ qi ⎠ = ⎝ ⎠ × 12
qi qi qi
i∈I i∈I i∈I i∈I i∈I
⎛ ⎞1
pi2 σ 2F,i 2
=⎝ ⎠ .
qi
i∈I
Consequently, the optimal choice for the allocation parameters qi ’s, i.e. the solution
to the above constrained minimization problem, is given by
∗ pi σ F,i
q F,i = , i ∈ I, (3.10)
j p j σ F, j
At this stage the problem is that unlike the weights pi , the local inertia σ 2F,i are
not known, which makes the implementation less straightforward and sometimes
questionable.
Some attempts have been made to circumvent this problem, see e.g. [86] for a
recent reference based on an adaptive procedure for the computation of the local
F-inertia σ 2F,i .
However, using that the L p -norms with respect to a probability measure are non-
decreasing in p, one derives that
! 2 " 21
σ F,i = E F(X ) − E F(X ) | {X ∈ Ai } {X ∈ Ai }
≥ E F(X ) − E F(X ) | {X ∈ Ai } {X ∈ Ai }
E F(X ) − E (F(X )|X ∈ Ai ) {X ∈ Ai }
=
pi
When compared to the resulting variance in (3.9) obtained with the suboptimal
choice qi = pi , this illustrates the magnitude of the gain that can be expected from the
3.5 Stratified Sampling 85
' '2 '
optimal choice qi = qi∗ : it lies in between ' F(X ) − E F(X ) | A I '1 and ' F(X ) −
'2
E F(X ) | A I '2 .
d
Examples. Stratifications for the computation of E F(X ), X = N (0; Id ), d ≥ 1.
(a) Stripes. Let v be a fixed unitary vector (a simple and natural choice for v is v =
e1 = (1, 0, 0, . . . , 0): it is natural to define the strata as hyper-stripes perpendicular
to the main axis Re1 of X ). So, we set, for a given size N of the stratification
(I = {1, . . . , N }),
% &
Ai := x ∈ Rd s.t. (v|x) ∈ [yi−1 , yi ] , i = 1, . . . , N ,
1
pi = P(X ∈ Ai ) = P(Z ∈ [yi−1 , yi ]) = 0 (yi ) − 0 (yi−1 ) = ,
N
where 0 denotes the c.d.f. of the N (0; 1)-distribution. Other choices are possible
for the yi , leading to a non-uniform distribution of the pi ’s. The simulation of the
conditional distributions follows from the fact that
L X | (v|X ) ∈ [a, b] = L ξ1 v + πv⊥⊥ (ξ2 ) ,
d d
where ξ1 = L(Z | Z ∈ [a, b]) is independent of ξ2 = N (0; Id−1 ),
L(Z | Z ∈ [a, b]) = −1
d d
0 (0 (b) − 0 (a))U + 0 (a) , U = U ([0, 1])
and πv⊥⊥ denotes the orthogonal projection on v ⊥ . When v = e1 , this reads simply as
d
(b) Hyper-rectangles. We still consider X = (X 1 , . . . , X d ) = N (0; Id ), d ≥ 2. Let
(e1 , . . . , ed ) denote the canonical basis of Rd . We define the strata as hyper-
rectangles. Let N1 , . . . , Nd ≥ 1.
d
d
Ai := x ∈ Rd s.t. (e |x) ∈ [yi −1 , yi ] , i ∈ {1, . . . , N },
=1 =1
where the yi ∈ R are defined by 0 (yi ) = Ni , i = 0, . . . , N . Then, for every multi-
(
index i = (i 1 , . . . , i d ) ∈ d=1 {1, . . . , N },
86 3 Variance Reduction
)
d
L(X | X ∈ Ai ) = L(Z | Z ∈ [yi −1 , yi ]). (3.11)
=1
Optimizing the allocation to each stratum in the simulation for a given function F
in order to reduce the variance is of course interesting and can be highly efficient but
with the drawback of being strongly F-dependent, especially when this allocation
needs an extra procedure like in [86]. An alternative and somewhat dual approach is
to try optimizing the strata themselves uniformly with respect to a class of functions
F (namely Lipschitz continuous functions) prior to the allocation across the strata.
This approach emphasizes the connections between stratification and optimal
quantization and provides bounds on the best possible variance reduction factor that
can be expected from a stratification. Some elements are provided in Chap. 5, see
also [70] for further developments in infinite dimensions.
In practice, we will have to simulate several r.v.s, whose distributions are all abso-
lutely continuous with respect to this reference measure μ. For a first reading, one
may assume that E = Rd and μ is the Lebesgue measure λd , but what follows can
also be applied to more general measure spaces like the Wiener space (equipped
with the Wiener measure), or countable sets (with the counting measure), etc. Let
h ∈ L 1 (PX ). Then,
E h(X ) = h(x)PX (d x) = h(x) f (x)μ(d x).
E E
Now, for any μ-a.s. positive probability density function g defined on (E, E) (with
respect to μ), one has
h(x) f (x)
E h(X ) = h(x) f (x)μ(d x) = g(x)μ(d x).
E E g(x)
One can always enlarge (if necessary) the original probability space (, A, P) to
design a random variable Y : (, A, P) → (E, E) having g as a probability density
3.6 Importance Sampling 87
with respect to μ. Then, going back to the probability space yields for every non-
negative or PX -integrable function h : E → R,
h(Y ) f (Y )
E h(X ) = E . (3.12)
g(Y )
So, in order to compute E h(X ), one may also implement a Monte Carlo simulation
based on the simulation of independent copies of the random variable Y , i.e.
1
M
h(Y ) f (Y ) f (Yk )
E h(X ) = E = lim h(Yk ) a.s.
g(Y ) M→+∞ M k=1 g(Yk )
ℵ Practitioner’s corner.
Practical requirements (to undertake the simulation). To proceed, it is necessary
to simulate independent copies of Y and to compute the ratio of density functions
f /g at a reasonable cost. Note that only the ratio is needed which makes useless the
computation of some “structural” constants like (2π)d/2 , e.g. when both f and g are
Gaussian densities with different means (see below). By “reasonable cost” for the
simulation of Y , we mean at the same cost as that of X (in terms of complexity). As
concerns the ratio f /g, this means that its computational cost remains negligible with
respect to that of h or, which may be the case in some slightly different situations,
that the computational cost of h × gf is equivalent to that of h alone.
Sufficient conditions (to undertake the simulation). Once the above conditions
are fulfilled, the question is: is it profitable to proceed like this? This is the case
if the complexity of the simulation for a given accuracy (in terms of confidence
interval) is lower with the second method. If one assumes as above that simulating
X and Y on the one hand, and computing h(x) and (h f /g)(x) on the other hand
are both comparable in terms of complexity, the question amounts to comparing the
variances or, equivalently, the squared quadratic norm of the estimators since they
have the same expectation E h(X ).
Now
* + * 2 +
h(Y ) f (Y ) 2 hf h(x) f (x) 2
E =E (Y ) = g(x) μ(d x)
g(Y ) g E g(x)
f (x)
= h(x)2 f (x) μ(d x)
g(x)
E
f
= E h(X )2 (X ) .
g
hf
As a consequence, simulating (Y ) rather than h(X ) will reduce the variance
g
if and only if
88 3 Variance Reduction
f
E h(X )2 (X ) < E h(X )2 . (3.13)
g
since g is a probability density function. Now the equality case in the Schwarz
√
√ f (x) are proportional
inequality says that the variance is 0 if and only if g(x) and h(x)
g(x)
μ(d x)-a.s., i.e. h(x) f (x) = c g(x) μ(d x)-a.s. for a (non-negative) real constant c.
Finally, when h has a constant sign and E h(X ) = 0 this leads to
h(x)
g(x) = f (x) μ(d x) P-a.s.
E h(X )
This choice is clearly impossible to make since it would mean that E h(X ) is
known since it is involved in the formula…and would then be of no use. A contrario
this may suggest a direction for designing the (distribution) of Y .
The intuition that must guide practitioners when designing an importance sampling
method is to replace a random variable X by a random variable Y so that hgf (Y ) is in
some way often “closer” than h(X ) to their common mean. Let us be more specific.
We consider a Call on the risky asset (X t )t∈[0,T ] with strike price K and maturity
T > 0 (with interest rate r ≡ 0 for simplicity). If X 0 = x K , i.e. the option is
deep out-of-the money at the origin of time so that most of the scenarii X T (ω) will
satisfy X T (ω) ≤ K or equivalently (X T (ω) − K )+ = 0. In such a setting, the event
{(X T − K )+ > 0} – the payoff is positive – is a rare event so that the number of
scenarii that will produce a non-zero value for (X T − K )+ will be small, inducing a
too rough estimate of the quantity of interest E (X T − K )+ . Put in a more quantitative
way, it means that, even if both the expectation and the standard deviation of the payoff
are both small in absolute value, their ratio (standard deviation over expectation) will
be very large.
3.6 Importance Sampling 89
and we can reasonably hope that we will simulate more significant scenarii for
f
(YT − K )+ g X T (YT ) than for (X T − K )+ . This effect will be measured by the variance
YT
reduction.
This interpretation in terms of “rare events” is in fact the core of importance
sampling, more than the plain “variance reduction” feature. In particular, this is what
a practitioner must have in mind when searching for a “good” probability distribution
g: importance sampling is more a matter of “focusing light where it is needed” than
reducing variance.
When dealing with vanilla options in simple models (typically local volatility),
one usually works on the state space E = R+ and importance sampling amounts to
a change of variable in one-dimensional integrals as emphasized above. However, in
more involved frameworks, one considers the scenarii space as a state space, typically
E = = C(R+ , Rd ) and uses Girsanov’s Theorem instead of the usual change of
variable with respect to the Lebesgue measure.
Of course there is no reason why the solution to the above problem should be θ0
(if so, such a parametric model is inappropriate). At this stage one can follow two
strategies:
– Try to solve by numerical means the above minimization problem.
– Use one’s intuition to select a priori a good (though sub-optimal) θ ∈ by
applying the heuristic principle: “focus light where needed”.
Example (The Cameron–Martin formula and Importance Sampling by mean trans-
lation). This example takes place in a Gaussian framework. We consider (as a starting
motivation) a one dimensional Black–Scholes model defined by
√
X Tx = xeμT +σWT = xeμT +σ
d
TZ
, Z = N (0; 1),
σ2
with x > 0, σ > 0 and μ = r − 2
. Then, the premium of an option with payoff
h : (0, +∞) → (0, +∞) reads
z2 dz
e−r t E h(X Tx ) = E ϕ(Z ) = ϕ(z)e− 2 √ ,
R 2π
√
where ϕ(z) = e−r t h xeμT +σ T z , z ∈ R.
From now on, we forget about the financial framework and deal with
z2
e− 2
E ϕ(Z ) = ϕ(z)g0 (z)dz where g0 (z) = √
R 2π
and the random variable Z plays the role of X in the above theoretical part. The idea
is to introduce the parametric family
Yθ = Z + θ, θ ∈ := R.
We consider the Lebesgue measure on the real line λ1 as a reference measure, so that
(y−θ)2
e− 2
gθ (y) = √ , y ∈ R, θ ∈ := R.
2π
g0 θ2
(y) = e−θy+ 2 , y ∈ R, θ ∈ := R.
gθ
θ2
E ϕ(Z ) = e 2 E ϕ(Yθ )e−θYθ
θ2
θ2
= e 2 E ϕ(Z + θ)e−θ(Z +θ) = e− 2 E ϕ(Z + θ)e−θ Z .
It is to be noticed again that there is no need to account for the normalization constants
g
to compute the ratio 0 .
gθ
The next step is to choose a “good” θ which significantly reduces the variance,
i.e. following Condition (3.13) (using the formulation involving “Yθ = Z + θ”), such
that
θ2 2
E e 2 ϕ(Z + θ)e−θ(Z +θ) < E ϕ2 (Z ),
i.e.
e−θ E ϕ2 (Z + θ)e−2θ Z < E ϕ2 (Z )
2
or, equivalently, if one uses the formulation of (3.13) based on the original random
variable (here Z ),
θ2
E ϕ2 (Z )e 2 −θ Z < E ϕ2 (Z ).
θ∈R
It is clear that the solution of this optimization problem and the resulting choice
of θ highly depends on the function h.
– Optimization approach: When h is smooth enough, an approach based on large
deviation estimates has been proposed by Glasserman et al. (see [115]). We propose
a simple recursive/adaptive approach in Sect. 6.3.1 of Chap. 6 based on Stochastic
Approximation which does not depend upon the regularity of the function h (see
also [12] for a pioneering work in that direction).
– Heuristic suboptimal approach: Let us temporarily
return to
√our pricing
problem
involving the specified function ϕ(z) = e−r t x exp (μT + σ T z) − K + , z ∈ R.
When x K (deep-out-of-the-money option), most simulations of ϕ(Z ) will pro-
92 3 Variance Reduction
duce 0 as a result. A first simple idea – if one does not wish to carry out the above
optimization – can be to “re-center the simulation” of X Tx around K by replacing Z
by Z + θ, where θ satisfies
√
E x exp μT + σ T (Z + θ) = K
log(x/K ) + r T
θ := − √ . (3.14)
σ T
would lead to
log(x/K )
θ := − √ . (3.15)
σ T
is strictly convex and differentiable on the whole real line with a derivative given by
θ2
θ → E ϕ2 (Z )(θ − Z )e 2 −θ Z .
(b) Show that if ϕ is an even function, then this parametric importance sampling
procedure by mean translation is useless. Give a necessary and sufficient condition
(involving ϕ and Z ) that makes it always useful.
2. Set r = 0, σ = 0.2, X 0 = x = 70, T = 1. One wishes to price a Call option with
strike price K = 100 (i.e. deep out-of-the-money). The true Black–Scholes price is
0.248 (see Sect. 12.2).
Compare the performances of
(i) a “crude” Monte Carlo simulation,
(ii) the above “intuitively guided” heuristic choices for θ.
Assume now that x = K = 100. What do you think of the heuristic suboptimal
choice?
d
3. Write all the preceding when Z = N (0; Id ).
4. Randomization of an integral. Let h ∈ L 1 (Rd , Bor (Rd ), λd ).
(a) Show that for any Rd -valued random vector Y having an absolutely continuous
distribution PY = g.λd with g > 0, λd a.s. on {h > 0}, one has
h
hdλd = E (X ) .
Rd g
Derive a probabilistic method to compute h dλd .
Rd
(b) Propose an importance sampling approach to this problem inspired by the above
examples.
Equation (3.17) has at least one solution since F is continuous which may not
be unique in general. For convenience one often assumes that the lowest solution
of Equation (3.17) is the VaRα,X . In fact, Value-at-Risk (of X ) is not consistent as
a measure of risk (as emphasized in [92]), but nowadays it is still widely used to
measure financial risk.
One naive way to compute VaRα,X is to estimate the empirical distribution function
of a (large enough) Monte Carlo simulation at some points ξ lying in a grid :=
{ξi , i ∈ I }, namely
1
M
F(ξ) M := 1{X k ≤ξ} , ξ ∈ ,
M k=1
d
X := ϕ(Z ), Z = N (0; 1).
θ2
P(X ≤ ξ) = e− 2 E 1{ϕ(Z +θ)≤ξ} e−θ Z
It remains to find good variance reducers θ. This choice depends of course on ξ but
in practice it should be fitted to reduce the variance in the neighborhood of VaRα,X .
We will see in Chap. 6 that more efficient methods based on Stochastic Approxi-
mation can be devised. But they also need variance reduction to be implemented.
Furthermore, similar ideas can be used to compute a consistent measure of risk called
the Conditional Value-at-Risk (or Averaged Value-at-Risk).
Chapter 4
The Quasi-Monte Carlo Method
In this chapter we present the so-called Quasi-Monte Carlo (QMC) method, which
can be seen as a deterministic alternative to the standard Monte Carlo method: the
pseudo-random numbers are replaced by deterministic computable sequences of
[0, 1]d -valued vectors which, once substituted mutatis mutandis in place of pseudo-
random numbers in the Monte Carlo method, may significantly speed up its rate
of convergence, making it almost independent of the structural dimension d of the
simulation.
or an infinite-dimensional integral
f (u)λ∞ (du),
[0,1](N∗ )
∗
where [0, 1](N ) denotes the set of finite [0, 1]-valued sequences (or, equivalently,
sequences
vanishing at a finite range) and λ∞ = λ⊗N is the Lebesgue measure on
N∗ ∗
[0, 1] , Bor ([0, 1]N ) . Integrals of the first type show up when X can be simu-
lated by standard methods like the inverse distribution function, Box–Muller simula-
d
tion method of Gaussian distributions, etc, so that X = g(U ), U = (U 1 , · · · , U d ) =
U ([0, 1]d ) whereas the second type is typical of a simulation using an acceptance-
rejection method (like the polar method for Gaussian distributions).
As concerns finite-dimensional integrals, we saw that, if (Un )n≥1 denotes an i.i.d.
sequence of uniformly distributed random vectors on [0, 1]d , then, for every function
f ∈ L 1R ([0, 1]d , λd ),
P(dω)-a.s.
1
n
(4.1)
f Uk (ω) −→ E f (U1 ) = f (u 1 , . . . , u d )du 1 · · · du d ,
n k=1 [0,1]d
where the subset f of P-probability 1 on which this convergence holds true depends
on the function f . In particular, the above a.s.-convergence holds for any continuous
function on [0, 1]d . But in fact, taking advantage of the separability of the space
of continuous functions, we will show below that this convergence simultaneously
holds for all continuous functions on [0, 1]d and even on the larger class of Riemann
integrable functions on [0, 1]d .
First we briefly recall the basic definition of weak convergence of probability
measures on metric spaces (see [45] Chap. 1 for a general introduction to weak
convergence of probability measures on metric spaces).
Definition 4.1 (Weak convergence) Let (S, δ) be a metric space and let S :=
Borδ (S) be its Borel σ-field. Let (μn )n≥1 be a sequence of probability measures
on (S, S) and μ a probability measure on the same space. The sequence (μn )n≥1
(S)
weakly converges to μ (denoted by μn =⇒ μ) if, for every function f ∈ Cb (S, R),
f dμn −→ f dμ as n → +∞. (4.2)
S S
(S)
Proposition 4.1 (See Theorem 5.1 in [45]) If μn =⇒ μ, then the above conver-
gence (4.2) holds for every bounded Borel function f : (S, S) → R such that
μ Disc( f ) = 0 where Disc( f ) = x ∈ S, f is discontinuous at x ∈ S.
1
n
([0,1]d )
P(dω)-a.s. δUk (ω) =⇒ λd|[0,1]d = U ([0, 1]d ) as n → +∞,
n k=1
4.1 Motivation and Definitions 97
i.e.
Proof. The vector space C([0, 1]d , R) endowed with the sup-norm f sup :=
supx∈[0,1]d | f (x)| is separable in the sense that there exists a sequence ( f m )m≥1 of
continuous functions on [0, 1]d which is everywhere dense in C([0, 1]d , R) with
respect to the sup-norm (1 ).
Now, for every m ≥ 1, the convergence (4.1) holds with f = f m for every
m , where
ω ∈
/ m ∈ A and P(m ) = 1. Set 0 = ∩m∈N∗ m . One has P(c0 ) ≤
m≥1 P(m ) = 0 by σ-sub-additivity of probability measures so that P(0 ) = 1.
c
Then,
∀ ω ∈ 0 , ∀ m ≥ 1,
1
n
n→+∞
f m (Uk (ω)) −→ E f m (U1 ) = f m (u)λd (du).
n k=1 [0,1]d
On the other hand, it is straightforward that, for every f ∈ C([0, 1]d , R), for every
n ≥ 1 and every ω ∈ ,
1 1
n n
f m (Uk (ω)) − f (Uk (ω)) ≤ f − f m sup
n k=1 n k=1
1
n
lim f (Uk (ω)) − E f (U1 ) ≤ 2 f − f m sup .
n n k=1
Now, the fact that the sequence ( f m )m≥1 is everywhere dense in C([0, 1]d , R) with
respect to the sup-norm means precisely that
lim f − f m sup = 0.
m
1 When d = 1, an easy way to construct this sequence is to consider the countable family of contin-
uous piecewise affine functions with monotonicity breaks at rational points of the unit interval and
taking rational values at these break points (and at 0 and 1). The density follows from that of the set
Q of rational numbers. When d ≥ 2, one proceeds likewise by considering continuous functions
which are affine on hyper-rectangles with rational vertices which tile the unit hypercube [0, 1]d .
We refer to [45] for more details.
98 4 The Quasi-Monte Carlo Method
1
n
f (Uk (ω)) − E f (U1 ) −→ 0 as n → +∞.
n k=1
This completes the proof since it shows that, for every ω ∈ 0 , the expected conver-
gence holds for every continuous function on [0, 1]d . ♦
Corollary 4.1 Owing to Proposition 4.1, one may replace in (4.3) the set of
continuous functions on [0, 1]d by that of all bounded Borel λd -a.s. continuous
functions on [0, 1]d .
Remark. In fact, in the above definition one may even replace the Borel measurabil-
ity by
“Lebesgue”-measurability.
This means, for a function f : [0, 1]d → R, replac-
ing Bor ([0, 1]d ), Bor (R) -measurability by L([0, 1]d ), Bor (R) -measurability,
where L([0, 1]d ) denotes the completion of the Borel σ-field on [0, 1]d by the λd -
negligible sets (see [52], Chap. 13). Such functions are known as Riemann integrable
functions on [0, 1]d (see again [52], Chap. 13).
The preceding suggests that, as long as one wishes to compute some quanti-
ties E f (U ) for (reasonably) smooth functions f , we only need to have access to
a sequence that satisfies the above convergence property for its empirical distri-
bution. Furthermore, we know from the fundamental theorem of simulation (see
Theorem 1.2) that this situation is generic since all distributions can be simulated
from a uniformly distributed random variable over [0, 1], at least theoretically. This
leads us to formulate the following definition of a uniformly distributed sequence.
Definition 4.3 A [0, 1]d -valued sequence (ξn )n≥1 is uniformly distributed on [0, 1]d
(or simply uniformly distributed in what follows) if
1
n
([0,1]d )
δξk =⇒ U ([0, 1]d ) as n → +∞.
n k=1
Definition 4.4 (a) We define a componentwise partial order on [0, 1]d , simply
denoted by “≤”, by: for every x = (x 1 , . . . , x d ), y = (y 1 , . . . , y d ) ∈ [0, 1]d
x ≤ y if x i ≤ y i , 1 ≤ i ≤ d.
Note that [[x, y]] = ∅ if and only if x ≤ y and, if this is the case, [[x, y]] =
i=1
d
[x i , y i ].
Notation. In particular, the unit hypercube [0, 1]d can be denoted by [[0, 1]], where
1 = (1, . . . , 1) ∈ Rd .
1
n d
1[[0,x]] (ξk ) −→ λd ([[0, x]]) = x i as n → +∞.
n k=1 i=1
1
n d
Dn∗ (ξ) := sup 1[[0,x]] (ξk ) − x i −→ 0 as n → +∞. (4.4)
x∈[0,1]d n k=1 i=1
1
n d
Dn∞ (ξ) := sup 1[[x,y]] (ξk ) − (y i − x i ) −→ 0 as n → +∞.
x,y∈[0,1]d n k=1 i=1
(4.5)
2 The name of this theorem looks mysterious. Intuitively, it can be simply justified by the multiple
properties established as equivalent to the weak convergence of a sequence of probability measures.
However, it is sometimes credited to Jean-Pierre Portmanteau in the paper: Espoir pour l’ensemble
vide, Annales de l’Université de Felletin (1915), 322–325. In fact, one can easily check that no
mathematician called Jean-Pierre Portmanteau ever existed and that there is no university in the
very small French town of Felletin. This reference is just a joke hidden in the bibliography of the
second edition of [45]. The empty set is definitely hopeless…
100 4 The Quasi-Monte Carlo Method
1 2ı̃π( p|ξk )
n
e −→ 0 as n → +∞ (with ı̃ 2 = −1).
n k=1
(vi) Bounded Riemann integrable function: for every bounded λd -a.s. continuous
Lebesgue-measurable function f : [0, 1]d → R
1
n
f (ξk ) −→ f (x)λd (d x) as n → +∞.
n k=1 [0,1]d
Definition 4.5 The two moduli introduced in items (iii) by (4.4) and (iv) by (4.5)
define the discrepancy at the origin and the extreme discrepancy, respectively.
Remark. By construction these two discrepancies take their values in the unit inter-
val [0, 1].
Sketch of proof. The ingredients of the proof come from theory of weak conver-
gence of probability measures. For more details in the multi-dimensional setting
we refer to [45] (Chap. 1 devoted to the general theory of weak convergence of
probability measures on a Polish space) or [175] (an old but great book devoted to
uniformly distributed sequences). We provide hereafter some elements of proof in
the one dimensional case.
The equivalence (i) ⇐⇒ (ii) is simply the characterization of weak conver-
gence of probability measures by the convergence of their distributions func-
tions (3 ) since the distribution function Fμn of μn = n1 1≤k≤n δξk is given by
Fμn (x) = n1 1≤k≤n 1{0≤ξk ≤x} .
Owing to Dini’s second Lemma, this convergence of non-decreasing (distribution)
functions is uniform as soon as it holds pointwise since its pointwise limiting function,
FU ([0,1]) (x) = x, is continuous. This remark yields the equivalence (ii) ⇐⇒ (iii).
Although more technical, the d-dimensional extension remains elementary and relies
on a similar principle.
The equivalence (iii) ⇐⇒ (iv) is trivial since Dn∗ (ξ) ≤ D ∞ (ξ) ≤ 2Dn∗ (ξ) in
one dimension. Note that in d dimensions the inequality reads Dn∗ (ξ) ≤ D ∞ (ξ) ≤
2d Dn∗ (ξ).
Item (v) is based on the fact that weak convergence of finite measures on
[0, 1]d is characterized by that of the sequences
of their Fourier coefficients.
The
Fourier coefficients of a finite measure μ on [0, 1]d , Bor ([0, 1]d ) are defined by
3 The distribution function Fμ of a probability measure μ on [0, 1] is defined by Fμ (x) = μ([0, x]).
One shows that a sequence of probability measures μn converges toward a probability measure μ if
and only if their distribution functions Fμn and Fμ satisfy that Fμn (x) converges to Fμ (x) at every
continuity point x ∈ R of Fμ (see [45], Chap. 1).
4.1 Motivation and Definitions 101
c p (μ) := [0,1]d e2ı̃π( p|u) μ(du), p ∈ Zd , ı̃ 2 = −1 (see e.g. [156]). One checks that the
Fourier coefficients c p (λd|[0,1]d ) p∈Zd are simply c p (λd|[0,1]d ) = 0 if p = 0 and 1 if
p = 0.
Item (vi) follows from (i) and Proposition 4.1 since for every x ∈ [[0, 1]], f c (ξ) :=
1{ξ≤x} is continuous outside {x}, which is clearly Lebesgue negligible. Conversely,
(vi) implies the pointwise convergence of the distribution functions Fμn as defined
above toward FU ([0,1]) . ♦
The discrepancy at the origin Dn∗ (ξ) plays a central rôle in theory of uniformly
distributed sequences: it does not only provide a criterion for uniform distribution, it
also appears as an upper error modulus for numerical integration when the function
f has the appropriate regularity (see Koksma–Hlawka Inequality below).
is simply the frequency with which the first n points ξk ’s of the sequence fall into
[[0, x]]. The star discrepancy measures the maximal resulting error when x runs over
[[0, 1]].
Exercise 1. below provides a first example of a uniformly distributed sequence.
Exercises. 1. (Rotations of the torus ([0, 1]d , +)). Let (α1 , . . . , αd ) ∈ (R \ Q)d
be (irrational numbers) such that (1, α1 , . . . , αd ) are linearly independent over Q
(4 ). Let x = (x 1 , . . . , x d ) ∈ Rd . For every n ≥ 1, set
ξn := {x i + n αi } 1≤i≤d ,
where {x} denotes the fractional part of a real number x. Show that the sequence
(ξn )n≥1 is uniformly distributed on [0, 1]d (and can be recursively generated). [Hint:
use Weyl’s criterion.]
2. More on the one dimensional case. (a) Assume d = 1. Show that, for every n-tuple
(ξ1 , . . . , ξn ) ∈ [0, 1]n
k−1 k
Dn∗ (ξ1 , . . . , ξn ) = max − ξk(n) , − ξk(n)
1≤k≤n n n
where (ξk(n) )1≤k≤n is the/a reordering of the n-tuple (ξ1 , . . . , ξn ) defined by k → ξk(n)
is non-decreasingand {ξ1(n) , . . . , ξn(n) } = {ξ1 , . . . , ξn }. [Hint: Where does the càdlàg
function x → n1 nk=1 1{ξk ≤x} − x attain its infimum and supremum?]
(b) Deduce that
1 2k − 1
Dn∗ (ξ1 , . . . , ξn ) = + max ξk(n) − .
2n 1≤k≤n 2n
Definition 4.7 (see [48, 237]) A function f : [0, 1]d →R has finite variationin the
measure sense if there exists a signed measure (5 ) ν on [0, 1]d , Bor ([0, 1]d ) such
that ν({0}) = 0 and
(or equivalently f (x) = f (0) − ν(c [[0, 1 − x]])). The variation V ( f ) is defined by
V ( f ) := |ν|([0, 1]d ),
Exercises. 1. Show that f has finite variation in the measure sense if and only
if there exists a signed ν measure with ν({1}) = 0 such that
∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[x, 1]]) = f (0) − ν(c [[x, 1]])
and that its variation is given by |ν|([0, 1]d ). This could of course be taken as the
definition of finite variation equivalently to the above one.
5 A signed measure ν on a space (X, X ) is a mapping from X to R which satisfies the two
axioms of
a measure, namely ν(∅) = 0 and if An , n ≥ 1, are pairwise disjoint, then ν(∪n An ) = n≥1 ν(An )
(the series is commutatively convergent hence absolutely convergent). Such a measure is finite and
can be decomposed as ν = ν1 − ν2 , where ν1 , ν2 are non-negative finite measures supported by
disjoint sets, i.e. there exists A ∈ X such that ν1 (Ac ) = ν2 (A) = 0 (see [258]).
4.2 Application to Numerical Integration: Functions with Finite Variation 103
f (x 1 , x 2 ) := (x 1 + x 2 ) ∧ 1, (x 1 , x 2 ) ∈ [0, 1]2
has finite variation in the measure sense [Hint: consider the distribution of (U, 1 − U ),
d
U = U ([0, 1])].
For the class of functions with finite variation, the Koksma–Hlawka Inequal-
1
n
ity provides an error bound for f (ξk ) − f (x)d x based on the star
n k=1 [0,1]d
discrepancy.
Proof. Set μn = n1 nk=1 δξk − λd|[0,1]d . It is a signed measure with 0-mass. Then, if
f has finite variation with respect to a signed measure ν,
1
n
f (ξk ) − f (x)λd (d x) = f (x)
μn (d x)
n k=1 [0,1]d
= f (1) μn ([0, 1]d ) + ν([[0, 1 − x]]))μn (d x)
[0,1]d
= 0+ 1{v≤1−x} ν(dv) μn (d x)
[0,1]d
=
μn ([[0, 1 − v]])ν(dv),
[0,1]d
1
n
f (ξk ) − f (x)λd (d x) =
μn ([[0, 1 − v]])ν(dv)
n k=1 [0,1]d [0,1]d
≤ |
μn ([[0, 1 − v]])| |ν|(dv)
[0,1]d
≤ sup |
μn ([[0, v]])||ν|([0, 1]d )
v∈[0,1]d
= Dn∗ (ξ)V ( f ). ♦
Remarks. • The notion of finite variation in the measure sense has been introduced
in [48, 237]. When d = 1, it coincides with the notion of left continuous functions
with finite variation. The most general multi-dimensional extension to higher dimen-
sions d ≥ 2 is the notion of “finite variation” in the Hardy and Krause sense. We refer
e.g. to [175, 219] for its definition and properties. Essentially, it relies on a geometric
extension of the one-dimensional finite variation where meshes on the unit interval
are replaced by tilings of [0, 1]d with hyper-rectangles with an appropriate notion of
increment of the function on these hyper-rectangles. However, finite variation in the
measure sense is much easier to handle, In particular, to establish the above Koksma–
Hlawka Inequality. Furthermore, V ( f ) ≤ VH &K ( f ). Conversely, one shows that a
function with finite variation in the Hardy and Krause sense is λd -a.s. equal to a func-
tion f having finite variations in the measure sense satisfying V ( f˜) ≤ VH &K ( f ). In
one dimension, finite variation in the Hardy and Krause sense exactly coincides with
the standard definition of finite variation.
• A classical criterion (see [48, 237]) for finite variation in the measure sense is the
∂d f
following: if f : [0, 1]d → R has a cross derivative in the distribution
∂x 1 · · · ∂x d
sense which is an integrable function, i.e.
∂d f
(x 1 , . . . , x d ) d x 1 · · · d x d < +∞,
[0,1]d ∂x 1· · · ∂x d
then f has finite variation in the measure sense. This class includes the functions f
defined by
f (x) = f (1) + ϕ(u 1 , . . . , u d )du 1 . . . du d , ϕ ∈ L 1 ([0, 1]d , λd ). (4.6)
[[0,1−x]]
1 n
Dn∗ (ξ1 , . . . , ξn ) = sup f (ξk ) − f dλd ,
n k=1 [0,1]d
f : [0, 1]d → R, V ( f ) ≤ 1 . (4.7)
4.2 Application to Numerical Integration: Functions with Finite Variation 105
f (x 1 , x 2 , x 3 ) := (x 1 + x 2 + x 3 ) ∧ 1
does not have finite variation in the measure sense. [Hint: its third derivative in the
distribution sense is not a measure.] (6 )
2. (a) Show directly that if f satisfies (4.6), then for any n-tuple (ξ1 , . . . , ξn ),
1
n
∗
f (ξk ) − f (x)λd (d x) ≤ ϕ L 1 (λd|[0,1]d ) Dn (ξ1 , . . . , ξn ).
n k=1 [0,1]d
(b) Show that if ϕ is also in L p ([0, 1]d , λd ), for a p ∈ (1, +∞] with a Hölder con-
jugate q, then
1
n
(q)
f (ξk ) − f (x)λd (d x) ≤ ϕ L p (λd|[0,1]d ) Dn (ξ1 , . . . , ξn ),
n k=1 [0,1]d
where
q q1
1
n d
Dn(q) (ξ1 , . . . , ξn ) = 1[[0,x]] (ξk ) − xi λd (d x) .
[0,1]d n k=1 i=1
6 In fact, its variation in the Hardy and Krause sense is not finite either.
106 4 The Quasi-Monte Carlo Method
n
(n)
ξk+1
k
Dn(1) (ξ1 , . . . , ξn ) = − u du,
k=0 ξk(n) n
Let (Un )n≥1 be an i.i.d. sequence of random vectors uniformly distributed over [0, 1]d
defined on a probability space (, A, P). We saw that
P(dω)-a.s., the sequence Un (ω) n≥1 is uniformly distributed
and
E supx∈[0,1]d |Z x(d) |
E Dn∗ U1 , . . . , Un ∼ √ as n → +∞,
n
where (Z x(d) )x∈[0,1]d denotes the centered Gaussian multi-index process (or “bridged
hyper-sheet”) with covariance structure given by
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 107
∀ x = (x 1 , . . . , x d ), y = (y 1 , . . . , y d ) ∈ [0, 1]d ,
d d d
Cov Z x(d) , Z (d)
y = x i ∧ yi − xi yi .
i=1 i=1 i=1
Remarks. • A Brownian hyper-sheet is a centered Gaussian field Wx(d) x∈[0,1]d
characterized by its covariance structure
d
∀ x, y ∈ [0, 1]d , E Wx(d) W y(d) = x i ∧ yi .
i=1
d
(d)
∀ x = (x 1 , . . . , x d ) ∈ [0, 1]d , Z x(d) = Wx(d) − x i W(1,...,1) .
i=1
• In particular, when d = 1, Z = Z (1) is simply the Brownian bridge over the unit
interval [0, 1] defined by
Z x = Wx − x W1 , x ∈ [0, 1],
where (Wx )x ∈ [0, 1] is a standard Brownian motion on [0,1].
The distribution of its sup-norm is characterized, through its tail (or survival)
distribution function, by
√
2π − (2k−1)π 2
(−1)k−1 e−2k z = 1 −
2 2
∀ z ∈ R+ , P sup |Z x | ≥ z =2 e 8z2
x∈[0,1] k≥1
z k≥1
(4.8)
(see [72] or [45], Chap. 2). This distribution is also known as the Kolmogorov–
Smirnov distribution distribution since it appears as the limit distribution used in the
non-parametric eponymous goodness-of-fit statistical test.
• The above CLT points out a possible somewhat hidden dependence of the Monte
d d
Carlo method upon the dimension d. As a matter of fact, let X ϕ = ϕ(U ), U =
ϕ
U ([0, 1]d ) and let ϕ have finite variation V (ϕ). Then, if X n denotes for every n ≥ 1
ϕ
the empirical mean of n independent copies of X k = ϕ(Uk ), one has, owing to (4.7),
√ ϕ
√
n sup X n − E X ϕ = n E Dn∗ U1 , . . . , Un
ϕ:V (ϕ)=1
1
→ V (ϕ)E sup |Z x(d) | as n → +∞.
x∈[0,1]d
108 4 The Quasi-Monte Carlo Method
This dependence with respect to the dimension appears more clearly when dealing
with Lipschitz continuous functions. For more details we refer to the third paragraph
of Sect. 7.3.
Chung’s LIL for the star discrepancy. The random sequence Dn∗ U1 , . . . , Un ,
n ≥ 1, satisfies the following Law of the Iterated Logarithm:
2n
lim D ∗ U1 (ω), . . . , Un (ω) = 1 P(dω)-a.s.
n log(log n) n
At this stage, this suggests a temporary definition of a sequence with low discrep-
ancy on [0, 1]d as a [0, 1]d -valued sequence ξ := (ξn )n≥1 such that
log(log n)
Dn∗ (ξ) = o as n → +∞,
n
which means that its implementation with a function with finite variation will speed
up the convergence rate of numerical integration by the empirical measure with
respect to the worst rate of the Monte Carlo simulation.
Exercise. Show, using the standard LIL, the easy part of Chung’s LIL, that is
2n
lim Dn∗ U1 (ω), . . . , Un (ω) ≥ 2 sup λd ([[0, x]])(1 − λd ([[0, x]]))
n log(log n) x∈[0,1]d
= 1 P(dω)-a.s.
Before providing examples of sequences with low discrepancy, let us first describe
some results concerning the known lower bounds for the asymptotics of the (star)
discrepancy of a uniformly distributed sequence.
The first results are due to Roth (see [257]): there exists a universal constant
cd ∈ (0, +∞) such that, for any [0, 1]d -valued n-tuple (ξ1 , . . . , ξn ),
d−1
(log n) 2
Dn∗ (ξ1 , . . . , ξn ) ≥ cd . (4.9)
n
Furthermore, there exists a real constant cd ∈ (0, +∞) such that, for every sequence
ξ = (ξn )n≥1 ,
d
∗ (log n) 2
Dn (ξ) ≥ cd for infinitely many n. (4.10)
n
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 109
This second lower bound can be derived from the first one, using the Hammersley
procedure introduced and analyzed in the next section (see the exercise at the end of
Sect. 4.3.4).
On the other hand, there exists (see Sect. 4.3.3 that follows) sequences for which
(log n)d
∀ n ≥ 1, Dn∗ (ξ) = C(ξ) where C(ξ) < +∞.
n
Based on this, one can derive from the Hammersley procedure (see again Sect. 4.3.4
below) the existence of a real constant Cd ∈ (0, +∞) such that
n (log n)d−1
∀ n ≥ 1, ∃ (ξ1 , . . . , ξn ) ∈ [0, 1]d , Dn∗ (ξ1 , . . . , ξn ) ≤ Cd .
n
In spite of more than fifty years of investigation, the gap between these asymptotic
lower and upper-bounds have not been significantly reduced: it has still not been
proved whether there exists a sequence for which C(ξ) = 0, i.e. for which the rate
(log n)d
n
would not be optimal.
In fact, it is widely shared in the QMC community that in the above lowerbounds,
(log n)d
d−1
2
can be replaced by d − 1 in (4.9) and d
2
by d in (4.10) so that the rate O n
is commonly considered as the lowest possible rate of convergence to 0 for the star
discrepancy of a uniformly distributed sequence. When d = 1, Schmidt proved that
this conjecture is true.
This leads to a more convincing definition of a sequence with low discrepancy.
Definition 4.8 A [0, 1]d -valued sequence (ξn )n≥1 is a sequence with low discrepancy
if
(log n)d
Dn∗ (ξ) = O as n → +∞.
n
For more insight about the other measures of uniform distribution (L p -discrepancy
( p)
Dn (ξ), diaphony, etc), we refer e.g. to [46, 219].
where the so-called “radical inverse functions” p defined for every integer p ≥ 2
by
r
ak
p (n) =
k=0
p k+1
Theorem 4.2 (see [165]) Let ξ = (ξn )n≥1 be defined by (4.11). For every n ≥ 1,
d
1 log( pi n) (log n)d
Dn∗ (ξ) ≤ ( pi − 1) =O as n → +∞. (4.12)
n i=1
log( pi ) n
The proof of this upper-bound, due to H.L. Keng and W. Yu, essentially relies
on the Chinese Remainder Theorem (known as “Théorème chinois” in French and
Sunzi’s Theorem in China, see [165], Sect. 3.5, among others). Since the methods
of proof of such bounds for sequences with low discrepancy are usually highly
technical and rely on combinatorics and Number theory arguments not very familiar
to specialists in Probability theory, we decided to provide a proof for this first upper-
bound which turns out to be more accessible. This proof is postponed to Sect. 12.10.
In fact, the above upper-bound (4.12) remains true if the sequence (ξn )n≥1
is defined with integers p1 , . . . , pd ≥ 2 which are simply pairwise prime, i.e.
gcd( pi , p j ) = 1, i = j, 1 ≤ i, j ≤ d. In particular, if d = 1, the (one-dimensional)
Van der Corput sequence VdC( p) defined by
ξn = p (n)
is uniformly distributed with Dn∗ (ξ) = O logn n for every integer p ≥ 2.
Several improvements of this classical bound have been established: some non-
asymptotic and of numerical interest (see e.g. [222, 278]), some more theoretical.
Among them let us cite the following one established by H. Faure (see [88], see
also [219], p. 29)
d
1 log n pi + 2
Dn∗ (ξ) ≤ d+ ( pi − 1) + , n ≥ 1,
n i=1
2 log pi 2
One easily checks that the first terms of the VdC(2) sequence are as follows
1 1 3 1 5 3 7
ξ1 = , ξ2 = , ξ3 = , ξ4 = , ξ5 = , ξ6 = , ξ7 = , . . .
2 4 4 8 8 8 8
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 111
Exercise. Let ξ = (ξn )n≥1 denote the p-adic Van der Corput sequence and let
ξ1(n) ≤ · · · ≤ ξn(n) be the reordering of its first n terms.
(a) Show that, for every k ∈ {1, . . . , n},
k
ξk(n) ≤ .
n+1
ξ1 + · · · + ξn 1
≤ .
n 2
(c) One considers the p-adic Van der Corput sequence (ξ˜n )n≥1 starting at 0, i.e.
where (ξn )n≥1 is the regular p-adic Van der Corput sequence. Show that ξ˜k(n+1) ≤ k−1
n+1
,
k = 1, . . . , n + 1. Deduce that the L 1 -discrepancy of the sequence ξ˜ satisfies
˜ = 1 − ξ1 + · · · + ξn .
Dn(1) (ξ)
2 n
Kakutani sequences
This family of sequences was first obtained as a by-product while trying to generate
the Halton sequence as the orbit of an ergodic transform (see [181, 185, 223]). This
extension is based on p-adic addition on [0, 1], p integer, p ≥ 2. It is also known
as the Kakutani adding machine. It is defined on regular p-adic expansions of real
numbers of [0, 1] (7 ) as the addition from the left to the right of the regular p-adic
expansions with carry-over. The p-adic regular expansion of 1 is conventionally set
p
to 1 = 0.( p − 1)( p − 1)( p − 1) . . . and 1 is not considered as a p-adic rational
number in the rest of this section.
Let ⊕ p denote this addition (or Kakutani’s adding machine). Thus, as an example
7 Every real number in [0, 1) admits a p-adic expansion x = k≥1 xpkk , xk ∈ {0, . . . , p − 1}, k ≥ 1.
If x is not a p-adic rational, this expansion is unique. If x is a p-adic rational number, i.e. of the
form x = pNr for some r ∈ N and N ∈ {0, . . . , pr − 1}, then x has two p-adic expansions, one of
x −1
the form x = k=1 xpkk with x = 0 and a second reading x = −1 xk
k=1 p k + p
p−1
k≥+1 p k . It
is clear that if x is not a p-adic rational number, its p-adic “digits” xk cannot all be equal to p − 1
for k large enough. By definition the regular p-adic expansion of x ∈ [0, 1) is the unique expansion
of x whose digits xk will be infinitely often not equal to p − 1. The case of 1 is specific: its unique
p-adic expansion 1 = k≥1 p−1 pk
will be considered as regular. This regular expansion is denoted
p
by x = 0.x1 x2 . . . xk . . . for every x ∈ [0, 1].
112 4 The Quasi-Monte Carlo Method
with ε0 = 0.
– If x or y is a p-adic rational number and x, y = 1, then one easily checks that
z k = yk or z k = yk for every large enough k so that this defines a regular expansions,
i.e. the digits (x ⊕ p y)k of x ⊕ p y are (x ⊕ p y)k = z k , k ≥ 1.
numbers, then it may happen that z k =
– If both x and y are not p-adic rational
p − 1 for every large enough k so that k≥1 pzkk is not the regular p -adic expansion
of x ⊕ p y. So is the case, for example, in the following pseudo-sum:
Proposition 4.4 (see [185, 223]) Let p1 , . . . , pd denote the first d prime numbers,
y1 , . . . , yd ∈ (0, 1), where yi is a pi -adic rational number satisfying yi ≥ 1/ pi , i =
1, . . . , d and x1 , . . . , xd ∈ [0, 1]. Then the sequence (ξ)n≥1 defined by
ξn := T pn−1
i ,yi
(xi ) 1≤i≤d , n ≥ 1,
has a discrepancy at the origin Dn∗ (ξ) satisfying a similar upper-bound to (4.12) as
the Halton sequence, namely, for every integer n ≥ 1,
d
1 log( pi n)
Dn∗ (ξ) ≤ d −1+ ( pi − 1) .
n i=1
log( pi )
p
Remarks. • Note that if yi = xi = 1/ pi = 0.1 , i = 1, . . . , d, the sequence ξ is
simply the regular Halton sequence.
• This upper-bound is obtained by adapting the proof of Theorem 4.2 (see Sect. 12.10)
and we do not pretend it is optimal as a universal bound when the starting values
xi and the angles yi vary. Its main interest is to provide a large family in which
sequences with better performances than the regular Halton sequences are “hidden”,
at least at finite range.
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 113
ξn = ξn−1 ⊕ p (y1 , . . . , yd ),
It was later shown in [219] that they share the following Pp,d -property (see
also [48], p. 79).
114 4 The Quasi-Monte Carlo Method
Proposition 4.5 For every m ∈ N and every ∈ N∗ , for any r1 , . . . , rd ∈ N such that
r1 + · · · + rd = m and every x1 , . . . , xd ∈ N such that xk ≤ prk − 1, k = 1, . . . , d,
there is exactly one term in the sequence ξpm +i , i = 0, . . . , p m − 1, lying in the
d
%
hyper-cube xk p −rk , (xk + 1) p −rk .
k=1
increases.
The above convergence result is an easy consequence of Stirling’s formula and
Bertrand’s conjecture which says that, for every integer d > 1, there exists a prime
number p such that d < p < 2d (8 ). To be more precise, the function x → log
x−1
x
being
increasing on (0, +∞),
d d d
1 p−1 1 2d − 1 e 1
≤ √ →0 as d → +∞.
d! 2 log p d! 2 log 2d log 2d 2πd
Note that this bound has the same coefficient in its leading term as the asymptotic
error bound obtained by Faure. Unfortunately, from a numerical point of view, it
becomes efficient only for very large n: thus if d = p = 5 and n = 1 000,
d d
1 1 p−1 log(2n)
Dn∗ (ξ) ≤ +d +1 1.18,
n d! 2 log p
which is of little interest if one keeps in mind that, by construction, the discrepancy
takes its values in [0, 1]. This can be explained by the form of the “constant” term
in the (log n)k -scale in the above upper-bound: one has
8 Bertrand’s
conjecture was stated in 1845 but it is no longer a conjecture since it was proved by P.
Tchebychev in 1850.
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 115
d
(d + 1)d p−1
lim = +∞.
d→+∞ d! 2
A better bound is provided in Y.-J. Xiao’s PhD thesis [278], provided n ≥ p d+2 /2.
But once again it is of little interest for applications when d increases since, for p ≥ d,
p d+2 /2 ≥ d d+2 /2.
The Sobol’ sequences
The discovery by Ilia Sobol’ of his eponymous sequences has clearly been, not
only a pioneering but also a striking contribution to sequences with low discrepancy
and quasi-Monte Carlo simulation. The publication of his work goes back to 1967
(see [265]). Although it was soon translated into English, it remained ignored to a
large extent for many years, at least by practitioners in the western world. From a
purely theoretical point of view, it is the first family of sequences ever discovered
satisfying the Pp,d -property. In the modern classification of sequences with low dis-
crepancy they appear as subfamilies of the largest family of Niederreiter’s sequences
(see below), discovered later on. However, now that they are widely used especially
for (Quasi-)Monte Carlo simulation in Quantitative Finance, their impact, among
theorists and practitioners, is unrivaled, compared to any other uniformly distributed
sequence.
In terms of implementation, Antonov and Saleev proposed in [10] a new imple-
mentation based on the Gray code, which dramatically speeds up the computation
of these sequences.
In practice, the construction of the Sobol’ sequences relies on “direction numbers”,
which turn out to be crucial for the efficiency of the sequences. Not all admissible
choices are equivalent and many authors proposed efficient initializations of these
numbers after Sobol’ himself (see [266]), who have proposed a solution up to d = 51
in 1976. Implementations are also available in [247]. For recent developments on
this topic, see e.g. [276].
Even if (see below) some sequences proved to share slightly better “academic” per-
formances, no major progress has been made since in the search for good sequences
with low discrepancy. Sobol’ sequences remain unrivaled among practitioners and
are massively used in Quasi-Monte Carlo simulation, now becoming a synonym for
QMC in the quantitative finance world.
The main advances come from post-processing of the sequences like randomiza-
tion and/or scrambling. These points are briefly discussed in Sect. 4.4.2.
The Niederreiter sequences.
These sequences were designed as generalizations of Faure and Sobol’ sequences
(see [219]).
Let q ≥ d be the smallest primary integer not lower than d (a primary integer is an
integer of the form q = pr with p prime, r ∈ N∗ ). The (0, d)-Niederreiter sequence
is defined for every integer n ≥ 1 by
ξn = q,1 (n − 1), q,2 (n − 1), · · · , q,d (n − 1) ,
116 4 The Quasi-Monte Carlo Method
where
q,i (n) := ψ −1
C((i)j,k) (ak ) q− j ,
j k
This quite general family of sequences contains both the Faure and the Sobol’
sequences. To be more precise:
• when q is the smallest prime number not less than d, one retrieves the Faure
sequences,
• when q = 2r , with 2r −1 < d ≤ 2r , the sequence coincides with the Sobol’
sequences (in their original form).
The main feature of Niederreiter sequences is to be (t, d)-sequences in base q and
consequently have a discrepancy satisfying an upper-bound with a structure similar
to that of the Faure or Sobol’ sequences (which correspond to t = 0). For a precise
definition of (t, d)-sequences (and (t, m, d)-nets) as well as an in-depth analysis of
their properties in terms of discrepancy, we refer to [219]. Note that (0, d)-sequences
in base p reduce to the Pd, p -property mentioned above. Records in terms of low
discrepancy are usually held within this family, and then beaten by other sequences
from this family.
The Hammersley procedure is a canonical method for designing a [0, 1]d -valued n-
tuple from a [0, 1]d−1 -valued one with a discrepancy at the origin ruled by the latter
(d − 1)-dimensional one.
Proposition 4.6 Let d ≥ 2. Let (ζ1 , . . . , ζn ) be a [0, 1]d−1 -valued n-tuple. Then, the
[0, 1]d -valued n-tuple defined by
k
(ξk )1≤k≤n = ζk ,
n 1≤k≤n
satisfies
Proof. It follows from the very definition of the discrepancy at the origin that
1
n d−1
Dn∗ (ξk )1≤k≤n = sup 1{ζk ∈[[0,x]], nk ≤y} − y xi
(x,y)∈[0,1]d−1 ×[0,1] n k=1 k=1
& '
1
n d−1
= sup sup 1{ζk ∈[[0,x]], nk ≤y} − y xi
x∈[0,1]d−1 y∈[0,1] n k=1 k=1
&
1
k d−1
k
= max sup 1{ζ ∈[[0,x]]} − xi
1≤k≤n x∈[0,1]d−1 n =1 n i=1
'
1
k−1 d−1
k
∨ sup 1{ζ ∈[[0,x]]} − x i
x∈[0,1]d−1 n =1 n i=1
1
n
since one can easily check that the functions of the form y → ak 1{ nk ≤y} − b y
n k=1
(ak , b ≥ 0) attain their supremum either at some y = k
or at its left limit “y− =
k n
n −
”, k ∈ {1, . . . , n}. Consequently,
(
1
Dn∗ (ξk )1≤k≤n = max k Dk∗ (ζ1 , . . . , ζk )
n 1≤k≤n
( ))
1
k−1 d−1
k
∨ (k − 1) sup 1{ζ ∈[[0,x]]} − x i
x∈[0,1]d−1 k − 1 =1 k − 1 i=1
1
≤ max k Dk∗ (ζ1 , . . . , ζk ) ∨ (k − 1)Dk−1 ∗
(ζ1 , . . . , ζk−1 ) + 1
n 1≤k≤n
1 + max1≤k≤n k Dk∗ (ζ1 , . . . , ζk )
≤ . (4.15)
n
Corollary 4.2 Let d ≥ 1. There exists a real constant Cd ∈ (0, +∞) such that, for
every n ≥ 1, there exists an n-tuple (ξ1n , . . . , ξnn ) ∈ ([0, 1]d )n satisfying
1 + (log n)d−1
Dn∗ (ξ1n , . . . , ξnn ) ≤ Cd .
n
The main drawback of this procedure is that if one starts from a sequence with low
discrepancy (often defined recursively), one loses the “telescopic” feature of such a
sequence. If one wishes, for a given function f defined on [0, 1]d , to increase n in
order to improve the accuracy of the approximation, all the terms of the sum in the
empirical mean have to be re-computed.
where ξk1:d−1 = (ξk1 , . . . , ξnd−1 ). [Hint: Follow the lines of the proof of Proposition 4.6,
using that the supremum of the càdlàg function ϕ : y → n1 nk=1 ak 1{ξkd ≤y} − b y
(ak , b ≥ 0) is max1≤k≤n |ϕ(ξkd )| ∨ |ϕ((ξkd )− )| ∨ |ϕ(1)|.]
(b) Deduce that the upper-bound in (4.14) can be slightly improved by an appropriate
choice of the dth component of the terms of the n-tuple (ξk )1≤k≤n [Hint: what can
be better that ( nk )1≤k≤n in one dimension?]
The use of sequences with low discrepancy to compute integrals instead of the Monte
Carlo method (based on pseudo-random numbers) is known as the Quasi-Monte
Carlo method (QMC). This terminology extends to so-called good lattice points not
described here (see [219]).
The Pros.
The main attracting feature of sequences with low discrepancy is the combination
of the Koksma–Hlawka inequality with the rate of decay of the discrepancy. It sug-
gests that the QMC method is almost dimension free. This fact should be tempered
in practice, standard a priori bounds for discrepancy do not allow for the use of this
inequality to provide some “100%-confidence intervals”.
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 119
where g is a bounded Borel function (see e.g. [223]). As a matter of fact, for such
coboundaries,
1
1
n
g(ξ1 ) − g(ξn+1 )
f (ξk ) − f (u)du = =O .
n k=1 [0,1]d n n
Then Birkhoff’s pointwise ergodic Theorem (see [174]) implies that, for every f ∈ L 1 (μ),
1
n
μ(d x)-a.s. f (T k−1 (x)) −→ f dμ.
n X
k=1
The mapping T is uniquely ergodic if μ is the only measure satisfying T . If X is a topological space,
X = Bor (X ) and T is continuous, then, for any continuous function f : X → R,
1
n
sup f (T k−1 (x)) − f dμ −→ 0 as n → +∞.
x∈X n X
k=1
In particular, it shows that any orbit (T n−1 (x))n≥1 is μ-distributed. When X = [0, 1]d and μ = λd ,
one retrieves the notion of uniformly distributed sequence. This provides a powerful tool for devising
and studying uniformly distributed sequences. This is the case e.g. for Kakutani sequences or
rotations of the torus.
120 4 The Quasi-Monte Carlo Method
sequences in one dimension and other orbits of the Kakutani transforms. Similar
results also exist for the rotations of the torus.
Extensive numerical tests on problems involving some smooth (periodic) func-
tions on [0, 1]d , d ≥ 2, have been carried out, see e.g. [48, 237]. They suggest that
this improvement still holds in higher dimensions, at least partially.
It remains that, at least for structural dimensions d up to a few tens, Quasi-
Monte Carlo integration empirically usually outperforms regular Monte Carlo sim-
ulation even if the integrated function does not have finite variation. We refer to
Fig. 4.1 further on (see Application to the Box–Muller method), where E |X 1 − X 2 |,
d
X = (X 1 , X 2 ) = N (0; I2 ) is computed by using a simple Halton(2, 3) sequence and
pseudo-random numbers. More generally we refer to [237] for extensive numerical
comparisons between Quasi- and regular Monte Carlo methods.
This concludes the pros.
The cons.
As concerns the cons, the first is that all the non-asymptotic bounds for the dis-
crepancy at the origin are very poor from a numerical point of view. We again refer
to [237] for some examples which emphasize that these bounds cannot be relied on
to provide (deterministic) error intervals for numerical integration. This is a major
drawback compared to the regular Monte Carlo method, which automatically pro-
vides, almost for free, a confidence interval at any desired level.
The second significant drawback concerns the family of functions for which we
know that the QMC numerical integration is speeded up thanks to the Koksma–
Hlawka Inequality. This family – mainly the functions with finite variation over
[0, 1]d in some sense – somehow becomes sparser and sparser in the space of Borel
functions as the dimension d increases since the requested condition becomes more
and more stringent (see Exercise 2 immediately before the Koksma–Hlawka formula
in Proposition 4.3 and Exercise 1 which follows).
More importantly, if one is interested in integrating functions sharing a “standard”
regularity like Lipschitz continuity, the following theorem due to Proïnov ([248])
shows that the curse of dimensionality comes back into the game in a striking way,
without any possible escape.
1
n
f (ξk ) ≤ Cd [ f ]∞ ∗ 1
f (x)d x − Lip Dn (ξ1 , . . . , ξn ) .
d
[0,1]d n k=1
Exercise. Show, using the Van der Corput sequences starting at 0 (see the exercise
in the paragraph devoted to Van der Corput and Halton sequences in Sect. 4.3.3) and
the function f (x) = x on [0, 1], that the above Proïnov Inequality cannot be improved
for Lipschitz continuous functions even in one dimension. [Hint: Reformulate some
results of the Exercise in Sect. 4.3.3.]
A third drawback of using QMC for numerical integration is that all functions
need to be defined on unit hypercubes. One way to partially get rid of that may be to
consider integration on some domains C ⊂ [0, 1]d having a regular boundary in the
Jordan sense (10 ). Then a Koksma–Hlawka-like inequality holds true:
1
n
1{ξk ∈C} f (ξk ) ≤ V ( f ) D̃n∞ (ξ) d
1
f (x)λd (d x) −
C n k=1
where V ( f ) denotes the variations of f (in the Hardy and Krause or measure sense)
and D̃n∞ (ξ) denotes the extreme discrepancy of (ξ1 , . . . , ξn ) (see again [219]). The
simple fact to integrate over such a set annihilates the low discrepancy effect (at least
from a theoretical point of view).
Exercise. Prove Proïnov’s Theorem when d = 1. [Hint: read the next chapter and
compare the star discrepancy modulus and the L 1 -mean quantization error.]
This suggests that the rate of numerical integration in dimension
d by a sequence
log n
with low discrepancy of Lipschitz continuous functions is O 1 as n → +∞,
nd
d−1
(log n) d
or O 1 when considering, for a fixed n, an n-tuple designed by the
nd
Hammersley method. This emphasizes that sequences with low discrepancy are not
spared by the curse of dimensionality when implemented on functions with standard
regularity. . .
10 Namely that for every ε > 0, λd ({u ∈ [0, 1]d : dist(u, ∂C) < ε}) ≤ κC ε.
122 4 The Quasi-Monte Carlo Method
Exercise. (a) Let ξ = (ξn )n≥1 denote the dyadic Van der Corput sequence. Show
that, for every n ≥ 0,
1 ξn
ξ2n+1 = ξ2n + and ξ2n =
2 2
with the convention ξ0 = 0. Deduce that
1
n
5
lim ξ2k ξ2k+1 = .
n n k=1 24
* *
ζn = ζn1 , ζn2 ) := −2 log(ξn1 ) sin 2πξn2 , −2 log(ξn1 ) cos 2πξn2 .
1 1
n
d
lim f 1 ζk −→ E f 1 (Z 1 ), Z 1 = N 0; 1
n n k=1
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 123
since (ξ 1 , ξ 2 ) −→ f −2 log ξ 1 sin(2πξ 2 ) is continuous on (0, 1]2 and bounded,
hence Riemann integrable on [0, 1]2 . Likewise, for every bounded continuous
function f 2 : R2 → R,
1
n
d
lim f 2 ζk −→ E f 2 (Z ), Z = N 0; I2 .
n n k=1
These continuity assumptions on f 1 and f 2 can be relaxed, e.g. for f 2 to: the function
f 2 defined on (0, 1]2 by
(ξ 1 , ξ 2 ) −→
f 2 (ξ 1 , ξ 2 ) := f 2 −2 log ξ 1 sin 2πξ 2 , −2 log ξ 1 cos 2πξ 2
i = 1, . . . , d/2.
In particular, we will see further on in Chap. 7 that simulating the Euler scheme
with step mT of a d-dimensional diffusion over [0, T ] with an underlying q-dimensional
Brownian motion consumes m independent N 0; Iq -distributed random vectors, i.e.
m × q independent N 0; 1 random variables. To perform a QMC simulation of a
function of this Euler scheme at time T , we consequently need to consider a sequence
with low discrepancy over [0, 1]mq . Existing error bounds on sequences with low dis-
crepancy and the sparsity of functions with finite variation make essentially mean-
ingless any use of Koksma–Hlawka’s inequality to produce error bounds. Proïnov’s
theorem itself is difficult to use, owing to the difficult evaluation of [ f ]Lip . Not to
mention that in the latter case, the curse of dimensionality will lead to extremely
poor theoretical bounds for Lipschitz functions (like for f 2 in dimension 2).
Fig. 4.1 Mc versus QMC. E |X 1 − X 2 | computed by simulation with 1 500 trials of pseudo-
random numbers (red) and a Halton(2, 3) sequence (blue) plugged into a Box–Muller formula
(Reference value √2π )
Exercise.
(a) Implement a QMC-adapted Box–Muller simulation method for
N 0; I2 (based on the sequence with low discrepancy of your choice) and orga-
nize a race MC vs QMC to compute various calls, say Call B S (K , T ) (T = 1,
K ∈ {95, 96, . . . , 104, 105}) in a Black–Scholes model (with r = 2%, σ = 30%,
x = 100, T = 1). To simulate this underlying Black–Scholes risky asset, first use
the closed expression
σ2
√
X tx = x e(r − 2 )t+σ TZ
.
(b) Anticipating Chap. 7, implement the Euler scheme (7.5) of the Black–Scholes
dynamics
d X tx = X tx (r dt + σdWt ).
n 1 n − pi
ξni = , n ∈ {1, . . . , pi − 1}, ξni = 2 + , n ∈ { pi , . . . , 2 pi − 1}, . . .
pi pi pi
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 125
so it is clear that the ith and the (i + 1)th components will remain highly correlated
if d = 81 and (i, i + 1) = (d − 1, d) then pd−1 = 503 and pd = 509,. . .
To overcome this correlation observed for (not so) small values of n, the usual
method is to discard the first values of a sequence.
Exercise. Let ξ 1 = VdC( p1 ) and ξ 2 = VdC( p2 ) where p1 and p2 are two (rea-
sonably) distinct large prime numbers satisfying p1 < p2 < 2 p1 ( p2 exists owing to
Bertrand’s conjecture). Show that the pseudo-estimator of the correlation between
these two sequences at order n = p1 satisfies
1 1 2
n
1 1
n
1 2
n 1 4 p1 − p2 + 1
ξ ξ − ξ ξ = 1+
n k=1 k k n k=1 k n k=1 k p1 12 p2
1 1 1 1
> 1+ 1+ > .
12 p1 2 p1 12
MC versus QMC : a visual point of view. As a conclusion to this section, let us
emphasize graphically in Fig. 4.2 the differences of textures between MC and QMC
“sampling”, i.e. between (pseudo-)randomly generated points (say 60 000) and the
same number of terms of the Halton sequence (with p1 = 2 and p2 = 3).
Fig. 4.2 MC versus QMC Left: 6.104 randomly generated points; Right: 6.104 terms of the
Halton(2, 3)-sequence
126 4 The Quasi-Monte Carlo Method
The starting idea is to note that a shifted uniformly distributed sequence is still
uniformly distributed. The second stage is to randomly shift such a sequence to
combine the properties of the Quasi- and regular Monte Carlo simulation methods.
Proposition 4.7 (a) Let U be a uniformly distributed random variable on [0, 1]d .
Then, for every a ∈ Rd
d
{a + U } = U.
Proof. (a) One easily checks that the characteristic functions of U and {a + U }
coincide on Zd since, for every p ∈ Zd ,
Hence (see [155]), U and {a + U } have the same distributions since they are [0, 1]d -
valued (in fact this is the static version of Weyl’s criterion used to prove claim (b)
below).
(b) This follows from Weyl’s criterion: let p ∈ Nd \ {0},
1
n
χ = χ( f, U ) := f ({U + ξk })
n k=1
satisfies
1
Eχ = × n E f (U ) = E f (U )
n
1 1
M M n
+
I ( f, ξ)n,M := χ( f, Um ) = f ({Um + ξk }).
M m=1 n M m=1 k=1
by the Strong Law of Large Numbers and its (weak) rate of convergence is ruled by
a CLT √ L
M + I ( f, ξ)n,M − E f (U ) −→ N 0; σn2 ( f, ξ)
with
1
n
σn2 ( f, ξ) = Var f ({U + ξk }) .
n k=1
Hence, the specific rate of convergence of the QMC is irremediably lost. In particular,
this hybrid method should be compared to a regular Monte Carlo simulation of size
n M through their respective variances. It is clear that we will observe a variance
reduction if and only if
σn2 ( f, ξ) Var( f (U ))
< ,
M nM
i.e.
1
n
Var( f (U ))
Var f ({U + ξk }) ≤ .
n k=1 n
128 4 The Quasi-Monte Carlo Method
The only natural upper-bound for the left-hand side of this inequality is
2
1
n
σn2 ( f, ξ) = f ({u + ξk }) − f dλd du
[0,1]d n k=1 [0,1]d
2
1
n
≤ sup f ({u + ξk }) − f dλd .
u∈[0,1]d n k=1 [0,1]d
One can show that f u : v → f ({u + v}) has finite variation on [0, 1]d as soon as f
has (in the same sense) and that supu∈[0,1]d V ( f u ) < +∞ (more precise results can
be established). Consequently
so that, if ξ = (ξn )n≥1 is a sequence with low discrepancy (say Faure, Halton, Kaku-
tani or Sobol’, etc),
(log n)2d
σn2 ( f, ξ) ≤ C 2f,ξ , n ≥ 1.
n2
Consequently, in that case, it is clear that randomized QMC provides a very significant
variance reduction (for the same complexity) of a magnitude proportional to (lognn)
2d
the whole Rd with the same Lipschitz coefficient, say [ f ]Lip . Furthermore, it satisfies
f (x) = f ({x}) for every x ∈ Rd . Then, it follows from Proïnov’s Theorem 4.3 that
2
1
n
≤ [ f ]2Lip Dn∗ (ξ1 , . . . , ξn ) d
2
sup f ({u + ξk }) − f dλd
u∈[0,1]d n k=1 [0,1]d
2 (log n)2
≤ Cd2 [ f ]2Lip Cξd 2
nd
(where Cd is Proïnov’s constant). This time, still for a prescribed budget, the “gain”
2
factor in terms of variance is proportional to n 1− d (log n)2 , which is no longer a
gain. . . but a loss as soon as d ≥ 2!
For more results and details, we refer to the survey [271] on randomized QMC
and the references therein.
Finally, randomized QMC is a specific (and not so easy to handle) variance reduc-
tion method, not a QMC speeding up method. It suffers from one drawback shared
by all QMC-based simulation methods: the sparsity of the class of functions with
finite variation and the difficulty for identifying them in practice when d > 1.
If the principle of mixing randomness and the Quasi-Monte Carlo method is undoubt-
edly a way to improve rates of convergence of numerical integration over unit
hypercubes, the approach based on randomly “shifted” sequences with low discrep-
ancy (or nets) described in the former section turned out to be not completely satisfac-
tory and it is no longer considered as the most efficient way to proceed by the QMC
community.
A new idea emerged at the very end of the 20th century inspired by the pioneering
work by A. Owen (see [221]): to break the undesired regularity which appears even
in the most popular sequences with low discrepancy (like Sobol’ sequences), he
proposed to scramble them in an i.i.d. random way so that these regularity features
disappear while preserving the quality, in terms of discrepancy, of these resulting
sequences (or nets).
The underlying principle – or constraint – was to preserve their “geometric-
combinatorial” properties. Typically, if a sequence shares the (s, d)-property in a
given base (or the (s, m, d)-property for a net), its scrambled version should share it
too. Several attempts to produce efficient deterministic scrambling procedures have
been made as well, but of course the most radical way to get rid of regularity features
was to consider a kind of i.i.d. scrambling as originally developed in [221]. This has
been successfully applied to the “best” Sobol’ sequences by various authors.
130 4 The Quasi-Monte Carlo Method
We will not go deeper into the detail and technicalities as it would lead us too
far from probabilistic concepts and clearly beyond the mathematical scope of this
textbook.
Nevertheless, one should keep in mind that these types of improvements are
mostly devoted to highly regular functions. In the same spirit, several extensions of
the Koksma–Hlawka Inequality have been established (see [78]) for differentiable
functions with finite variation (in the Hardy and Krause sense) whose partial deriva-
tives also have finite variations up to a given order.
If one looks at the remark “The practitioner’s viewpoint” in Sect. 1.4 devoted to
Von Neumann’s acceptance-rejection method, it is a simple exercise to check that
one can replace pseudo-random numbers by a uniformly distributed sequence in
the procedure (almost) mutatis mutandis, except for some more stringent regularity
assumptions.
We adopt the notations of this section and assume that the Rd -valued random
vectors X and Y have absolutely
continuous
distributions with respect to a reference
σ-finite measure μ on Rd , Bor (Rd ) . Assume that PX has a density proportional to
f , that PY = g · μ and that f and g satisfy
d
Y = (U ), U = U ([0, 1]r )
Our aim is to compute E ϕ(X ), where ϕ ∈ L1 (PX ). Since we will use ϕ(Y ) =
ϕ ◦ (Y ) to perform this integration (see below), we also ask ϕ to be such that
ϕ ◦ is Riemann integrable.
This classically holds true if ϕ is continuous (see e.g. [52], Chap. 3).
Let ξ = (ξn1 , ξn2 )n≥1 be a [0, 1] × [0, 1]r -valued sequence, assumed to be with low
discrepancy (or simply uniformly distributed) over [0, 1]r +1 . Hence, (ξn1 )n≥1 and
(ξn2 )n≥1 are In particular, uniformly distributed over [0, 1] and [0, 1]r , respectively.
d d
If (U, V ) = U ([0, 1] × [0, 1]r ), then (U, (V )) = (U, Y ). Consequently, the
product of two Riemann integrable functions being Riemann integrable,
n
k=11{c ξ 1 g((ξk2 ))≤ f ((ξk2 ))} ϕ((ξk2 )) n→+∞ E (1{c U g(Y )≤ f (Y )} ϕ(Y ))
n k −→
k=1 1{c ξk1 g((ξk2 ))≤ f ((ξk2 ))} P(c U g(Y ) ≤ f (Y ))
= ϕ(x) f (x)d x
Rd
= E ϕ(X ), (4.17)
where the last two lines follow from computations carried out in Sect. 1.4.
The main gap to apply the method in a QMC framework is the a.s. continuity
assumption (4.16). The following proposition yields an easy and natural criterion.
f
Proposition 4.8 If the function g
◦ is λr -a.s. continuous on [0, 1]r , then Assump-
tion (4.16) is satisfied.
f
λr +1 Disc(I) = λr +1 (ξ 1 , ξ 2 ) ∈ [0, 1]r +1 s.t. c ξ 1 = ◦ (ξ 2 ) .
g
132 4 The Quasi-Monte Carlo Method
In turn, this subset of [0, 1]r +1 is negligible with respect to the Lebesgue measure
λr +1 since, returning to the independent random variables U and Y and keeping in
mind that g(Y ) > 0 P-a.s.,
f
λr +1 (ξ 1 , ξ 2 ) ∈ [0, 1]r +1 s.t. c ξ 1 = ◦ (ξ 2 ) = P(c U g(Y ) = f (Y ))
g
f
=P U = (Y ) = 0
cg
where we used (see exercise below) that U and Y are independent by construction
and that U has a diffuse distribution (no atom). ♦
f
Remark. The criterion of the proposition is trivially satisfied when g
and are
continuous on Rd and [0, 1]r , respectively.
Exercise. Show that if X and Y are independent and X or Y has no atom then
P(X = Y ) = 0.
As a conclusion note that in this section we provide no rate of convergence for this
acceptance-rejection method by quasi-Monte Carlo. In fact, there is no such error
bound under realistic assumptions on f , g, ϕ and . Only empirical evidence can
justify its use in practice.
The Borel sets Ci () are called Voronoi cells of the partition induced by .
and
◦
C i () = ξ ∈ R , |ξ − xi | < min |ξ − xj | .
d
i=j
5.1 Theoretical Background on Vector Quantization 135
◦
Furthermore, C i () and C i () are polyhedral convex sets as the intersection of the
half-spaces containing xi defined by the median hyperplanes Hij of the pairs (xi , xj ),
j = i.
Let = {x1 , . . . , xN } be an N -quantizer. The nearest
neighbor
projection Proj :
Rd → {x1 , . . . , xN } induced by a Voronoi partition Ci (x) i=1,...,N is defined by
N
∀ ξ ∈ Rd , Proj (ξ) := xi 1{ξ∈Ci ()} .
i=1
N
= Proj (X ) =
X xi 1{X ∈Ci ()} . (5.1)
i=1
and does not depend on the selected nearest neighbour projection on . When X has a
strongly continuous distribution, i.e. P(X ∈ H ) = 0 for any hyperplane H of Rd , the
boundaries of the Voronoi cells Ci () are P-negligible so that any two quantizations
induced by are P-a.s. equal.
Definition 5.2 (a) The mean quadratic quantization error induced by an N -quantizer
⊂ Rd is defined as the quadratic norm of the pointwise error, i.e.
1
2
1
2
X −X 2
= E min |X − xi |2 = min |ξ − xi |2 PX (d ξ) . (5.2)
1≤i≤N Rd 1≤i≤N
(b) We define the quadratic distortion function at level N , defined as the squared
mean quadratic quantization error on (Rd )N , namely
Theorem 5.1 (Existence of optimal N -quantizers, [129, 224]) Let X ∈ L2Rd (P)
and let N ∈ N∗ .
(a) The quadratic distortion function Q2,N at level N attains a minimum at an N -
tuple x(N ) ∈ (Rd )N and x(N ) = xi(N ) , i = 1, . . . , N is an optimal quantizer at level
N (though its cardinality may be lower than N , see the remark below).
(b) If the support of the distribution PX of X hasat least N elements,
then x(N ) =
(N )
(x1 , . . . , xN ) has pairwise distinct components, P X ∈ Ci (x ) > 0, i = 1, . . . , N
N N
(and minx∈(Rd )N −1 Q2,N −1 (x) > 0). Furthermore, the sequence N → inf Q2,N (x)
x∈(Rd )N
converges to 0 and is (strictly) decreasing as long as it is positive.
Remark. If supp PX is finite, say supp PX = {x1 , . . . , xN0 } ⊂ Rd , N0 ≥ 1 (with
pairwise distinct xi ), then x(N0 ) = (x1 , . . . , xN0 ) is an optimal quantizer at level N0
and minx∈(Rd )N0 Q2,N0 (x) = 0 and for every level N ≥ N0 , minx∈(Rd )N Q2,N (x) = 0.
Proof. (a) We will proceed by induction on the level N . First note that the L2 -mean
quantization error function defined on (Rd )N by
(x1 , . . . , xN ) −→ Q2,N (x1 , . . . , xN ) = min |X − xi |
1≤i≤N 2
is
clearly 1-Lipschitz
continuous with respect to the ∞ -norm on (Rd )N defined by
(x1 , . . . , x ) ∞ := max |xi |. This is a straightforward consequence of Minkowski’s
N 1≤i≤N
inequality combined with the more elementary inequality
min ai − min bi ≤ max |ai − bi |
1≤i≤N 1≤i≤N 1≤i≤N
(whose proof is left to the reader). As a consequence, it implies the continuity of its
square, the quadratic distortion function Q2,N .
As a preliminary remark, note that by its very definition the sequence N →
inf X − X x 2 is non-increasing.
x∈(Rd )N
N = 1. The non-negative strictly convex function Q2,1 clearly goes to +∞ as
|x1 | → +∞. Hence, Q2,1 attains a unique minimum at the mean of X , i.e. x1(1) = E X .
So {x1(1) } is an optimal quantization grid at level 1.
5.1 Theoretical Background on Vector Quantization 137
The sequence x[k] k∈N∗ being KN +1 -valued, one has
inf Q2,|I | (y) ≤ λN +1 < inf Q2,N (x).
y∈(Rd )|I | x∈(Rd )N
In turn, this implies that |I | = N + 1, i.e. the sequence of (N + 1)-tuples x[k] k≥1 is
bounded. As a consequence, the set KN +1 is compact and the function Q2,N +1 attains
a minimum over KN +1 at an x(N +1) which is obviously its absolute minimum and has
pairwise
components
such that, with obvious notation, card(x(N +1) ) = N + 1 and
P X ∈ Ci (x(N +1) ) > 0, i = 1, . . . , N + 1.
138 5 Optimal Quantization Methods I: Cubatures
(b) The strict decrease of N → inf (Rd )N Q2,N as long as it is not 0 is a straightforward
consequence of the proof of the Claim (a). If (zN )N ≥1 is an everywhere dense sequence
in Rd , then
0 ≤ inf Q2,N ≤ X − X
(z1 ,...,zN ) = min |X − zi | ↓ 0 as N → +∞
(Rd )N 2 1≤i≤N 2
p
2. Lp -optimal quantizers. Let p ∈ (0, +∞) and X ∈ LRd (P). Show that the Lp -
distortion function at level N ∈ N∗ defined by
Qp,N : x = (x1 , . . . , xN ) −→ E min |X − xi |p = X − X
x p
p
(5.4)
1≤i≤N
attains its minimum on (Rd )N . (Note that when p = N = 1 this minimum is attained
at the median of the distribution of X ).
3. Constrained quantization (at 0). (a) Show that, if X ∈ L2Rd (P) and N ∈ N∗ , the
function defined on (Rd )N by
(x1 , . . . , xN ) → E min |X − xi |2 ∧ |X |2 = min |ξ − xi |2 ∧ |ξ|2 PX (d ξ)
1≤i≤N Rd 1≤i≤N
(5.5)
attains a minimum at an N -tuple (x1∗ , . . . , xN∗ ).
(b) How would you interpret (x1∗ , . . . , xN∗ ) in terms of quadratic optimal quantization?
At which level?
On the other hand, by the definition of a x(N ) = x1(N ) , . . . , xN(N ) -valued Voronoi
quantizer, one has
x(N ) ≥ dist X , x(N ) = X − X
X − E X | X x(N ) .
2 2 2
140 5 Optimal Quantization Methods I: Cubatures
x(N ) =
The former equality and this inequality are compatible if and only if E X | X
x(N ) P-a.s.
X
(b) Let Y : (, A, P) → Rd with |Y ()| ≤ N and let = {Y (ω), ω ∈ }. It is clear
that || ≤ N so that
(N )
X −Y 2 ≥ dist(X , ) = X − X
2 ≥ inf Q2,N (x)=X − X
x . ♦
2 2
x∈(Rd )N
-1
-2
-3
-3 -2 -1 0 1 2 3
Fig. 5.1 Optimal quadratic quantization of size N = 200 of the bi-variate normal distribution
N (0, I2 )
Fig. 5.2 Voronoi Tessellation of an optimal N -quantizer (N = 500) for the N (0; I2 ) distribution.
Color code: the heavier the cell is, the warmer (i.e. the lighter) the cell looks (with J. Printems)
is to the mode of the distribution, the heavier it weighs. In fact, optimizing a quantizer
tends to equalize the local inertia of the cells, i.e.
E min1≤j≤N |X − xj(N ) |2
E 1{X ∈Ci (x(N ) )} |X − xi(N ) |2 , i = 1, . . . , N .
N
142 5 Optimal Quantization Methods I: Cubatures
(N ) (N ) (N )
Fig. 5.3 xi → P X ∈ Ci (x(N ) ) (flat line) and xi → E 1{X ∈Ci (x(N ) )} |X − xi |2 , X ∼
N (0; 1) (Gaussian line), x(N ) optimal N -quantizer, N = 50 (with J.-C. Fort)
This fact can be easily highlighted numerically in one dimension, e.g. on the normal
distribution, as illustrated in Fig. 5.3 (1 ).
Let us return to the study of the quadratic mean quantization error and, to be more
precise, to its asymptotic behavior as the quantization level N goes to infinity. As seen
in Theorem 5.1, the fact that the minimal mean quadratic quantization error goes to
zero as N goes to infinity is relatively obvious since it follows from the existence of
an everywhere dense sequence in Rd . Determining the rate of this convergence is a
much more involved question which is answered by the following theorem, known as
Zador’s Theorem, that we will essentially admit. In fact, optimal vector quantization
can be defined and developed in any Lp -space and, keeping in mind some future
application (in L1 ), we will state the theorem in such a general framework.
− 2·3
ξ2
1 The “Gaussian” solid line follows the shape of ξ → ϕ 1 (ξ) = √e √ , i.e. P X ∈ Ci (x(N ) )
3 3 2π
(N )
ϕ 1 (xi ), i = 1, . . . , N , in a sense to be made precise. This is another property of optimal quantizers
3
which is beyond the scope of this textbook, see e.g. [132].
2 This means that there is a Borel set A ∈ B or(Rd ) such that λ (A ) = 0 and ν(A ) = ν(Rd ). Such
ν d ν ν
a decomposition always exists and is unique.
5.1 Theoretical Background on Vector Quantization 143
(b) Non- asymptotic upper- bound (see [205]). Let δ > 0. There exists a real
constant Cd ,p,δ ∈ (0, +∞) such that, for every Rd-valued random vector X ,
∀ N ≥ 1, min x
X −X p
≤ Cd ,p,δ σp+δ (X )N − d
1
x∈(Rd )N
Additional proof of Claim (b) (see also Theorem 1 in [236] and the remark that
follows.) In fact, what is precisely proved in [205] (Lemma 1) is
∀ N ≥ 1, min x
X −X p
≤ Cd ,p,δ X p+δ N
− d1
.
x∈(Rd )N
To derive the above conclusion, just note that the quantization error is invariant under
xa
translation since X −a =Xx − a (where x a := {x1 − a, . . . , xN − a}), so
xa
x = (X − a) − X
that X − X −a , which in turn implies
xa
∀ a ∈ Rd , min x
X −X p
= min
(X − a) − X −a p
x∈(Rd )N x∈(Rd )N
x
= min
(X − a) − X −a p
x∈(Rd )N
− d1
≤ Cd ,p,δ X − a p+δ N .
Remarks. • By truncating any random vector X , one easily checks that one always
has 1 1 p+d
x ≥
1 d
lim N d min X −X p
Jp,d ϕ d +p d λd .
N →+∞ x∈(Rd )N Rd
• The first rigorous proof of claim (a) in a general framework is due to S. Graf and H.
Luschgy in [129]. Claim (b) is an improved version of the so-called Pierce’s Lemma,
also established in [129].
1
• The N d factor is known as the curse of dimensionality: this is the optimal rate to
“fill” a d -dimensional space by 0-dimensional objects.
• The real constant Jp,d clearly corresponds to the case of the uniform distribution
U ([0, 1]d ) over the unit hypercube [0, 1]d for which the following slightly more
precise statement holds
1
lim N d min x
X −X p
= inf N d min
1
x
X −X p
=
Jp,d .
N x∈(Rd )N N x∈(Rd )N
144 5 Optimal Quantization Methods I: Cubatures
The random vector X x takes its values in the finite set (or grid) x = {x1 , . . . , xN } (of
size N ), so for every continuous function F : Rd → R with F(X ) ∈ L2 (P), we have
N
x ) =
E F(X F(xi )P X ∈ Ci (x) ,
i=1
function xu → u ,
r
owing to the conditional Jensen inequality applied to the convex
) , one
see Proposition 3.1. In particular, using that E F(X ) = E E (F(X ) | X
derives (with r = 1) that
x ) ≤ E F(X ) | X
E F(X ) − E F(X x )
x − F(X
1
x 1 .
≤ [F]Lip X − X
This universal bound is optimal in the following sense: considering the 1-Lipschitz
continuous function F(ξ) := mini=1,...,N |x − ξi | = dist(ξ, x ) which is equal to 0 at
every component xi of x, shows that equality may hold in (5.8) so that
x
X −X 1
x ) .
= sup E F(X ) − E F(X (5.9)
[F]Lip ≤1
μ and ν are distributions on Rd , Bor(Rd ) with finite pth moment (1 ≤ p < +∞), the Lp -
3 If
Note that the absolute values can be removed in the above characterization of W1 (μ, ν) since f and
−f are simultaneously Lipschitz continuous with the same Lipschitz coefficient.
146 5 Optimal Quantization Methods I: Cubatures
For an introduction to the Wasserstein distance and its main properties, we refer
to [272].
Moreover, the bounded Lipschitz continuous functions making up a character-
izing family for the weak convergence of probability measures on Rd (see The-
orem 12.6(ii)), one derives that, for any sequence of N -quantizers x(N ) satisfying
X −X x(N ) 1 → 0 as N → +∞ (but a priori not optimal),
x(N ) (Rd )
= xi(N ) δ (N ) =⇒
P X PX ,
x i
1≤i≤N
(Rd )
where =⇒ denotes the weak convergence of probability measures on Rd .
In fact, still due to the Monge–Kantorovich characterization of L1 -Wasserstein
distance, we have trivially by (5.9) the more powerful result
W1 PX , Px(N ) −→ 0 as n → +∞.
X
Variants of these cubature formulas can be found in [231] or [131] for functions
F having only some local Lipschitz continuous regularity and polynomial growth.
Let x ∈ (Rd )N bea stationary quantizer, e.g. in the sense of Definition 5.4 (Eq. (5.7)),
so that E X | Xx = X x is a stationary quantization of X . Then, Jensen’s inequality
yields, for every convex function F : Rd → R such that F(X ) ∈ L1 (P),
x ) ≤ E F(X ) | X
F(X x .
1
F(v) = F(u) + ∇F(u) | v − u + ∇F tv + (1 − t)u − ∇F(u) | v − u dt.
0
Using the Schwarz Inequality and the fact that ∇F is Lipschitz continuous, we obtain
1
F(v) − F(u) − ∇F(u) | v − u ≤
∇F tv + (1 − t)u − ∇F(u) | v − u dt
0
1
≤ ∇F tv + (1 − t)u − ∇F(u) v − udt
0
1 [∇F]Lip
≤ [∇F]Lip |v − u|2 t dt = |v − u|2 .
0 2
[∇F]Lip
F(X ) − F(Xx ) − ∇F(X x ≤
x ) | X − X x 2 .
X − X (5.11)
2
Taking conditional expectation given X x ≤
x yields, keeping in mind that E Z | X
x
E Z | X
,
x
E (F(X ) | X
x ) − F(X
x ) − E ∇F(X x X
x ) | X − X
[∇F]Lip
≤ E |X − X x |2 | X
x .
2
x ) is σ(X
Now, using that the random variable ∇F(X x )-measurable, one has
x ) | X − X
E ∇F(X x ) | E (X − X
x = E ∇F(X x | X
x ) = 0
so that
[∇F]Lip
E (F(X ) | X x ) ≤
x ) − F(X x |2 | X
E |X − X x .
2
Then, for every real exponent r ≥ 1, the conditional Jensen’s Inequality applied to
the function u → ur yields
[∇F]Lip
E (F(X ) | X x ) ≤
x ) − F(X x
X −X 2
. (5.12)
r 2 2r
[∇F]Lip
x ) ≤
E F(X ) − E F(X x 2 .
X −X
2 2
148 5 Optimal Quantization Methods I: Cubatures
Moreover,
1
sup E F(X ) − E F(X
x ) , F ∈ CLip
1 x 2 . (5.14)
(Rd , R), [∇F]Lip ≤ 1 = X − X
2 2
In fact, this supremum holds as a maximum, attained for the function F(ξ) = 21 |ξ|2
since [∇F]Lip = 1 and
x
X −X 2 x |2 .
= E |X |2 − E |X (5.15)
2
x ) ≤ 1 |||D2 F||| X − X
E F(X ) − E F(X x 2 ,
2 2
where |||D2 F||| = sup sup u∗ D2 F(x)u.
x∈Rd u:|u|=1
Proof. (a) The error bound (5.13) is proved above. Equality in (5.14) holds for the
function F(ξ) = 21 |ξ|2 since [∇F]Lip = 1 and
1 x |2 = 1 E |X |2 − 2 E X | X x + E |X x |2
E |X − X
2 2
1
= E |X |2 − 2 E E (X |X x ) | X
x + E |X x |2
2
1 x |2 + E |X x |2
= E |X |2 − 2 E |X
2
= E F(X ) − E F(X x )
Variants of these cubature formulas can be found in [231] or [131] for functions
F whose gradient only has local Lipschitz continuous regularity and polynomial
growth.
5.2 Cubature Formulas 149
Exercise. (a) Let x ⊂ Rd be a quantizer and let Px denotethe set of probabil-
ity distributions supported by x . Show that X − x 2 = W2 μX , P , where μX
X
x
denotes the distribution of X and W2 μX , Px = inf ν∈Px W2 μX , ν).
(b) Deduce that, if x is a stationary quantizer, then
1 2
sup E F(X ) − E F(X
x ) , F ∈ CLip
1
(Rd , R), [∇F]Lip ≤ 1 = W2 μX , Px .
2
Let X and Y be two Rd -valued random vectors defined on the same probability
space (, A, P) and let F : Rd → R be a Borel function. Assume that F(X ) ∈ L2 (P).
The natural idea to approximate E (F(X ) | Y ) by quantization is to replace mutatis
mutandis the random vectors X and Y by their quantizations X and Y (with respect
to quantizers x and y that we drop in the notation for simplicity). The resulting
approximation is then
E F(X ) | Y E F(X ) |
Y .
At this stage, a natural question is to look for a priori estimates for the result-
ing quadratic error given the quadratic mean quantization errors X − X 2 and
Y − Y 2.
To this end, we need further assumptions on F. Let ϕF : Rd → R be a regular
(Borel) version of the conditional expectation, i.e. satisfying
E F(X ) | Y = ϕF (Y ).
Usually, no closed form is available for the function ϕF but some regularity property
can be established, or more precisely “transmitted” or “propagated” from F to ϕF .
Thus, we may assume that both F and ϕF are Lipschitz continuous with Lipschitz
continuous coefficients [F]Lip and [ϕF ]Lip , respectively.
The main example of such a situation is the (homogeneous) Markovian frame-
work, where (Xn )n≥0 is a homogeneous Feller Markov chain, with transition
(P(y, dx))y∈Rd and when X = Xk and Y = Xk−1 . Then, with the above notations,
ϕF = PF. If we assume that the Markov kernel P preserves and propagates
Lipschitz continuity in the sense that, if [F]Lip < +∞ then [PF]Lip < +∞, the above
assumption is clearly satisfied.
Remark. The above property is the key to quantization-based numerical schemes
in Numerical probability. We refer to Chap. 11 for an application to the pricing
of American options where the above principle is extensively applied to the Euler
scheme which is a Markov process and propagates Lipschitz continuity.
We prove below a slightly more general proposition by considering the situation
where the function F itself is replaced/approximated by a function G. This means
) |
that we approximate E (F(X ) | Y ) by E (G(X Y ).
150 5 Optimal Quantization Methods I: Cubatures
(b) Lp -case p = 2. Assume now that F(X ), G(X ) ∈ Lp (P), p ∈ [1, +∞). Then
ϕF (Y ) ∈ Lp (P) and
) | )
E F(X ) | Y − E G(X Y ≤ F(X ) − G(X + 2[ϕF ]Lip Y −
Y . (5.18)
p p p
) |
E F(X ) | Y − E F(X Y ≤ [F]Lip X − X + 2[ϕF ]Lip Y −
Y . (5.19)
p p p
2
+ E F(X ) − G(X ) |
Y . (5.20)
2
Finally,
2 2
) | 2 )
E (F(X ) | Y ) − E G(X Y ≤ F(X ) − G(X + [ϕF ]2Lip Y −
Y . (5.21)
2 2 2
= ϕF (Y ) − ϕF (Y ) p + E ϕF (Y ) − ϕF (Y ) | Y )
p
≤ 2 ϕF (Y ) − ϕF (
Y )
p
since E ( · |
Y ) is an Lp -contraction when p ∈ [1, +∞). ♦
Remark. Markov kernels P(x, d ξ) x∈Rd which propagate Lipschitz continuity in
the sense that
[P]Lip = sup [Pf ]Lip , f : Rd → R, [f ]Lip ≤ 1 < +∞
are especially well-adapted to propagate quantization errors using the above error
bounds since it implies that, if F is Lipschitz continuous, then ϕF = Pf is Lipschitz
continuous too.
Exercises. 1. Detail the proof of the above Lp -error bound when p = 2. [Hint:
show that Z − E (ϕ(Y )|Y ) p ≤ Z − ϕ(Y ) p + 2 ϕ(Y ) − ϕ( Y ) p .]
2. Prove that the Euler scheme with step Tn of a diffusion with Lipschitz continuous
drift b(t, x) = b(x) and diffusion coefficient σ(t, x) = σ(x) starting at x0 ∈ Rd at
time 0 (see Chap. 7 for a definition) is an Rd -valued homogeneous Markov chain
with respect to the filtration of the Brownian increments whose transition propagates
Lipschitz continuity.
152 5 Optimal Quantization Methods I: Cubatures
5.3.1 Dimension 1…
Though originally introduced to design weighted cubature formulas for the compu-
tation of integrals with respect to distributions in medium dimensions (from 2 to 10
or 12), the quadrature formulas derived for specific one-dimensional distributions or
random variables turn out to be quite useful. This is especially the case when some
commonly encountered random variables are not easy to simulate in spite of the exis-
tence of closed forms for their density, c.d.f. or cumulative first moment functions.
Such is the case for, among others, the (one-sided) Lévy area, the supremum of the
Brownian bridge (or Kolmogorov–Smirnov distribution) which will be investigated
in exercises further on. The main assets of optimal quantizations grids and their
companion weight vectors is threefold:
• Such optimal quantizers are specifically fitted to the random variable they quantize
without any auxiliary “transfer” function-like for Gauss–Legendre points which
are naturally adapted to uniform, Gauss–Laguerre points adapted to exponential
distributions, Gauss–Hermite points adapted to Gaussian distributions.
• The error bounds in the resulting quadrature formulas for numerical integration
are tailored for functions with “natural” Lipschitz or CLip
1
-regularity.
• Finally, in one dimension as will be seen below, fast deterministic algorithms,
based on fixed point procedures for contracting functions or Newton–Raphson
zero search algorithms, can be implemented, quickly producing optimal quantizers
with high accuracy.
Definition 5.5 Absolutely continuous distributions on the real line with a log-
concave density are called unimodal distributions. The support (in R) of such a dis-
tribution and of its density is a (closed) interval [a, b], where −∞ ≤ a ≤ b ≤ +∞.
with
xi+1 + xi
xi+ 21 = , i = 1, . . . , N − 1, x 21 = a, xN + 21 = b
2
and the convention that, if a or b is infinite, they are “removed” from “their” Voronoi
cell. Also keep in mind that if the density ϕ is unimodal, then supp(PX ) = supp(ϕ) =
[a, b] (in R).
We will now compute the gradient and the Hessian of the quadratic distortion
function Q2,N on SNa,b .
Computation of the gradient ∇Q2,N (x). Let x ∈ SNa,b . Decomposing the quadratic
distortion function Q2,N (x) across the Voronoi cells Ci (x) leads to the following
expression for the quadratic distortion:
N xi+ 1
Q2,N (x) = (xi − ξ)2 ϕ(ξ) d ξ.
2
i=1 xi− 1
2
154 5 Optimal Quantization Methods I: Cubatures
If x = x(N ) ∈ SNa,b is the optimal quantizer at level N , hence with pairwise dis-
tinct components, ∇Q2,N (x) = 0, i.e. x is a zero of the gradient of the distortion
function Q2,N .
u
If we introduce the cumulative distribution function (u) = ϕ(v)dv and the
−∞
u
cumulative partial first moment function Ψ (u) = v ϕ(v)dv, the zeros of ∇Q2,N
−∞
are solutions to the non-linear system of equations
xi (xi+ 21 ) − (xi− 21 ) = Ψ (xi+ 21 ) − Ψ (xi− 21 ), i = 1, . . . , N . (5.22)
which is actually a rewriting of the stationarity Eq. (5.6) since we know that
(xi+ 21 ) = (xi− 21 ). Hence, Eq. (5.22) and its stationarity version are of course true
for any absolutely continuous distributions with density ϕ (keep in mind that any
minimizer of the distortion function has pairwise distinct components). However,
when ϕ is not unimodal, it may happen that ∇Q2,N has several zeros, i.e. several
stationary quantizers, which do not all lie in argmin Q2,N .
These computations are special cases of a multi-dimensional result (see Sect. 6.3.5
of the next chapter devoted to stochastic approximation and optimization) which
holds for any distribution PX – possibly having singular component – at N -tuples
with pairwise distinct components.
d
Example. Thus, if X = N (0; 1), the c.d.f. = 0 is (tabulated and) computable
(see Sect. 12.1.2) with high accuracy at a low computational cost whereas the cumu-
lative partial first moment is simply given by
x2
e− 2
Ψ0 (x) = − √ , x ∈ R.
2π
Example of the normal distribution. Thus, for the normal distribution N (0; 1), the
three functions ϕ0 , 0 and Ψ0 are explicit and can be computed at a low computational
cost. Thus, for N = 1, . . . , 1 000, tabulations with a 10−14 accuracy of the optimal
N -quantizers
x(N ) = x1(N ) , . . . , xN(N )
and the resulting (squared) quadratic quantization error (through its square, the
quadratic distortion) given by
x(N ) 2
N
(N ) 2 (N ) (N )
x(N ) 2 = E X 2 − E X
X − X = EX −
2
xi 0 xi+ 1 − 0 xi− 1
2 2 2
i=1
⎛ ⎞
x(N )1
using (5.15). The vector of local inertia ⎝ (ξ − xi(N ) )2 ϕ(ξ)d ξ ⎠
i+ 2
is also
x(N )1
i− 2
i=1,...,N
made available.
Ψ (xi+ 21 ) − Ψ (xi− 21 )
xi = Λi (x) := , i = 1, . . . , N . (5.24)
(xi+ 21 ) − (xi− 21 )
If the density ϕ is log-concave and not piecewise affine, the function Λ = (Λi )i=1,...,N
defined by the right-hand side of (5.24) from SNa,b onto SNa,b – called Lloyd’s map – is
contracting (see [168]) and the Lloyd method is defined as the iterative fixed point
procedure based on Λ, namely
x[n+1] = Λ x[n] , n ≥ 0, x[0] ∈ SNa,b . (5.25)
Hence, (x[n] )n≥0 converges exponentially fast toward x(N ) . When Λ is not contracting,
no general convergence results are known, even in case of uniqueness of the optimal
N -quantizer.
In terms of implementation, the Lloyd method only involves the c.d.f. and
the cumulative partial first moment function Ψ of the random variable X for which
closed forms are required. The multi-dimensional version of this algorithm – which
does not require such closed forms – is presented in Sect. 6.3.5 of Chap. 6.
When both functions and Ψ admit closed forms (as for the normal distribu-
tion, for example), the procedure can be implemented. It is usually slower than the
5.3 How to Get Optimal Quantization? 157
Remark. In practice, one can successfully implement these two deterministic recur-
sive procedures even if the distribution of X is not unimodal (see the exercises below).
In particular, the procedures converge when uniqueness of the stationary quantizer
holds, which is true beyond the class of unimodal distributions. Examples of unique-
ness of optimal N -quantizers for non-unimodal (one-dimensional) distributions can
be found in [96] (see Theorem 4): thus, uniqueness holds for the normalized Pareto
distributions αx−(α+1) 1[1,+∞) , α > 0, or power distributions αxα−1 1(0,1] , α ∈ (0, 1),
none of them being unimodal (in fact semi-closed forms are established for such
distributions, from which uniqueness is derived).
where xi(N
0
−1)
< x̄ < xi(N −1)
0 +1
.
where as usual 0 denotes the c.d.f. of the standard normal distribution N (0; 1).
Show that its density ϕX is log-convex and given by
e− 2
x
1
ϕX (x) = √ 0 (x) = √ , x ∈ (0, +∞)
x 2πx
where 0 still denotes the c.d.f. of the standard normal distribution N (0; 1).
(b) Derive and implement both the Newton–Raphson zero search algorithm and
Lloyd’s method I for log-normal distributions. Compare.
3. Quantization by Fourier. Let X be a symmetric random variable on (, A, P)
with a characteristic function χ(u) = E eı̃ uX , u ∈ R, ı̃ 2 = −1. We assume that χ ∈
L1 (R, du).
(a) Show that PX is absolutely continuous with an even continuous density ϕ defined
1 +∞
by ϕ(0) = χ(u) du and, for every x ∈ (0, +∞),
π 0
5.3 How to Get Optimal Quantization? 159
+∞
1 1 u
ϕ(x) = eı̃xu χ(u)du = cos(u) χ du
2π R πx 0 x
1 π
u + kπ
= (−1)k cos(u) χ du.
πx 0 x
k≥0
Furthermore, show that χ is real-valued, even and always (strictly) positive. [Hint:
See an elementary course on Fourier transform and/or characteristic functions, e.g.
[44, 52, 155, 263] among (many) others.]
(b) Show that the c.d.f. X of X is given by X (0) = 1
2
and, for every x ∈ (0, +∞),
+∞
1 1 sin u u
X (x) = + χ du, X (−x) = 1 − X (x). (5.26)
2 π 0 u x
(c) Assume furthermore that X ∈ L1 (P). Prove that its cumulated partial first moment
function ΨX is negative, even and given by ΨX (0) = −C and, for every x ∈ (0, +∞),
+∞
1 x 1 − cos u u
ΨX (x) = −C + x (x) − − χ du, ΨX (−x) = ΨX (x), (5.27)
2 π 0 u2 x
1 1 π
sin u u + kπ
X (x) = + (−1)k χ du, x ∈ (0, +∞).
2 π 0 u + kπ x
k≥0
Show likewise that ΨX (x) also reads for every x ∈ (0, +∞)
1 x π
1 − (−1)k cos u u + kπ
ΨX (x) = −C + x X (x) − − χ du.
2 π 0 (u + kπ)2 x
k≥0
(N )
(e)(NPropose database ofoptimal quantizers x =
two methods to compute a (small)
x1 ) , . . . , xN(N ) and their weight vectors p1(N ) , . . . , pN(N ) for a symmetric random
variable X satisfying the above conditions, say for levels running from N = 1 up to
Nmax = 50. Compare their respective efficiency. [Hint: Prior to the implementation
have a look at the (recursive) splitting initialization method for scalar distributions
described in the above Practitioner’s corner, see also Sect. 6.3.5.]
4. Quantized one-sided Lévy’s area. Let W = (W 1 , W 2 ) be a 2-dimensional standard
Wiener process and let
1
X = Ws1 dWs2
0
160 5 Optimal Quantization Methods I: Cubatures
denote the Lévy area associated to (W 1 , W 2 ) at time 1. We admit that the character-
istic function of X reads
1
χ(u) = E eı̃uX = √ , u ∈ R,
cosh u
(see Formula (9.105) in Chap. 9 further on, applied here with μ = 0, and Sect. 12.11
of the Miscellany Chapter for a proof).
(a) Show that
1
C := E X+ = √ E B L2 ([0,1],dt) ,
2π
denote the supremum of the standard Brownian bridge (see Chap. 8 for more details,
see also Sect. 4.3 for the connection with uniformly distributed sequences and
(a) Show that the cumulative partial first moment function ΨX of X is given for every
x ≥ 0 by
x
ΨX (x) = E X 1{X ≤x} = ¯ X (u) du − x
¯ X (x)
0
and implement a second method to compute the same small database. Compare their
respective efficiencies.
The first is a randomized version of the d -dimensional version of the fixed point
procedure described above in the one dimensional setting. The second is a recursive
stochastic gradient approximation procedure.
These two stochastic optimization procedures are presented in more detail in
Sect. 6.3.5 of Chap. 6 devoted to Stochastic Approximation.
For N (0; Id ), a large scale optimization has been carried out (with the sup-
port of ACI Fin’Quant) based on a mixed CLVQ -Lloyd’s procedure. To be pre-
cise, grids have been computed for d = 1 up to 10 and N = 1 up to 5 000. Their
companion parameters have also been computed (still by simulation): weight, L1
quantization error, (squared) L2 -distortion, local L1 and L2 -pseudo-inertia of each
Voronoi cell. These enriched grids are available for downloading on the website www.
quantize.maths-fi.com which also contains many papers dealing with quantization
optimization.
Recent implementations of exact or approximate fast nearest neighbor search
procedures has dramatically reduced the computation time in higher dimensions; not
to speak of implementation on GPU . For further details on the theoretical aspects
we refer to [224] for CLVQ (mainly devoted to compactly supported distributions).
As for Lloyd’s method I, a huge literature is available (often under the name of
k-means algorithm). Beyond the seminal – but purely one-dimensional – paper by
J.C. Kieffer [168], let us cite [79, 80, 85, 238]. For more numerical experiments
with the Gaussian distributions, we refer to [231] and for more general aspects in
connection with classification and various other applications, we refer to [106].
For illustrations depicting optimal quadratic N -quantizers of the bi-variate nor-
mal distribution N (0; I2 ), we refer to Fig. 5.1 (for N = 200), Fig. 6.3 (for N = 500,
with its Voronoi diagram) and Fig. 5.2 (for the same quantizer colored with a coded
representation of the weight of the cells).
Algorithms such as CLVQ and the randomized Lloyd’s procedure developed to quan-
tize the multivariate normal N (0; Id ) distributions in an efficient and systematic way
can be successfully implemented for other multivariate distributions. Sect. 6.3.5 in
the next chapter is entirely devoted to these stochastic optimization procedures, with
some emphasis on their practical implementation as well as techniques to speed
them up.
However, to anticipate their efficiency, we propose in Fig. 5.4 a quantization of
size N = 500 of the joint law (W1 , supt∈[0,1] Wt ) of a standard Brownian motion W .
5.4 Numerical Integration (II): Quantization-Based Richardson–Romberg Extrapolation 163
Fig. 5.4 Optimal N -quantization (N = 500) of W1 , supt∈[0,1] Wt depicted with its Voronoi tes-
sellation, W standard Brownian motion (with B. Wilbertz)
The challenge is to fight against the curse of dimensionality to increase the critical
dimension beyond which the theoretical rate of convergence of the Monte Carlo
method outperforms that of optimal quantization. Combining the above cubature
formula (5.8), (5.13) and the rate of convergence of the (optimal) quantization error,
it seems natural to deduce that the critical dimension to use quantization-based cuba-
ture formulas is d = 4 (when dealing with continuously differentiable functions),
at least when compared to Monte Carlo simulation. Several tests have been carried
out and reported in [229, 231] to refine this a priori theoretical bound. The bench-
mark was made of several options on a geometric index on d independent assets
in a Black–Scholes model: Puts, Puts spread and the same in a smoothed version,
always without any control variate. Of course, having no correlation assets is not a
realistic assumption but it is clearly more challenging as far as numerical integration
is concerned. Once the dimension d and the number of points N have been chosen,
we compared the resulting integration error with a one standard deviation confidence
interval of the corresponding Monte Carlo estimator for the same number of inte-
gration points N . The last standard deviation is computed thanks to a Monte Carlo
simulation carried out using 104 trials.
The results turned out to be more favorable to quantization than predicted by
theoretical bounds, mainly because we carried out our tests with rather small values
of N , whereas the curse of dimensionality is an asymptotic bound. Up to dimension 4,
164 5 Optimal Quantization Methods I: Cubatures
the larger N is, the more quantization outperforms Monte Carlo simulation. When
the dimension d ≥ 5, quantization always outperforms Monte Carlo (in the above
sense) until a critical size Nc (d ) which decreases as d increases.
In this section, we provide a method to push these critical sizes forward, at least
for sufficiently smooth functions. Let F : Rd → R be a twice differentiable function
with Lipschitz continuous Hessian D2 F. Let (X (N ) )N ≥1 be a sequence of optimal
quadratic quantizations. Then
(N ) ) + 1 E D2 F(X
E F(X ) = E F(X (N ) ) · (X − X
(N ) )⊗2
2 (5.29)
|3 .
+ O E |X − X
Under some assumptions which are satisfied by most usual distributions (including
the normal distribution), it is proved in [131] as a special case of a more general
result that
E |X − X|3 = O(N − d3 )
c
(N ) ) · (X − X
E D2 F(X (N ) )⊗2 = F,X2 + o N − d3 , (5.30)
Nd
one can use a Richardson–Romberg extrapolation to compute E F(X ).
Quantization-based Richardson–Romberg extrapolation. We consider two sizes
N1 and N2 (in practice one often sets N1 = N /2 and N2 = N with N even). Then
combining (5.29) with N1 and N2 , we cancel the first order error term and obtain
2 2
⎛ ⎞
(N2 ) ) − N1d E F(X
N2 E F(X
d (N1 ) ) 1
E F(X ) = 2 2 +O⎝ 1
2 2
⎠.
N2 − N1
d d
(N1 ∧ N2 ) (N2d − N1d )
d
where
σ2
g(Z) = e−rT /2 ψ e 2 ( d1 −1) T2
I T2 , K1 , K2 , r, σ, T /2
d
and Z = (Z 1, 2 , . . . , Z d , 2 ) = N (0; Id ). The numerical specifications of the function
T T
g are as follows:
s0 = 100, K1 = 98, K2 = 102, r = 5%, σ = 20%, T = 2.
The results are displayed in Fig. 5.5 in a log-log-scale for the dimensions d =
4, 6, 8, 10.
First, we recover theoretical rates (namely −2/d ) of convergence for the error
bounds. Indeed, some slopes β(d ) can be derived (using a regression) for the quantiza-
tion errors and we found β(4) = −0.48, β(6) = −0.33, β(8) = −0.25 and β(10) =
−0.23 for d = 10 (see Fig. 5.5). These rates plead for the implementation of the
Richardson–Romberg extrapolation. Also note that, as already reported in [231],
166 5 Optimal Quantization Methods I: Cubatures
1
1
0.1
0.1
0.01
0.01
0.001
0.001
1e-04
1e-05 0.0001
1 10 100 1000 10000 1 10 100 1000 10000
d=8 d = 10
(c) 0.1
QTF g4 (slope -0.25)
(d) 0.1 QTF g4 (slope -0.23)
QTF g4 Romberg (slope -0.8)
QTF g4 Romberg (slope -1.2)
MC MC
0.01
0.01
0.001
0.0001 0.001
100 1000 10000 100 1000 10000
Fig. 5.5 Errors and standard deviations as functions of the number of points N in a log-log-scale.
The quantization error is displayed by the cross + and the Richardson–Romberg extrapolation error
by the cross ×. The dashed line without crosses denotes the standard deviation of the Monte Carlo
estimator. a d = 4, b d = 6, c d = 8, d d = 10 (with J. Printems)
Let X ∈ L2Rd (, A, P) take at least N ≥ 1 values with positive probability. We assume
that we have access to an N -quantizer x := (x1 , . . . , xN ) ∈ (Rd )N with pairwise dis-
tinct components and we denote by Projx one of its Borel nearest neighbor projections.
x = Projx (X ). We assume that we now have the distribution of X
Let X x characterized
by the N -tuple (x1 , . . . , xN ) itself and their weights (probability distribution),
x = xi ) = P X ∈ Ci (x) , i = 1, . . . , N ,
pi := P(X
x
where X (m) , m = 1, . . . , M , are M independent copies of X ,
X (m) = Projx (X (m) )
and RN ,M is a remainder term defined by (5.31). Term (a) can be computed by
quantization and term (b) can be computed by a Monte Carlo simulation. Then, it is
clear that
σ F(X ) − F(X x ) x ) 2
F(X ) − F(X X −Xx 2
RN ,M 2 = √ ≤ √ ≤ [F]Lip √
M M M
168 5 Optimal Quantization Methods I: Cubatures
x )N ≥1 is a
Consequently, if F is simply a Lipschitz continuous function and if (X
(N )
where C2,δ denotes the constant coming from the non-asymptotic version of Zador’s
Theorem (see Theorem 5.2(b)).
ℵ Practitioner’s corner
As concerns practical implementation of this quantization-based variance reduction
method, the main gap is the nearest neighbor search needed at each step to compute
x
X (m) from X (m) .
In one dimension, an (optimal) N -quantizer is usually directly obtained as a mono-
tonic N -tuple with non-decreasing components and the complexity of a nearest num-
ber search on the real line based on a dichotomy procedure is approximately log N
log 2
.
Unfortunately, this one-dimensional setting is of little interest for applications.
In d dimensions, there exist nearest neighbor search procedures with a O(log N )-
complexity, once the N -quantizer has been given an appropriate tree structure (which
costs O(N log N )). The most popular tree-based procedure for nearest neighbor
search is undoubtedly the K-d tree (see [99]). During the last ten years, several
attempts to improve it have been carried out, among them, we may mention the Prin-
cipal Axis Tree algorithm (see [207]). These procedures are efficient for quantizers
with a large size N lying in a vector space with medium dimension (say up to 10).
An alternative to speed up the nearest neighbor search procedure is to restrict to
product quantizers whose Voronoi cells are hyper-parallelepipeds. In such a case,
the nearest neighbor search reduces to those on the d marginals with an approximate
resulting complexity of d log N
log 2
.
However, this nearest neighbor search procedure undoubtedly slows down the
global procedure.
The main drawback of the preceding is the repeated use of nearest neighbor search
procedures. Using a quantization-based stratification may be a way to take advan-
tage of quantization to reduce the variance without having to implement such time
consuming procedures. On the other hand, one important drawback of the regular
stratification method as described in Sect. 3.5 is to depend on the function F, at least
when concerned by the optimal choice for the allocation parameters qi . Our aim is to
5.5 Hybrid Quantization-Monte Carlo Methods 169
show that quantization-based stratification has a uniform efficiency among the class of
Lipschitz continuous functions. The first step is the universal stratified sampling for
Lipschitz continuous functions detailed in the simple proposition below, where we use
the notations introduced in Sect. 3.5. Also keep in mind that for a random vector Y ∈
1/2
L2Rd (, A, P), Y 2 = E |Y |2 where | · | denotes the canonical Euclidean norm.
Proposition 5.4 (Universal stratification) Let X ∈ L2Rd (, A, P) and let (Ai )i∈I be
a stratification of Rd . For every i ∈ I , we define the local inertia of the random vector
X on the stratum Ai by
σi2 = E |X − E (X |X ∈ Ai )|2 | X ∈ Ai .
(a) Then, for every Lipschitz continuous function F : (Rd , | · |) → (Rd , | · |),
(c) Optimal choice of the qi . (see (3.10) for a closed form of the qi )
!2 !2
sup pi σF,i = pi σi
[F]Lip ≤1 i∈I i∈I (5.35)
2
= X − E X |σ({X ∈ Ai }, i ∈ I ) 1 .
The equalities in (b) and (c) straightforwardly follow from (a). Finally, it follows
from Jensen’s Inequality (or the monotonicity of conditional Lp -norms) that
N
N
*1
pi σi = pi E |X − E (X |X ∈ Ai )|2 |X ∈ Ai 2
i=1 i=1
≥ X − E (X |σ({X ∈ Ai }, i ∈ I ))1 . ♦
Let X ∈ L2Rd (, A, P) take at least N values with positive probability. The starting
idea is to use the Voronoi diagram of an N -quantizer x = (x1 , . . . , xN ) of X such that
P(X ∈ Ci (x)) > 0, i = 1, . . . , N , to design the strata in a stratification procedure.
Firstly, this amounts to setting I = {1, . . . , N } and
Ai = Ci (x), i ∈ I ,
in the preceding. Then, for every i ∈ {1, . . . , N }, there exists a Borel function
ϕ(xi , .) : [0, 1]q → Rd such that
d
x = xi ) = 1Ci (x) PX (d ξ)
ϕ(xi , U ) = L(X | X ,
P X ∈ Ci (x)
d
where U = U ([0, 1]r ). Note that the dimension r is arbitrary: one may always assume
that r = 1 by the fundamental theorem of simulation, but in order to obtain some
closed forms for ϕ(xi , .), we are led to consider situations where r ≥ 2 (or even
infinite when considering a Von Neumann acceptance-rejection method).
Now let (ξ, U ) be a pair of independent random vectors such that ξ = X
d
x and
d
U = U ([0, 1] ). Then, one checks that
r
d
ϕ(ξ, U ) = X
and !2 !2
sup pi σF,i ≤ pi σi ≥ X − E (X |X
x )2
1
(5.37)
[F]Lip ≤1 i∈I i∈I
x ).
where we used the obvious fact that σ({X ∈ Ci (x)}, i ∈ I ) = σ(X
(a) It follows from (5.34) that
!
inf pi σi2 = inf X − E X |σ({X ∈ Ai }, i ∈ I .
(Ai )1≤i≤N (Ai )1≤i≤N 2
i∈I
Now, it follows from Proposition 5.1(b) that, if x(N ) denotes an optimal quantizer at
level N for X (which has N pairwise distinct components),
x(N ) = min X − Y , Y : (, A, P) → Rd , |Y ()| ≤ N
X − X
2 2
172 5 Optimal Quantization Methods I: Cubatures
and
x(N ) = E X | X
X x(N ) .
where
A := E X |σ({X ∈ Ai }, i ∈I ) (ω), ω ∈ has at most N elements. Now
dist(X , A ) = E dist X , A = X − X x(A) , consequently,
1 1
A
pi σi ≥ X − X 1
= min X − X
x . ♦
x∈(Rd )N 1
i∈I
One dimension
In this case the method applies straightforwardly provided both the distribution func-
tion FX (u) := P(X ≤ u) of X (on R̄) and its right continuous (canonical) inverse on
[0, 1], denoted by FX−1 are computable.
We also need the additional assumption that the N -quantizer x = (x1 , . . . , xN )
satisfies the following continuity assumption
P(X = xi ) = 0, i = 1, . . . , N .
Note that this is always the case if X has a density. Then set
xi + xi+1
xi+ 21 = , i = 1, . . . , N − 1, x 21 = −∞, xN + 21 = +∞.
2
Then elementary computations show that, with q = 1,
∀ u ∈ [0, 1], ϕN (xi , u) = FX−1 FX (xi− 21 ) + FX (xi+ 21 ) − FX (xi− 21 ) u , (5.38)
i = 1, . . . , N .
5.5 Hybrid Quantization-Monte Carlo Methods 173
Higher dimensions
We consider a random vector X = (X 1 , . . . , X d ) whose marginals X i are indepen-
dent. This may appear as a rather stringent restriction in full generality although it is
often possible to “extract” in a model an innovation with this correlation structure. At
least in a Gaussian framework, such a reduction is always possible after an orthogonal
diagonalization of its covariance matrix. One considers a product quantizer (see e.g.
[225, 226]) defined as follows: for every ∈ {1, . . . , d }, let x(N ) = (x1(N ) , . . . , xN(N ) )
be an N -quantizer of the marginal X and + set N := N1 × · · · × Nd . Then, define for
every multi-index i := (i1 , . . . , id ) ∈ I := d=1 {1, . . . , N },
1)
xi = xi(N
1
, . . . , xi(N
d
d)
.
6.1 Motivation
In Finance, one often faces some optimization problems or zero search problems.
The former often reduce to the latter since, at least in a convex framework, mini-
mizing a function amounts to finding a zero of its gradient. The most commonly
encountered examples are the extraction of implicit parameters (implicit volatility of
an option, implicit correlations for a single best-of-option or in the credit markets),
the calibration, the optimization of an exogenous parameters for variance reduction
(regression, importance sampling, etc). All these situations share a common feature:
the involved functions all have a representation as an expectation, namely they read
h(y) = E H (y, Z) where Z is a q-dimensional random vector. The aim of this chapter
is to provide a toolbox – stochastic approximation – based on simulation to solve
these optimization or zero search problems. It can be viewed as a non-linear extension
of the Monte Carlo method.
Stochastic approximation can also be presented as a probabilistic extension of
Newton–Raphson like zero search recursive procedures of the form
ODEh ≡ ẏ = −h(y).
d
L(y(t)) = ∇L(y(t))|ẏ(t) = −(∇L|h)(y(t)).
dt
If such a Lyapunov function does exist (which is not always the case!), the system
is said to be dissipative.
We essentially have two frameworks:
– the function L is identified a priori, it is the object of interest for optimization
purposes, and one designs a function h from L, e.g. by setting h = ∇L (or possibly
h positively proportional to ∇L, i.e. h = ρ∇L, where ρ is – at least – an everywhere
positive function).
– The function naturally entering into the problem is h and one has to search for
a Lyapunov function L (which may not exist). This usually requires a deep under-
standing of the problem from a dynamical point of view.
This duality also occurs in discrete time Stochastic Approximation Theory from
its very beginning in the early 1950s (see [169, 253]).
As concerns the constraints on the Lyapunov function, due to the specificities of
the discrete time setting, we will require some further regularity assumptions on ∇L,
typically ∇L is Lipschitz continuous and |∇L|2 ≤ C(1 + L) (essentially a quadratic
growth property).
Exercises. 1. Show that if a function h : Rd → Rd is non-decreasing in the fol-
lowing sense
∀ x, y ∈ Rd , (h(y) − h(x)|y − x) ≥ 0
Borel d
h(y) = E H (y, Z), H : Rd × Rq −→ Rd , Z = μ, (6.2)
where
and Y0 , defined on the same probability space, is independent of the sequence (Zn )n≥1 ,
also converges to a zero y∗ of h, under appropriate assumptions on both H and the gain
sequence γ = (γn )n≥1 . We will call such a recursive procedure a Markovian stochastic
algorithm or, more simply, a stochastic algorithm. There are more general classes of
recursive stochastic algorithms but this framework is sufficient for our purpose.
The preceding can be seen as the “meta-theorem” of Stochastic Approximation
since a large part of this theory is focused on making this algorithm converge toward
its “target” (a zero of h). In this framework, the Lyapunov functions mentioned above
are called upon to ensure the stability of the procedure.
A toy-example: the Strong Law of Large Numbers. As a first example note that the
sequence of empirical means (Z n )n≥1 of an i.i.d. sequence (Zn ) of integrable random
variables satisfies
1
Z̄n+1 = Z̄n − (Z̄n − Zn+1 ), n ≥ 0, Z̄0 = 0,
n+1
yn − n+1
1
(yn − z∗ ), converges at a 1n -rate to z∗ (and this is clearly not the optimal
choice for γn in this deterministic framework).
However, if we do not know the value of the mean z∗ = E Z but if we are able to
simulate μ-distributed random vectors, the first recursive stochastic procedure can be
easily implemented whereas the deterministic one cannot. The stochastic procedure
we are speaking about is simply the regular Monte Carlo method!
A toy-model: extracting implicit volatility in a Black–Scholes model. A second
toy-example is the extraction of implicit volatility in a Black–Scholes model for a
vanilla Call or Put option. In practice it is carried out by a deterministic Newton
procedure (see e.g. [209]) since closed forms are available for both the premium and
the vega of the option. But let us forget about that for a while to illustrate the basic
principle of Stochastic Approximation. Let x, K, T ∈ (0, +∞), let r ∈ R and set for
every σ ∈ R,
σ2
Xtx,σ = xe(r− 2 )t+σWt , t ≥ 0.
or, equivalently,
E e−rt (K − XTx,σ )+ − Pmarket (x, K, r, T ) = 0.
This naturally suggests to devise the following stochastic algorithm to solve this
equation numerically:
√
σ2
(r− 2n )T +σn T Zn+1
σn+1 = σn − γn+1 K − xe − Pmarket (x, K, r, T )
+
where (Zn )n≥1 is an i.i.d. sequence of N (0; 1)-distributed random variables and the
step sequence γ = (γn )n≥1 is for example given by γn = nc for some parameter c > 0.
After a necessary “tuning” of the constant c (try c = x+K2
), one observes that
A priori, one might imagine that σn could converge toward −σimpl (which would
not be a real problem) but it a.s. never happens because this negative solution is
repulsive for the related ODEh and “noisy”. This is an important topic often referred
to in the literature as “how stochastic algorithms never fall into noisy traps” (see [51,
191, 242], for example).
To conclude this introductory section, let us briefly return to the case where h =
∇L and L(y) = E (y, Z) so that ∇L(y) = E H (y, Z) with H (y, z) := ∂(y,z) ∂y
. The
function H is sometimes called a local gradient of L and the procedure (6.3) is known
as a stochastic gradient procedure. When Yn converges to some zero y∗ of h = ∇L
at which the algorithm is “noisy enough” – say e.g. E (H (y∗ , Z) H (y∗ , Z)∗ ) > 0 is a
definite symmetric matrix – then y∗ is necessarily a local minimum of the potential
L: y∗ cannot be a trap. So, if L is strictly convex and lim|y|→+∞ L(y) = +∞, ∇L has
a single zero y∗ which is simply the global minimum of L: the stochastic gradient
turns out to be a minimization procedure.
However, most recursive stochastic algorithms (6.3) are not stochastic gradients
and the Lyapunov function, if it exists, is not naturally associated to the algorithm:
finding a Lyapunov function to “stabilize” the algorithm (by bounding a.s. its paths,
see the Robbins–Siegmund Lemma below) is often a difficult task which requires a
deep understanding of the related ODEh .
As concerns the rate of convergence, one must keep√in mind that it is usually ruled
√
by a CLT at a 1/ γn -rate which can attain at most the n-rate of the regular CLT. So,
such a “toolbox” is clearly not competitive compared to a deterministic procedure,
when available, but this rate should be compare to that of the Monte Carlo method
(i.e. the SLLN) since their fields of application are similar: stochastic approximation
is the natural extension of the Monte Carlo method to solve inverse or optimization
problems related to functions having a representation as an expectation of simulable
random functions.
Recently, several contributions (see [12, 196, 199]) have drawn the attention
of the quants world to stochastic approximation as a tool for variance reduc-
tion, implicitation of parameters, model calibration, risk management…It is also
used in other fields of finance like algorithmic trading as an on-line optimizing
device for execution of orders (see e.g. [188, 189]). We will briefly discuss several
(toy-)examples of application.
Stochastic Approximation theory provides various theorems which guarantee the a.s.
and/or Lp -convergence of recursive stochastic approximation algorithms as defined
by (6.3). We provide below a general (multi-dimensional) preliminary result known
as the Robbins–Siegmund Lemma from which the main convergence results will be
180 6 Stochastic Approximation with Applications to Finance
easily derived. However, though, strictly speaking, this result does not provide any
direct a.s. convergence result.
In what follows the function H and the sequence (Zn )n≥1 are defined by (6.2) and
h is the vector field from Rd to Rd defined by h(y) = E H (y, Z1 ).
Theorem 6.1 (Robbins–Siegmund Lemma) Let h : Rd → Rd and
let H : R × R → R satisfy (6.2). Suppose that there exists a continuously dif-
d q d
for some real constant C > 0 such that h satisfies the mean-reverting assumption
(∇L|h) ≥ 0. (6.5)
Finally, assume that Y0 is independent of (Zn )n≥1 and E L(Y0 ) < +∞.
Then, the stochastic algorithm defined by (6.3) satisfies the following five prop-
erties:
(i) Yn − Yn−1 −→ 0 P-a.s. and in L2 (P) as n → +∞ (and |Yn |2 < +∞ a.s.),
n≥1
a.s.
(iii) L(Yn ) −→ L∞ ∈ L1 (P) as n → +∞,
(iv) γn (∇L|h)(Yn−1 ) < +∞ a.s. as an integrable random variable,
n≥1
γ
(v) the sequence Mn = nk=1 γk H (Yk−1 , Zk ) − h(Yk−1 ) is an L2 -bounded square
integrable martingale, and hence converges in L2 and a.s.
Remarks and terminology. • The sequence (γn )n≥1 is called a step sequence or a
gain parameter sequence.
• If the function L satisfies Assumptions (6.4), (6.5), (6.6) and moreover
lim|y|→+∞ L(y) = +∞, then L is called a Lyapunov function of the system like in
Ordinary Differential Equation Theory.
6.2 Typical a.s. Convergence Results 181
√ √
• Note that Assumption (6.4) on L implies that ∇ 1 + L is bounded. Hence L has
at most a linear growth so that L itself has at most a quadratic growth. This justifies
the somewhat unexpected terminology “sub-linear growth” for Assumption (6.6).
• In spite of the standard terminology, the step sequence does not need to be decreas-
ing in Assumption (6.7).
• A careful reading of the proof below shows that the assumption n≥1 γn = +∞
is not needed. However we leave it in the statement because it is dramatically useful
for any application of this Lemma since it implies, combined with (iv), that
lim(∇L|h)(Yn−1 ) = 0.
n
The key of the proof is the following quite classical convergence theorem for non-
negative super-martingales (see [217]).
Theorem 6.2 Let (Sn )n≥0 be a non-negative super-martingale with respect to a fil-
tration (Fn )n≥0 , on a probability space (, A, P) (i.e. for every n ≥ 0, Sn ∈ L1 (P)
and E (Sn+1 |Fn ) ≤ Sn a.s.), then, Sn converges P-a.s. to an integrable (non-negative)
random variable S∞ .
For general convergence theorems for sub-, super- and true martingales we refer
to any standard course on Probability Theory or, preferably, to [217].
where
Mn+1 = H (Yn , Zn+1 ) − h(Yn ).
1
E ∇L(Yn )|H (Yn , Zn+1 ) ≤ E |∇L(Yn )|2 + E |H (Yn , Zn+1 )|2 .
2
Now, Yn being Fn -measurable and Zn+1 being independent of Fn ,
E |H (Yn , Zn+1 )|2 | Fn = E |H (y, Z1 )|2 |y=Yn
so that
E |H (Yn , Zn+1 )|2 = E E |H (Yn , Zn+1 )|2 | Fn
= E E |H (y, Z1 )|2 |y=Yn
≤ C 2 1 + E L(Yn )
owing to (6.6). Combined with (6.4) and plugged into the above inequality, this yields
E ∇L(Yn )|H (Yn , Zn+1 ) ≤ C 2 1 + E L(Yn ) .
Plugging these bounds into (6.9), we derive that L(Yn+1 ) ∈ L1 (P). Consequently
E (Mn+1 | Fn ) = 0. The announced inequality for E (|Mn+1 |2 | Fn ) holds with
C = 2 C 2 owing to (6.6) and the inequality
At this stage, we derive from the fact that ∇L(Yn ) and Mn+1 ∈ L2 (P),
E (∇L(Yn )|Mn+1 ) | Fn = ∇L(Yn ) | E (Mn+1 | Fn ) = 0.
n
γk (∇L|h)(Yk−1 ) + CL γk2
k=1 k≥n+2
n−1
a.s.
L(Yn ) + γk+1 (∇L|h)(Yk ) −→ S∞ = S∞ (1 + CL γn2 ) ∈ L1 (P). (6.10)
k=0 n≥1
(ii) The super-martingale (Sn )n≥0 being L1 (P)-bounded by E S0 = E L(Y0 ) < +∞,
one derives likewise that (L(Yn ))n≥0 is L1 -bounded since
n
L(Yn ) ≤ (1 + CL γk2 ) Sn , n ≥ 0,
k=1
and (1 + CL γk2 ) < +∞ owing to the decreasing step assumption (6.7) made on
k≥1
(γn )n≥1 .
(iv) Now, for the same reason, the series 0≤k≤n−1 γk+1 (∇L|h)(Yk ) (with non-
negative terms) satisfies for every n ≥ 1,
n−1
n
E γk+1 (∇L|h)(Yk ) ≤ (1 + CL γk2 )E S0
k=0 k=1
so that, by the Beppo Levi monotone convergence Theorem for series with non-
negative terms,
E γn+1 (∇L|h)(Yn ) < +∞
n≥0
so that, in particular,
184 6 Stochastic Approximation with Applications to Finance
γn+1 (∇L|h)(Yn ) < +∞ P-a.s.
n≥0
so that E |Yn |2 → 0 and n≥1 |Yn |2 < +∞ a.s. which in turns yields Yn =
Yn − Yn−1 → 0 a.s.
γ
(v) We have Mn = nk=1 γk Mk so that M γ is clearly an (Fn )-martingale.
Moreover,
n
M γ n = γk2 E |Mk |2 |Fk−1
k=1
n
≤ γk2 E |H (Yk−1 , Zk )|2 |Fk−1
k=1
n
= γk2 E |H (y, Z1 )|2 |y=Yk−1
,
k=1
where we used in the last line that Zk is independent of Fk−1 . Consequently, owing
to (6.6) and (ii), one has
E M γ ∞ < +∞,
γ
which in turn implies by Theorem 12.7 that Mn converges a.s. and in L2 . ♦
Corollary 6.1 (a) Robbins–Monro algorithm. Assume that the mean function
h of the algorithm is continuous and satisfies
∀ y ∈ Rd , y = y∗ , y − y∗ |h(y) > 0. (6.11)
Finally, assume that the step sequence (γn )n≥1 satisfies (6.7). Then
a.s.
{h = 0} = {y∗ } and Yn −→ y∗ .
6.2 Typical a.s. Convergence Results 185
The convergence also holds in every Lp (P), p ∈ (0, 2) (and |Yn − y∗ | n≥0 is L2 -
bounded).
(b) Stochastic gradient (h = ∇L). Let L : Rd → R+ be a differentiable function
satisfying (6.4), lim L(y) = +∞, and {∇L = 0} = {y∗ }. Assume the mean func-
|y|→+∞
tion of the algorithm
is given by h = ∇L, that the function H satisfies
E |H (y, Z)|2 ≤ C 1 + L(y) and that L(Y0 ) ∈ L1 (P). Assume that the step sequence
(γn )n≥1 satisfies (6.7). Then
a.s.
L(y∗ ) = min L and Yn −→ y∗ as n → +∞.
Rd
Moreover, ∇L(Yn ) converges to 0 in every Lp (P), p ∈ (0, 2) (and L(Yn ) n≥0 is L1 -
bounded so that ∇L(Yn ) n≥0 is L2 -bounded).
Proof. (a) Assumption (6.11) implies that the mean-reverting assumption (6.5) is
satisfied by the quadratic Lyapunov function L(y) = |y − y∗ |2 (which clearly satis-
fies (6.4)). The assumption on H is clearly the linear growth Assumption (6.6) for this
function L. Consequently, it follows from the above Robbins–Siegmund Lemma that
|Yn − y∗ |2 −→ L∞ ∈ L1 (P) and γn h(Yn−1 )|Yn−1 − y∗ < +∞ P-a.s.
n≥1
If lim Yn−1 (ω) − y∗ |h(Yn−1 (ω)) > 0, the convergence of the above series
n
induces a contradiction with the fact that γn = +∞. Let φ(n, ω) n≥1 be a sub-
n≥1
sequence such that
Yφ(n,ω) (ω) − y∗ |h(Yφ(n,ω) (ω)) −→ 0 as n → +∞.
2
Finally, for every p ∈ (0, 2), (|Yn − y∗ |p )n≥0 is L p (P)-bounded, hence uniformly
integrable. As a consequence the a.s. convergence holds in L1 , i.e. Yn → y∗ converges
in Lp (P).
that {h = 0} ⊂ {y∗ }. On the other hand, if y = y∗ = εu,
It is clear from (6.11)
|u| = 1, ε > 0, one has ε u|h(y∗ + εu) > 0. Letting ε → 0 implies that u|h(y∗ ) ≥
0 for every unitary vector
u since
h is continuous, which in turn implies (switching
h(y∗ )
from u to −u) that u|h(y∗ ) = 0. Hence h(y∗ ) = 0 (otherwise u = |h(y ∗ )|
yields a
contradiction).
(b) One may apply the Robbins–Siegmund Lemma with L as a Lyapunov function
since (h|∇L) = |∇L|2 ≥ 0. The assumption on H is just the linear growth assump-
tion (6.6). As a consequence
2
L(Yn ) −→ L∞ ∈ L1 (P) and γn ∇L(Yn−1 ) < +∞ P-a.s.
n≥1
and L(Yn ) n≥0 is L1 (P)-bounded. Let ω ∈ such that L Yn (ω) converges in R+ ,
2
γn ∇L(Yn−1 (ω)) < +∞ and Yn (ω) − Yn−1 (ω) → 0.
n≥1
From the convergence of L(Yn ω) toward L∞ (ω) and lim L(y) = +∞, one
|y|→+∞
derives that (Yn (ω))n≥0 is bounded. Then there exists a subsequence φ(n, ω) n≥1
such that
Yφ(n,ω) → y, ∇L Yφ(n,ω) (ω) → 0 and L Yφ(n,ω) (ω) → L∞ (ω)
as n → +∞.
Then ∇L(y) = 0 which implies y = y∗ and L∞ (ω) = L(y∗ ). Since L is non-negative,
differentiable and goes to infinity at infinity, it attains its unique global minimum at
y∗ . In particular, {L = L(y∗ )} = {∇L = 0} = {y∗ }. Consequently, the only possible
limiting value for the bounded sequence (Yn (ω))n≥1 is y∗ since L(Yn ) → L(y∗ ), i.e.
Yn (ω) converges toward y∗ .
The Lp (P)-convergence to 0 of |∇L(Yn )|, p ∈ (0, 2), follows by the same uniform
integrability argument as in (a). ♦
Exercises. 1. Show that Claim (a) remains true if one only assumes that
a sequence of positive real numbers satisfying the decreasing step Assumption (6.7).
Show that the recursive procedure defined by
The above settings are in fact some special cases of a more general result, the
so-called “pseudo-gradient setting”, stated below. However its proof, in particular in
a multi-dimensional setting, needs additional arguments, mainly the so-called ODE
method (for Ordinary Differential Equation method) originally introduced by Ljung
(see [200]). The underlying idea is to think of a stochastic algorithm as a perturbed
Euler scheme with a decreasing step of the ODE ẏ = −h(y). For an introduction to
the ODE method, see Sect. 6.4.1; we also refer to classical textbooks on Stochastic
Approximation like [39, 81, 180].
Theorem 6.3 (Pseudo-Stochastic Gradient) Assume that L and h and the step
sequence (γn )n≥1 satisfy all the assumptions of the Robbins–Siegmund Lemma.
Assume furthermore that
Then, P(d
ω)-a.s., there
exists an = (ω) ≥ 0 and a connected component
C∞ (ω) of (∇L|h) = 0 ∩ L = such that
dist Yn (ω), C∞ (ω) −→ 0 as n → +∞.
Proof (One-dimensional case). We consider an ω ∈ for which all the “a.s. conclu-
sions” of the Robbins–Siegmund Lemma are true. Combining Yn (ω) − Yn−1 (ω) → 0
with the boundedness of the sequence (Yn (ω))n≥0 , one can show that the set Y∞ (ω)
of the limiting values of (Yn (ω))n≥0
is a connected
compact set (2 ).
On the other hand, Y∞ (ω) ⊂ L = L∞ (ω) since L(Yn (ω)) → L∞ (ω). Further-
more, reasoning as in the proof of claim (b) of the above corollary shows that
there exists a limiting
value y ∈ Y
∗ ∞ (ω) such that ∇L(y ∗ )|h(y ∗ ) = 0 so that
y∗ ∈ (∇L|h) = 0 ∩ L = L∞ (ω) .
At this stage, we assume that d = 1. Either Y∞ (ω) = {y∗ } and the proof is com-
plete, or Y∞ (ω) is a non-trivial compact interval as a compact connected subset of
R. The function L is constant on this interval,
consequently its derivative L is zero
on Y∞(ω) so that Y∞ (ω)
⊂ (∇L|h) = 0 ∩ L = L(y∗ ) . Hence the conclusion.
When (∇L|h) = 0 ∩ L = is locally finite, the conclusion is obvious since its
connected components are reduced to single points. ♦
This section was originally motivated by the seminal paper [12]. Finally, we followed
the strategy developed in [199] which provides, in our mind, an easier to implement
procedure. Assume we want to compute the expectation
|z|2 dz
E ϕ(Z) = ϕ(z)e− 2
d
(6.12)
Rd (2π) 2
P(ϕ(Z) = 0) > 0.
with A a lower triangular matrix such that the covariance matrix R = AA∗ has diagonal
entries equal to 1 and φ a non-negative, continuous if necessary, payoff function. The
dimension d corresponds to the number of underlying risky assets.
(b) Monte Carlo simulation of functionals of the Euler scheme of a diffusion (or
Milstein scheme) appear as integrals with respect to a multivariate Gaussian vector.
Then the dimension d can be huge since it corresponds to the product of the number
of time steps by the number of independent Brownian motions driving the dynamics
of the SDE.
Variance reduction by mean translation: first approach (see [12]).
A change of variable z = ζ + θ, for a fixed θ ∈ Rd , leads to
|θ|2
E ϕ(Z) = e− 2 E ϕ(Z + θ)e−(θ|Z) . (6.13)
6.3 Applications to Finance 189
|θ|2 2
since Var e− 2 ϕ(Z + θ)e−(θ|Z) = L(θ) − E ϕ(Z) .
A reverse change of variable shows that
|θ|2
L(θ) = e 2 E ϕ2 (Z)e−(Z|θ) . (6.15)
Hence, if E ϕ2 (Z)|Z|ea|Z| < +∞ for every a ∈ (0, +∞), one can always differen-
tiate the function L on Rd owing to Theorem 2.2(b) with
|θ|2
∇L(θ) = e 2 E ϕ2 (Z)e−(θ|Z) (θ − Z) , θ ∈ Rd . (6.16)
for every z ∈ Rd and P(ϕ(Z) > 0) > 0. Furthermore, Fatou’s Lemma implies
lim L(θ) = +∞.
|θ|→+∞
Consequently, L has a unique global minimum θ∗ which is also local, whence
satisfies {∇L = 0} = {θ∗ }.
We now prove the classical lemma which shows that if L is strictly convex then
θ → |θ − θ∗ |2 is mean-reverting for ∇L (strictly, in the strengthened sense (6.11) of
the Robbins–Monro framework).
Proof. (a) One introduces the differentiable function defined on the unit interval by
The conclusion follows by noting that g (1) − g (0) ≥ inf s∈[0,1] g (s). Moreover,
noting that g(1) ≥ g(0) + g (0) + 21 inf s∈[0,1] g (s) yields the inequality (ii). Finally,
setting θ = 0 yields
1
L(θ ) ≥ L(0) + α|θ |2 − |∇L(0)| |θ | → +∞ as |θ | → ∞. ♦
2
This suggests (as noted in [12]) to consider the quadratic function V defined by
V (θ) := |θ − θ∗ |2 as a Lyapunov function instead of L defined in (6.15). Indeed,
L is usually not essentially quadratic: as soon as ϕ(z) ≥ ε0 > 0, it is obvious that
L(θ) ≥ ε20 e|θ| , but this exponential growth is also observed when ϕ is bounded away
2
from zero outside a ball. Hence ∇L cannot be Lipschitz continuous either and, con-
sequently, cannot be used as a Lyapunov function.
However, if one uses the representation of ∇L as an expectation derived from (6.17)
by pathwise differentiation in order to design a stochastic gradient algorithm, namely
|z|2
∂
1 |z−θ|2
considers the local gradient H (θ, z) := ϕ2 (z)e− 2 ∂θ e2 , a major difficulty
remains: the convergence results in Corollary 6.1 do not apply, mainly because the lin-
ear growth assumption (6.6) in quadratic mean is not fulfilled by such a choice for H .
In fact, this “naive” procedure explodes at almost every implementation, as pointed
out in [12]. This leads the author to introduce in [12] some variants of the algorithm
involving repeated re-initializations – the so-called projections “à la Chen” – to force
the stabilization of the algorithm and subsequently prevent explosion. The choice
we make in the next section is different (and still other approaches are possible to
circumvent this problem, see e.g. [161]).
6.3 Applications to Finance 191
|ζ|2 dζ
= e|θ| ϕ2 (ζ − θ)(2θ − ζ)e− 2
2
d /2
(z := ζ − θ),
R d (2π)
= e|θ| E ϕ2 (Z − θ)(2 θ − Z) .
2
∇L(θ) = 0 ⇐⇒ E ϕ2 (Z − θ)(2θ − Z) = 0.
This suggests to work with the function inside the expectation (though not exactly a
local gradient) up to an explicit appropriate multiplicative factor depending on θ to
satisfy the linear growth assumption for the L2 -norm in θ.
From now on, we assume that there exist two positive real constants a ≥ 0 and
C > 0 such that
0 ≤ ϕ(z) ≤ Ce 2 |z| , z ∈ Rd .
a
(6.18)
Exercise. (a) Show that under this assumption, E ϕ2 (Z)|Z|e|θ||Z| < +∞ for
every θ ∈ Rd , which implies
that (6.16) holds true.
(b) Show that in fact E ϕ2 (Z)|Z|m e|θ||Z| < +∞ for every θ ∈ Rd and every m ≥ 1,
which in turn implies that L is C ∞ . In particular, show that for every θ ∈ Rd ,
|θ|2
D2 L(θ) = e 2 E ϕ2 (Z)e−(θ|Z) Id + (θ − Z)(θ − Z)t
(throughout this chapter, we adopt the notation t for transposition). Derive that
|θ|2
D2 L(θ) ≥ e 2 E ϕ2 (Z)e−(θ|Z) Id > 0
(in the sense of positive definite symmetric matrices) which proves again that L is
strictly convex.
Taking Assumption (6.18) into account, we set
192 6 Stochastic Approximation with Applications to Finance
1
Ha (θ, z) = e−a(|θ| +1) 2
2
ϕ2 (z − θ)(2θ − z). (6.19)
≤ 2 C 4 4|θ|2 E e2a|Z| + E e2a|Z| |Z|2
≤ C (1 + |θ|2 )
|Z| has a Laplace transform defined on the whole real line, which in turn implies
since
E ea|Z| |Z|r < +∞ for every a, r > 0.
On the other hand, it follows that the resulting mean function ha reads
1
ha (θ) = e−a(|θ| +1) 2
2
E ϕ2 (Z − θ)(2 θ − Z)
or, equivalently,
1
ha (θ) = e−a(|θ| +1) 2 −|θ|2
2
∇L(θ) (6.20)
so that ha is continuous, (θ − θ∗ |ha (θ)) > 0 for every θ = θ∗ and {ha = 0} = {θ∗ }.
Applying Corollary 6.1(a) (the Robbins–Monro Theorem), one derives that for
any step sequence γ = (γn )n≥1 satisfying (6.7), the sequence (θn )n≥0 recursively
defined by
θn+1 = θn − γn+1 Ha (θn , Zn+1 ), n ≥ 0, (6.21)
where (Zn )n≥1 is an i.i.d. sequence with distribution N (0; Id ) defined on a probability
space (, A, P) independent of the Rd -valued random vector θ0 ∈ L2Rd (, A, P),
satisfies
a.s.
θn −→ θ∗ as n → +∞.
1
Remarks. • The reason for introducing (|θ|2 + 1) 2 is that this function is explicit,
behaves like |θ| at infinity and is also everywhere differentiable, which simplifies the
discussion about the rate of convergence detailed further on.
• Note that no regularity assumption is made on the payoff ϕ.
• An alternative approach based on a large deviation principle but which needs some
regularity assumption on the payoff ϕ is developed in [116]. See also [243].
• To prevent a possible “freezing” of the procedure, for example when the step
sequence has been misspecified or when the payoff function is too anisotropic, one
can replace the above procedure (6.21) by the following fully data-driven variant of
the algorithm
where
ϕ2 (z − θ)
Ha (θ, z) := (2θ − z).
1 + ϕ2 (−θ)
1
α> > 0,
2e(λa,min )
where λa,min is the eigenvalue of Dha (θ∗ ) with the lowest real part. Moreover, one
shows that the theoretical best choice for α is αopt := e(λ1a,min ) . The asymptotic
variance is made explicit, once again in Sect. 6.4.3. Let us focus now on Dha (θ∗ ).
Starting from the expression (6.20)
1
ha (θ) = e−a(|θ| +1) 2 −|θ|2
2
∇L(θ)
1
−a(|θ|2 +1) 2 − |θ|2
2
=e × E ϕ2 (Z)(θ − Z)e−(θ|Z) by (6.16)
= ga (θ) × E ϕ2 (Z)(θ − Z)e−(θ|Z) .
Then
|θ|2
Dha (θ) = ga (θ)E ϕ2 (Z)e−(θ|Z) (Id + ZZ t − θZ t ) + e− 2 ∇L(θ) ⊗ ∇ga (θ)
Hence Dha (θ∗ ) is a definite positive symmetric matrix and its lowest eigenvalue
λa,min satisfies
λa,min ≥ ga (θ∗ )E ϕ2 (Z)e−(θ∗ |Z) > 0.
These computations show that if the behavior of the payoff ϕ at infinity is mis-
evaluated, this leads to a bad calibration of the algorithm. Indeed, if one considers two
real numbers a, a satisfying (6.18) with 0 < a < a , then one checks with obvious
notations that
1 ga (θ∗ ) 1
1 1 1
= e(a−a )(|θ∗ | +1) 2
2
= < .
2λa,min ga (θ∗ ) 2λa ,min
2λa ,min
2λa ,min
θ0 + · · · + θn−1
θ̄n = , n ≥ 1.
n
√
Then (θ̄n )n≥1 satisfies a Central Limit Theorem at a rate n with an asymptotic
variance corresponding to the optimal asymptotic variance obtained for the original
αopt
algorithm (θn )n≥0 with theoretical optimal step sequences γn = β+n , n ≥ 1.
The choice of the parameter β either for the original algorithm or in its averaged
version does not depend on theoretical motivations. A heuristic rule is to choose it
so that γn does not decrease too fast to avoid being “frozen” far from its target.
Adaptive implementation into the computation of E ϕ(Z)
At this stage, like for the variance reduction by regression, we may follow two
strategies – batch or adaptive – to reduce the variance.
The batch strategy. This is the simplest and the most elementary strategy.
Phase 1: One first computes an hopefully good approximation of the optimal
variance reducer, which we will denote by θn0 for a large enough n0 that will remain
fixed during the second phase devoted to the computation of E ϕ(Z). It is assumed
that an n0 sample of an i.i.d. sequence (Zm )1≤m≤n0 of N (0; Id )-distributed random
vectors.
6.3 Applications to Finance 195
1
M
|θn0 |2
E ϕ(Z) = lim ϕ(Zm+n0 + θn0 )e−(θn0 |Zm+n0 )− 2 ,
M M m=1
where (Zm )m≥n0 +1 is an i.i.d. sequence of N (0; Id )-distributed random vectors inde-
pendent of (Zm )1≤m≤n0 . This procedure satisfies a CLT with (conditional) variance
L(θn0 ) − (E ϕ(Z))2 (given θn0 ).
The adaptive strategy. This approach, introduced in [12], is similar to the adaptive
variance reduction by regression presented in Sect. 3.2. The aim is to devise a pro-
cedure fully based on the simultaneous computation of the optimal variance reducer
and E ϕ(Z) from the same sequence (Zn )n≥1 of i.i.d. N (0; Id )-distributed random
vectors used in (6.21). To be precise, this leads us to devise the following adaptive
estimator of E ϕ(Z):
1
M
|θm−1 |2
ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 , M ≥ 1, (6.23)
M m=1
As a first consequence, the estimator defined by (6.23) is unbiased. Now let us define
the (FM )-martingale
|θm−1 |2
M
ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 − E ϕ(Z)
NM := 1{|θm−1 |≤m} , M ≥ 1.
m=1
m
It is clear that (NM )M ≥1 has square integrable increments so that NM ∈ L2 (P) for
every M ∈ N∗ and
2
|θm−1 |2
E ϕ(Zm + θm−1 )e−(θm−1 Zm )− 2 Fm−1 1{|θm−1 |≤m} = L(θm−1 )1{|θm−1 |≤m}
a.s.
−→ L(θ∗ )
196 6 Stochastic Approximation with Applications to Finance
M
1 |θm−1 |2
ϕ(Zm + θm−1 ) e−(θm−1 |Zm )− 2 − E ϕ(Z) 1{|θm−1 |≤m} −→ 0 a.s.
M m=1
1
M
|θm−1 |2
ϕ(Zm + θm−1 ) e−(θm−1 |Zm )− 2 −→ E ϕ(Z) a.s. as M → +∞.
M m=1
One can show, using the CLT theorem for triangular arrays of martingale incre-
ments (see [142] and Chap. 12, Theorem 12.8) that
√ 1
M
|θ |2 L
−(θm−1 |Zm )− m−1
M ϕ(Zm + θm−1 ) e 2 − E ϕ(Z) −→ N 0, σ∗2 ,
M m=1
2
where σ∗2 = L(θ∗ ) − E ϕ(Z) is the minimal variance.
As set, this second approach seems more performing owing to its minimal asymp-
totic variance. For practical use, the verdict is more balanced and the batch approach
turns out to be quite satisfactory.
2 0.25
0.2
1.5 0.15
0.1
1 0.05
0.5 −0.05
−0.1
0 −0.15
−0.2
−0.5 −0.25
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1 2 3 4 5 6 7 8 9 10
5
x 10
Fig. 6.1 B- S Vanilla Call option. T = 1, r = 0.10, σ = 0.5, X0 = 100, K = 100. Left:
convergence toward θ∗ (up to n = 10 000). Right: Monte Carlo simulation of size M = 106 ; dotted
line: θ = 0, solid line: θ = θ10 000 θ∗
We did not try to optimize the choice of the step γn following theoretical results
on the weak rate of convergence, nor did we perform an averaging principle. We
just applied the heuristic rule that if the function H (here Ha ) takes its (usual) values
within a few units, then choosing γn = c 20+n 20
with c 1 (say c ∈ [1/2, 2]) leads to
satisfactory performances of the algorithm.
The resulting value θ10 000 was used in a standard Monte Carlo simulation of size
M = 106 based on (6.13) and compared to a crude Monte Carlo simulation with
θ = 0. The numerical results are as follows:
– θ = 0: 95 % Confidence interval = [23.92, 24.11] (pointwise estimate: 24.02).
– θ = θ10 000 1.51: 95 % Confidence interval = [23.919, 23.967] (pointwise esti-
mate: 23.94).
The gain ratio in terms of standard deviations is 42.69
11.01
= 3.88 4. This is observed
on most simulations we made, however the convergence of θn may be more chaotic
than displayed in Fig. 6.1 (left), where the convergence is almost instantaneous.
The behavior of the two Monte Carlo simulations are depicted in Fig. 6.1 (right).
The√alternative original “parametrized” version of the algorithm (Ha (θ, z)) with a =
2σ T yields quite similar results (when implemented with the same step and the
same starting value).
Further comments: As developed in [226], all of the preceding can be extended
to non-Gaussian random vectors Z provided their distribution have a log-concave
probability density p satisfying, for some positive ρ,
log(p) + ρ| · |2 is convex.
One can also replace the mean translation by other importance sampling procedures
like those based on the Esscher transform. This has applications, e.g. when Z = XT
is the value at time T of a process belonging to the family of subordinated Lévy
processes (Lévy processes of the form Zt = WYt , where Y is a subordinator – an
198 6 Stochastic Approximation with Applications to Finance
We consider a 2-dimensional B-S toy model as defined by (2.2), i.e. Xt0 = ert. (riskless
asset) and
σi2
Xti = x0i e(r− 2 )t+σi Wt , x0i > 0, i = 1, 2,
i
for the two risky assets, where W 1 , W 2 t = ρt, ρ ∈ [−1, 1] denotes the correlation
between W 1 and W 2 , that is, the correlation between the yields of the risky assets
X 1 and X 2 .
In this market, we consider a best-of call option defined by its payoff
max(XT1 , XT2 ) − K +
.
A market of such best-of calls is a market of the correlation ρ since the respective
volatilities are obtained from the markets of vanilla options on each asset as implicit
volatilities. In this 2-dimensional B-S setting, there is a closed formula for the pre-
mium involving the bi-variate standard normal distribution (see [159]), but what
follows can be applied as soon as the asset dynamics – or their time discretization –
can be simulated at a reasonable computational cost.
We will use a stochastic recursive procedure to solve the inverse problem in ρ
where
PBoC (x01 , x02 , K, σ1 , σ2 , r, ρ, T ) := e−rt E max(XT1 , XT2 ) − K +
√ √ √
−rt 1 μ1 T +σ1 T Z1 2 μ2 T +σ2 T (ρZ 1 + 1−ρ2 Z 2 )
=e E max x0 e , x0 e −K ,
+
σ2 d
where μi = r − 2i , i = 1, 2, Z = (Z 1 , Z 2 ) = N (0; I2 ).
It is intuitive and easy to check (at least empirically by simulation) that the
function ρ −→ PBoC (x01 , x02 , K, σ1 , σ2 , r, ρ, T ) is continuous and (strictly) decreas-
ing on [−1, 1]. We assume that the market price is at least consistent, i.e. that
Pmarket ∈ [PBoC (1), PBoC (−1)] so that Eq. (6.24) in ρ has exactly one solution, say
ρ∗ . This example is not only a toy model because of its basic B-S dynamics, it is also
due to the fact that, in such a model, more efficient deterministic procedures can be
6.3 Applications to Finance 199
called upon, based on the closed form for the option premium. Our aim is to propose
and illustrate below a general methodology for correlation search.
The most convenient way to prevent edge effects due to the fact that ρ ∈ [−1, 1]
is to use a trigonometric parametrization of the correlation by setting
ρ = cos θ, θ ∈ R.
d
since Z 2 = −Z 2 . Consequently, as soon as ρ = cos θ,
d
ρ Z1 + 1 − ρ2 Z 2 = (cos θ) Z 1 + (sin θ) Z 2
The function PBoC is a 2π-periodic continuous function. Extracting the implicit cor-
relation from the market amounts to solving (with obvious notations) the equation
where Pmarket is the quoted premium of the option (mark-to-market price). We need
to slightly strengthen the consistency assumption on the market price, which is in
fact necessary with almost any zero search procedure: we assume that Pmarket lies in
the open interval
Pmarket ∈ PBoC (1), max PBoC (−1)
θ
i.e. that Pmarket is not an extremal value of PBoC . So we are looking for a zero of the
function h defined on R by
200 6 Stochastic Approximation with Applications to Finance
d
and Z = (Z 1 , Z 2 ) = N (0; I2 ).
Proposition 6.1 Assume the above assumptions made on the Pmarket and the function
PBoC . If, moreover, the equation PBoC (θ) = Pmarket has finitely many solutions on
[0, 2π], then the stochastic zero search recursive procedure defined by
where (Zn )n≥1 is an i.i.d, N (0; I2 ) distributed sequence and (γn )n≥1 is a step sequence
satisfying the decreasing step assumption (6.7), a.s. converges toward solution θ∗ to
PBoC (θ) = Pmarket .
Proof. For every z ∈ R2 , θ → H (θ, z) is continuous, 2π-periodic and dominated by
a function g(z) such that g(Z) ∈ L2 (P) (g is obtained by replacing z 1 cos θ + z 2 sin θ
by |z 1 | + |z 2 | in the above formula for H ). One deduces that both the mean function
h and θ → E H 2 (θ, Z) are continuous and 2π-periodic, hence bounded.
The main difficulty in applying the Robbins–Siegmund Lemma is to find the
appropriate Lyapunov function.
2π
As the quoted value Pmarket is not an extremum of the function P, h± (θ)d θ > 0
0
where h± := max(±h, 0). The two functions h± are 2 π-periodic so that
t+2π 2π
h± (θ)d θ = h± (θ)d θ > 0 for every t > 0. We consider any (fixed) solu-
t 0
tion θ0 to the equation h(θ) = 0 and two real numbers β ± such that
2π
+ h+ (θ)d θ
0 < β < 02π < β−
0 h− (θ)d θ
The function is clearly continuous, 2π-periodic “on the right” on [θ0 , +∞) and “on
the left” on (−∞, θ0 ]. In particular, it is a bounded function. Furthermore, owing to
the definition of β ± ,
6.3 Applications to Finance 201
θ0 θ0 +2π
(θ)d θ < 0 and (θ)d θ > 0
θ0 −2π θ0
so that θ
lim (u)du = +∞.
θ→±∞ θ0
As a consequence, there exists a real constant C > 0 such that the function
θ
L(θ) = (u)du + C
0
It remains to prove that L = is Lipschitz continuous. Calling upon the usual argu-
ments to interchange expectation and differentiation (see Theorem 2.2(b)), one shows
that the function PBoC is differentiable at every θ ∈ R \ 2πZ with
√
PBoC (θ) = σ2 T E 1{X 2 (θ)>max(X 1 ,K)} XT2 (cos(θ)Z 2 − sin(θ)Z 1 )
T T
so that PBoc is clearly Lipschitz continuous on the interval [0, 2π], hence continuous
on the whole real line by periodicity. Consequently h and h± are Lipschitz continuous,
which implies in turn that is Lipschitz continuous as well.
Moreover, we know that the equation PBoC (θ) = Pmarket has exactly two solutions
on every interval of length 2π. Hence the set {h = 0} is countable and locally finite,
i.e. has a finite trace on any bounded interval.
One may apply Theorem 6.3 (for which we provide a self-contained proof in
one dimension) to deduce that θn will converge toward a solution θ∗ of the equation
PBoC (θ) = Pmarket . ♦
Exercise. Show that PBoC is continuously differentiable on the whole real line. [Hint:
extend the derivative on R by continuity.]
Extend the preceding to any payoff ϕ(XT1 , XT2 ) where ϕ : R2+ → R+ is a Lipschitz
continuous function. In particular, show without the help of differentiation that the
corresponding function θ → P(θ) is Lipschitz continuous.
202 6 Stochastic Approximation with Applications to Finance
2.8 0
−0.1
2.6
−0.2
2.4
−0.3
2.2 −0.4
2 −0.5
−0.6
1.8
−0.7
1.6
−0.8
1.4 −0.9
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10
Fig. 6.2 B- S Best- of- Call option. T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100, K =
100. Left: convergence of θn toward a θ∗ (up to n = 100 000). Right: convergence of ρn := cos(θn )
toward −0.5
Table 6.1 B- S Best- of- Call option. T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Convergence of ρn := cos(θn ) toward -0.5
n ρn := cos(θn )
1000 −0.5606
10000 −0.5429
25000 −0.5197
50000 −0.5305
75000 −0.4929
100000 −0.4952
θ0 = 0, n = 105 .
σ2
where χ(σ) = (1 + |σ|)e− 2 T . Carefully justify this choice of H and implement the
algorithm with x = K = 100, r = 0.1 and a market price equal to 16.73. Choose
the step parameter of the form γn = xc 1n , n ≥ 1, with c ∈ [0.5, 2] (this is simply a
suggestion).
Warning. The above exercise is definitely a toy exercise! More efficient methods
for extracting standard implied volatility are available (see e.g. [209], which is based
on a Newton–Raphson zero search algorithm; a dichotomy approach is also very
efficient).
where the functions σi : R2+ → R+ are bounded Lipschitz continuous functions and
the Brownian motions W 1 and W 2 are correlated with correlation ρ ∈ [−1, 1] so that
W 1 , W 2 t = ρt. (This ensures the existence and uniqueness of strong solutions for
this SDE, see Chap. 7.)
Assume that we know how to simulate (XT1 , XT2 ), either exactly, or at least as
an approximation by an Euler scheme from a d -dimensional normal vector Z =
d
(Z 1 , . . . , Z d ) = N (0; Id ).
Show that the above approach can be extended mutatis mutandis.
1 2
(C) ≡ argminθ E Yθ = argminθ E Yθ S .
S 2
Here are two simple examples to illustrate this somewhat abstract definition.
Examples. 1. Black–Scholes model. Let, for any x, σ > 0, r ∈ R,
σ2
Xtx,σ = xe(r− 2 )t+σWt , t ≥ 0,
and
Yθ := e−rTi (XTx,σ
i
− Ki )+ − Pmarket (Ti , Ki ) i=1,...,p
where Pmarket (Ti , Ki ) is the mark-to-market price of the option with maturity Ti > 0
and strike price Ki .
2. Merton model (mini-krach). Now, for every x, σ, λ > 0, a ∈ (0, 1), set
σ2
Xtx,σ,λ,a = x e(r− 2 +λa)t+σWt (1 − a)Nt , t ≥ 0,
where W is as above and N = (Nt )t≥0 is a standard Poisson process with jump
intensity λ. Set
θ = (σ, λ, a), = (0, +∞)2 × (0, 1)
and
6.3 Applications to Finance 205
Yθ = e−rTi (XTx,σ,λ,a
i
− Ki ) + − Pmarket (Ti , Ki ) .
i=1...,p
One checks – using the exercise “Extension to uniform integrability” which follows
Theorem 2.2 – that θ −→ E Yθ is differentiable and that its Jacobian is given by
∂θ E Yθ = E ∂θ Yθ .
Then, the function L is differentiable everywhere on and its gradient (with respect
to the canonical Euclidean norm) is given by
∀ θ ∈ , ∇L(θ) = E (∂θ Yθ )t S E Yθ = E (∂θ Yθ )t S E Yθ .
where the price dynamics (Xt (θ))t≥0 of the d traded assets is driven by a parametrized
diffusion process
where (W n )n≥1 is an i.i.d. sequence of copies of W and (γn )n≥1 is a sequence of steps
satisfying the usual decreasing step assumptions
γn = +∞ and γn2 < +∞.
n n
In such a general framework, of course, one cannot ensure that the functions L and
H will satisfy the basic assumptions needed to make stochastic gradient algorithms
converge, typically
∇L is Lipschitz continuous and ∀ θ ∈ , H (θ, . )2 ≤ C 1 + L(θ)
or one of their numerous variants (see e.g. [39] for a large overview of possible
assumptions). However, in many situations, one can make the problem fit into a con-
verging setting by an appropriate change of variable on θ or by modifying the function
L and introducing an appropriate explicit (strictly) positive “weight function” χ(θ)
that makes the product χ(θ)H (θ, W (ω)) fit with these requirements.
Despite this, the topological structure of the set {∇L = 0} can be nontrivial, in
particular disconnected. Nonetheless, as seen in Theorem 6.3, one can show, under
natural assumptions, that
The next step is that if ∇L has several zeros, they cannot all be local minima of L,
especially when there are more than two of them (this is a consequence of the well-
known Mountain-Pass Lemma, see [164]). Some are local maxima or saddle points of
various kinds. These equilibrium points which are not local minima are called traps.
An important fact is that, under some non-degeneracy assumptions on H at such a par-
asitic equilibrium point θ∞ (typically E H (θ∞ , W )H (θ∞ , W )t is positive definite at
least in the direction of an unstable manifold of h at θ∞ ), the algorithm will a.s. never
converge toward such a “trap”. This question has been extensively investigated in the
literature in various settings for many years (see [33, 51, 95, 191, 242]).
A final problem may arise due to the incompatibility between the geometry
of the parameter set and the above recursive algorithm: to be really defined
208 6 Stochastic Approximation with Applications to Finance
by the above recursion, we need to be left stable by (almost) all the mappings
θ → θ − γH (θ, w), at least for γ small enough. If not the case, we need to introduce
some constraints on the algorithm by projecting it onto whenever θn skips outside
. This question was originally investigated in [54] when is a convex set.
Once all these technical questions have been circumvented, we may state the fol-
lowing meta-theorem, which says that θn a.s. converges toward a local minimum of L.
At this stage it is clear that calibration looks like quite a generic problem for
stochastic optimization and that almost all difficulties arising in the field of Stochastic
Approximation can be encountered when implementing such a (pseudo-)stochastic
gradient to solve it.
The Kieffer–Wolfowitz approach
Practical implementations of the Robbins–Siegmund approach point out a specific
technical difficulty: the random functions θ → Yθ (ω) are not always pathwise dif-
ferentiable (nor in the Lr (P)-sense, which could be enough). More important in
some way, even if one shows that θ → E Y (θ) is differentiable, possibly by calling
upon other techniques (log-likelihood method, Malliavin weights, etc) the resulting
representation for ∂θ Y (θ) may turn out to be difficult to simulate, requiring much
programming care, whereas the random vectors Yθ can be simulated in a standard
way. In such a setting, an alternative is provided by the Kiefer–Wolfowitz algorithm
(K–W ) which combines the recursive stochastic approximation principle with a finite
difference approach to differentiation. The idea is simply to approximate the gradient
∇L by
∂L L(θ + η i ei ) − L(θ − η i ei )
(θ) , 1 ≤ i ≤ p,
∂θi 2η i
where (ei )1≤i≤p denotes the canonical basis of Rp and η = (η i )1≤i≤p . This finite
difference term has an integral representation given by
L(θ + η i ei ) − L(θ − η i ei ) (θ + η i , W ) − (θ − η i , W )
=E
2ηi 2η i
(θn + ηn+1
i
, W n+1 ) − (θn − ηn+1
i
, W n+1 )
θn+1
i
= θni − γn+1 i
, 1 ≤ i ≤ p.
2ηn+1
We reproduce below a typical convergence result for K–W procedures (see [39])
which is the natural counterpart of the stochastic gradient framework.
6.3 Applications to Finance 209
Theorem 6.4. Assume that the function θ → L(θ) is twice differentiable with a
Lipschitz continuous Hessian. We assume that
γn 2
γn = ηni = +∞, γn2 < +∞, ηn → 0, < +∞.
n≥1 n≥1 n≥1 n≥1
ηni
Definition 6.1. The Value at Risk at level α ∈ (0, 1) is the (lowest) α-quantile of the
distribution of X , i.e.
VaRα (X ) := inf ξ : P(X ≤ ξ) ≥ α . (6.25)
P X < VaRα (X ) ≤ α ≤ P X ≤ VaRα (X ) .
Definition 6.2. Let X ∈ L1 (P) with an atomless distribution. The Conditional Value-
at-Risk at level α ∈ (0, 1) is defined by
CVaRα (X ) := E X | X ≥ VaRα (X ) . (6.27)
Remark. Note that in the case of non-uniqueness of the α-quantile, the Conditional
Value-at-risk is still well-defined since the above conditional expectation does not
depend upon the choice of this α-quantile solution to (6.26).
Exercises. 1. Assume that the distribution of X has no atom. Show that
+∞
1
CVaRα (X ) = VaRα (X ) + P(X > u) du.
1 − α VaRα (X )
+∞
[Hint: use that for a non-negative r.v. Y , E Y = P(Y ≥ y) dy.]
0
2. Show that the conditional value-at-risk CVaRα (X ) is a consistent measure of risk,
i.e. that it satisfies the following three properties
• ∀ λ > 0, CVaRα (λ X ) = λCVaRα (X ).
• ∀ a ∈ R, CVaRα (X + a) = CVaRα (X ) + a.
• Let X , Y ∈ L1 (P), CVaRα (X + Y ) ≤ CVaRα (X ) + CVaRα (Y ).
1
L(ξ) = ξ + E (X − ξ)+
1−α
at
VaRα (X ) = inf argminξ∈R L(ξ).
Proof. The function L is clearly convex and 1-Lipschitz continuous since both func-
tions ξ → ξ and ξ → (x − ξ)+ are convex and 1-Lipschitz continuous for every
x ∈ R. As X has no atom, the function L is also differentiable on the whole real line
with a derivative given for every ξ ∈ R by
1 1
L (ξ) = 1 − P(X > ξ) = P(X ≤ ξ) − α .
1−α 1−α
E ((X − ξα )+ )
L(ξα ) = ξα +
P(X > ξα )
ξα E1{X >ξα } + E ((X − ξα )1{X >ξα } )
=
P(X > ξα )
E (X 1{X >ξα } )
=
P(X > ξα )
= E (X | {X > ξα }).
and
L(−ξ) 1 1 α
lim = lim −1 + E (X /ξ + 1)+ = −1 + = .
ξ→+∞ ξ ξ→+∞ 1−α 1−α 1−α
One checks that the function on the right-hand side of the above inequality attains
its minimum at its only break of monotonicity, i.e. when ξ = E X . This completes
the proof. ♦
Proposition 6.3 Assume that X ∈ L1 (P) with a unique Value-at-Risk VaRα (X ). Let
(Xn )n≥1 be an i.i.d. sequence of random variables with the same distribution as X ,
let ξ0 ∈ L1 (P), independent of (Xn )n≥1 , and let (γn )n≥1 be a positive sequence of
real numbers satisfying the decreasing step assumption (6.7). Then, the stochastic
algorithm (ξn )n≥0 defined by
where
1 1
H (ξ, x) := 1 − 1{x≥ξ} = 1{x<ξ} − α , (6.29)
1−α 1−α
α
(ξ, x) → H (ξ, x) is bounded by 1−α so that ξ → H (ξ, X )2 is bounded as well.
The conclusion follows directly from the Robbins–Monro theorem.
In the general case – ξ0 ∈ L1 (P) – one introduces the Lyapunov function L(ξ) =
(ξ−ξα )2
√ where we set ξα = VaRα (X ) for convenience. The derivative of L is
1+(ξ−ξα )2
(ξ−ξα )(2+(ξ−ξα )2 )
given by L (ξ) = 3 . One checks on the one hand that L is Lips-
(1+(ξ−ξα )2 ) 2
chitz continuous over the real line (e.g. because L is bounded) and, on the other
hand, {L = 0} ∩ {L = 0} = {L = 0} = {VaRα (X )}. Then Theorem 6.3 (pseudo-
Stochastic Gradient) applies and yields the announced conclusion (note that we need
its one-dimensional version that we established in detail). ♦
Proposition 6.4 If X ∈ Lp (P) for a p > 0 and is continuous with a unique value-at-
risk and if ξ0 ∈ Lp (P), then the algorithm (6.28) a.s. converges toward VaRα (X ).
Remark. The uniqueness of the value-at-risk can also be relaxed. The conclusion
becomes that (ξn ) a.s. converges to a random variable taking values in the “VaRα (X )
set”: {ξ ∈ R : P(X ≤ ξ) = α} (see [29] for a statement in that direction).
Second step: adaptive computation of CVaRα (X ). The main aim of this section is
to compute on-line the CVaRα (X ). The fact that L(ξn ) → CVaRα (X ) is no practical
help since the function L is not explicit. How can we proceed? The idea is to devise
a companion procedure of the above stochastic gradient. Still set temporarily ξα =
VaRα (X ) for convenience. It follows from Proposition 6.3 and Césaro’s averaging
L(ξ0 ) + · · · + L(ξn−1 )
principle that → CVaRα (X ) a.s. and in L1 since one has
n
L(ξn−1 ) → CVaRα (X ) a.s. and in L1 . In particular
L(ξ0 ) + · · · + L(ξn−1 )
E −→ CVaRα (X ) as n → +∞.
n
(x − ξ)+
L(ξ) = E (ξ, X ) where (ξ, x) = ξ + .
1−α
214 6 Stochastic Approximation with Applications to Finance
Using that Xk and (ξ0 , ξ1 , . . . , ξk−1 ) are independent for every k ≥ 1, one has
L(ξk−1 ) = E (ξ, X ) = E (ξk−1 , Xk ) ξk−1 , k ≥ 1,
|ξ=ξk−1
so that
(ξ0 , X1 ) + · · · + (ξn−1 , Xn ) L(ξ0 ) + · · · + L(ξn−1 )
E =E
n n
−→ CVaRα (X ) as n → +∞.
1
n−1
Cn = (ξk , Xk+1 ), n ≥ 1, C0 = 0,
n
k=0
1
Cn+1 = Cn − Cn − (ξn , Xn+1 ) . (6.30)
n+1
Proposition 6.5 Assume that X ∈ L1+ρ (P) for ρ ∈ (0, 1] and that ξn −→ VaRα (X )
a.s. Then
a.s.
Cn −→ CVaRα (X ) as n → +∞.
Proof. We will prove this claim in detail in the quadratic case ρ = 1. The proof in
the general case relies on the Chow Theorem (see [81] or the second exercise right
after the proof). First, one decomposes
1 1
n−1 n
Cn − L(ξα ) = L(ξk ) − L(ξα ) + Yk
n n
k=0 k=1
1
n−1
with Yk := (ξk−1 , Xk ) − L(ξk−1 ), k ≥ 1. It is clear that L(ξk ) − L(ξα ) → 0
n
k=0
as n → +∞ by Césaro’s principle. As for the second term, we first note that
1
(ξ, x) − L(ξ) = (x − ξ)+ − E (X − ξ)+
1−α
1 1
|(ξ, x) − L(ξ)| ≤ E |X − x| ≤ E |X | + |x| .
1−α 1−α
6.3 Applications to Finance 215
2
E Yk2 ≤ (E X )2 + E X 2 .
(1 − α) 2
We consider the filtration Fn := σ(ξ0 , X1 , . . . , Xn ). One checks that (ξn )n≥1 is (Fn )-
adapted and that, for every k ≥ 1,
E (Yk | Fk−1 ) = E (ξk−1 , Xk ) | Fk−1 ) − L(ξk−1 = L(ξk−1 ) − L(ξk−1 ) = 0.
n
Yk
Nn := , n ≥ 1,
k
k=1
n
E (Y 2 | Fk−1 )
N n = k
k2
k=1
so that
1
E N ∞ ≤ sup E Yn2 × < +∞.
n k2
k≥1
1
n
Yk −→ 0 a.s. as n → +∞
n
k=1
Remark. For practical implementation, one may prefer to first estimate the VaRα (X )
and, once it is done, use a regular Monte Carlo procedure to evaluate the CVaRα (X ).
Exercises. 1. Show that an alternative method to compute CVaRα (X ) is to design
the following recursive procedure
Cn+1 = Cn − γn+1 Cn − (ξn , Xn+1 ) , n ≥ 0, C0 = 0, (6.31)
where (γn )n≥1 is the step sequence implemented in the algorithm (6.28) to compute
VaRα (X ).
216 6 Stochastic Approximation with Applications to Finance
2 (Proof of Proposition 6.5). Show that the conclusion of Proposition 6.5 remains
valid if X ∈ L1+ρ (P). [Hint: rely on the Chow Theorem (3 ).]
ℵ Practitioner’s corner
Warning!…Toward an operating procedure! As it is presented, the pre-
ceding is essentially a toy exercise for the following reason: in practice α 1,
the convergence of the above algorithm turns out to be slow and chaotic since
P(X > VaRα (X )) = 1 − α is close to 0. For a practical implementation on real
life portfolios the above algorithm must be combined with an importance sampling
transformation to “recenter” the simulation where things do happen. A realistic and
efficient procedure is developed and analyzed in [29].
A second practical improvement to the procedure is to make the level α vary
slowly from, say, α0 = 21 to the target level α during the grist part of the simulation.
An alternative approach to this recursive algorithm is to invert the empirical mea-
sure of the innovations (Xn )n≥1 (see [83]). This method is close to the one described
above once rewritten in a recursive way (it corresponds to the step sequence γn = 1n ).
3 Let (Mn )n≥0 be an (Fn , P)-martingale null at 0 and let ρ ∈ (0, 1]; then
⎧ ⎫
⎨ ⎬
a.s.
Mn −→ M∞ on E |Mn | 1+ρ
| Fn−1 < +∞ .
⎩ ⎭
n≥1
6.3 Applications to Finance 217
of such a procedure. First, the potential does not go to infinity when the norm of
the N -quantizer goes to infinity; secondly, the procedure is not well-defined when
some components of the quantizers merge. Very partial results are known about the
asymptotic behavior of this procedure (see e.g. [39, 224]) except in one dimension in
the unimodal setting (log-concave density for the distribution to be quantized) where
much faster deterministic Newton–Raphson like procedures can be implemented if
the cumulative distribution and the first moment functions both have closed forms.
But its use remains limited due to its purely one-dimensional feature.
However, these theoretical gaps (possible asymptotic “merge” of components or
escape of components at infinity) are not observed in practical simulations. This fully
justifies presenting it in detail.
Let us recall that the quadratic distortion function (see Definition 5.1.2) is defined as
the squared quadratic mean-quantization error, i.e.
where x = {x1 , . . . , xN } and the local distortion function q2,N (x, ξ) is defined by
2
∀ x ∈ (Rd )N , ∀ ξ ∈ Rd , q2,N (x, ξ) = min |ξ − xi |2 = dist ξ, {x1 , . . . , xN } .
1≤i≤N
∂Q2,N ∂q2,N ∂q2,N
(x) := E (x, X ) = (x, ξ)PX (d ξ),
∂xi ∂xi R d ∂xi
∂q2,N
(x, ξ):=2(xi − ξ)1{Projx (ξ)=xi } , 1 ≤ i ≤ N ,
∂xi
where Projx denotes a (Borel) projection following the nearest neighbor rule on the
grid {x1 , . . . , xN }.
Proof. First note that, as the N -tuple x has pairwise distinct components, all the
◦
interiors C i (x), i = 1, . . . , N , of the Voronoi cells induced by x are non-empty. Let
$N
◦ $
ξ∈ C i (x) = R \
d
∂Ci (x). One has mini=j |ξ − xi | − |ξ − xj | > 0 and
i=1 1≤i≤N
218 6 Stochastic Approximation with Applications to Finance
N
q2,N (x, ξ) = |xj − ξ|2 1{ξ∈ ◦ .
C j (x)}
j=1
∂ ∂|xi − ξ|2
∀ i ∈ {1, . . . , N }, q2,N (x, ξ) = 1{ξ∈ ◦ (x)} = 2 1{ξ∈ ◦ (x)} (xi − ξ).
∂xi Ci ∂xi Ci
%
Hence, it follows from the assumption P 1≤i≤N ∂Ci (x) = 0 that q2,N ( . , ξ) is
PX (d ξ)-a.s. differentiable with a gradient ∇x q2,N (x, ξ) given by the above formula.
On the other hand, for every x, x ∈ (Rd )N , the function Q2,N is locally Lipschitz
continuous since
Remarks. In fact, when p > 1, the Lp -distortion function Qp,N with respect to a
Euclidean norm is also differentiable at N -tuples having pairwise distinct components
with gradient
xi − ξ
∇Qp,N (x) = p |xi − ξ|p−1 μ(d ξ)
Ci (x) |xi − ξ| 1≤i≤N
xi − X
= p E 1{X ∈Ci (x)} |xi − X |p−1 .
|xi − X | 1≤i≤N
An extension to the case p ∈ (0, 1] exists under appropriate continuity and integrabil-
ity assumptions
on the distribution μ so that μ({a}) = 0 for every a and the function
a → Rd |ξ − a|p−1 μ(d ξ) remains bounded on compact sets of Rd . A more general
differentiation result exists for strictly convex smooth norms (see Lemma 2.5, p. 28
in [129]).
6.3 Applications to Finance 219
In case of conflict on the winner index one applies a prescribed rule (like selection
at random of the true winner, uniformly among the winner candidates).
Warning! Note that choosing a (0, 1)-valued step sequence (γn )n≥1 is crucial to
ensure that the learning phase is a dilatation with coefficient ρ = 1 − γn+1 at iteration
n + 1.
One can easily check by induction that if x[n] has pairwise distinct components
[n]
xi , this is preserved by the learning phase. So, that the above procedure is well-
defined. The name of the procedure – Competitive Learning Vector Quantization
algorithm – is of course inspired by these two phases.
Heuristics: x[n] −→ x(N ) ∈ argminx∈(Rd )N Q2,N (x) as n → +∞, or, at least toward
a local minima of Q2,N . (This implies that x(N ) has pairwise distinct components.)
On- line computation of the “companion parameters”: this phase is very
important in view of numerical applications.
x(N )
• Weights pi(N ) = P X
# = xi(N ) , i = 1, . . . , N .
– Initialize: pi[0] := 0, i = 1, . . . , N ,
One has:
pi[n] −→ pi(N ) a.s. on the event x[n] → x(N ) as n → +∞.
#x 2 : set
• Distortion Q2,N (x(N ) ) = X − X
(N )
– Initialize: Q[0]
2,N := 0
One has
[n]
Q2,N −→ Q2,N x(N ) a.s. on the event x[n] → x(N ) as n → +∞.
Note that, since the ingredients involved in the above computations are those used
in the competition phase (nearest neighbor search), there is (almost) no extra CPU
time cost induced by this companion procedure. By contrast the nearest neighbor
search is costly.
In some way the CLVQ algorithm can be seen as a Non-linear Monte Carlo
Simulation devised to design an optimal skeleton of the distribution of X .
For partial theoretical results on the convergence of the CLVQ algorithm, we
refer to [224] when X has a compactly supported distribution. To the best of our
6.3 Applications to Finance 221
Exercise. (a) Replace the step sequence (γn ) in (6.32) and (6.33) by γ̃n = 1n ,
without modifying anything in the CLVQ procedure itself. Show that, if x[n] → x(N )
a.s., the resulting new procedures both converge toward their target. [Hint: Follow
the lines of the convergence of the adaptive estimator of the CVaR in Sect. 6.3.4.]
(b) Extend this result to prove that the a.s. convergence holds on the event
{x[n] → x(N ) }.
The CLVQ algorithm is recommended to obtain accurate results for small or
medium levels N (less that 20) and medium dimensions d (less than 10).
The Randomized Lloyd I procedure
The randomized Lloyd I procedure described below is recommended when N is large
and d is medium. We start again from the fact that if a function is differentiable at
one of its local or global minima, then its gradient is zero at this point. Any global
minimizer x(N ) of the quadratic distortion function Q2,N has pairwise distinct compo-
nents and P-negligible Voronoi cell boundaries as mentioned above. Consequently,
owing to Proposition 6.6, the gradient of the quadratic distortion at such x = x(N )
must be zero, i.e.
∂Q2,N
(x, ξ) = 2 E (xi − X )1{Projx (X )=xi } = 0, 1 ≤ i ≤ N .
∂xi
Moreover we know from Theorem 5.1.1(b) that P X ∈ Ci (x) > 0 for every i ∈
{1, . . . , N } so that the above equation reads equivalently
#x = xi } , i = 1, . . . , N .
xi = E X | {X (6.34)
N
#x =
since E X | X #x = xi } 1{X#x =x } . Or, equivalently, but in a more
E X | {X i
i=1
tractable form, of the mapping
#x ()
−→ E X | X (6.36)
the recursive procedure associated to the fixed point identity (6.34) (or (6.36)). In
its generic form it reads, keeping the notation x[n] for the running N -quantizer at
iteration n ≥ 0:
⎧ x[n]
#x[n] = xi[n] } if P X
⎨ E X | {X # = xi[n] > 0,
xi[n+1] = x[n] i = 1, . . . , N , (6.37)
⎩ # = xi[n] = 0,
xi[n+1] = xi[n] if P X
Exercise. Prove that this recursive procedure only involves the distribution μ = PX
of the random vector X .
The Lloyd I algorithm can be viewed as a two step procedure acting on random
vectors as follows
⎧
⎨ (i) Grid updating : x[n+1] = X [n+1] () with X [n+1] = E X | X #x[n] ,
⎩ #x [n+1]
(ii) Voronoi cells (weights) updating : X ← X [n+1] .
The first step updates the grid, the second step re-assigns to each element of the
grid its Voronoi cell, which can be also interpreted as a weight updating.
Proposition 6.6 The Lloyd I algorithm makes the mean quadratic quantization error
decrease, i.e.
n −→ X − X #x[n] is non-increasing.
2
Proof. It follows from the above decomposition of the procedure and the very defi-
nitions of nearest neighbor projection and conditional expectation as an orthogonal
projector in L2 (P) that, for every n ∈ N,
X − X
#x[n+1] = dist X , x[n+1]
[n+1]
2 2
≤ X − X x 2 = X − E X | X #x[n]
2
≤ X −X # x[n]
. ♦
2
Though attractive, this proposition is far from a convergence result for the Lloyd
procedure since components of the running quantizer x[n] can a priori escape to
infinity. In spite of a huge literature on the Lloyd I algorithm, also known as k-means
in Statistics and Data Science, this question of its a.s. convergence toward a sta-
tionary, hopefully optimal, N -quantizer has so far only received partial answers. Let
us cite [168], where the procedure is introduced and investigated, probably for the
first time, in a one dimensional setting for unimodal distributions. More recently,
6.3 Applications to Finance 223
as far as strongly continuous distribution are concerned (4 ), let us cite [79, 80]
or [85], where a.s. convergence is established if X has a compactly supported den-
sity and [238],where the convergence is proved for unbounded strongly continuous
distributions under an appropriate initialization of the procedure at level N depend-
ing on the lowest quantization error at level N − 1 (see the splitting method in the
Practitioner’s corner below).
keeping in mind that the convergence holds when M → +∞. To be more precise,
this amounts to setting at every iteration n ≥ 0 and for every i = 1, . . . , N ,
⎧
⎪ 1≤m≤M ξm 1{# x[n] =x[n] }
⎪
⎨
ξm i
if {1 ≤ m ≤ M , #
ξmx = xi[n] } = ∅,
[n]
1
M
μ(ω, d ξ)M = δξ (ω) (d ξ).
M m=1 m
In particular, if we use the same sample (ξm (ω))1≤m≤M at each iteration of the pro-
cedure, we still have the property that the procedure decreases a quantization error
modulus (at level N ) related to the distribution μ.
This suggests that the random i.i.d. sample (ξm )m≥1 can also be replaced by deter-
ministic copies obtained through a QMC procedure based on a representation of X
d
of the form X = ψ(U ), U = U ([0, 1]r ).
ℵ Practitioner’s corner
Splitting Initialization method II. When computing quantizers of larger and larger
sizes for the same distribution, a significant improvement of the method is to initialize
the randomized Lloyd’s procedure or the CLVQ at level N + 1 by adding one com-
ponent to the N -quantizer resulting from the previous execution of the procedure, at
level N . To be more precise, one should initialize the procedure with the N + 1-tuple
(x(N ) , ξ) ∈ (Rd )N +1 where x(N ) denotes the limiting value of the procedure at level
N (assumed to exist, which is the case in practice). Such a protocol is known as the
splitting method.
Fast nearest neighbor search procedure(s) in Rd .
This is the key step in all stochastic procedures which intend to compute optimal
(or at least “good”) quantizers in higher dimensions. To speed it up, especially when
d increases, is one of the major challenges of computer science.
– The Partial Distance Search paradigm (see [62]).
The nearest neighbor search in a Euclidean vector space can be reduced to the
simpler problem to checking whether a vector u = (u1 , . . . , ud ) ∈ Rd is closer to 0
with respect to the canonical Euclidean distance than a given former “minimal record
distance” δrec > 0. The elementary “trick” is the following
(u1 )2 ≥ δrec
2
=⇒ |u| ≥ δrec
..
.
(u1 )2 + · · · + (u )2 ≥ δrec
2
=⇒ |u| ≥ δrec
..
.
This is the simplest and easiest idea to implement but it seems that it is also the only
one that still works as d increases.
– The K-d tree (Friedmann, Bentley, Finkel, 1977, see [99]): the principle is to
store the N points of Rd in a tree of depth O(log(N )) based on their coordinates on
the canonical basis of Rd .
– Further improvements are due to McNames (see [207]): the idea is to perform a
pre-processing of the dataset of N points using a Principal Component Axis (PCA)
analysis and then implement the K-d -tree method in the new orthogonal basis induced
by the PCA.
Numerical optimization of quantizers for the normal distributions N (0; Id ) on
Rd , d ≥ 1
The procedures that minimize the quantization error are usually stochastic (except in
one dimension). The most famous ones are undoubtedly the so-called Competitive
Leaning Vector Quantization algorithm (see [231] or [229]) and the Lloyd I procedure
(see [106, 226, 231]) which have just been described and briefly analyzed above.
More algorithmic details are also available on the website
www.quantize.maths−fi.com
For normal distributions a large scale optimization has been carried out based on
a mixed CLVQ -Lloyd procedure. To be precise, grids have been computed for d = 1
6.3 Applications to Finance 225
Fig. 6.3 An optimal quantization of the bi-variate normal distribution with size N = 500 (with
J. Printems)
226 6 Stochastic Approximation with Applications to Finance
where h(y) = E H (y, Z1 ) is the mean function of the algorithm. We define the fil-
tration Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. The sequence (Yn )n≥0 is (Fn )-adapted and
Mn = E (H (Yn−1 , Zn ) | Fn−1 ) − h(Yn−1 ), n ≥ 1, is a sequence of (Fn )-martingale
increments.
In the preceding we established that, under the assumptions of the Robbins–
Siegmund Lemma (based on the existence of a “Lyapunov function”, see Theorem 6.1
(ii) and (v)), the sequence (Yn )n≥0 is a.s. bounded and that the martingale
Mnγ = γk Mk is a.s. convergent in Rd .
k≥1
At this stage, to derive the a.s. convergence of the algorithm itself in various
settings (Robbins–Monro, stochastic gradient, one-dimensional stochastic gradi-
ent), we used direct pathwise arguments based on elementary topology. The main
improvement provided by the ODE method is to provide more powerful tools from
functional analysis derived from further investigations on the asymptotics of the
“tail”-sequences (Yk (ω))k≥n , n ≥ 0, assuming a priori that the whole sequence
(Yn (ω))n≥0 is bounded. The idea is to represent these tail sequences as stepwise
constant càdlàg functions of the cumulative function of the steps. We will also need
γ
an additional assumption on the paths of the martingale (Mn )n≥1 , however signifi-
cantly less stringent than the above a.s. convergence property. To keep on working in
this direction, it is more convenient to temporarily abandon our stochastic framework
and focus on a discrete time deterministic dynamics.
ODE method I
Let us be more specific: first we consider a recursively defined sequence
yn+1 = yn − γn+1 h(yn ) + πn+1 , y0 ∈ Rd , (6.39)
n
n = γk .
k=1
yt(n) = y(0)n +t , t ∈ R+ .
so that N (t) = n if and only if t ∈ [n , n+1 ) (in particular, N (n ) = n).
Expanding the recursive Equation (6.39), we get
n
n
yn = y0 − γk h(yk−1 ) − γk πk
k=1 k=1
N (t) N (t)
yt(0) = y0(0) − h(ys(0) )ds − γk πk
0 k=1
N (t)
t
= y0(0) − h(ys(0) )ds + h(yN (t) ) t − N (t) − γk πk . (6.40)
0 k=1
Then, using the very definition of the shifted function y(n) and taking advantage
of the fact that N (n ) = n , we derive, by subtracting (6.40) at times n + t and n
successively, that for every t ∈ R+ ,
228 6 Stochastic Approximation with Applications to Finance
N (
n +t)
with R(n)
t = h(yN (t+n ) ) t + n − N (t+n ) − γk πk ,
k=n+1
keeping in mind that y0(n) = y(0)n (= yn ). The term Rn (t) is intended to behave as a
remainder term as n goes to infinity. The next proposition establishes a first connection
between the asymptotic behavior of the sequence of vectors (yn )n≥0 and that of the
sequence of functions (y(n) )n≥0 .
Then:
(a) The set Y ∞ := limiting points of (yn )n≥0 is a compact connected set.
(b) The sequence (y(n) )n≥0 is sequentially relatively compact (5 ) with respect to
the topology of uniform convergence on compacts sets on the space B(R+ , Rd ) of
bounded functions from R+ to Rd (6 ) and all its limiting points lie in Cb (R+ , Y ∞ ).
N (
n +t)
5 Ina metric space (E, d ), a sequence (xn )n≥0 is sequentially relatively compact if from any subse-
quence one can extract a convergent subsequence with respect to the distance d .
6 This topology is defined by the metric d (f , g) =
min(supt∈[0,k] |f (t)−g(t)|,1)
k≥1 2k
.
6.4 Further Results on Stochastic Approximation 229
∞
set Y is compact
As a consequence, the and bien enchaı̂né (7 ), hence connected.
.
(b) The sequence h(ys(n) )ds is uniformly Lipschitz continuous with
0 n≥0
Lipschitz continuous coefficient M since, for every s, t ∈ R+ , s ≤ t,
t s t
h(yu(n) )du − h(yu(n) )du ≤ |h(yu(n) )|du ≤ M (t − s)
0 0 s
and (y0(n) )n≥0 = (yn )n≥0 is bounded, hence it follows from the Arzela–Ascoli Theorem
.
that the sequence of continuous y0(n) − h(ys(n) )ds is relatively compact in
0 n≥0
C(R+ , Rd ) endowed with the topology, denoted by UK , of the uniform convergence
on compact intervals. On the other hand, for every T ∈ (0, +∞),
t
sup yt(n) − y0(n) + h(ys(n) )ds = sup |R(n)
t |
t∈[0,T ] 0 t∈[0,T ]
N (
n +t)
≤ M sup γk + sup γk πk → 0 as n → +∞
k≥n+1 t∈[0,T ] k=n+1
which implies that the sequence y(n) n≥0 is UK -relatively compact. Moreover, both
.
sequences y0(n) − 0 h(ys(n) )ds and y(n) n≥0 have the same UK -limiting values.
n≥0
In particular, these limiting functions are continuous with values in Y ∞ . ♦
ODEh∗ ≡ ẏ = h(y).
Theorem 6.5 (ODE II) Assume Hi , i = 1, 2, 3, hold and that the mean function h is
continuous. Let y0 ∈ Rd and Y ∞ be the set of limiting values of the sequence (yn )n≥0
recursively defined by (6.39).
(a) Any limiting function of the sequence (y(n) )n≥0 is a Y ∞ -valued solution of
ODEh ≡ ẏ = −h(y).
7A set A in a metric space (E, d ) is “bien enchaı̂né” if for every a, a ∈ A, and every ε > 0, there
exists an integer n ≥ 1 and a0 , . . . , an such that a0 = a, an = a and d (ai , ai+1 ) ≤ ε for every
i = 0, . . . , n − 1. Any connected set C in E is bien enchaı̂né. The converse is true if C is compact,
see e.g. [13] for more details.
230 6 Stochastic Approximation with Applications to Finance
(b) Assume that ODEh has a flow t (ξ) (8 ). Assume the existence of a flow for
ODEh∗ . Then, the set Y ∞ is a compact, connected set, flow-invariant for both ODEh
and ODEh∗ .
Proof: (a) Given the above Proposition 6.7, the conclusion follows if we prove that
any limiting function y(∞) = UK -limn y(ϕ(n)) (ϕ(n)
(ϕ(n)) → +∞)
(∞)is
a solution to ODEh .
(ϕ(n)) (∞)
For every t ∈ R+ , yt → yt , hence h yt → h yt since the function h
is continuous. Then, by the Lebesgue dominated convergence theorem, one derives
that for every t ∈ R+ ,
t t
h ys(ϕ(n)) ds −→ h ys(∞) ds.
0 0
(ϕ(n))
One also has y0 → y0(∞) so that finally, letting ϕ(n) → +∞ in (6.41), we obtain
t
yt(∞) = y0(∞) − h ys(∞) ds.
0
UK
y(N (ϕ(n) −p)) −→ y(∞),p as n → +∞.
(N (ϕ(n) −p−1)) (N (ϕ(n) −p))
Since yt+1 = yt for every t ∈ R+ and every n ≥ np+1 , one has
(∞),p+1 (∞),p
∀ p ∈ N, ∀ t ∈ R+ , yt+1 = yt .
Furthermore, it follows from (a) that the functions y(∞),p are Y ∞ -valued solutions
to ODEh . One defines the function y(∞) by
(∞),p
yt(∞) = yp−t , t ∈ [p − 1, p],
8 For every ξ ∈ Y ∞ , ODEh admits a unique solution (t (ξ))t∈R+ starting at 0 (ξ) = ξ.
6.4 Further Results on Stochastic Approximation 231
(∞),p
which satisfies a posteriori, for every p ∈ N, yt(∞) = yp−t , t ∈ [0, p]. This implies
that y(∞) is a Y ∞ -valued solution to ODEh∗ starting from y∞ on ∪p≥0 [0, p] = R+ .
Uniqueness implies that y(∞) = ∗t (y∞ ), which completes the proof. ♦
Remark. If uniqueness fails either for ODEh or for ODEh∗ , one still has that Y ∞ is
left invariant by ODEh and ODEh∗ in the weaker sense that, for every y∞ ∈ Y ∞ , there
exist Y ∞ -valued solutions of ODEh and ODEh∗ .
This property is the first step toward the deep connection between the asymptotic
behavior of a recursive stochastic algorithm and its associated mean field ODEh . Item
(b) can be seen as a first criterion to direct possible candidates to a set of limiting
values of the algorithm. Thus, any zero y∗ of h, or equivalently equilibrium points
of ODEh , satisfies the requested invariance condition since t (y∗ ) = y∗ for every
t ∈ R+ . No other single point can satisfy this invariance property. More generally,
we have the following result.
Corollary 6.2 If there are finitely many compact connected sets Xi , i ∈ I (I finite),
two-sided invariant for ODEh and ODEh∗ , then the sequence (yn )n≥0 converges toward
one of these sets, i.e. there is an integer i0 ∈ I such that dist(yn , Xi0 ) → 0 as n →
+∞.
As an elementary example let us consider the ODE
ẏ = (1 − |y|)y + ςy⊥ , y0 ∈ R2 ,
where y = (y1 , y2 ), y⊥ = (−y2 , y1 ) and ς is a positive real constant. Then the unit
circle C(0; 1) is clearly a connected compact set invariant under ODE and ODE ∗ .
The singleton {0} also satisfies this invariant property. In fact, C(0; 1) is an attractor
of ODE and {0} is a repeller in the sense that the flow of this ODE uniformly
converges toward C(0; 1) on every compact set K ⊂ R2 \ {0}.
We know that any recursive procedure with mean function h of the form
h(y1 , y2 ) = (1 − |y|)y + ςy⊥ satisfying Hi , i = 1, 2, 3, will converge either toward
C(0, 1) or 0. But at this stage, we cannot eliminate the repulsive equilibrium {0}
(what happens if y0 = 0 and the perturbation term sequence πn ≡ 0 ).
Sharper characterizations of the possible set of limiting points of the sequence
(yn )n≥0 have been established in close connection with the theory of perturbed
dynamical systems. To be slightly more precise it has been shown that the set Y ∞ of
limiting points of the sequence (yn )n≥0 is internally chain recurrent or, equivalently,
contains no strict attractor for the dynamics of the ODE, i.e. as a subset A ⊂ Y ∞ ,
A = Y ∞ , such that φt (ξ) converges to A uniformly in ξ ∈ X ∞ . Such results are beyond
the scope of this monograph and we refer to [33] (see also [94] when uniqueness
fails and the ODE has no flow) for an introduction to internal chain recurrence.
Unfortunately, such refined results are still not able to discriminate between these
two candidates as a limiting set, though, as soon as πn behaves like a non-degenerate
noise when yn is close to 0, it seems more likely that the algorithm converges toward
the unit circle, like the flow of the ODEh does (except when starting from 0). At this
232 6 Stochastic Approximation with Applications to Finance
point probability comes back into the game: this intuition can be confirmed under
the additional non-degeneracy assumption of the noise at 0 for the algorithm (the
notion of “noisy trap”). Thus, if the procedure (6.39) is a generic path of a Markovian
algorithm of the form (6.3) satisfying at 0
this generic path cannot converge toward 0 and will subsequently converge to C(0; 1).
Practical aspects of assumption H3 .
To make the connection with the original form of stochastic algorithms, we come
back to hypothesis H3 in the following proposition. In particular, it emphasizes that
this condition is less stringent than a standard convergence assumption on the series.
N (
n +t)
N (
n +t)
sup γk πk ≤ sup γk πk → 0 a.s. as n → +∞.
t∈[0,T ] ≥n+1 k=n+1
k=n+1
(c) Mixed perturbation: In practice, one often meets the combination of these
rrsituations:
πn = Mn + rn ,
γ
where rn is a remainder term which goes to 0 a.s. and Mn is an a.s. convergent
martingale.
γ
The a.s.
convergence of the martingale (Mn )n≥0 follows from the fact that
supn≥1 E (Mn )2 | Gn−1 < +∞ a.s., where Gn = σ(Mk , k = 1, . . . , n), n ≥ 1
γ
and G0 = {∅, }. The a.s. convergence of (Mn )n≥0 can even be relaxed for the
martingale term. Thus we have the following classical results where H3 is satisfied
γ
while the martingale Mn may diverge.
Proposition 6.9 (a) Métivier–Priouret criterion (see [213]). Let (Mn )n≥1 be a
sequence of martingale
increments and let (γn )n≥1 be a sequence of non-negative
steps satisfying n γn = +∞. Then, H3 is a.s. satisfied with πn = Mn as soon as
6.4 Further Results on Stochastic Approximation 233
there exists a pair of Hölder conjugate exponents (p, q) ∈ (1, +∞)2 (i.e. 1
p
+ 1
q
= 1)
such that 1+ q
sup E |Mn |p < +∞ and γn 2 < +∞.
n n
This allows for the use of steps of the form γn ∼ c1 n−a , a > 2
2+q
= 2(p−1)
3(p−1)+1
.
(b) Exponential criterion (see e.g. [33]). Assume that there exists a real number c > 0
such that,
λ2
∀ λ ∈ R, E eλMn ≤ ec 2 .
−c
Then, for every sequence (γn )n≥1 such that e γn < +∞, Assumption H3 is satis-
n≥1
fied with πn = Mn . This allows for the use of steps of the form γn ∼ c1 n−a , a > 0,
and γn = c1 (log n)−1+a , a > 0.
Examples. Typical examples where the sub-Gaussian assumption is satisfied are
the following:
λ2 2
• |Mn | ≤ K ∈ R+ since, owing to Hoeffding’s Inequality, E eλMn ≤ e 8 K
(see e.g. [193], see also the exercise that follows the proof of Theorem 7.4).
d λ2
• Mn = N (0; σn2 ) with σn ≤ K, so that E eλMn ≤ e 2 K2
.
The first case is very important since in many situations the perturbation term is a
martingale term and is structurally bounded.
Application to an extended Lyapunov approach and pseudo-gradient
By relying on claim (a) in Proposition 6.7, one can also derive directly a.s. conver-
gence results for an algorithm.
Proposition 6.10 (G-Lemma, see [94]) Assume H1 , i = 1, 2, 3. Let G : Rd → R+
be a function satisfying
(G) ≡ lim yn = y∞ and lim G(yn ) = 0 =⇒ G(y∞ ) = 0.
n n
Then, there exists a connected component X ∗ of the set {G = 0} such that dist
(yn , X ∗ ) = 0.
Remark. Any non-negative lower semi-continuous (l.s.c.) function G : Rd → R+
satisfies (G) (9 ).
Proof. First, it follows from Proposition (6.7) that the sequence (yn(n) )n≥0 is UK -
relatively compact with limiting functions lying in C(R+ , Y ∞ ) where Y ∞ still
denotes the compact connected set of limiting values of (yn )n≥0 .
Set, for every y ∈ Rd ,
G(y) = lim G(x) = inf lim G(xn ), xn → y
x→y n
so that 0 ≤ G ≤ G. The function G is the l.s.c. envelope of the function G, i.e. the
highest l.s.c. function not greater than G. In particular, under Assumption (G)
{G = 0} = {G = 0} is closed.
+∞
Consequently, G ys(∞) ds = 0, which implies that G ys(∞) = 0 ds-a.s.
0
Now as the function s → ys(∞) is continuous it follows that G y0(∞) = 0 since
∞
G is l.s.c. This in turn implies G y0 = 0, i.e. G y∞ = 0. As a consequence
Y ∞ ⊂ {G = 0}, which yields the result since on the other hand it is a connected
set. ♦
Proposition 6.11 Let (Yn )n≥1 be a stochastic algorithm defined by (6.3) where the
function H satisfies the quadratic linear growth assumption (6.6), namely
Assume Y0 ∈ L2 (P) is independent of the i.i.d. sequence (Zn )n≥1 . Assume there exists
a y∗ ∈ Rd and an α > 0 such that the strong mean-reverting assumption
holds. Finally, assume that the step sequence (γn )n≥1 satisfies the usual decreasing
step assumption (6.7) and the additional assumption
1 γn
(Gα ) ≡ lim an = 1 − 2 α γn+1 − 1 = −κ∗ < 0. (6.44)
n γn+1 γn+1
Then
a.s. √
Yn −→ y∗ and Yn − y∗ 2 = O( γn ).
γ1
ℵ Practitioner’s corner. • If γn = ϑ , 21 < ϑ < 1, then (Gα ) is satisfied for any
n
α > 0.
γ1 1 − 2 α γ1 1
• If γn = , Condition (Gα ) reads = −κ∗ < 0 or equivalently γ1 > .
n γ1 2α
Proof of Proposition 6.11 The fact that Yn → y∗ a.s. is a straightforward conse-
quence of Corollary 6.1 (Robbins–Monro framework).
We also know that
|Yn − y∗ | n≥0 is L2 -bounded, so that, in particular Yn n≥0 is L2 -bounded. As con-
cerns the quadratic rate of convergence, we re-start from the classical proof of
Robbins–Siegmund’s Lemma. Let Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. Then
236 6 Stochastic Approximation with Applications to Finance
|Yn+1 − y∗ |2 = |Yn − y∗ |2 − 2γn+1 H (Yn , Zn+1 )|Yn − y∗ + γn+1
2
|H (Yn , Zn+1 )|2 .
This implies
E |Yn+1 − y∗ |2 = E |Yn − y∗ |2 − 2γn+1 E h(Yn )|Yn − y∗ + γn+1 2
E |H (Yn , Zn+1 )|2
≤ E |Yn − y∗ |2 − 2γn+1 E h(Yn )|Yn − y∗ + 2 γn+1 2
C 2 1 + E |Yn |2
≤ E |Yn − y∗ |2 − 2 α γn+1 E |Yn − y∗ |2 + γn+1
2
C 1 + E |Yn − y∗ |2
owing successively to the linear quadratic growth and the strong mean-reverting
assumptions. Finally,
E |Yn+1 − y∗ |2 = E |Yn − y∗ |2 1 − 2 α γn+1 + C γn+1
2
+ C γn+1
2
.
κ∗
Let n0 be an integer such that, for every n ≥ n0 , an ≤ − 34 κ∗ and C γn ≤ 4
. For these
∗
integers n, 1 − κ2 γn+1 > 0 and
κ∗
un+1 ≤ un 1 − γn+1 + C γn+1 .
2
Then, one derives by induction that
2C
∀ n ≥ n0 , 0 ≤ un ≤ max un0 , ∗ ,
κ
6.4 Further Results on Stochastic Approximation 237
∇L(x) − ∇L(y)|x − y ≥ α|x − y|2 .
Proof. It is clear that such a stochastic algorithm satisfies the assumptions of the
above Proposition 6.11, especially the strong mean-reverting assumption (6.43),
owing to the preliminaries on L that precede the proposition. By the fundamen-
tal theorem of Calculus, for every n ≥ 1, there exists a n ∈ (Yn−1 , Yn ) (geometric
interval in Rd ) such that
1
L(Yn ) − L(y∗ ) = ∇L(y∗ )|Yn − y∗ + (Yn − y∗ )∗ D2 L( n )(Yn − y∗ )
2
1 2
≤ [∇L]1 Yn − y∗
2
where we used in the second line that ∇L(y∗ ) = 0 and the above inequality.
One concludes by taking the expectation in the above inequality and applying
Proposition 6.11. ♦
provided c is not small (see Practitioner’s corner that follows the proof of the proposi-
tion). Such a result corresponds to the Law of Large Numbers in quadratic mean and
one may reasonably guess that a Central Limit Theorem can be established. In fact,
the mean-reverting (to coercivity) assumption can be localized at the target y∗ , lead-
√ n −y∗
ing a CLT at a γn -rate, namely that Y√ γn
converges in distribution to some normal
distribution involving the dispersion matrix H (y∗ ) = E H (y∗ , Z)H (y∗ , Z)t .
The CLT for Stochastic Approximation algorithms has given rise to an extensive
literature starting from the pioneering work by Kushner [179] in the late 1970’s (see
also [49] for a result with Markovian innovations). We give here a result established
by Pelletier in [241] whose main originality is its “locality” in the following sense:
a CLT is shown to hold for the stable weak convergence, locally on the convergence
set(s) of the algorithm to its equilibrium points. In particular, it solves the case of
multi-target algorithms, which is often the case, e.g. for stochastic gradient descents
associated to non-convex potentials. It could also be of significant help to elucidate
the rate of convergence of algorithms with constraints or repeated projections like
those introduced by Chen. Such is the case for Arouna’s original adaptive variance
reduction procedure for which a CLT has been established in [197] by a direct
approach (see also [196] for a rigorous proof of the convergence).
6.4 Further Results on Stochastic Approximation 239
Theorem 6.6 (Pelletier [241]) We consider the stochastic procedure (Yn )n≥0 defined
by (6.3). Let y∗ ∈ {h = 0} be an equilibrium point. We make the following
assumptions.
(i) y∗ is an attractor for ODEh : y∗ is a “locally uniform attractor” for the
ODEh ≡ ẏ = −h(y) in the following sense:
h is differentiable at y∗ and all the eigenvalues of Dh(y∗ ) have positive real parts.
(ii) Regularity and growth control of H : the function H satisfies the following regu-
larity and growth control properties
and
y → E |H (y, Z)|2+β, is locally bounded on Rd
H (y∗ ) := E H (y∗ , Z)H (y∗ , Z)t is positive definite in S(d , R). (6.45)
(iv) Specification of the step sequence: Assume that the step γn is of the form
c 1
γn = , < ϑ ≤ 1, b ≥ 0, c > 0, (6.46)
nϑ +b 2
with
1
c> if ϑ = 1, (6.47)
2 e(λmin )
where λmin denotes the eigenvalue of Dh(y∗ ) with the lowest real part.
Then, the a.s. convergence is ruled on the event A∗ = Yn → y∗ by the following
stable Central Limit Theorem
√ Lstably
nϑ Yn − y∗ −→ N (0; c ) on A∗ , (6.48)
+∞
I I
where := e−s Dh(y∗ )t − 2cd∗
H (y∗ )e
−s Dh(y∗ )− 2cd∗
ds and c∗ = c if ϑ = 1 and
0
c∗ = +∞ otherwise.
Note that the optimal rate is obtained when ϑ = 1, provided c satisfies (6.47).
and every A ∈ A,
√ √ √
E 1A∗ ∩A f n(Yn − y∗ ) −→ E 1A∗ ∩A f ( c Z) as n → +∞.
In fact, when the algorithm a.s. converges toward a unique target y∗ , it has been
shown in [203] (see Sect. 11.4) that one may assume that Z and A are independent
so that, in particular, for every A ∈ A,
√ √ √
E f n(Yn − y∗ ) | A −→ E f ( c Z) as n → +∞.
The proof is detailed for scalar algorithms but its extension to higher-dimensional
procedures is standard, though slightly more technical. In the possibly multi-target
setting, like in the above statement, it is most likely that such an improvement still
holds true (though no proof is known to us). As A∗ ∈ A is independent of Z, this
would read, with the notation of the original theorem: if P(A∗ ) > 0, for every A in
A such that P(A ∩ A∗ ) > 0,
√ √ √
E f n(Yn − y∗ ) | A ∩ A∗ −→ E f ( c Z) as n → +∞.
ℵ Practitioner’s corner
Optimal choice of the step sequence. As mentioned in the theorem itself, the
optimal weak rate is obtained for ϑ = 1, provided the step sequence is of the form
γn = n+b
c
, with c > 2 e(λ
1
min )
. The explicit expression for – which depends on c as
6.4 Further Results on Stochastic Approximation 241
well – suggests that there exists an optimal choice of this parameter c minimizing
the asymptotic variance of the algorithm.
case to get rid of some technicalities. If d =
We will deal with the one-dimensional
1, then λmin = h (y∗ ), H (y∗ ) = Var H (y∗ , Z) and a straightforward computation
shows that
c2
c = Var H (y∗ , Z)
.
2c h (y∗ ) − 1
This simple function of c attains its minimum on (1/(2h (y∗ )), +∞) at
1
copt =
h (y∗ )
One shows that this is the lowest possible variance in such a procedure. Consequently,
the best choice for the step sequence (γn )n≥1 is
1 1
γn := or, more generally, γn := ,
h (y∗ )n h (y∗ )(n + b)
where b can be tuned to “control” the step at the beginning of the simulation when
n is small.
At this stage, one encounters the same difficulties as with deterministic proce-
dures since y∗ being unknown, h (y∗ ) is even “more” unknown. One can imagine to
estimating this quantity as a companion procedure of the algorithm, but this turns out
to be not very efficient. A more efficient approach, although not completely satisfac-
tory in practice, is to implement the algorithm in its averaged version (see Sect. 6.4.4
below).
Exploration vs convergence rate. However, one must keep in mind that this tuning
of the step sequence is intended to optimize the rate of convergence of the algorithm
during its final convergence phase. In real applications, this class of recursive pro-
cedures spends most of its time “exploring” the state space before getting trapped
in some attracting basin (which can be the basin of a local minimum in the case of
multiple critical points). The CLT rate occurs once the algorithm is trapped.
Simulated annealing. An alternative to these procedures is to implement a simu-
lated annealing procedure which “super-excites” the algorithm using an exogenous
simulated noise in order to improve the efficiency of the exploring phase (see [103,
104, 241]). Thus, when the mean function h is a gradient (h = ∇L), it finally con-
verges – but only in probability – to the true minimum of the potential/Lyapunov
function L. However, the final convergence rate is worse owing to the additional
242 6 Stochastic Approximation with Applications to Finance
exciting noise which slows down the procedure in its convergence phase. Practi-
tioners often use the above Robbins–Monro or stochastic gradient procedure with a
sequence of steps (γn )n≥1 which decreases to a positive limit γ.
Proving a CLT
We now prove this CLT in the 1D-framework when the algorithm a.s. converges
toward a unique “target” y∗ . Our method of proof is the so-called SDE method, which
heavily relies on functional weak convergence arguments. We will have to admit few
important results about weak convergence of processes, for which we will provide
precise references. An alternative proof is possible based on the CLT for triangular
arrays of martingale increments (see [142], see also Theorem 12.8 for a statement in
the Miscellany Chapter). Such an alternative proof – in a one-dimensional setting –
can be found in [203].
We propose below a proof of Theorem 6.6, dealing only with the regular weak
convergence in the case of an a.s. convergence toward a unique equilibrium point y∗ .
The extension to a multi-target algorithm is not much more involved and we refer
to the original paper [241]. Before getting onto the proof, we need to recall the dis-
crete time Burkhölder–Davis–Gundy (B.D.G.) inequality (and the Marcinkiewicz–
Zygmund inequality) for discrete time martingales. We refer to [263] (p. 499) for a
proof and various developments.
The left inequality in (6.50) remains true for p = 1 if the random variables (Mn )n≥1
are independent. Then, the inequality takes the name of Marcinkiewicz–Zygmund
inequality.
Proof of Theorem 6.4. We first introduce the augmented filtration of the innova-
tions defined by Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. It is clear by induction that (Yn )n≥0 is
d
(Fn )n≥0 -adapted (in what follows we will sometimes use a random variable Z = Z1 ).
We first rewrite the recursion satisfied by the algorithm in its canonical form
6.4 Further Results on Stochastic Approximation 243
Yn = Yn−1 − γn h(Yn−1 ) + Mn , n ≥ 1,
where
Mn = H (Yn−1 , Zn ) − h(Yn−1 ), n ≥ 1,
Yn − y∗
ϒn := √ , n ≥ 1.
γn
Moreover the function η is locally bounded on the real line owing to the growth
assumption made on H (y, Z)2 which implies that h is locally bounded too owing
to Jensen’s inequality. For every n ≥ 1, we have
/
γn √
ϒn+1 = ϒn − γn+1 h(Yn ) + Mn+1
γn+1
/ √
γn √
= ϒn − γn+1 Yn h (0) + η(Yn ) − γn+1 Mn+1
γn+1
/
γn √ √
= ϒn − ϒn + ϒn − γn+1 γn ϒn h (0) + η(Yn )
γn+1
√
− γn+1 Mn+1
/ /
γn 1 γn
= ϒn − γn+1 ϒn h (0) + η(Yn ) − −1
γn+1 γn+1 γn+1
√
− γn+1 Mn+1 .
Assume that the sequence (γn )n≥1 is such that there exists a c ∈ (0, +∞] satisfying
244 6 Stochastic Approximation with Applications to Finance
/
1 γn 1
lim an = −1 = .
n γn+1 γn+1 2c
where (αni )n≥1 i = 1, 2 are two deterministic sequences such that αn1 → 0 and αn2 →
1 as n → +∞.
Step 2 (Localization(s)). Since Yn → 0 a.s., one can write the scenarii space as
follows
$ 0 1
∀ ε > 0, = ε,N a.s. where ε,N := sup |Yn | ≤ ε .
N ≥1 n≥N
To alleviate notation, we will drop the exponent ε,N in what follows and write Yn
instead of Ynε,N .
The mean function and Fn -martingale increments associated to this new algorithm
are
h and Mn+1 = 1|Y |≤ε Mn+1 , n ≥ N ,
n
which satisfy
6.4 Further Results on Stochastic Approximation 245
sup E |Mn+1 |2+β |Fn ≤ 21+β sup E |H (θ, X )|2+β
n≥N |θ|≤ε
≤ A(ε) < +∞ a.s. (6.51)
Yn
ϒn := √ , n ≥ N .
γn
– If γn = n+b
c
with c > 1
2h (0)
(and b ≥ 0), we may choose ρ = ρ(h) > 0 small
enough so that
1
c> .
2h (0)(1 − ρ)
– If γn = (n+b)
c
ϑ , 2 < ϑ < 1 and, more generally, as soon as lim n an = 0, any
1
choice of ρ ∈ (0, 1) is possible. Now let ε(ρ) > 0 be such that |y| ≤ ε(ρ) implies
|η(y)| ≤ ρh (0). It follows that
θh(y) = y2 h (0) + η(y) ≥ y2 (1 − ρ)h (0) if |y| ≤ η(ρ).
∀ y ∈ R, yh(y) ≥ Ky2 .
Consequently, since (γn )n≥1 satisfies (Gα ) with α = K and κ∗ = 2K − 1c > 0, one
derives following the lines of the proof of Proposition 6.11 (established for Markovian
algorithms) that, √
Yn 2 = O γn .
Step 4 (The SDE method). First we apply Step 1 to our framework and we write
1 √
ϒn+1 = ϒn − γn+1 ϒn h (0) − + αn1 + αn2 η(Yn ) − γn+1 Mn+1 , (6.52)
2c
246 6 Stochastic Approximation with Applications to Finance
where αn1 → 0 and αn2 → 1 are two deterministic sequences and η is a bounded
function.
At this stage, we want to re-write the above recursive equation in continuous time
exactly like we did for the ODE method. To this end, we first set
n = γ1 + · · · + γn , n ≥ N ,
and
∀ t ∈ R+ , N (t) = min k : k+1 ≥ t , t = N (t) .
(0)
ϒ(t) = ϒn , t ∈ [n , n+1 ), n ≥ N ,
(0) (0)
so that, in particular, ϒ(t) = ϒ(t) . Following the strategy adopted for the ODE
method, one expands the recursion (6.52) in order to obtain
n
(0) n
(0) √
ϒ(n)
= ϒN − ϒ(t) a + αN1 (t) + αN2 (t) η(YN (t) ) dt) − γk Mk .
N k=N
N (t)
(0)
t √
ϒ(t) = ϒN − ϒs(0) a + αN1 (s) + αN2 (s) η(YN (s) ) ds − γk Mk . (6.53)
N k=N
Still like for the ODE, we are interested in the functional asymptotics at infinity
(0)
of ϒ(t) , this time in a weak sense. To this end, we introduce for every n ≥ N and
every t ≥ 0, the sequence of time shifted functions, defined this time on R+ :
(n) (0)
ϒ(t) = ϒ(n +t)
, t ≥ 0, n ≥ N .
n +t N (
n +t)
(n) (0) √
ϒ(t) = ϒn − ϒ(s) a + αN1 (s) + αN2 (s) η(YN (s) ) ds + γk Mk .
n
' () * 'k=n+1 () *
=: A(n) (n)
(t) =: M(t)
(6.54)
Step 5 (Functional tightness). At this stage, we need two fundamental results about
functional weak convergence. The first is a criterion which implies the functional
tightness of the distributions of a sequence of càdlàg processes X (n) (viewed as
6.4 Further Results on Stochastic Approximation 247
probability measures on the space D(R+ , R) of càdlàg functions from R+ to R). The
second is an extension of Donsker’s Theorem for sequences of martingales.
We recall the definition of the uniform continuity modulus defined for every
function f : R+ → R and δ, T > 0 by
The terminology comes from the seminal property of this modulus: f is (uniformly)
continuous over [0, T ] if and only if lim w(f , δ, T ) = 0.
δ→0
Theorem 6.7 (C-tightness criterion, see [45], Theorem 15.5, p. 127) Let Xtn t≥0 ,
n ≥ 1, be a sequence of càdlàg processes null at t = 0.
(a) If, for every T > 0 and every ε > 0,
lim lim P w(X (n) , δ, T ) ≥ ε = 0, (6.55)
δ→0 n
then the sequence (X n )n≥1 is C-tight in the following sense: from any subsequence
(X n )n≥1 one may extract a subsequence (X n )n≥1 such that X n converges in distribu-
tion toward a process X ∞ with respect to the weak topology on the space D(R+ , R)
induced by the topology of uniform convergence on compact sets (10 ) such that
P(X ∞ ∈ C(R+ , R)) = 1.
(b) A criterion (see [45], proof of Theorem 8.3 p. 56): If, for every T > 0 and every
ε > 0,
1
lim lim sup P sup |Xt(n) − Xs(n) | ≥ ε = 0, (6.56)
δ→0 n s∈[0,T ] δ s≤t≤s+δ
The second theorem below provides a tightness criterion for a sequence of mar-
tingales based on the sequence of their bracket processes.
Theorem 6.8 (Weak functional limit of a sequence of martingales, see [154]) Let
(M(t)
n
)t≥0 , n ≥ 1, be a C-tight sequence of càdlàg (local) martingales, null at 0, with
(existing) predictable bracket process M n . If
a.s.
∀ t ≥ 0, M n (t) −→ σ 2 t as n → +∞, σ > 0,
then
LD(R+ ,R)
M n −→ σW
10 Although this topology is not standard on this space, it is simply defined sequentially by
(U )
X n =⇒ X ∞ if for every bounded functional F : D(R+ , R) → R, continuous for the · sup -norm,
E F(X n ) → E F(X ∞ ).
248 6 Stochastic Approximation with Applications to Finance
so that
2 3
2
E sup |A(0)
(t) − A(0)
(s) |
2
≤ C 2 sup E |ϒn |2 × s + δ + n − s + n
s≤t≤s+δ n≥N
2
≤ C 2 sup ϒn 22 × δ + γN (s+δ+n )+1 .
n≥N
δ s≤t≤s+δ δε
Noting that limn→+∞ γN (T +δ+n )+1 = 0, one derives that Criterion (6.56) is satisfied.
Hence, the sequence (A(n) )n≥N is C-tight by applying the above Theorem 6.7.
Now, we deal with the martingales M (n) , n ≥ N . Let us consider the filtration
(0)
Ft = Fn , t ∈ [n , n+1 ). We define M (0) by
N (t)
(0) (0) √
M(t) = 0 if t ∈ [0, N ] and M(t) = γk Mk+1 if t ∈ [N , +∞).
k=N
(0)
It is clear that M(t) t≥0
is an (Ft(0) )t≥0 -martingale. Moreover, we know from (6.51)
that sup E |Mn |2+β ≤ A(ε) < +∞.
n
11 This means that for every bounded functional F : D R+ , R → R, measurable with respect to the
σ-field spanned by finite projection α → α(t), t ∈ R+ , and continuous at every α ∈ C R+ , R), one
has E F(M n ) → E F(σ W ) as n → +∞. In fact, this remains true for
measurable functionals F
which are Pσ W (d α)-a.s. continuous on C (R+ , R), such that F(M n ) n≥1 is uniformly integrable.
6.4 Further Results on Stochastic Approximation 249
2 3 ⎛ ⎞1+ β2
(s+δ)
N
(0) (0)
≤ Cβ E ⎝ γk (Mk )2 ⎠
2+β
E sup M(t) − M(s)
s≤t≤s+δ k=N (s)+1
⎛ ⎞1+ β2
⎛ ⎞1+ β2
N (s+δ)
(s+δ)
N ⎜ γk (Mk ) ⎟ 2
⎜ k=N (s)+1 ⎟
≤ Cβ ⎝ γk⎠ E⎜ ⎟
⎝
N (s+δ) ⎠
k=N (s)+1 γ+ + + + k
k=N (s)+1
⎛ ⎞
⎛ ⎞1+ β2
N (s+δ)
(s+δ)
N ⎜ γ k | M k | 2+β
⎟
⎜ k=N (s)+1 ⎟
≤ Cβ ⎝ γk ⎠ E⎜ ⎟
⎝
N (s+δ) ⎠
k=N (s)+1 γk
k=N (s)+1
⎛ ⎞ β2
(s+δ)
N (s+δ)
N
≤ Cβ ⎝ γk ⎠ γk E |Mk |2+β ,
k=N (s)+1 k=N (s)+1
where Cβ is a positive real constant. One finally derives that, for every s ∈ [N , +∞),
2 3 ⎛ ⎞1+ β2
(s+δ)
N
(0) (0)
≤ Cδ A(ε) ⎝ γk ⎠
2+δ
E sup M(t) − M(s)
s≤t≤s+δ k=N (s)+1
1+ β2
≤ Cδ A(ε) δ + sup γk .
k≥N (s)+1
(n)
Noting that M(t) = M(0)
n +t
− M(0)
n
, t ≥ 0, n ≥ N , we derive
1+ β2
(n) (n)
≤ Cδ δ +
2+β
∀ n ≥ N , ∀ s ≥ 0, E sup M(t) − M(s) sup γk .
s≤t≤s+δ k≥N (n )+1
1 β
δ2
(n) (n)
lim sup P sup M(t) − M(s) ≥ ε ≤ Cδ 2+β .
n δ s∈[0,T ] s≤t≤s+δ ε
The C-tightness of the sequence (M (n) )n≥N follows again from Theorem 6.7(b).
Furthermore, for every n ≥ N ,
250 6 Stochastic Approximation with Applications to Finance
N (
n +t)
M (n) t = γk E (Mk )2 | Fk−1
k=n+1
N (
n +t)
= γk E H (y, Z)2 |y=Yk−1
− h(Yk−1 )2
k=n+1
N (
n +t)
∼ E [H (0, Z) ] − h(0) 2 2
γk as n → +∞
k=n+1
LC(R+ ,R)
M (n) −→ σ W (∞)
Consequently, the sequence ϒ (n) n≥N is C-tight since C-tightness is stable under
addition, (ϒn )n≥N is tight (by L2 -boundedness) and A(n) n≥N , M (n) n≥N are both
C-tight.
The sequence of random variables (ϒn )n≥N istight since it is L2 -bounded. Con-
sequently, the sequence of processes ϒ (n) , M (n) n≥N is C-tight as well.
Now let us elucidate the limit of A(n)(t) n≥N
n +t T
sup A(n)
t −a (0)
ϒ(s) ds ≤ an (n)
ϒ(s) ds
t∈[0,T ] n 0
2 3
where an = sup |αk1 | + α.2 sup sup η(Yk ) . As η(Yn ) → 0 a.s., we derive that
k≥n k≥n
T (n)
an → 0 a.s., whereas 0 ϒ(s) ds is L1 -bounded since
T T
(n)
E ϒ(s) ds = E ϒs(n) ds ≤ sup ϒn 2 T < +∞
0 0 n≥N
6.4 Further Results on Stochastic Approximation 251
n +t
2 n +T
(0) (0)
sup ϒ(s) ds ≤ sup n + t − n + t |ϒ(s) |ds
t∈[0,T ] n +t t∈[0,T ] n
T
≤ sup γk |ϒs(n) |ds,
k≥n 0
n +t
2
(0)
which implies likewise that sup ϒ(s) ds → 0 in probability as n → +∞.
t∈[0,T ] n +t
Consequently, for every T > 0,
t
sup A(n)
t −a
(n)
ϒ(s) ds → 0 in probability as n → +∞. (6.57)
t∈[0,T ] 0
(∞) (∞)
Let ϒ(t) , M(t) t≥0
be a weak functional limiting value of ϒ (n) , M (n) n≥N i.e.
a weak limit along a subsequence (ϕ(n)). It follows from (6.54) that
. .
(n) (n) (n) (n) (n) (n)
ϒ − ϒ(0) +a ϒ(s) ds =M − A −a ϒ(s) ds .
0 0
.
Using that (x, y) → x → x − x(0) + a x(s)ds, y is clearly continuous for the
0
.sup -norm and (6.57), it follows that
t
(∞) (∞) (∞)
ϒ(t) − ϒ(0) +a ϒ(s) ds = σWt(∞) , t ≥ 0.
0
(∞)
starting from a random variable ϒ(0) ∈ L2 such that
(∞)
ϒ ≤ sup ϒn .
(0) 2 2
n≥N
This follows from the weak Fatou’s Lemma for convergence in distribution (see
(∞) L
Theorem 12.6(v)) since ϒϕ(n) → ϒ(0) . Let ν0 be a weak limiting value of (ϒn )n≥N
(R)
i.e. such that ϒψ(n) ⇒ ν0 .
252 6 Stochastic Approximation with Applications to Finance
For every t > 0, one considers the sequence of integers ψt (n) uniquely defined by
and
L (∞,−t)
ϒ (ψt (n)) −→ ϒ (∞,−t) starting from ϒ(0)
d
= ν−t .
Now, let (Pt )t≥0 denote the semi-group of the Ornstein–Uhlenbeck process defined
on bounded Borel functions f : R → R by Pt f (x) = E f (Xtx ). From the preceding,
for every t ≥ 0,
ν0 = ν−t Pt .
Moreover, (ν−t )t≥0 is tight since it is L2 -bounded. Let ν−∞ be a weak limiting value
of ν−t as t → +∞.
μ
Let ϒ(t) denote a solution to (6.58) starting from a μ-distributed random variable
independent of W . It is straightforward that its paths satisfy the confluence property
μ μ μ μ
|ϒt − ϒt | ≤ |ϒ0 − ϒ0 |e−at .
hence
L σ2
ϒn −→ N 0; as n → +∞.
2a
Now we return to ϒn (prior to the localization). We just proved that for ε = ε(ρ)
and for every N ≥ 1,
L σ2
ϒnε,N −→ N 0; as n → +∞. (6.59)
2a
%
On the other hand, we already saw that Yn → 0 a.s. implies that = N ≥1 ε,N a.s.
ε,N ε,N
where ε,N = Y = Yn , n ≥ N = ϒ = ϒn , n ≥ N . Moreover, the events
ε,N are non-decreasing as N increases so that
lim P(ε,N ) = 1.
N →+∞
d
where ζ = N (0; 1). The result follows by letting N go to infinity and observing that
for every bounded continuous function f
σ
lim E f (ϒn ) = E f √ ζ
n 2a
L σ2
i.e. ϒn −→ N 0; as n → +∞. ♦
2a
Then, we implement the standard recursive stochastic algorithm (6.3) and set
Y0 + · · · + Yn−1
Ȳn := , n ≥ 1.
n
Note that, of course, this empirical mean itself satisfies a recursive formula:
1
∀ n ≥ 0, Ȳn+1 = Ȳn − Ȳn − Yn , Ȳ0 = 0.
n+1
By Césaro’s averaging principle it is clear that under the assumptions which ensure
that Yn → y∗ , one has
a.s.
Ȳn −→ y∗ as n → +∞
as well. This is even true on the event {Yn → y∗ } (e.g. in the case of multiple tar-
gets). What is more unexpected is that, under natural assumptions (see Theorem 6.9
hereafter), the (weak) rate of this convergence is ruled by a CLT
√ Lstably
n (Ȳn − y∗ ) −→ N (0; ∗) on the event {Yn → y∗ },
As we did for the CLT of the algorithm itself, we again state – and prove – this
CLT for the averaged procedure for Markovian algorithms, though it can be done
for more general recursive procedures of the form
Yn+1 = Yn − γn+1 h(Yn ) + Mn+1 ,
Theorem 6.9 (Ruppert and Polyak, see [245, 259], see also [81, 241, 246]) Let
H : Rd × Rq → Rd be a Borel function. Let (Zn )n≥1 be a sequence of i.i.d. Rq -
6.4 Further Results on Stochastic Approximation 255
Assume that, for every y ∈ Rd , H (y, Z) ∈ L1 (P) so that the mean vector field h(y) =
E H (y, Z) is well-defined (this is implied by (iii) below).
We make the following assumptions:
(i) The function h is zero at y∗ and is “fast” differentiable at y∗ , in the sense that
where all eigenvalues of the Jacobian matrix Jh (y∗ ) of h at y∗ have a (strictly) positive
real part. (Hence Jh (y∗ ) is invertible).
(ii) The algorithm Yn converges toward y∗ with positive probability.
(iii) There exists an η > 2 such that
(iv) The mapping
y → E H (y,Z)H (y, Z)t is continuous at y∗ .
Set ∗ = E H (y∗ , Z)H (y∗ , Z)t .
Then, if the step sequence is slowly decreasing of the form γn = nϑc+b , n ≥ 1, with
1/2 < ϑ < 1 and c > 0, b ≥ 0, the empirical mean sequence defined by
Y0 + · · · + Yn−1
Ȳn =
n
satisfies the CLT with the optimal asymptotic variance, on the event {Yn → y∗ },
namely
√ Lstably
n (Ȳn − y∗ ) −→ N 0; Jh (y∗ )−1 ∗ Jh (y∗ )
−1
on {Yn → y∗ }.
Proof (partial). We will prove this theorem in the case of a scalar algorithm, that is
we assume d = 1. Beyond this dimensional limitation, only adopted for notational
convenience, we will consider a more restrictive setting than the one proposed in the
above statement of the theorem. We refer for example to [81] for the general case.
In addition (or instead) of the above assumption we assume that:
– the function H satisfies the linear growth assumption:
– the mean function h satisfies the coercivity assumption (6.43) from Proposi-
tion 6.11 for some α > 0 and has a Lipschitz continuous derivative.
At this point, note that the step sequences (γn )n≥1 under consideration are non-
increasing and all satisfy the Condition (Gα ) of Proposition 6.11 (i.e. (6.44)) with α
√
from (6.43). In particular, this implies that Yn → y∗ a.s. and Yn − y∗ 2 = O( γn ).
Without loss of generality we may assume that y∗ = 0, by replacing Yn by Yn − y∗ ,
H (y, z) by H (y∗ + y, z), etc. We start from the canonical decomposition
1 Yk − Yk−1 1 2
n n−1
√ 1
h (0) n Ȳn = − √ − √ Mn − √ Y κ(Yk ).
n k=1 γk n n k=0 k
We will successively inspect the three sums on the right-hand side of the equation.
First, by an Abel transform, we get
Y0
n n
Yk − Yk−1 Yn 1 1
= − + Yk−1 − .
γk γn γ1 γk γk−1
k=1 k=2
1
Hence, using that the sequence γn n≥1
is non-decreasing, we derive
|Yn | |Y0 |
n n
Yk − Yk−1 1 1
≤ + + |Yk−1 | − .
γk γn γ1 γk γk−1
k=1 k=2
√
Taking expectation and using that E |Yk | ≤ Yk 2 ≤ C γk , k ≥ 1, for some real
constant C > 0, we get,
E |Yn | E |Y0 |
n n
Yk − Yk−1 1 1
E ≤ + + E |Yk−1 | −
γk γn γ1 γk γk−1
k=1 k=2
n
C E |Y0 | √ 1 1
≤ √ + +C γk−1 −
γn γ1 γk γk−1
k=2
6.4 Further Results on Stochastic Approximation 257
1 n
C E |Y0 | γk−1
= √ + +C √ −1 .
γn γ1 γk−1 γk
k=2
so that
n
1 γk−1 ϑ
√ − 1 = O n2 .
γk−1 γk
k=2
1 Yk − Yk−1 L1
n
i.e. √ −→ 0 as n → +∞.
n k=1 γk
The second term is the martingale Mn obtained by summing the increments Mk .
This is this martingale which will rule the global weak convergence rate. To analyze it,
we rely on Lindeberg’s Theorem 12.8 from the Miscellany Chapter, applied with an =
2
n. First, if we set H (y) = E H (y, Z) − h(y) , it is straightforward that condition (i)
of Theorem 12.8 involving the martingale increments is satisfied since
1
n
M n a.s.
= H
(Yk−1 ) −→ H
(y∗ ),
n n
k=1
owing to Césaro’s Lemma and the fact that Yn → y∗ a.s. It remains to check the
condition (ii) of Theorem 12.8, known as Lindeberg’s condition. Let ε > 0. Then,
√
E (Mk )2 1{|Mk |≥ε√n} ≤ ς(Yk−1 , ε n),
2
where ς(y, a) = E H (y, Z) − h(y) 1{|H (y,Z)−h(y)|≥a} , a > 0. One shows, owing
to Assumption (6.60),
that h is bounded on compact sets which, in turn, implies that
(H (y, Z) − h(y))2 |y|≤K makes up a uniformly integrable family. Hence ς(y, a) → 0
as a → 0 for every y ∈ Rd , uniformly on compact sets of Rd . Since Yn → y∗ a.s.,
(Yn )n≥0 is a.s. bounded, hence
258 6 Stochastic Approximation with Applications to Finance
1 √ √
n
ς Yk−1 , ε n ≤ max ς Yk , ε n → 0 a.s. as n → +∞.
n 0≤k≤n−1
k=1
The third term can be handled as follows, at least in our strengthened framework.
UnderAssumption (6.43) of Proposition 6.11, we know that E Yn2 ≤ Cγn for some real
constant C > 0 since the class of step sequences we consider satisfies the condition
(Gα ) (see Practitioner’s corner after Proposition 6.11). Consequently,
+∞
1 C [h ]Lip
n−1 n−1
√ E (Yk2 |κ(Yk )|) ≤ [h ]Lip E Yk2 = √ γk
n k=0 k=0
n k=1
Cc [h ]Lip
n−1
−ϑ
≤ √ E Y0 +
2
k
n k=1
Cc [h ]Lip n1−ϑ Cc [h ]Lip 1 −ϑ
≤ √ E Y02 + ∼ n2 → 0
n 1−ϑ 1−ϑ
+∞
1 2 L1
as n → +∞. This implies that √ Yk |κ(Yk )| −→ 0. Slutsky’s Lemma com-
n k=1
pletes the proof. ♦
Remark. As far as the step sequence is concerned, we only used in the above proof
that the step sequence (γn )n≥1 is non-increasing and satisfies the following three
conditions γk √
(Gα ), √ < +∞ and lim n γn = +∞.
k n
n
Indeed, we have seen in the former section that, in one dimension, the asymptotic
∗
variance h (0)H
2 obtained in the Ruppert-Polyak theorem is the lowest possible asymp-
totic variance in the CLT when specifying the step parameter in an optimal way
copt
(γn = n+b ). In fact, this discussion and its conclusions can be easily extended to
higher dimensions (if one considers some matrix-valued step sequences) as empha-
sized, for example, in [81].
So, the Ruppert and Polyak averaging principle performs as fast as the “fastest”
regular stochastic algorithm with no need to optimize the step sequence: the optimal
asymptotic variance is realized for free!
1 Yk − Yk−1 ϑ−1 1 2 1
n n−1
√ = OL1 n 2 and √ Y |κ(Yk )| = OL1 n 2 −ϑ .
n k=1 γk n k=0 k
Hence the balance between these two terms is obtained by equalizing the two
exponents 21 − ϑ and ϑ−1
2
i.e.
1 ϑ−1 2
−ϑ= ⇐⇒ ϑopt = .
2 2 3
See also [101].
When to start averaging? In practice, one should not start the averaging at the
true beginning of the procedure but rather wait for its stabilization, ideally once the
“exploration/search” phase is finished. On the other hand, the compromise consisting
in using a moving window (typically of length n after 2n iterations) does not yield
the optimal asymptotic variance, as pointed out in [196].
Exercises. 1. Test the above averaging principle on the former exercises and
“numerical illustrations” by considering γn = γ1 n− 3 , n ≥ 1, as suggested in the
2
first item of the above Practitioner’s corner. Compare with a direct approach with
c
a step of the form γn = n+b , with c > 0 “large enough but not too large…”, and
b ≥ 0.
2. Show that, under the (stronger) assumptions that we considered in the proof of the
former theorem, Proposition 6.12 holds true with the averaged algorithm (Ȳn )n≥1 ,
namely that
1
E L(Ȳn ) − L(y∗ ) = O .
n
In the presence of multiple equilibrium points, i.e. of multiple zeros of the mean
function h, some of them turn out to be parasitic. This is the case for saddle points
or local maxima of the potential function L in the framework of stochastic gradi-
ent descent. More generally any zero of h whose Jacobian Jh (y∗ ) has at least one
eigenvalue with non-positive real part is parasitic.
There is a wide literature on this problem which says, roughly speaking, that
a noisy enough parasitic equilibrium point is a.s. not a possible limit point for a
stochastic approximation procedure. Although natural and expected, such a conclu-
sion is far from being straightforward to establish, as testified by the various works
on the topic (see [191], see also [33, 81, 95, 242], etc). If the equilibrium is not noisy
260 6 Stochastic Approximation with Applications to Finance
many situations may occur as illustrated by the two-armed bandit algorithm, whose
zeros are all noiseless (see [184]).
To some extent local minima are parasitic too but this is another story and stan-
dard stochastic approximation does not provide satisfactory answers to this “second
order problem” for which specific procedures like simulated annealing should be
implemented, with the drawback of degrading the (nature and) rate of convergence.
Going deeper in this direction is beyond the scope of this monograph so we refer
to the literature mentioned above and the references therein for more insight on this
aspect of Stochastic Approximation.
We can apply both above CLTs to the VaRα and CVaRα (X ) algorithms (6.28)
and (6.30). Since
1 1
h(ξ) = F(ξ) − α and E H (ξ, X )2 = F(ξ)(1 − F(ξ)),
1−α (1 − α)2
where F is the c.d.f. of X , one easily derives from Theorems 6.6 and 6.9 the following
results.
a L cα(1 − α)
n 2 ξn − ξα∗ −→ N 0; .
2f (ξα∗ )
(b) If γn = c
n+b
, b ≥ 0, and c > 1−α
2f (ξα∗ )
then
√ L c2 α
n ξn − ξα∗ −→ N 0; ,
2cf (ξα∗ ) − (1 − α)
The algorithm for the CVaRα (X ) satisfies the same kind of CLT.
6.4 Further Results on Stochastic Approximation 261
This result is not satisfactory because the asymptotic variance remains huge since
f (ξα∗ ) is usually very close to 0 when α is close to 1. Thus if X has a normal distribution
N (0; 1), then it is clear that ξα∗ → +∞ as α → 1. Consequently,
f (ξα∗ )
1 − α = P(X ≥ ξα∗ ) ∼ as α → 1
ξα∗
so that
α(1 − α) 1
∼ ∗ ∗ → +∞ as α → 1.
f (ξα∗ )2 ξα f (ξα )
This simply illustrates the “rare event” effect which implies that, when α is close
to 1, the event {Xn+1 > ξn } is rare especially when ξn gets close to its limit ξα∗ =
VaRα (X ).
The way out is to add an importance sampling procedure to somewhat “re-center”
the distribution around its VaRα (X ). To proceed, we will take advantage of our recur-
sive variance reduction by importance sampling described and analyzed in Sect. 6.3.1.
This is the object of the next section.
As emphasized in the previous section, the asymptotic variance of our “naive” algo-
rithms for VaRα and CVaRα computation are not satisfactory, in particular when α
is close to 1. To improve them, the idea is to mix the recursive data-driven variance
reduction procedure introduced in Sect. 6.3.1 with the above algorithms.
First we make the (not so) restrictive assumption that the r.v. X , representative of
d
a loss, can be represented as a function of a Gaussian normal vector Z = N (0; Id ),
namely
X = ϕ(Z), ϕ : Rd → R, Borel function.
Hence, for a level α ∈ (0, 1], in a (temporarily) static framework (i.e. fixed ξ ∈ R),
the function of interest for variance reduction is defined by
1
ϕα,ξ (z) = 1{ϕ(z)≤ξ} − α , z ∈ Rd .
1−α
So, still following Sect. 6.3.1 and taking advantage of the fact that ϕα,ξ is bounded,
we design the following data driven procedure for the adaptive variance reducer
(using the notations from this section),
|θ|2
1 − |θk−1 |2
n
E ϕα,ξ (Z) = e− 2 E ϕα,ξ (Z)e−(θ|Z) = lim e 2 ϕα,ξ (Zk + θk−1 )e−(θk−1 |Zk ) .
n n
k=1
with an appropriate initialization (see the remark below). This procedure is a.s. con-
vergent toward its target, denoted by (θα , ξα ), and the averaged component ξ n n≥0
of (ξn )n≥0 satisfies a CLT (see [29]).
n→+∞
(ξn , θn ) −→ (ξα , θα ) a.s. with ξα = VaRα (X )
and 2
θα = argminθ∈R Vα,ξ (θ) = e−|θ| E 1{ϕ(Z+θ)≤ξ} − α e−2(θ|Z) .
2
Note that Vα,ξ (0) = F(ξ) 1 − F(ξ) .
(b) If the step sequence satisfies γn = c
,1
nϑ +b 2
< ϑ < 1, b ≥ 0, c > 0, then
√ L Vα,ξα (θα )
n ξ n − ξα −→ N 0; as n → +∞.
f (ξα )2
Remark. In practice it may be useful, as noted in [29], to make the level α slowly
vary with n in (6.61) and (6.62), e.g. from 0.5 up to the requested level, usually close
to 1. Otherwise, the procedure may freeze. The initialization of the procedure should
be set accordingly.
it may significantly accelerate the convergence of the procedure like in Monte Carlo
simulations.
In [186], this question is mostly investigated from a theoretical viewpoint. The
main results are based on an extension of uniformly distributed sequences on unit
hypercubes called averaging systems. The two main results are based, on the one
hand, on a contraction assumption and on the other hand on a monotonicity assump-
tion, which both require some stringent conditions on the function H . In the first set-
ting, some a priori error bounds emphasize that quasi-stochastic approximation does
accelerate the convergence rate of the procedure. Both results are one-dimensional,
though the contracting setting could be easily extended to multi-dimensional proce-
dures. Unfortunately, it turns out to be of little interest for practical applications.
In this section, we want to propose a more natural multi-dimensional framework
in view of applications. First, we give the counterpart of the Robbins–Siegmund
Lemma established in Sect. 6.2. It relies on a pathwise Lyapunov function, which
remains a rather restrictive assumption. It emphasizes what kind of assumption is
needed to establish theoretical results when using deterministic uniformly distributed
sequences. A more detailed version, including several examples of applications, is
available in [190].
Suppose that
{h = 0} = {y∗ }
such that H fulfills the following pathwise mean-reverting assumption: the function
H defined for every y ∈ Rd by
H (y) := inf q (∇L(y)|H (y, u) − H (y∗ , u)) is l.s.c. and positive on Rd \ {y∗ }.
u∈[0,1]
(6.63)
Furthermore, assume that
∀ y ∈ Rd , ∀ u ∈ [0, 1]q , |H (y, u)| ≤ CH 1 + L(y) (6.64)
Let ξ := (ξn )n≥1 be a uniformly distributed sequence over [0, 1]q with low discrep-
ancy, hence satisfying
n := max kDk∗ (ξ) = O (log n)q .
1≤k≤n
satisfies:
yn −→ y∗ as n → +∞.
Proof. (a) Step 1 (Regular step). The beginning of the proof is rather similar to the
“regular” stochastic case except that we will use as a Lyapunov function
√
= 1 + L.
First note that ∇ = √∇L is bounded (by the constant CL ) so that is CL -Lipschitz
2 1+L
continuous. Furthermore, for every x, y ∈ Rd ,
|∇L(y) − ∇L(y )|
∇(y) − ∇(y ) ≤
1 + L(y)
1 1
+ |∇L(y )| − (6.67)
1 + L(y) 1 + L(y )
|y − y |
≤ [∇L]Lip
1 + L(y)
CL
+ 1 + L(y) − 1 + L(y )
1 + L(y)
|y − y | C2
≤ [∇L]Lip + L |y − y |
1 + L(y) 1 + L(y)
|y − y |
≤ C (6.68)
1 + L(y)
6.5 From Quasi-Monte Carlo to Quasi-Stochastic Approximation 265
Now, the above inequality (6.68) applied with y = yn and y = ζn+1 yields, knowing
that |ζn+1 − yn | ≤ |yn+1 − yn |,
C
(yn+1 ) ≤ (yn ) − γn+1 ∇(yn ) | H (yn , ξn+1 ) + γn+1
2
|H (yn , ξn+1 )|2
1 + L(yn )
≤ (yn ) − γn+1 ∇(yn ) | H (yn , ξn+1 ) − H (y∗ , ξn+1 )
− γn+1 (∇(yn ) | H (y∗ , ξn+1 )) + γn+1
2
C (yn ).
k=1 (1 + C γk )
2
with the usual convention ∅ = 0. It follows from (6.63) that the sequence (sn )n≥0
is non-negative since all the terms involved in its numerator are non-negative.
Now (6.69) reads
∀ n ≥ 0, 0 ≤ sn+1 ≤ sn − γ̃n+1 ∇(yn ) | H (y∗ , ξn+1 ) (6.70)
γn
where γn = n , n ≥ 1.
k=1 (1 + C γk )
2
n
n
mn := γk ∇(yk−1 ) | H (y∗ , ξk ) and Sn∗ = H (y∗ , ξk ).
k=1 k=1
266 6 Stochastic Approximation with Applications to Finance
First note that (6.65) combined with the Koksma–Hlawka Inequality (see Propo-
sition (4.3)) imply
|Sn∗ | ≤ Cξ V H (y∗ , . ) (log n)q , (6.71)
where V H (y∗ , . ) denotes the variation in the measure sense of H (y∗ , . ). An Abel
transform yields (with the convention S0∗ = 0)
∗
n−1
mn = γn ∇(yn−1 ) | Sn − γk+1 ∇(yk ) − γk ∇(yk−1 ) | Sk∗
k=1
n−1
= γn ∇(yn−1 ) | Sn∗ − γk ∇(yk ) − ∇(yk−1 ) | Sk∗
' () * k=1
(a) ' () *
(b)
n−1
− γk+1 ∇(yk ) | Sk∗ .
k=1
' () *
(c)
n−1
n−1
|H (yk−1 , ξk )| ∗
∗
γk ∇(yk ) − ∇(yk−1 ) | Sk ≤ C γk γk |Sk |
k=1 k=1
1 + L(yk−1 )
n−1
≤ C CH V H (y∗ , . ) γk2 (log k)q ,
k=1
for some real constant C . One checks that the series (c) is also (absolutely) conver-
gent owing to the boundedness of ∇L, Assumption (6.65) and the upper-bound (6.71)
for Sn∗ .
6.5 From Quasi-Monte Carlo to Quasi-Stochastic Approximation 267
Then mn converges toward a finite limit m∞ . This induces that the sequence
(sn + mn )n is bounded below since (sn )n is non-negative. Now, we know from (6.70)
that (sn + mn ) is also non-increasing, hence convergent in R, which in turn implies
that the sequence (sn )n≥0 itself is convergent toward a finite limit. The same argu-
ments as in the regular stochastic case yield
L(yn ) −→ L∞ as n → +∞ and γn H (yn−1 ) < +∞.
n≥1
One concludes, still like in the stochastic case, that (yn ) is bounded and eventually
converges toward the unique zero of H , i.e. y∗ .
(b) is obvious. ♦
ℵ Practitioner’s corner • The step assumption (6.65) includes all the step sequences
of the form γn = ncα , α ∈ (0, 1]. Note that as soon as q ≥ 2, the condition γn (log n)q →
0 is redundant (it follows from the convergence of the series on the right owing to
an Abel transform).
• One can replace the (slightly unrealistic) assumption on H (y∗ , . ) by a more nat-
ural Lipschitz continuous assumption, provided one strengthens the step assump-
tion (6.65) into 1
γn = +∞, γn (log n)n1− q → 0
n≥1
and 1
max γn − γn+1 , γn2 (log n)n1− q < +∞.
n≥1
Note that the above new assumptions are satisfied by the step sequences γn = c
nρ
,
1 − q1 < ρ ≤ 1.
• It is clear that the mean-reverting assumption on H is much more stringent in the
QMC setting.
• It remains that theoretical spectrum of application of the above theorem is dra-
matically more narrow than the original one. However, from a practical viewpoint,
one observes on simulations a very satisfactory behavior of such quasi-stochastic
procedures, including the improvement of the rate of convergence with respect to the
regular MC implementation.
Exercise. We assume now that the recursive procedure satisfied by the sequence
(yn )n≥0 is given by
268 6 Stochastic Approximation with Applications to Finance
∀ n ≥ 0, yn+1 = yn − γn+1 H (yn , ξn+1 ) + rn+1 , y0 ∈ Rd ,
where the sequence (rn )n≥1 is a perturbation term. Show that, if n≥1 γn rn is a
convergent series, then the conclusion of the above theorem remains true.
Table 6.2 B- S Best- of- Call option. T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Convergence of ρn = cos(θn ) toward a ρ∗ = toward − 0.5 (up to n = 100 000)
n ρn := cos(θn )
1000 −0.4964
10000 −0.4995
25000 −0.4995
50000 −0.4994
75000 −0.4996
100000 −0.4998
−0.25 −0.25
−0.3 −0.3
−0.35 −0.35
−0.4 −0.4
−0.45 −0.45
−0.5 −0.5
−0.55 −0.55
−0.6 −0.6
−0.65 −0.65
−0.7 −0.7
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10
Fig. 6.4 B- S Best- of- Call option.T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100, K =
100. Convergence of ρn = cos(θn ) toward a ρ∗ = toward − 0.5 (up to n = 100 000). Left: MC
implementation. Right: QMC implementation
6.6 Concluding Remarks 269
The aim of this chapter is to investigate several discretization schemes of the (adapted)
solution (X t )t∈[0,T ] to a d-dimensional Brownian Stochastic Differential Equation
(SDE ) formally reading
Note that | · | and . may denote any norm on Rd and on M(d, q, R), respectively,
in the above condition. However, without explicit mention, we will consider the
canonical Euclidean and Fröbenius norms in what follows.
We consider the so-called augmented filtration of the SDE generated by X 0 and
σ(Ws , 0 ≤ s ≤ t), i.e.
Ft := σ X 0 , NP , Ws , 0 ≤ s ≤ t , t ∈ [0, T ], (7.3)
where NP denotes the class of P-negligible sets of A (i.e. all negligible sets if
σ-algebra A is P-complete).
the When X 0 = x0 ∈ Rd is deterministic, Ft = FtW =
σ NP , Ws , 0 ≤ s ≤ t is simply the augmented filtration of the Brownian motion
W . One shows using Kolmogorov’s 0-1 law (see [251] or [162]) that (FtW ) is right
continuous, i.e. FtW = ∩s>t FsW for every t ∈ [0, T ). The same holds for (Ft ). Such
In fact, if b and σ are continuous this condition follows from (7.2) applied with (t, x)
and (t, 0), given the fact that t → b(t, 0) is bounded on [0, T ] by continuity.
Moreover, still under this linear growth assumption, the Lipschitz assumption (7.2)
can be relaxed to a local Lipschitz condition on b and σ, namely, for every N ∈ N∗ ,
• By adding the 0-th component t to X , i.e. by setting Yt := (t, X t ), one may some-
times assume that the (S D E) is autonomous, i.e. that the coefficients b and σ only
depend on the space variable. This is often enough for applications, though it may
induce some too stringent assumptions on the time variable in many theoretical
results. Furthermore, when some ellipticity assumptions are required, this way of
considering the equation no longer works since the equation dt = 1dt + 0 dWt is
completely degenerate (in terms of noise).
The above theorem admits an easy extension which allows us to define, like
for ODE s, the flow of an SDE. Under the assumptions of Theorem 7.1, for every
t ∈ [0, T ] and every x ∈ Rd , there exists a unique (Ft+sW
)s∈[t,T ] -adapted solution
(X s )s∈[t,T ] to the above SDE (7.1) in the sense that
t,x
7 Discretization Scheme(s) of a Brownian Diffusion 273
s s
P-a.s. ∀ s ∈ [t, T ], X tt,x = x + b(u, X ut,x )du + σ(u, X ut,x )dWu . (7.4)
t t
In fact, (X st,x )s∈[t,T ] is adapted to the augmented filtration of the Brownian motion
W (t) = Wt+s − Wt , s ∈ [t, T ].
Except for some very specific equations, it is impossible to devise an exact simulation
of the process X , even at a fixed time T . By exact simulation, we mean writing X T =
d
χ(U ), U = U ([0, 1]r ), where r ∈ N∗ ∪ {+∞}, χ is an explicit function, defined on
∗
[0, 1]r if r < ∞ and on [0, 1](N ) when r = +∞. In fact, such exact simulation
has been shown to be possible when d = 1 and σ ≡ 1, see [43], by an appropriate
acceptance-rejection method. For a brief discussion on recent developments in this
direction, we refer to the introduction
of Chap.
9. Consequently, to approximate
E f (X T ), or more generally E F (X t )t∈[0,T ] by a Monte Carlo method, one needs
to approximate X by a process that can be simulated (at least at a fixed number of
instants).
To this end, we will introduce three types of Euler schemes with step Tn (n ∈
N∗ ) associated to the SDE (7.1): the discrete time Euler scheme X̄ = X̄ kTn 0≤k≤n
with step Tn , its càdlàg stepwise constant extension known as the stepwise constant
(Brownian) Euler scheme and the continuous or genuine (Brownian) Euler scheme.
The idea, like for ODEs in the deterministic framework, is to freeze the solution of
the SDE between the regularly spaced discretization instants kT
n
.
Discrete time Euler scheme
T
The discrete time Euler scheme with step n
is defined by
T T n
X̄ tk+1
n = X̄ tkn + b(tk , X̄ tkn ) + σ(tk , X̄ tkn )
n n
Z , X̄ 0 = X 0 ,
n n k+1 (7.5)
k = 0, . . . , n − 1,
T T n
X̄ tk+1
n = X̄ tkn 1 + r + σ Z , k = 0, . . . , n − 1, X̄ 0 = X 0 . (7.6)
n n k+1
At this stage it is natural to extend the definition (7.5) of the Euler scheme at every
instant t ∈ [0, T ] by interpolating the drift and the diffusion term in their own scale,
that is, with respect to time for the drift and with respect to the Brownian motion for
the diffusion coefficient, namely
Proof. It is clear from (7.8), the recursive definition (7.5) at the discretization dates
tkn and the continuity of b and σ that X̄ t → X̄ tk+1
n as t → tk+1
n
. Consequently, for every
t ∈ [tkn , tk+1
n
],
t t
X̄ t = X̄ tkn + b(s, X̄ s )ds + σ(s, X̄ s )dWs
tkn tkn
so that the conclusion follows by just concatenating the above identities between
t0n = 0, t1n , …, tkn = t and t. ♦
Then, the main (classical) result is that under the assumptions on the coefficients
b and σ mentioned above, supt∈[0,T ] |X t − X̄ t | goes to zero in every L p (P), 0 < p <
+∞, as n → +∞. Let us be more specific on this topic by providing error rates
under slightly more stringent assumptions.
How to use this genuine scheme for practical simulation does not seem not obvi-
ous, at least not as obvious as the stepwise constant Euler scheme. However, it turns
out to be an important method for improving the convergence rate of MC simula-
tions, e.g. for option pricing. Using this scheme in simulations is possible for specific
functionals F. It relies on the so-called diffusion bridge method and will be detailed
further on in Chap. 8.
We consider the SDE (7.1) and its Euler–Maruyama scheme(s) as defined by (7.5),
(7.8). The first version of Theorem 7.2 below (including the second remark that fol-
lows) is mainly due to O. Faure in his PhD thesis (see [90]). An important preliminary
step is to establish the existence of finite L p moments of the sup-norm of solutions
when b and σ have linear growth, whenever X 0 itself lies in L p .
Polynomial moments control
It is often useful to have at hand the following uniform bounds for the solution(s) of
(S D E) and its Euler schemes, which first appears as a step of the proof of the rate
but has many other applications. Thus it is an important step to prove the existence of
276 7 Discretization Scheme(s) of a Brownian Diffusion
global strong solutions to (S D E) (7.1) when b and σ are only locally Lipschitz con-
tinuous but both satisfy a linear growth assumption in the space variable x uniformly
in t ∈ [0, T ].
for some real constant C = C T > 0 and a “horizon” T > 0. Then, for every p ∈
(0, +∞), there exists a universal positive real constant κ p,d > 0 such that every
strong solution (X t )t∈[0,T ] of (7.1) (if any) satisfies
sup |X t | ≤ κ p,d eκ p,d C(C+1)T (1 + X 0 p ) (7.12)
t∈[0,T ] p
Remarks. • The universal constant κ p,d is a numerical function of the BDG real
B DG
constant C p∨2,d .
• Less synthetic but more precise bounds can be obtained. For example, if p ≥ 2,
Inequality (7.55) established in the proof of the Proposition 7.6 reads, at time t = T ,
√
sup |X t | ≤ 2 e(2+C(Cd, p ) )C T X 0 p + C(T + Cd,
B DG 2
B DG
p T)
t∈[0,T ] p
First we introduce the following condition (HTβ ) which strengthens Assumption (7.2)
by adding a time regularity assumption of the Hölder type, namely there exists a
β ∈ (0, 1] such that
7.2 The Strong Error Rate and Polynomial Moments (I) 277
⎧
⎨ ∃ Cb,σ,T > 0 such that ∀s, t ∈ [0, T ], ∀ x, y ∈ Rd ,
β
(HT ) ≡ |b(t, x) − b(s, x)| + σ(t, x) − σ(s, x) ≤ Cb,σ,T (1 + |x|)|t − s|β
⎩
|b(t, x) − b(t, y)| + σ(t, x) − σ(t, y) ≤ Cb,σ,T |y − x|.
(7.14)
Theorem 7.2 (Strong Rate for the Euler scheme) (a) Genuine Euler scheme.
Suppose the coefficients b and σ of the SDE (7.1) satisfy the above regularity condition
(HTβ ) for a real constant Cb,σ,T > 0 and an exponent β ∈ (0, 1]. Then the genuine
Euler scheme ( X̄ tn )t∈[0,T ] converges toward (X t )t∈[0,T ] in every L p (P), p > 0, such
1
that X 0 ∈ L p , at a O n −( 2 ∧β) -rate. To be precise, there exists a universal constant
κ p > 0 only depending on p such that, for every n ≥ T ,
1
n
T β∧ 2
sup X t − X̄ t ≤ K ( p, b, σ, T ) 1 + X 0 p (7.15)
t∈[0,T ] p n
where
K ( p, b, σ, T ) = κ p Cb,σ,T eκ p (1+Cb,σ,T )
2
T
and
Cb,σ,T = Cb,σ,T + sup |b(t, 0)| + sup σ(t, 0) < +∞. (7.16)
t∈[0,T ] t∈[0,T ]
1
n
T β∧ 2
sup X tk − X̄ tk ≤ K ( p, b, σ, T ) 1 + X 0 p . (7.17)
0≤k≤n p n
(a ) If b and σ are defined on the whole R+ × Rd and satisfy (HTβ ) with the same
real constant Cb,σ not depending on T and if b( . , 0) and σ( . , 0) are bounded on
R+ , then Cb,σ,T does not depend on T .
This will be the case in the autonomous case, i.e. if b(t, x) = b(x) and σ(t, x) =
σ(x), t ∈ R+ , x ∈ Rd , with b and σ Lipschitz continuous on Rd .
(b) Stepwise constant Euler scheme. As soon as b and σ satisfy the
linear growth assumption (7.11) with a real constant L b,σ,T > 0, then, for every
p ∈ (0, +∞) and every n ≥ T ,
n T (1 + log n)
sup X̄ t − X̄ t ≤ κ̃ p e
n κ̃ p L b,σ,T T
1 + X 0 p
t∈[0,T ] n
p
1 + log n
=O
n
278 7 Discretization Scheme(s) of a Brownian Diffusion
where κ̃ p > 0 is a positive real constant only depending on p (and increasing in p).
In particular, if b and σ satisfy the assumption (HTβ ) like in item (a), then the
stepwise constant Euler scheme ( X tn )t∈[0,T ] converges toward (X t )t∈[0,T ] in every
L p (P), p > 0, such that X 0 ∈ L p and for every n ≥ T ,
β∧ 21
T (1 + log n)
T
sup X t − X t ≤ K ( p, b, σ, T ) 1 + X 0 p
n
+
t∈[0,T ] p n n
1 β 1 + log n
=O +
n n
where
K ( p, b, σ, T ) = κ̃ p (1 + Cb,σ,T )eκ̃ p (1+Cb,σ,T ) T ,
2
Warning! The complete and detailed proof of this theorem in its full generality,
i.e. including the tracking of the constants, is postponed to Sect. 7.8. It makes use
of stochastic calculus. A first version of the proof in the one-dimensional quadratic
case is proposed in Sect. 7.2.3. However, owing to its importance for applications,
the optimality of the upper-bound for the stepwise constant Euler scheme will be
discussed right after the remarks below.
Remarks. • When n ≤ T , the above explicit bounds still hold true with the same
constants provided one replaces
β∧ 21 1
T 1 T β T 2
2 by +
n 2 n n
and
T 1 T T
1 + log n by 1 + log n + .
n 2 n n
• As a consequence, note that the time regularity exponent β rules the convergence
rate of the scheme as soon as β < 1/2. In fact, the method of proof itself will
emphasize this fact: the idea is to use a Gronwall Lemma to upper-bound the error
X − X̄ in L p (P) by the L p (P)-norm of the increments X s − X s or, equivalently,
X̄ s − X̄ s as we will see later.
• If b(t, x) and σ(t, x) are globally Lipschitz continuous on R+ × Rd with Lipschitz
continuous coefficient Cb,σ , one may consider the time t as a (d + 1)-th spatial
component of X and apply item (a ) of the above theorem directly.
7.2 The Strong Error Rate and Polynomial Moments (I) 279
We will see further on that this weak error (2 ) rate can be dramatically improved
when b, σ and ϕ share higher regularity properties.
On the universality of the rate for the stepwise constant Euler scheme ( p ≥ 2)
Note that the rate in claim (b) of theorem is universal since, as we will see now,
it holds as a sharp rate for the Brownian motion itself (here we deal with the case
d = 1). Indeed, since W is its own genuine Euler scheme, W̄tn − Wtn = Wt − Wt for
every t ∈ [0, T ]. Now,
sup |Wt − Wt | = max sup |Wt − Wtk−1
n |
t∈[0,T ] p k=1,...,n t∈[tk−1 ,tk )
n n
p
T
= max sup |Bt − Bk−1 |
n k=1,...,n t∈[k−1,k)
p
n
where Bt := T
W Tn t is a standard Brownian motion owing to the scaling property.
Hence
T
max ζk
sup |Wt − Wt | =
t∈[0,T ]
n k=1,...,n p
p
ζk ≥ Z k := |Wk − Wk−1 |
since the Brownian motion (Wt ) has continuous paths. The sequence (Z k )k≥1 is i.i.d.
as well, with the same distribution as |W1 |. Hence, the random variables Z k2 are still
i.i.d. with a χ2 (1)-distribution so that (see Exercises 1. and 2. below)
2 Theword “weak” refers here to the fact that the error E ϕ(X T ) − E ϕ( X̄ Tn ) is related to the
convergence of the distributions P X̄ n toward P X T as n → ∞—e.g. weakly or in variation—whereas
T
strong L p (P)-convergence involves the joint distributions P( X̄ n ,X ) .
T T
280 7 Discretization Scheme(s) of a Brownian Diffusion
∀ p ≥ 2, max |Z k | = max Z k2 p ≥ c p log n.
k=1,...,n p k=1,...,n 2
T
∀ p ≥ 2, sup |Wt − Wt | ≥ c p log n.
t∈[0,T ] p n
(see e.g. [251], Reflection principle, p. 105). Hence using that, for every a, b ≥ 0,
e(a∨b) ≤ ea + eb , that supt∈[0,1) Wt ≥ 0 and that −W is also a standard Brownian
2 2 2
= 2 E eθ(supt∈[0,1) Wt )
2
u2 du 2
= 2 E eθW1 = 2 exp −
2
√ =√ < +∞
R
1
2( √1−2θ )2 2π 1 − 2θ
as long as θ ∈ (0, 21 ). Consequently, it follows from Lemma 7.1 below applied with
the sequence (ζn2 )n≥1 , that
∀ p ∈ (0, +∞), max ζk p = max ζk2 p ≤ C W, p 1 + log n,
k=1,...,n k=1,...,n 2
T
sup |Wt − Wt | ≤ C W, p 1 + log n . (7.18)
t∈[0,T ] p n
Lemma 7.1 Let Y1 , . . . Yn be non-negative random variables with the same distri-
bution satisfying E (eλY1 ) < +∞ for some λ > 0. Then,
1
∀ p ∈ (0, +∞), max(Y1 , . . . Yn ) p ≤ log n + C p,Y1 ,λ .
λ
7.2 The Strong Error Rate and Polynomial Moments (I) 281
Proof: We may assume without loss of generality that p ≥ 1 since the · p -norm
is non-decreasing. First, assume λ = 1. Let p ≥ 1. One sets
p
ϕ p (x) = log(e p−1 + x) − ( p − 1) p , x > 0.
Hence
1
max Yk ≤ log 1 + n E eY1 + p − 1 = log n + log E eY1 + + p−1
k=1,...,n p n
≤ log n + C p,1,Y1 ,
where C p,λ,Y1 = log E eλY1 + e p−1 .
Let us return to the general case, i.e. E eλY1 < +∞ for a λ > 0. Then
1
max(Y1 , . . . , Yn ) = max(λY1 , . . . , λYn )
p λ p
1
≤ log n + C p,λ,Z 1 .
λ ♦
n
1 − F k (a)
∀ p ≥ 1, max(Z 1 , . . . , Z n ) ≥ c
p
k=1
k
≥ c log(n + 1) + log 1 − F(a) .
for any non-negative random variable U and use some basic facts about Stieljès
integral like d F(z) = f (z)dz, etc.]
2. (Completion of the proof of the above lower bound II ). Show that the χ2 (1)
− u2
distribution defined by its density on the real line f (u) := √e 2πu 1{u>0} satisfies the
above inequality (7.19). [Hint: use an integration by parts and usual comparison
theorems for integrals to show that
+∞
2e− 2 e− 2 2e− 2
z u z
F̄(z) = √ − √ du ∼ √ as z → +∞ .]
2πz z u 2πu 2πz
a(Z 1 +···+Z n )−n a22
n
e − (1 + a Z k ) .
2
k=1
(c) Let σ > 0. Show the existence of two positive real constants ci , i = 1, 2, such
that
n
Zk σ2 e 2
(Z 1 +···+Z n )/√n− 21
σ2
σ 2 (c1 σ 2 + c2 )
e − 1 + √ = √ 1− + O(1/n 2 ) .
k=1
n 2 2n n
(d) Deduce the exact convergence rate of the Euler scheme with step T
n
of the
martingale Black–Scholes model
7.2 The Strong Error Rate and Polynomial Moments (I) 283
d X t = σ X t dWt , X 0 = x0 > 0
Theorem 7.3 If b and σ satisfy (HTβ ) for a β ∈ (0, 1] and if X 0 is a.s. finite, the
continuous Euler scheme X̄ n = ( X̄ tn )t∈[0,T ] a.s. converges
toward
the diffusion X for
the sup-norm over [0, T ]. Furthermore, for every α ∈ 0, β ∧ 21 ,
a.s.
n α sup |X t − X̄ tn | −→ 0.
t∈[0,T ]
We provide below a proof of both Proposition 7.2 and Theorem 7.2 in a simplified one
dimensional, autonomous and quadratic ( p = 2) setting. This means that b(t, x) =
b(x) and σ(t, x) = σ(x) are defined as Lipschitz continuous functions on the real
line. Then (S D E) admits a unique strong solution starting from X 0 on every interval
[0, T ], which means that there exists a unique strong solution (X t )t≥0 starting from
X 0.
Furthermore, we will not care about the structure of the real constants that come
out, in particular no control in T is provided in this concise version of the proof. The
complete and detailed proofs are postponed to Sect. 7.8.
Proof. It is clear that the non-decreasing (finite) function ϕ(t) := sup0≤s≤t f (s)
t
satisfies (G) instead of f . Now the function e−αt 0 ϕ(s) ds has a right derivative at
every t ≥ 0 and
t t
e−αt ϕ(s) ds = e−αt ϕ(t+) − α ϕ(s) ds
0 r 0
≤ e−αt ψ(t+ )
where ϕ(t+ ) and ψ(t+ ) denote the right limits of ϕ and ψ at t. Then, it follows from
the fundamental theorem of Calculus that the function
t t
−αt
t −→ e ϕ(s)ds − e−αs ψ(s+ ) ds is non-increasing and null at 0,
0 0
where we used successively that a monotonic function is ds-a.s. continuous and that
ψ is non-decreasing. ♦
Now we recall the quadratic Doob Inequality which is needed in the proof (instead
of the more sophisticated Burkhölder–Davis–Gundy required in the non-quadratic
case).
Doob’s Inequality (L 2 case) (see e.g. [183]).
(a) Let M = (Mt )t≥0 be a continuous martingale with M0 = 0. Then, for every
T > 0,
E sup Mt2 ≤ 4 E MT2 = 4 E MT .
t∈[0,T ]
(b) If M is simply a continuous local martingale with M0 = 0, then, for every T > 0,
E sup Mt2 ≤ 4 E MT .
t∈[0,T ]
7.2 The Strong Error Rate and Polynomial Moments (I) 285
Proof of Proposition 7.2 (A first partial). We may assume without loss of generality
that E X 02 < +∞ (otherwise the inequality is trivially fulfilled). Let τL := min t :
|X t − X 0 | ≥ L , L ∈ N \ {0} (with the usual convention min ∅ = +∞). It is an F-
stopping time as the hitting time of a closed set by a process with continuous paths
(see Sect. 7.8.2). Furthermore, for every t ∈ [0, T ],
τ
|X t L | ≤ L + |X 0 |.
Then,
t∧τL t∧τL
τ
XtL = X0 + b(X s )ds + σ(X s )dWs
0 0
t∧τL t∧τL
τ τ
= X0 + b(X sL )ds + σ(X sL )dWs
0 0
owing to the local feature of both regular and stochastic integrals. The stochastic
integral t∧τ
L τ
Mt(L) := σ(X sL )dWs
0
where we used again (in the first inequality) that τL ∧ t ≤ t. Finally, this can be
rewritten as
t
τL 2 τL 2
E sup (X s ) ≤ Cb,σ,T 1 + E X 0 + 2
E (|X s | )ds
s∈[0,t] 0
As for the Euler scheme, the proof follows closely the above lines, once we replace
the stopping time τL by
τ̄L = min t : | X̄ t − X 0 | ≥ L .
It suffices to note that, for every s ∈ [0, T ], sup | X̄ u | ≤ sup | X̄ u |. Then one shows
u∈[0,s] u∈[0,s]
that
sup E sup ( X̄ sn )2 ≤ Cb,σ,T 1 + E X 02 eCb,σ,T T . ♦
n≥1 s∈[0,T ]
Proof of Theorem 7.2 (simplified setting). (a) (Convergence rate of the continuous
Euler scheme). Combining the equations satisfied by X and its (continuous) Euler
scheme yields
t t
X t − X̄ t = b(X s ) − b( X̄ s ) ds + σ(X s ) − σ( X̄ s ) dWs .
0 0
Consequently, using that b and σ are Lipschitz continuous, the Schwartz and Doob
Inequalities lead to
2
t
E sup |X s − X̄ s | 2
≤ 2E [b]Lip |X s − X̄ s |ds
s∈[0,t] 0
2
s
+2 E sup σ(X u ) − σ( X̄ u ) d Wu
s∈[0,t] 0
2
t t 2
≤ 2 [b]2Lip E |X s − X̄ s |ds + 8E σ(X u ) − σ( X̄ u ) du
0 0
t t
≤ 2t[b]2Lip E |X s − X̄ s |2 ds + 8 [σ]2Lip E |X u − X̄ u |2 du
0 0
t
= Cb,σ,T E | X s − X̄ s |2 ds
0
t t
≤ Cb,σ,T E |X s − X̄ s |2 ds + Cb,σ,T E | X̄ s − X̄ s |2 ds
0 0
t t
≤ Cb,σ,T E sup |X u − X̄ u |2 ds + Cb,σ,T E | X̄ s − X̄ s |2 ds.
0 u∈[0,s] 0
Now
X̄ s − X̄ s = b( X̄ s )(s − s) + σ( X̄ s )(Ws − Ws ) (7.20)
so that, using Step 1 (for the Euler scheme) and the fact that Ws − Ws and X̄ s are
independent
T 2
E | X̄ s − X̄ s | ≤ Cb,σ
2
E b ( X̄ s ) + E σ ( X̄ s ) E (Ws − Ws )
2 2 2
n
T 2 T
≤ Cb,σ 1 + E sup | X̄ t | 2
+
t∈[0,T ] n n
T
≤ Cb,σ 1 + E X 02 .
n
T
sup | X̄ t − X̄ t | ≤ Cb,σ 1 + sup | X̄ t | + sup |Wt − Wt |
t∈[0,T ] t∈[0,T ] n t∈[0,T ]
so that
T
sup | X̄ t − X̄ t | ≤ Cb,σ 1 + sup | X̄ t | + sup |Wt − Wt | .
t∈[0,T ] 2 t∈[0,T ] n t∈[0,T ]
2
Now, note that if U and V are two real-valued random variables, the Schwarz Inequal-
ity implies
U V 2 = U 2 V 2 1 ≤ U 2 2 V 2 2 = U 4 V 4 .
Combining this with the Minkowski inequality in both resulting terms yields
sup | X̄ t − X̄ t | ≤ Cb,σ 1 + sup | X̄ t |4 T /n + sup |Wt − Wt |4 .
t∈[0,T ] 2 t∈[0,T ] t∈[0,T ]
Now, as already mentioned in the first remark that follows Theorem 7.2,
T
sup |W t − W |
t ≤ C W
1 + log n
t∈[0,T ] 4 n
which completes the proof… if we admit that sup | X̄ t |4 ≤ Cb,σ,T 1 + X 0 4 . ♦
t∈[0,T ]
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 289
Remarks. • The proof in the general L p framework follows exactly the same
lines, except that one replaces Doob’s Inequality for a continuous (local) martin-
gale (Mt )t≥0 by the so-called Burkhölder–Davis–Gundy Inequality (see e.g. [251])
which holds for every exponent p > 0 (only in the continuous setting)
1
∀ t ≥ 0, c p Mt p ≤ sup |Ms | p ≤ C p Mt p = C p Mt 2p ,
2
s∈[0,t]
where c p , C p are positive real constants only depending on p. This general setting is
developed in full detail in Sect. 7.8 (in the one-dimensional case to alleviate notation).
• In some so-called mean-reverting situations one may even get boundedness over
t ∈ (0, +∞).
The aim of this section is to establish non-asymptotic deviation inequalities for the
Euler scheme in order to provide confidence intervals for the Monte Carlo method. We
1
recall for convenience several important notations. Let A = Tr(A A∗ ) 2 denote
the Fröbenius norm of A ∈ M(d, q, R) (i.e. the canonical Euclidean norm on Rdq )
and let |||A||| = sup|x|=1 |Ax| be the operator norm of A (with respect to the canonical
Euclidean norms on Rd and Rq ). Note that |||A||| ≤ A.
We still consider the Brownian diffusion process solution to (7.1) on a prob-
ability space (, A, P) with the same Lipschitz regularity assumptions made on
the drift b and the diffusion coefficient σ. The q-dimensional driving Brownian
motion is still denoted by W and the (augmented) natural filtration (Ft )t∈[0,T ]
is still defined by (7.3). Furthermore, as the functions b : [0, T ] × Rd → Rd and
σ : [0, T ] × Rd → M(d, q, R) satisfy a Lipschitz continuous assumption in x uni-
formly in t ∈ [0, T ], we define
and
σ(t, x) − σ(t, y)
[σ]Lip = sup < +∞. (7.21b)
t∈[0,T ], x= y |x − y|
The definition of the Brownian Euler scheme with step Tn , starting at X 0 , is unchanged
but to alleviate notation in this section, we will temporarily write X̄ kn instead of X̄ tnn .
k
So we have: X̄ 0n = X 0 and
290 7 Discretization Scheme(s) of a Brownian Diffusion
T T
n
=
X̄ k+1 X̄ kn + b(tk , X̄ k ) + σ(tk , X̄ k )
n n n n
Z k+1 , k = 0, . . . , n − 1,
n n
where tkn = kTn
, k = 0, . . . , n and Z k = Tn Wtkn − Wtk−1 n , k = 1, . . . , n is an i.i.d.
sequence of N (0; Iq ) random vectors. When X 0 = x ∈ Rd , we may denote occa-
sionally by ( X̄ kn,x )0≤k≤n the Euler scheme starting at x.
The Euler scheme defines
an Rd -valued Markov chain with transitions Pk (x, dy) =
P X̄ k+1
n
∈ dy | X̄ kn = x , k = 0, . . . , n − 1, reading on bounded or non-negative
Borel functions f : Rd → R,
Pk ( f )(x) = E f ( X̄ k+1 ) | X̄ k = x = E f Ek (x, Z ) , k = 0, . . . , n − 1,
where
T T
Ek (x, z) = x + b(tk , x) + σ(tk , x)
n n
z, x ∈ Rd , z ∈ Rq , k = 0, . . . , n − 1,
n n
d
denotes the Euler scheme operator and Z = N (0; Iq ). Then, set for every k, ∈
{0, . . . , n − 1}, k < ,
Proposition 7.3 Under the above Lipschitz assumption (7.21a) and (7.21b) on b
and σ, the transitions Pk , k = 0, . . . , n − 1, satisfy
with 1
T 2 T 2
[Pk ]Lip ≤ Id +
b(tkn , .) + [σ(tkn , .)]2Lip
n Lip n
21
T T
≤ 1+ Cb,σ + κb
n n
where
Cb,σ = 2[b]Lip + [σ]2Lip and κb = [b]2Lip . (7.22)
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 291
2
T T
x − y + (b(tk , x) − b(tk , y)) +
n n
(σ(tkn , x) − σ(tkn , y))Z
n n 2
2
T n T 2
= x − y + b(tk , x) − b(tkn , y) + σ(tkn , x) − σ(tkn , y)) . ♦
n n
The key property is the following classical exponential inequality for the Gaussian
measure (the proof below is due to Ledoux in [195]).
1
d tx = − tx dt + dWt , 0x = x,
2
where W is a standard q-dimensional Brownian motion. One easily checks that this
equation has a (unique) explicit solution on the whole real line given by
t
− 2t − 2t s
tx =xe +e e 2 dWs , t ∈ R+ .
0
E tx = x e− 2 .
t
Using the Wiener isometry, we derive that the covariance matrix tx of tx is given
by
292 7 Discretization Scheme(s) of a Brownian Diffusion
t t
tx = e−t E e 2 dWs
s s
e 2 dWsk
0 0 1≤k,≤q
t
= e−t es ds Iq
0
= (1 − e−t )Iq .
(The time covariance structure of the process can be computed likewise but is of
no use in this proof). As a consequence, for every Borel function g : Rq → R with
polynomial growth,
√
Q t g(x) := E g tx = E g x e−t/2 + 1 − e−t Z
d
where Z = N (0; Iq ). (7.24)
∇x Q t g(x) = e− 2 E ∇g(tx ).
t
(7.25)
1
Lg : ξ −→ Lg(ξ) = g(ξ) − (ξ|∇g(ξ)) ,
2
⎤ ⎡
gxi (ξ)
! ⎢ . ⎥
where g(ξ) = 1≤i≤q gx 2 (ξ) denotes the Laplacian of g and ∇g(ξ) = ⎣ .. ⎦.
i
gxq (ξ)
Now, if h and g are both twice differentiable with existing partial derivatives having
polynomial growth, one has
q
|z|2 dz
E (∇g(Z )|∇h(Z ) = gxk (z)h xk (z)e− 2
q .
k=1 Rq (2π) 2
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 293
∂ |z|2 |z|2
∀ z = (z 1 , . . . , z q ) ∈ Rd , h xk (z)e− 2 = e− 2 h x 2 (z) − z k h xk (z) ,
∂z k k
k = 1, . . . , q,
integrations by parts in each of the above q integrals yield the following identity
E (∇g(Z )|∇h(Z ) = −2 E g(Z )Lh(Z ) . (7.27)
One also derives from (7.26) and the continuity of s → E (Lg)(sx ) that
∂
Q t g(x) = Q t Lg(x).
∂t
In particular, ∂t∂ Q t g]|t=0 (x) = Q 0 Lg(x) = Lg(x). On the other hand, as Q t g(x)
has partial derivatives with polynomial growth
Q s (Q t g)(x) − Q t g(x) ∂
lim = Qs (Q t g)(x) = L Q t g(x)
s→0 s ∂s |s=0
Let us first check that Hλ is well-defined by the above equality. Note that the func-
tion g is Lipschitz continuous since its gradient is bounded and [g]Lip ≤ ∇gsup =
supξ∈Rq |∇g(ξ)|. It follows from (7.24) that, for every z ∈ Rq ,
This ensures the existence of Hλ since E ea|Z | < +∞ for every a ≥ 0. One shows
likewise that, for every z ∈ Rq and every t ∈ R+ ,
294 7 Discretization Scheme(s) of a Brownian Diffusion
lim Hλ (t) = e0 = 1.
t→+∞
Furthermore, one shows, still using the same arguments, that Hλ is differentiable
over R+ with a derivative given for every t ∈ R+ by
Hλ (t) = λ E eλQ t g(Z ) Q t Lg(Z )
= λ E eλQ t g(Z ) L Q t g(Z ) by (7.28)
λ
= − E ∇z (eλQ t g(z) )|z=Z |∇ Q t g(Z ) by (7.27)
2
λ2 − t − t λQ t g(Z )
= − e 2e 2E e |Q t ∇g(Z )|2 by (7.25).
2
Consequently, as lim Hλ (t) = 1,
t→+∞
+∞
Hλ (t) = 1 − Hλ (s)ds
t
+∞
λ 2
≤ 1 + ∇g2sup e−s E eλQ s g(Z ) ds
2 t
+∞
−s λ2
= 1+K e Hλ (s)ds with K = ∇g2sup .
t 2
m
e−kt e−(m+1)t
Hλ (t) ≤ Kk + Hλ (0)K m+1 .
k=0
k! (m + 1)!
−t λ2
∇g2sup
Hλ (t) ≤ e K e ≤ e K = e 2 .
One completes this step of the proof by applying the above inequality to the function
g − E g(Z ).
Step 3 (The Lipschitz continuous case). This step relies on an approximation tech-
nique which is closely related to sensitivity computation for options attached to
non-regular payoffs (but in a situation where the Brownian motion plays the role of a
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 295
The Hessian of f ε is also bounded (by a constant depending on ε, but this has no
consequence on the inequality of interest). One concludes by Fatou’s Lemma
λ2 λ2
[ f ]2Lip
E eλ f (Z ) ≤ lim E eλ fε (Z ) ≤ lim e 2 ∇ f ε 2sup
≤e 2 . ♦
ε→0 ε→0
We are now in a position to state the main result of this section and its application
to the design of confidence intervals (see also [100], where this result was proved by
a slightly different method).
Theorem 7.4 Assume |||σ|||sup := sup(t,x)∈[0,T ]×Rd |||σ(t, x)||| < +∞. Then, there
exists a positive decreasing sequence K (b, σ, T, n) ∈ (0, +∞), n ≥ 1, of real num-
bers such that, for every n ≥ 1 and every Lipschitz continuous function f : Rd → R,
λ f ( X̄ Tn )−E f ( X̄ Tn ) λ2
≤ e 2 |||σ|||sup [ f ]Lip K (b,σ,T,n) .
2 2
∀ λ ∈ R, E e (7.29)
e(Cb,σ +κb n )T
T
3 Infact, for every z, u ∈ Rq , |u| = 1, one has |(∇h(z)|u)| = limε→0 ε−1 |h(z + εu) − h(z)| ≤
[h]Lip , whereas |∇h(z)| = sup|u|=1 |(∇h(z)|u)|. Hence, ∇hsup = supz∈Rq |∇h(z)| ≤ [h]Lip . The
reverse inequality is obvious since for every z, z ∈ Rq , |h(z) − h(z )| = |(∇h(ζ)|z − z)| ≤
∇hsup |z − z|.
296 7 Discretization Scheme(s) of a Brownian Diffusion
M
λ f ( X̄ Tn )−E f ( X̄ Tn )
= e−λεM E e
λ2
≤ e−λεM+ 2 M|||σ|||2sup [ f ]2Lip K (b,σ,T,n)
.
λ2
The function λ → −λε + 2
|||σ|||2sup [ f ]2Lip K (b, σ, T, n) attains its minimum at
ε
λmin =
|||σ|||2sup [ f ]2Lip K (b, σ, T, n)
so that, finally,
1 n,
M ε2 M
− 2 2
P f X̄ T − E f ( X̄ T ) > ε ≤ e 2|||σ|||sup [ f ]Lip K (b,σ,T,n) .
n
M =1
One easily derives from (7.30) confidence intervals provided (upper-bounds of) |||σ|||,
[ f ]2Lip and K (b, σ, T, n) are known.
The main feature of the above inequality (7.30), beyond the fact that it holds for
possibly unbounded Lipschitz continuous functions f , is that the right-hand upper-
bound does not depend on the time step Tn of the Euler scheme. Consequently we can
design confidence intervals for Monte Carlo simulations based on the Euler schemes
uniformly in the time discretization step Tn .
Doing so, we can design non-asymptotic confidence intervals when computing
E f (X T ) by a Monte Carlo simulation. We know that the bias is due to the discretiza-
tion scheme and only depends on the step Tn : under appropriate assumptions on b, σ
and f (see Sect. 7.6 for the first order expansion of the weak error), one has,
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 297
c1
E f ( X̄ Tn ) = E f (X T ) + + o(1/n).
nα
Remark. Under the assumptions we make on b and σ, the Euler scheme converges
a.s. and in every L p space (provided X 0 lies in L p ) for the sup norm over [0, T ]
toward the diffusion process X , so that we deduce a similar result for independent
copies X of the diffusion itself, namely
1 M
− ε2 M
2|||σ|||2 2
P f X T − E f (X T ) > ε ≤ 2e sup [ f ]Lip K (b,σ,T )
M =1
eCb,σ T
with K (b, σ, T ) = Cb,σ
and Cb,σ = 2[b]Lip + [σ]2Lip .
Proof of Theorem 7.4. It follows from Proposition 7.4 that, for every Lipschitz
continuous function f : Rd → R, the Euler scheme operator satisfies
λ2
√ 2
λ f (Ek (x,Z ))−Pk f (x) T
|||σ(tkn ,x)||| [ f ]2Lip
∀ λ ∈ R, E e ≤e 2 n
since the function z → f E(x, z) is Lipschitz continuous from Rq to R with respect
to
the canonical Euclidean norm with a Lipschitz coefficient upper-bounded by
T
n
[ f ]Lip |||σ(tkn , x)|||, where |||σ(tkn , x)||| denotes the operator norm of σ(tkn , x). Con-
sequently, as |||σ|||sup := sup(t,x)∈[0,T ]×Rd |||σ(t, x)|||,
λ f ( X̄ tnn ) 2 T
λPk f ( X̄ tnn )+ λ2 |||σ|||2sup [ f ]2Lip
∀ λ ∈ R, E e k+1 Ft n ≤e k
n
.
k
Applying this inequality to Pk+1,n f and taking expectation on both sides then
yields
λPk+1,n f ( X̄ tnn ) λPk,n f ( X̄ tnn ) λ2 T
|||σ|||2sup [Pk+1,n f ]2Lip
∀ λ ∈ R, E e k+1 ≤E e k e2 n
n−1
[Pk,n ]Lip ≤ [P ]Lip
=k
(with the consistent convention that an empty product is equal to 1). Hence, owing
to Proposition 7.3,
n−k
(n) T 2 (n) T
[Pk,n ]Lip ≤ 1 + Cb,σ with Cb,σ,T = Cb,σ + κb
n n
so that
T T
n n−1
[Pk,n f ]2Lip = [Pn−k,n f ]2Lip
n k=1 n k=0
(n) T n
(1 + Cb,σ ) −1
= n
(n)
[ f ]2Lip
Cb,σ
1 (Cb,σ +κb T )T
≤ e n [ f ]2Lip = K (b, σ, T, n)[ f ]2Lip .
Cb,σ ♦
1 Y λA 1 Y −λA
eλY ≤ 1+ e + 1− e .
2 A 2 A
(b) Deduce that, for every λ > 0,
λ2 A 2
E eλY ≤ cosh(λ A) ≤ e 2 .
At this point we want to emphasize that introducing this supremum inside the
probability radically modifies the behavior of the Monte Carlo error and highlights
a strong dependence on the structural dimension of the simulation. To establish this
behavior, we will rely heavily on an argument from optimal vector quantization
theory developed in Chap. 5.
Let Lip1 (Rd , R) be the set of Lipschitz continuous functions from Rd to R with
a Lipschitz continuous coefficient [ f ]Lip ≤ 1.
Let X : (, A, P) → Rd , ≥ 1, be independent copies of an integrable d-
dimensional random vector X with distribution denoted by PX , independent of X
!M
for convenience. For every ω ∈ and every M ≥ 1, let μMX (ω) = M1 =1 δ X (ω)
denote the empirical measure associated to X (ω) ≥1 . Then
1 M
W1 μMX (ω), PX ) = sup f X (ω) − f (ξ)PX (dξ)
f ∈Lip1 (R ,R)
d M R d
=1
(the absolute values can be removed without damage since f and − f simultaneously
belong to Lip1 (Rd , R)). Now let us introduce the function defined for every ξ ∈ Rd
by
f ω,M (ξ) = min |X (ω) − ξ| ≥ 0.
=1,...,M
It is clear from its very definition that f ω,M ∈ Lip1 (Rd , R) owing to the elementary
inequality
min ai − min bi ≤ max |ai − bi |.
1≤i≤M 1≤i≤M 1≤i≤M
The lower bound in the last line is just the optimal L 1 -quantization error of the
(distribution of the) random vector X at level M. It follows from Zador’s Theorem
(see the remark that follows Zador’s Theorem 5.1.2 in Chap. 5 or [129]) that
lim M − d
1
inf X − ,
X x 1 ≥ J1,d ϕX d+1
d ,
M x∈(Rd ) M
300 7 Discretization Scheme(s) of a Brownian Diffusion
where ϕX denotes the density of the nonsingular part of the distribution PX of X with
respect to the Lebesgue measure on Rd , if it exists. The constant J1,d ∈ (0, +∞) is a
universal constant and the pseudo-norm ϕX d+1 d is finite as soon as X ∈ L 1+ =
∪η>0 L 1+η . Furthermore, it is clear that as soon as the support of PX is infinite,
e1,M (X ) > 0. Combining these two inequalities we deduce that, for non-purely sin-
gular distributions (i.e. such that ϕ X ≡ 0),
1 M
− d1
lim M sup f (X ) − E f (X ) > 0.
M f ∈Lip1 (Rd ,R) M =1
This illustrates that the strong law of large numbers/Monte Carlo method is not as
“dimension free” as is commonly admitted.
For recent results about the (non-asymptotic) behavior of E W1 (μMX , PX ), we refer
to [97]. It again emphasizes that the L 1 -Wasserstein distance is not dimension free:
in fact, the generic behavior of E W1 (μMX , PX ) is M − d at least when d ≥ 3.
1
then
1 + log n
E F (X t )t∈[0,T ] ) − E F ( X̄ n )t∈[0,T ] ≤ [F]Lip Cb,σ,T
t
n
and
E F (X t )t∈[0,T ] − E F ( X̄ n )t∈[0,T ] ≤ Cn − 21 .
t
where λ = 1 in the regular Lookback case and λ > 1 in the so-called “partial Look-
back” case.
– Vanilla payoffs on supremum (like Calls and Puts) of the form
hT = ϕ sup X t or ϕ inf X t
t∈[0,T ] t∈[0,T ]
where ϕ and ψ are Lipschitz continuous on R+ . In fact, such Asianpayoffs are even
T
continuous with respect to the pathwise L 2 -norm, i.e. f L 2T := 0 f (s)ds. In
2
Throughout this section we will consider an autonomous diffusion just for nota-
tional convenience (b(t, . ) = b and σ(t, . ) = σ). The extension to general non-
autonomous SDE s of the form (7.1) is straightforward (in particular it adds no
further terms to the discretization scheme). The Milstein scheme has been originally
designed (see [214]) to produce a O(1/n)-error (in L p ), like…the Euler scheme
for deterministic ODE s. However, in the framework of SDE s, such a scheme is of
higher (second) order compared to the Euler scheme. In one dimension, its defini-
tion is simple and it can easily be implemented, provided the diffusion coefficient σ
is differentiable (see below). In higher dimensions, several problems arise, of both
a theoretical and practical (simulability) nature, making its use more questionable,
especially when its performances are compared to the weak error rate satisfied by
the Euler scheme (see Sect. 7.6).
where we assume that the functions b and σ are twice differentiable with bounded
existing derivatives (hence Lipschitz continuous).
The starting idea is to expand the solution X tx for small t in order to “select” the
terms which go to zero at most as fast as t to 0 when t → 0 (with respect to the
L 2 (P)-norm). Let us inspect the two integral terms successively. First,
t
b(X sx )ds = b(x)t + o(t) as t → 0
0
so that
t t s
σ(X s )dWs = σ(x)Wt +
x
σ(X ux )σ (X ux )dWu dWs + O L 2 (t 3/2 ) (7.32)
0 0 0
t
= σ(x)Wt + σσ (x) Ws dWs + o L 2 (t) + O L 2 (t 3/2 ) (7.33)
0
1
= σ(x)Wt + σσ (x)(Wt2 − t) + o L 2 (t) (7.34)
2
t
since, by Itô’s formula, 0 Ws dWs = 21 (Wt2 − t).
The O L 2 (t 3/2 ) in (7.32) comes from the fact that u → σ (X ux )b(X ux ) + 21 σ (X ux )
σ (X ux ) is L 2 (P)-bounded in t in the neighborhood of 0 (note that b and σ have at
2
most linear growth and use Proposition 7.2). Consequently, using Itô’s fundamental
isometry, the Fubini-Tonnelli Theorem and Proposition 7.2,
t s 2
1
E σ (X ux )b(X ux ) + σ (X ux )σ 2 (X ux ) du dWs
0 0 2
t s 2
1
=E σ (X u )b(X u )+ σ (X u )σ (X u ) du ds
x x x 2 x
0 0 2
t s 2
C
≤ C 1 + x4 du ds = 1 + x 4 t 3.
0 0 3
The o L 2 (t) in Eq. (7.33) also follows from the combination of Itô’s fundamental
isometry (twice) and the Fubini-Tonnelli Theorem, which yields
7.5 The Milstein Scheme (Looking for Better Strong Rates…) 303
t s 2 t s
E σσ (X ux ) − σσ (x) dWu dWs = ε(u) du ds,
0 0 0 0
2
where ε(u) = E σσ (X ux ) − σσ (x) → 0 as u → 0 by the Lebesgue dominated
convergence Theorem. Finally note that, by scaling and using that E W12 = 1 and
E W14 = 3,
√
Wt2 − t2 = tW12 − 12 = 2 t
so that the second term in the right-hand side of (7.34) is exactly of order one. Then,
X tx expands as follows
1
X tx = x + b(x)t + σ(x)Wt + σσ (x) Wt2 − t + o L 2 (t).
2
Using the Markov property of the diffusion, one can reproduce the above reasoning
on each time step [tkn , tk+1
n
), given the value of the scheme at time tkn , this suggests to
define the discrete time Milstein scheme ( X̄ tmil,n
n )k=0,...,n with step Tn as follows:
k
X̄ 0mil,n = X 0,
T mil,n mil,n T n
X̄ tmil,n
n = X̄ tmil,n
+ b X̄ t n
n + σ X̄ t n Z
k+1 k n k k n k+1
1 T n 2
+ σσ X̄ tmil,n
n (Z k+1 ) − 1 , (7.35)
2 k n
where
n
Z kn = (Wtkn − Wtk−1
n ), k = 1, . . . , n.
T
X̄ 0mil,n = X 0 ,
mil,n 1 mil,n T mil,n T n
X̄ tmil,n
n = X̄ tmil,n +
b X̄ t n
n − σσ X̄ t n +σ X̄ t n Z
k+1 k k 2 k n k n k+1
1 T n 2
+ σσ X̄ tmil,n
n (Z ) . (7.36)
2 k n k+1
Like for the Euler scheme, the stepwise constant Milstein scheme is defined as
In what follows, when no ambiguity arises, we will often drop the superscript n
in the notation of the Milstein scheme(s).
304 7 Discretization Scheme(s) of a Brownian Diffusion
By interpolating the above scheme between the discretization times, i.e. freezing
the coefficients of the scheme, we define the continuous or genuine Milstein scheme
with step Tn defined, with our standard notations t, by
X̄ 0mil = X 0 ,
1
X̄ tmil = X̄ tmil + b( X̄ tmil )− σσ ( X̄ tmil ) (t − t)+σ( X̄ tmil )(Wt − Wt )
2
1
+ σσ ( X̄ tmil )(Wt − Wt )2 (7.38)
2
for every t ∈ [0, T ].
Example (Black–Scholes model). The discrete time Milstein scheme of a Black–
Scholes model starting at x0 > 0, with interest rate r and volatility σ > 0 over [0, T ]
reads
σ 2
T T σ 2
T
X̄ tmil
n = X̄ tmil
n 1+ r − +σ Zn + (Z n )2 . (7.39)
k+1 k 2 n n k+1 2 n k+1
The following theorem gives the rate of strong pathwise convergence of the Mil-
stein scheme under slightly less stringent hypothesis than the usual CLip
1
-assumptions
on b and σ.
Theorem 7.5 (Strong L p -rate for the Milstein scheme) (See e.g. [170]) Assume
that b and σ are C 1 on R with bounded, αb and ασ -Hölder continuous first derivatives
respectively, αb , ασ ∈ (0, 1]. Let α = min(αb , ασ ).
(a) Discrete time and genuine Milstein scheme. For every p ∈ (0, +∞), there exists
a real constant Cb,σ,T, p > 0 such that, for every X 0 ∈ L p (P), independent of the
Brownian motion W , one has
1+α
mil,n mil,n T 2
max X tkn − X̄ t n ≤ sup X t − X̄ t ≤ Cb,σ,T, p 1 + X 0 p .
0≤k≤n k p t∈[0,T ] p n
(b) Stepwise
constant
Milstein scheme. As concerns the stepwise constant Milstein
scheme X tmil,n t∈[0,T ] defined by (7.37), one has (like for the Euler scheme!)
mil,n T
sup X t − X t ≤ Cb,σ,T (1 + X 0 p ) (1 + log n).
t∈[0,T ] p n
7.5 The Milstein Scheme (Looking for Better Strong Rates…) 305
A detailed proof is provided in Sect. 7.8.8 in the case X 0 = x ∈ R and p ∈ [2, +∞).
Remarks. • As soon as the derivatives of b and σ are Hölder, the genuine Milstein
scheme converges faster than the Euler scheme. Furthermore, if b and σ are C 1 with
Lipschitz continuous derivatives,
1 the L p -convergence rate of the genuine Milstein
scheme is, as expected, O n .
• This O( n1 )-rate obtained when b and σ are (bounded and) Lipschitz continuous
should be compared to the weak rate investigated in Sect. 7.6, which is also O( n1 ) for
the approximation of E f (X T ) by E f ( X̄ Tn ) (under sightly more stringent assump-
tions). Comparing the performances of both approaches should rely on numerical
evidence and depends on the specified diffusion, function or step parameter.
• Claim (b) of the theorem shows that the stepwise constant Milstein scheme does
not converge faster than the stepwise constant Euler scheme (except, of course, at the
discretization times tkn ). To convince yourself of this, just to think of the Brownian
motion itself: in that case, b ≡ 0 and σ ≡ 0, so that both the stepwise constant
Milstein and Euler schemes coincide and consequently converge at the same rate! As
a consequence, since it is the only simulable version of the Milstein scheme when
dealing with path-dependent functionals, its use for the approximate computation
(by Monte Carlo simulation) of the expectation of the form E F (X t )t∈[0,T ] should
not provide better results than implementing the standard stepwise constant Euler
scheme, as briefly described in Sect. 7.4.
By contrast, some functionals of the continuous Euler scheme can be simulated
in an exact way: this is the purpose of Chap. 8 devoted to diffusion bridges. This is
not the case for the Milstein scheme.
Exercises. 1. A.s. convergence of the Milstein scheme. Derive from these L p -rates
an a.s. rate of convergence for the Milstein scheme.
2. Euler scheme of the Ornstein-Uhlenbeck process. We consider on the one hand
the sequence of random variables recursively defined by
√
Yk+1 = Yk (1 + μ) + σ Z k+1 , k ≥ 0, Y0 = 0,
where μ > 0, > 0 are positive real numbers, and the Ornstein-Uhlenbeck process
solution to the SDE
d X t = μX t dt + σdWt , X 0 = 0.
Wtk −Wtk−1
Set tk = k, k ≥ 0 and Z k = √
, k ≥ 1.
(a) Show that, for every k > 0,
σ 2 (1 + μ)2k − 1
E |Yk |2 = .
μ 2 + μ
tk+1
(b) Show that, for every k ≥ 0, X tk+1 = eμ X tk + σeμtk+1 e−μs dWs .
tk
306 7 Discretization Scheme(s) of a Brownian Diffusion
then the discrete time Milstein scheme with step Tn starting at x > 0 defined by (7.36)
satisfies
∀ k ∈ {0, . . . , n}, X̄ tmil
n
k
> 0 a.s.
7.5 The Milstein Scheme (Looking for Better Strong Rates…) 307
[Hint: decompose the right-hand side of Milstein scheme on (7.35) as the sum of
three terms, one being a square.]
(b) Show that, under Assumption (7.40), the genuine Milstein scheme starting at
X > 0 also satisfies X̄ tmil > 0 for every t ∈ [0, T ].
(c) Show that the Milstein scheme of a Black–Scholes model is positive if and only
if σ 2 ≤ 2r .
5. Milstein scheme of a positive CIR process. We consider a CIR process solution to
the SDE
d X t = κ(a − X t )dt + ϑ X t dWt , X 0 = x > 0, t ≥ 0,
2
ϑ
where κ, a, ϑ > 0 and 2κa ≤ 1. Such an SDE has a unique strong solution (X tx )t≥0
living in (0, +∞) (see [183], Proposition 6.2.4, p.130). We set Yt = eκt X tx , t ≥ 0.
(a) Show that the Milstein scheme of the process (Yt )t≥0 is a.s. positive.
(b) Deduce a way to devise a positive simulable discretization scheme for the CIR
process (the convergence properties of this process are not requested).
6. Extension. We return to the setting of the above exercise 2. We want to relax the
assumption on the drift b. We still assume that the conditions (i)–(ii) in (7.40) on the
function σ are satisfied and we formally set Yt = eρt X tx , t ∈ [0, T ], for some ρ > 0.
Show that (X tx )t∈[0,T ] is a solution to the above SDE if and only if (Yt )t∈[0,T ] is a
solution to a stochastic differential equation
Let us examine what the above expansion in small time becomes if the SDE (7.31) is
modified to be driven by a 2-dimensional Standard Brownian motion W = (W 1 , W 2 )
(still with d = 1). It reads
The same reasoning as that carried out above shows that the first order term
t
σ(x)σ (x) Ws dWs in (7.33) becomes
0
t
σi (x)σ j (x) Wsi dWsj .
i, j=1,2 0
t
In particular, when i = j, this term involves the two Lévy areas Ws1 dWs2 and
t 0
2 1
Ws dWs , linearly combined with, a priori, different coefficients.
0
If we return to the general setting of a d-dimensional diffusion driven by a
q-dimensional standard Brownian motion, with (differentiable) drift b : Rd → Rd
and diffusion coefficient σ = [σi j ] : Rd → M(d, q, R), elementary though tedious
computations lead us to define the (discrete time) Milstein scheme with step Tn as
follows:
X̄ 0mil = X 0 ,
T
X̄ tmil
n = X̄ tmil
n + n )
b( X̄ tmil
k+1 k n k
n
tk+1
+ σ( X̄ tmil
n )Wt n +
k k+1
∂σ.i σ. j ( X̄ tmil
n )
k
(Wsi − Wtikn )dWsj , (7.41)
1≤i, j≤q tkn
k = 0, . . . , n − 1,
where Wtk+1
n := Wtk+1
n − Wtkn = Tn Z k+1 n
, σ.i (x) denotes the i-th column of the
matrix σ and, for every i, j ∈ {1, . . . , q},
d
∂σ.i
∀ x = (x 1 , . . . , x d ) ∈ Rd , ∂σ.i σ. j (x) := (x)σj (x) ∈ Rd . (7.42)
=1
∂x
Remark. A more synthetic way to memorize this quantity is to note that it is the
Jacobian matrix of the vector σ.i (x) applied to the vector σ. j (x).
The ability of simulating such a scheme entirely relies on the exact simulations
tkn
q q
Wt1kn − n ,...,W n
Wt1k−1 tk − Wt n , (Wsi − Wtikn )dWsj , i, j = 1, . . . , q, i = j ,
k−1 n
tk−1
k = 1, . . . , n,
(at t = Tn ). To the best of our knowledge no convincing method (i.e. with a reasonable
computational cost) to achieve this has been proposed so far in the literature (see
however [102]).
The discrete time Milstein scheme can be successfully simulated when the tensors
terms ∂σ.i σ. j commute since in that case the Lévy areas disappear, as shown in the
following proposition.
Proposition 7.5 (Commuting case) (a) If the tensor terms ∂σ.i σ. j commute, i.e. if
1 j
+ ∂σ.i σ. j ( X̄ tmil
n )W n W n ,
i
tk+1 tk+1 X̄ 0mil = X 0 . (7.43)
2 1≤i, j≤q k
j
since (Wti )t∈[0,T ] and (Wt )t∈[0,T ] are independent if i = j. The result follows noting
that, for every i ∈ {1, . . . , q},
n
tk+1
1 T
(Wsi − Wtikn )dWsi = (Wtk+1
i
n ) −
2
. ♦
tkn 2 n
The rate of convergence of the Milstein scheme is formally the same in higher
dimension as it is in one dimension: Theorem 7.5 remains true with a d-dimensional
diffusion driven by a q-dimensional Brownian motion provided b : Rd → Rd and
σ : Rd → M(d, q, R) are C 2 with bounded existing partial derivatives.
Theorem 7.6 (Multi-dimensional discrete time Milstein scheme) (See e.g. [170])
Assume that b and σ are C 1 on Rd with bounded αb and ασ -Hölder continuous
existing partial derivatives, respectively. Let α = min(αb , ασ ). Then, for every p ∈
(0, +∞), there exists a real constant C p,b,σ,T > 0 such that for every X 0 ∈ L p (P),
independent of the q-dimensional Brownian motion W , the error bound established
in Theorem 7.5(a) remains valid.
However, one should keep in mind that this strong rate result does not prejudge
the ability to simulate this scheme. In a way, the most important consequence of this
theorem concerns the Euler scheme.
Corollary 7.2 (Discrete time Euler scheme with constant diffusion coefficient)
If the drift b ∈ C 2 (Rd , Rd ) with bounded existing partial derivatives and if σ(x) =
is constant, then the discrete time Euler and Milstein schemes coincide. As a
consequence, the strong rate of convergence of the discrete time Euler scheme is,
in that very specific case, O( n1 ). Namely, for every p ∈ (0, +∞), there exists a real
constant C p,b,σ,T > 0
max |X t n − X̄ nn | ≤ C p,b,σ,T T 1 + X 0 .
0≤k≤n k tk
n p
p
7.6 Weak Error for the Discrete Time Euler Scheme (I)
In fact, the first inequality in this chain turns out to be highly non-optimal since
it switches from a weak error (the difference only depending on the respective
(marginal) distributions of X T and X̄ Tn ) to a pathwise approximation X T − X̄ Tn (4 ).
To improve asymptotically the other two inequalities is hopeless since it has been
shown (see the remark and comments in Sect. 7.8.6 for a brief introduction) that,
under appropriate assumptions, X T − X̄ Tn satisfies a central limit theorem at rate n − 2
1
with non-zero asymptotic variance. In fact, one even has a functional form of this
central limit theorem for the whole process (X t − X̄ tn )t∈[0,T ] (see [155, 178]) still
at rate n − 2 . As a consequence, a rate faster than n − 2 in an L 1 sense would not be
1 1
long been known and has been extensively investigated in the literature, starting with
the two seminal papers [270] by Talay–Tubaro and [24] by Bally–Talay, leading
to an expansion of the time discretization error at an arbitrary accuracy when b
and σ are smooth enough as functions. Two main settings have been investigated:
when the function f is itself smooth and when the diffusion is “regularizing”, i.e.
propagates the regularizing effect of the driving Brownian motion (5 ) thanks to a non-
degeneracy assumption on the diffusion coefficient σ, typically uniform ellipticity for
σ (see below) or weaker assumption such as parabolic Hörmander (hypo-ellipticity)
assumption (see [139] (6 )).
The same kind of question has been investigated for specific classes of path-
dependent functionals F of the diffusion X with some applications to exotic option
pricing (see Chap. 8). These results, though partial and specific, often emphasize that
the resulting weak error rate is the same as the strong rate derived from the Milstein
scheme for these types of functionals, especially when they are Lipschitz continuous
with respect to the sup-norm.
As a second step, we will show how the Richardson–Romberg extrapolation
method provides a first procedure to take advantage of such weak rates, before fully
exploiting a higher-order weak error rates expansion in Chap. 9 with the multilevel
paradigm.
4 A priori X̄ n and X could be defined on different probability spaces: recall the approximation of
T T
the Black–Scholes model by binomial models (see [59]).
5 The regularizing property of the Brownian motion should be understood in its simplest form as
if the column vectors (σ. j (x))1≤ j≤q span Rd for every x ∈ Rd , this condition is satisfied. If not, the
same spanning property is requested, after adding enough iterated Lie brackets of the coefficients
of the SDE—re-written in a Stratonovich sense—and including the drift this time.
312 7 Discretization Scheme(s) of a Brownian Diffusion
We adopt the notations of the former Sect. 7.1, except that we still consider, for
convenience, an autonomous version of the SDE, with initial condition x ∈ Rd ,
The notations (X tx )t∈[0,T ] and ( X̄ tn,x )t∈[0,T ] respectively denote the diffusion and the
Euler scheme of the diffusion with step Tn of the diffusion starting at x at time 0 (the
superscript n will often be dropped).
The first result is the simplest result on the weak error, obtained under less stringent
assumptions on b and σ.
Theorem 7.7 (see [270]) Assume b and σ are four times continuously differentiable
on Rd with bounded existing partial derivatives (this implies that b and σ are Lipschitz
continuous). Assume f : Rd → R is four times differentiable with polynomial growth
as well as its existing partial derivatives. Then, for every x ∈ Rd ,
1
E f (X Tx ) − E f ( X̄ Tn,x ) = O as n → +∞. (7.44)
n
Pt g(x) := E g(X tx ), t ≥ 0, x ∈ R.
On the other hand, the Euler scheme with step Tn starting at x ∈ R, denoted by
( X̄ txn )0≤k ≤n , is a discrete time homogeneous Markov chain with transition reading
k
on Borel test functions g
T d
P̄ g(x) = E g x + σ(x) Z , Z = N (0; 1).
n
To be more precise, this means for the diffusion process that, for any Borel test
function g,
∀ s, t ≥ 0, Pt g(x) = E g(X s+t ) | X s = x = E g(X tx )
7.6 Weak Error for the Discrete Time Euler Scheme (I) 313
Now, let us consider the four times differentiable function f . One gets, by the semi-
group property satisfied by both Pt and P̄,
n
E f (X T ) − E f ( X̄ T ) =
x x
P Tk ( P̄ n−k f )(x) − P Tk−1 ( P̄ n−(k−1) f )(x)
n n
k=1
n
= P Tk−1 (P Tn − P̄)( P̄ n−k f ) (x). (7.45)
n
k=1
P Tn f (x) − P̄ f (x)
where we use that f σ is bounded to ensure that the stochastic integral is a true
martingale.
A Taylor expansion of f x + σ(x) Tn Z at x yields for the transition of the
Euler scheme (after taking expectation)
314 7 Discretization Scheme(s) of a Brownian Diffusion
P̄ f (x) = E f ( X̄ Tx )
2
1T T
= f (x) + f (x)σ(x)E + ( f σ )(x)E
Z 2
Z
2n n
3 ⎛ 4 ⎞
σ 3
(x) T σ 4
(x) T
+ f (3) (x) E Z + E ⎝ f (4) (ξ) Z ⎠,
3! n 4! n
for a ξ ∈ x, X̄ T
⎛ 4 ⎞
T σ 4
(x) T
= f (x) + ( f σ 2 )(x) + E ⎝ f (4) (ξ) Z ⎠
2n 4! n
T σ 4 (x)T 2
= f (x) + ( f σ 2 )(x) + cn ( f ),
2n 4!n 2
where |cn ( f )| ≤ 3 f (4) sup . This follows from the well-known facts that
E Z = E Z 3 = 0, E Z 2 = 1 and E Z 4 = 3. Consequently
T
1 n
P Tn f (x) − P̄ f (x) = E ( f σ 2 )(X sx ) − ( f σ 2 )(x) ds
2 0 (7.46)
σ 4 (x)T 2
− cn ( f ).
4!n 2
so that
s
∀ s ≥ 0, sup E ( f σ 2 )(X sx ) − ( f σ 2 )(x) ≤ γ σ 2 sup .
x∈R 2
where Cσ depends on σ (k) sup , k = 0, 1, 2, but not on f (with the standard conven-
tion σ (0) = σ).
Consequently, we derive from (7.46) that
2
T
P T ( f )(x) − P̄( f )(x) ≤ C (k)
σ,T max f sup .
n k=2,3,4 n
7.6 Weak Error for the Discrete Time Euler Scheme (I) 315
The fact that the first derivative f is not involved in these bounds is an artificial
consequence of our assumption that b ≡ 0.
Now we switch to the second task. In order to plug this estimate in (7.45), we need
to control the first four derivatives of P̄ f , = 1, . . . , n, uniformly with respect to
k and n. In fact, we do not directly need to control the first derivative since b ≡ 0 but
we will do it as a first example, illustrating the method in a simpler case.
Let us consider again the generic function f and its four bounded derivatives.
T T
( P̄ f ) (x) = E f x + σ(x) Z 1 + σ (x) Z
n n
so that
T T
|( P̄ f ) (x)| ≤ f sup 1 + σ (x) Z ≤ f sup 1 + σ (x) Z
n n
1 2
2
3
3 T T
= f sup 4E 1 + 2σ (x) Z + σ (x)2 Z 2
n n
T T
= f sup 1 + (σ )2 (x) ≤ f sup 1 + σ (x)2
n 2n
√
since 1 + u ≤ 1 + u2 , u ≥ 0. Hence, we derive by induction that, for every n ≥ 1
and every ∈ {1, . . . , n},
σ (x)2 T
∀ x ∈ R, ( P̄ f ) (x) ≤ f sup 1 + σ (x)2 T /(2n) ≤ f sup e 2 ,
σ 2
sup T
( P̄ f ) sup ≤ f sup e 2 .
Now
⎡ 2 ⎤
T T 2T
E ⎣ f x + σ(x) 1 + σ (x) ⎦
Z Z ≤ f sup 1 + σ (x) n
n n
and, using that f x + σ(x) Tn Z = f (x) + f (ζ)σ(x) Tn Z for some ζ, owing
to the fundamental theorem of Calculus, we get
T T T
E f x + σ(x) Z σ (x) Z ≤ f sup σσ sup E (Z 2 )
n n n
since E Z = 0. Hence
T
∀ x ∈ R, |( P̄ f ) (x)| ≤ f sup 1 + σσ sup + (σ )2 sup ,
n
n
T2 T
E f (X x ) − E f ( X̄ x ) ≤ C max ( P̄ f )(i) sup ≤ Cσ,T, f T ,
σ,T 2
T T
1≤≤n,i=1,...,4
k=1
n n
R
ck
(E R+1 ) ≡ E f ( X̄ Tn,x ) − E f (X Tx ) = k
+ O(n −(R+1) ) (7.47)
k=1
n
7.6 Weak Error for the Discrete Time Euler Scheme (I) 317
then the conclusion of (a) holds true for any bounded Borel function.
Method of proof for (a). The idea is to to rely on the PDE method, i.e. considering
the solution of the parabolic partial differential equation
∂
+ L (u)(t, x) = 0, u(T, . ) = f
∂t
where L defined by
1
(Lg)(x) = g (x)b(x) + g (x)σ 2 (x),
2
denotes the infinitesimal generator of the diffusion. It follows from the Feynman–Kac
formula that (under some appropriate regularity assumptions)
u(0, x) = E f (X Tx ).
Formally (in one dimension), the Feynman–Kac formula can be established as follows
(see Theorem 7.11 for a more rigorous proof). Assuming that u is regular enough,
i.e. C 1,2 ([0, T ] × R) to apply Itô’s formula (see Sect. 12.8), then
f (X Tx ) = u(T, X Tx )
T
T
∂
= u(0, x) + + L (u)(t, X t )dt +
x
∂x u(t, X tx )σ(X tx )dWt
0 ∂t 0
T
= u(0, x) + ∂x u(t, X tx )σ(X tx )dWt
0
since u satisfies the above parabolic PDE. Assuming that ∂x u has polynomial growth,
so that the stochastic integral is a true martingale, we can take expectation. Then, we
introduce domino differences based on the Euler scheme as follows
E f ( X̄ Tn,x ) − E f (X Tx ) = E u(T, X̄ Tn,x ) − u(0, x)
n
= E u(tkn , X̄ tn,x n,x
n ) − u(tk−1 , X̄ n ) .
n
t
k k−1
k=1
318 7 Discretization Scheme(s) of a Brownian Diffusion
The core of the proof consists in applying Itô’s formula (to u, b and σ) to show that
E φ(tkn , X txn ) 1
E u(tkn , X̄ tn,x n n,x
n ) − u(tk−1 , X̄ n )
tk−1 = k
+ o
k n2 n2
for some continuous function φ. Then, one derives (after new computations) that
T 1
E φ(t, X tx )dt
E f (X T ) − E f ( X̄ T ) =
x n,x 0
+o .
n n
This approach will be developed in full detail in Sect. 7.8.9 where the theorem is
rigorously proved.
Remarks. • The weak error expansion, alone or combined with strong error rates (in
quadratic means), are major tools to fight against the bias induced by discretization
schemes. This aspect is briefly illustrated below where we first introduce standard
Richardson–Romberg extrapolation for diffusions. Wide classes of multilevel esti-
mators especially designed to efficiently “kill” the bias while controlling the variance
are introduced and analyzed in Chap. 9.
• A parametrix approach is presented in [172] which naturally leads to the higher-
order expansion stated in Claim (b) of Theorem 7.8. The expansion is derived, in
a uniformly elliptic framework, from an approximation result of the density of the
diffusion by that of the Euler scheme.
• For extensions to less regular f —namely tempered distributions—in the uniformly
elliptic case, we refer to [138].
• The last important information about weak error from the practitioner’s viewpoint
is that the weak error induced by the Milstein scheme has exactly the same order
as that of the Euler scheme, i.e. O(1/n). So the Milstein scheme seems at a first
glance to be of little interest as long as one wishes to compute E f (X T ). However,
we will see in Chap. 9 that combined with its fast strong convergence, it leads to
unbiased-like multilevel estimators.
Let V be a vector space of continuous functions with linear growth satisfying (E2 ) (the
case of non-continuous functions is investigated in [225]). Let f ∈ V . For notational
convenience, in view of what follows, we set W (1) = W and X (1) = X (including
X 0(1) = X 0 ∈ L 2 (, A, P) throughout this section). A regular Monte Carlo simulation
based on M independent copies ( X̄ T(1) )m , m = 1, . . . , M, of the Euler scheme X̄ T(1)
with step Tn induces the following global (squared) quadratic error
2
1 (1) m
M
2
E f (X T ) − f ( X̄ T ) = E f (X T ) − E f ( X̄ T(1) )
M m=1
2
2
M
1
+ E f ( X̄ T(1) ) − f ( X̄ T(1) )m
M m=1
c 2 Var f ( X̄ (1) )
2
+ O(n −3 ). (7.48)
1
= + T
n M
The above formula is the bias-variance decomposition of the approximation error
of the Monte Carlo estimator. The resulting quadratic error bound (7.48) emphasizes
that this estimator does not take full advantage of the above expansion (E2 ).
Richardson–Romberg extrapolation
To take advantage of the expansion, we will perform a Richardson–Romberg extrap-
olation. In this framework (originally introduced in the seminal paper [270]), one
considers the strong solution X (2) of a “copy” of Eq. (7.1), driven by a second Brow-
nian motion W (2) and starting from X 0(2) (independent of W (2) with the same distri-
bution as X 0(1) ) both defined on the same probability space (, A, P) on which W (1)
and X 0(1) are defined. One may always consider such a Brownian motion by enlarging
the probability space if necessary.
Then we consider the Euler scheme with a twice smaller step 2n T
, denoted by X̄ (2) ,
(2)
associated to X (2) , i.e. starting from X 0 with Brownian increments built from W (2) .
We assume from now on that (E3 ) (as defined in (7.47)) holds for f to get more
precise estimates but the principle would work with a function simply satisfying
(E2 ). Then combining the two time discretization error expansions related to X̄ (1)
and X̄ (2) , respectively, we get
c2
E f (X T ) = E 2 f ( X̄ T(2) ) − f ( X̄ T(1) ) + + O(n −3 ).
2 n2
Then, the new global (squared) quadratic error becomes
320 7 Discretization Scheme(s) of a Brownian Diffusion
M
(2) m (1) m 2
E f (X ) − 1 2 f ( X̄ T ) − f ( X̄ T ) )
T
M m=1
2
In this approach the gain of one order on the bias (switch from c1 n −1 to c2 n −2 )
induces an increase of the variance by 5 and of the complexity by (approximately) 3.
– Consistent simulation (of the Brownian increments). If W (i) = W and X 0(i) = X 0 ∈
L 2 (P), i = 1, 2, then
n→+∞
Var 2 f ( X̄ T(2) ) − f ( X̄ T(1) ) −→ Var 2 f (X T ) − f (X T ) = Var f (X T )
ℵ Practitioner’s corner
T
From a practical viewpoint, one first simulates an Euler scheme with step 2n using
(2)
a white Gaussian noise (Z k )k≥1 , then one simulates the Gaussian white noise Z (1)
of the Euler scheme with step Tn by setting
(2) (2)
Z 2k + Z 2k−1
Z k(1) = √ , k ≥ 1.
2
Note that such a volatility σ = 100% per year is equivalent to a 4 year maturity
with volatility 50% (or 16 years with volatility 25%). A high interest rate is chosen
accordingly. We consider the Euler scheme of this SDE with step h = Tn , namely
√
X̄ tk+1 = X̄ tk 1 + r h + σ h Z k+1 , X̄ 0 = X 0 ,
C0 = e−r t E (X T − K )+
using a crude Monte Carlo simulation and an RR extrapolation with consistent Brow-
nian increments as described in the above practitioner’s corner. The Black–Scholes
reference premium is C0B S = 42.9571 (see Sect. 12.2). To equalize the complex-
ity of the crude simulation and its RR extrapolated counterpart, we use M sample
paths, M = 2k , k = 14, . . . , 226 for the RR-extrapolated simulation (214 32 000
and 226 67 000 000) and 3M for the crude Monte Carlo simulation. Figure 7.1
depicts the obtained results. The simulation is large enough so that, at its end, the
observed error is approximately representative of the residual bias. The blue line
(crude MC ) shows the magnitude of the theoretical bias (close to 1.5) for such a coarse
step whereas the red line highlights the improvement brought by the Richardson–
322 7 Discretization Scheme(s) of a Brownian Diffusion
45
Reference Price
Crude Monte Carlo (Euler based)
Richardson−Romberg Extrapolation (Euler based)
44.5
44
Estimated premium
43.5
43
42.5
42
14 16 18 20 22 24 26
Simulat ion c om ple xity 3 M with M = 2 k , k =13, . . ., 26
Fig. 7.1 Call option in a B- S model priced by an Euler scheme. σ = 1.00, r = 0.15%,
T = 1, K = X 0 = 100. Step h = 1/10 (n = 10). Black line: reference price; Red line: (Consistent)
Richardson–Romberg extrapolation of the Euler scheme of size M; Blue line: Crude Monte Carlo
simulation of size 3 M of the Euler scheme (equivalent complexity)
Romberg extrapolation: the residual bias is approximately equal to 0.07, i.e. the bias
is divided by more than 20.
Throughout this section, we recall that | .| always denotes for the canonical Euclidean
√ ! 2
norm on Rd and A = Tr(A A∗ ) = i j ai j denotes for the Fröbenius norm of
A = [ai j ] ∈ M(d, q, R). We will extensively use that, for every u = (u 1 , . . . , u d ) ∈
Rd , |Au| ≤ A |u|, which is an easy consequence of the Schwarz Inequality (in
particular, |||A||| ≤ A). To alleviate notation, we will drop the exponent n in tkn = kT
n
.
On the non-quadratic case, Doob’s Inequality is not sufficient to carry out the proof:
we need the more general Burkhölder–Davis–Gundy Inequality. Furthermore, to get
some real constants having the announced behavior as a function of T , we will also
need to use the generalized Minkowski Inequality, (see [143]) established below in
a probabilistic framework.
The generalized Minkowski Inequality: For any (bi-measurable) process X =
(X t )t≥0 and for every p ∈ [1, ∞),
T T
∀ T ∈ [0, +∞], X t dt ≤ X t p dt. (7.50)
0 p 0
T T
Proof. First note that, owing to the triangle inequality 0 X s ds ≤ 0 |X s |ds, one
may assume without loss of generality that X Is a non-negative process. If p = 1 the
inequality is obvious. Assume now p ∈ (1, +∞). Let T ∈ (0, +∞) and let Y be a
non-negative random variable defined on the same probability space as (X t )t∈[0,T ] .
Let M > 0. It follows from Fubini’s Theorem and Hölder’s Inequality that
T T
E (X s ∧ M)ds Y = E (X s ∧ M)Y ) ds
0 0
T
≤ X s ∧ M p Y q ds with q = p
p−1
0
T
= Y q X s ∧ M p ds.
0
324 7 Discretization Scheme(s) of a Brownian Diffusion
T p−1
The above inequality applied with Y := X s ∧ M ds
0
p p 1− 1p
T T T
E X s ∧ M ds ≤ E X s ∧ Mds X s p ds.
0 0 0
p
T
If E 0 X s ∧ Mn ds
= 0 for any sequence Mn ↑ +∞, the inequality is obvious
T
since, by Beppo Levi’s monotone convergence Theorem, 0 X s ds = 0 P-a.s. Oth-
erwise, there is a sequence Mn ↑ +∞ such that all these integrals are non-zero (and
finite since X is bounded by M and T is finite). Consequently, one can divide both
sides of the former inequality to obtain
p 1p
T T
∀ n ≥ 1, E X s ∧ Mn ds ≤ X s p ds.
0 0
Now letting Mn ↑ +∞ yields exactly the expected result owing to two successive
applications of Beppo Levi’s monotone convergence Theorem, the first with respect
to the Lebesgue measure ds, the second with respect to dP. When T = +∞, the result
follows by Fatou’s Lemma by letting T go to infinity in the inequality obtained for
finite T . ♦
isfies 5
t T
B DG
Hs dWs ≤ Cd, p Ht dt
sup .
2 (7.53)
[0,T ] 0 0
p p
Proposition 7.6 (a) For every p ∈ (0, +∞), there exists a positive real constant
κ p > 0 (increasing in p), such that, if b, σ satisfy:
then, every strong solution of Equation (7.1) starting from the finite random vector
X 0 (if any), satisfies
∀ p ∈ (0, +∞), sup |X s | ≤ 2 eκ p C T 1 + X 0 p .
s∈[0,T ]
p
(b) The same conclusion holds under the same assumptions for the continuous Euler
scheme with step Tn , n ≥ 1, as defined by (7.8), with the same constant κ p (which
does not depend n), i.e.
n
∀ p ∈ (0, +∞), ∀ n ≥ 1, sup | X̄ s | ≤ 2 eκ p C T 1 + X 0 p .
s∈[0,T ]
p
Remarks. • Note that this proposition makes no assumption either on the existence
of strong solutions to (7.1) or on some (strong) uniqueness assumption on a time
interval or the whole real line. Furthermore, the inequality is meaningless when
X0 ∈/ L p (P).
• The case p ∈ (0, 2) will be discussed at the end of the proof.
hand,
t∧τ N s∧τ N
sup |X s∧τ N | ≤ |X 0 | + |b(s, X s )|ds + sup σ(u, X u )dWu .
s∈[0,t] 0 s∈[0,t] 0
It follows from successive applications of both the regular and the generalized
Minkowski (7.50) Inequalities and of the BDG Inequality (7.53) that
5
t t∧τ
B DG
N
f N (t) ≤ X 0 p + 1{s≤τ N } b(s, X s ) p ds + Cd, σ(s, X )2 ds
p s
0 0
p
5
t t
≤ X 0 p + b(s ∧ τ N , X s∧τ N ) p ds + Cd, p
B DG
σ(s ∧ τ N , X s∧τ N )2 ds
0 0
p
5
t t
B DG C
≤ X 0 p +C (1 + X s∧τ N p )ds + Cd, p (1 + |X s∧τ N |)2 ds
0 0
p
5
t
B DG C √t +
t
≤ X 0 p +C (1 + X s∧τ N p )ds + Cd, p |X s∧τ N |2 ds
0 0
p
where we used in the last line the Minkowski inequality on L 2 ([0, T ], dt) endowed
with its usual Hilbert norm. Hence, the L p (P)-Minkowski Inequality and the obvious
√ 1
identity . p = . 2p yield
2
⎛ ⎞
t 21
t √
f N (t) ≤ X 0 p + C B DG ⎝
(1 + X s∧τ N p )ds + Cd, p C t + ⎠.
|X s∧τ N | ds
2
0 0 p
2
p p
Now, as 2
≥ 1, the generalized L 2 (P)-Minkowski Inequality (7.50) yields
7 This holds true for any hitting time of an open set by an Ft -adapted càd process.
7.8 Further Proofs and Results 327
t
f N (t) ≤ X 0 p + C (1 + X s∧τ N p )ds
0
t 21
√
+ Cd, p C
B DG
t+ |X s∧τ | p ds
2
N
2
0
t
= X 0 p + C (1 + X s∧τ N p )ds
0
t 21
√
+ Cd,
B DG
t+ X s∧τ ds
2
.
p C N p
0
where √
ψ(t) = X 0 p + C t + Cd,
B DG
p t .
Step 3 (Conclusion when p ∈ [2, +∞)). Applying the above generalized Gronwall
Lemma to the functions f N and ψ defined in Step 1 leads to
∀ t ∈ [0, T ], sup |X s∧τ N |
s∈[0,t] p
√
= f N (t) ≤ 2 e(2+C(Cd, p ) )Ct
B DG 2
X 0 p + C(t + Cd,
B DG
p t) .
Then Fatou’s Lemma implies, by letting N go to infinity, that, for every t ∈ [0, T ],
sup |X s | ≤ lim sup |X s∧τ N |
s∈[0,t] p N s∈[0,t] p
√
(2+C(Cd, )
B DG 2
)Ct
≤ 2e p X 0 p + C(t + Cd,
B DG
p t) , (7.55)
√
which yields, using that max( u, u) ≤ eu , u ≥ 0,
sup |X s | ≤ 2 e(2+C(Cd, p ) )Ct X 0 p + eCt + eC(Cd, p ) t
B DG 2 B DG 2
∀ t ∈ [0, T ],
s∈[0,t] p
≤ 2 e(2+C(Cd, p ) )Ct ) t
B DG 2 B DG 2
(eCt + eC(Cd, p )(1 + X 0 p ).
One derives the existence of a positive real constant κ p,d > 0, only depending on the
B DG
BDG real constant C p,d , such that
∀ t ∈ [0, T ], sup |X s | ≤ κ p,d eκ p C(C+1)t 1 + X 0 p .
s∈[0,t] p
Step 4 (Conclusion when p ∈ (0, 2)). The extension can be carried out as follows:
for every x ∈ Rd , the diffusion process starting at x, denoted by (X tx )t∈[0,T ] , satisfies
the following two obvious facts:
– the process X x is FtW -adapted, where FtW := σ(NP , Ws , s ≤ t), t ∈ [0, T ].
– If X 0 is an Rd -valued random vector defined on (, A, P), independent of W ,
then the process X = (X t )t∈[0,T ] starting from X 0 satisfies
X t = X tX 0 .
7.8 Further Proofs and Results 329
Now
E sup |X t | p
= P X 0 (d x)E sup |X tx | p
t∈[0,T ] Rd t∈[0,T ]
≤ 2( p−1)+ (κ2,d ) p e pκ2,d C(C+1)T 1 + E |X 0 | p
≤ κ2,d eκ2,d C T 1 + X 0 p .
As concerns the SDE (7.1) itself, the same reasoning can be carried out only
if (7.1) satisfies an existence and uniqueness assumption for any starting value X 0 .
(b) (Euler scheme) The proof follows the same lines as above. One starts from the
integral form (7.9) of the continuous Euler scheme and one introduces for every
n, N ≥ 1 the stopping times
τ̄ N = τ̄ Nn := inf t ∈ [0, T ] : | X̄ tn − X 0 | > N .
To adapt the above proof to the continuous Euler scheme, we just need to note that
n
∀ s ∈ [0, t], 0 ≤ s ≤ s and X̄ ≤
sup
| X̄ s | . ♦
s p
s∈[0,t∧τ̄ N ] p
Lemma 7.4 Let p ≥ 1 and let (Yt )t∈[0,T ] be an (Rd -valued) Itô process defined on
(, A, P) by t t
Yt = Y0 + G s ds + Hs dWs , t ∈ [0, T ],
0 0
1
∀ s, t ∈ [0, T ], Yt − Ys p ≤ Cd,
B DG
p sup Ht p |t − s| 2 + sup G t p |t − s|
t∈[0,T ] t∈[0,T ]
√ 1
≤ Cd,
B DG
p sup Ht p + T sup G t p |t − s| 2 .
t∈[0,T ] t∈[0,T ]
sup Hu p (t − s) 2 .
1
= sup G t p (t − s) + C pB DG
t∈[0,T ] u∈[0,T ]
√ 1
The second inequality simply follows from |t − s| ≤ T |t − s| 2 .
(b) If p ∈ [1, 2], one simply uses that
t 21 1
2
Hu du ≤ |t − s| 21 2 1
2
p sup Hu = |t − s| 2 sup Hu
s u∈[0,T ] p u∈[0,T ]
2
2 p
Remark.
If H , G and Y are defined on the non-negative real line R+ and
sup G t p + Ht p < +∞, then t → Yt is locally 21 -Hölder on R+ . If
t∈R+
H = 0, the process is Lipschitz continuous on [0, T ].
7.8 Further Proofs and Results 331
Combining the above result for Itô processes (see Sect. 12.8) with those of Propo-
sition 7.6 leads to the following result on pathwise regularity of the diffusion solution
to (7.1) (when it exists) and the related Euler schemes.
Proposition 7.7 If the coefficients b and σ satisfy the linear growth assump-
tion (7.54) over [0, T ] × Rd with a real constant C > 0, then the Euler scheme
with step Tn and any strong solution of (7.1) satisfy for every p ≥ 1,
∀ n ≥ 1, ∀ s, t ∈ [0, T ],
√
X t − X s p + X̄ tn − X̄ sn p ≤ κ p,d Ceκ p,d C(C+1)T (1 +
1
T ) 1 + X 0 p |t − s| 2
since
max sup |G t | p , sup Ht p ≤ C 1 + sup |X t | p .
t∈[0,T ] t∈[0,T ] t∈[0,T ]
εt := X t − X̄ tn , t ∈ [0, T ]
t t
= b(s, X s ) − b(s, X̄ s ) ds + σ(s, X s ) − σ(s, X̄ s ) dWs
0 0
so that
t s
sup |εs | ≤ (b(s, X s ) − b(s, X̄ s )ds + sup σ(u, X ) − σ(u, X̄ ) d W .
u u u
s∈[0,t] 0 s∈[0,t] 0
It follows from the (regular) Minkowski Inequality on L p (P), · p , BDG Inequal-
ity (7.53) and the generalized Minkowski inequality (7.50) that
332 7 Discretization Scheme(s) of a Brownian Diffusion
t t 1
2 2
f (t) ≤ b(s, X s ) − b(s, X̄ s ) ds + C B DG σ(s, X s ) − σ(s, X̄ s ) ds
p d, p
0 0
p
t t 1
2 2
= B DG
b(s, X s ) − b(s, X̄ s ) p ds + Cd, p σ(s, X s ) − σ(s, X̄ s ) ds
p
0 0 2
t t 1
2
≤ b(s, X s ) − b(s, X̄ s ) ds + C B DG σ(s, X s ) − σ(s, X̄ s )2 p ds
p d, p 2
0 0
t t 1
2
= b(s, X s ) − b(s, X̄ s ) ds + C B DG σ(s, X s ) − σ(s, X̄ s )2 ds .
p d, p p
0 0
Let us temporarily set τtX = 1 + supt∈[0,t] |X s | t, t ∈ [0, T ]. Using Assumption
β
(HT ) (see (7.14)) and the Minkowski Inequality on L 2 ([0, T ], dt), | . | L 2 (dt) spaces,
we get
⎛
t
f (t) ≤ Cb,σ,T ⎝ (1 + X s p )(s − s)β + X s − X̄ s p ds
0
⎞
t 1
2 2
+Cd, p B DG (1 + X s p )(s − s)β + X s − X̄ s p ds ⎠
0
⎛
t
≤ Cb,σ,T ⎝ (1 + X s p )(s − s)β + X s − X̄ s p ds
0
⎡ ⎤⎞
t 1 t 1
B DG ⎣1 + sup |X |
2 2 2
+Cd, s p s − s)2β ds + X s − X̄ s ds ⎦⎠
p p
s∈[0,T ] 0 0
⎛
t t
≤ Cb,σ,T ⎝τtX (s − s)β ds + X s − X̄ s ds
p
0 0
⎡ ⎤⎞
t 1 t 1
2 2 2
B DG ⎣τ X
+Cd, (s − s)2β ds + X s − X̄ s ds ⎦⎠ .
p t p
0 0
X s − X̄ s p ≤ X s − X s p + X s − X̄ s p = X s − X s p + εs p
≤ X s − X s p + f (s),
we derive
21
t √ B DG t
f (t) ≤ Cb,σ,T f (s)ds + 2 Cd, p f (s)2 ds + ψ(t) (7.56)
0 0
where
β t
T
ψ(t) := 1 + Cd,
B DG
p τ t
X
+ X s − X s p ds
n 0
t 21 (7.57)
√ B DG
+ 2 Cd, p X s − X s p ds .
2
0
Now, we will use the L p (P)-path regularity of the diffusion process X established
in Proposition 7.7 to provide an upper-bound for the function ψ. We first note that, as
b and σ satisfy (HTβ ) with a positive real constant Cb,σ,T , they also satisfy the linear
growth assumption (7.54) with
Cb,σ,T := Cb,σ,T + sup |b(t, 0)| + σ(t, 0) < +∞
t∈[0,T ]
since b( . , 0) and σ( . , 0) are β-Hölder hence bounded on [0, T ]. Set for convenience
Cb,σ,T = Cb,σ,T (Cb,σ,T + 1). It follows from (7.57) and Proposition 7.7 that
T β X
ψ(t) ≤ 1 + Cd,
B DG
p τt
n
1
√ T 2 √ B DG √
+ κ p Cb,σ eκ p Cb,σ,T t 1 + X 0 p 1 + t t + 2Cd, p t
n
X T β
≤ 1 + Cd, p τt
B DG
n
1
√ 2 κ p,d Cb,σ,T t
T 2
+ 2 1 + 2Cd, p B DG
1 + t κ p,d Cb,σ e 1 + X 0 p ,
n
334 7 Discretization Scheme(s) of a Brownian Diffusion
√ √ B DG √ √ B DG
where we used the inequality (1 + t)(t + 2Cd, p t) ≤ 2(1 + 2Cd, p )
(1 + t) which can be established by inspecting the cases 0 ≤ t ≤ 1 and t ≥ 1.
2
where κ̃ p,d = 1 + κ p,d . Hence, plugging the right-hand side of (7.59) into the above
inequality satisfied by ψ, we derive the existence of a real constant κ̃ p,d > 0, only
B DG
depending on Cd, p , such that
β
T
ψ(t) ≤ κ̃ p,d eκ̃ p Cb,σ,T T 1 + X 0 p t
n
21
κ̃ p Cb,σ t
T
+κ̃ p,d Cb,σ e 1 + X 0 p (1 + t)2
n
β∧ 21
2 T
≤ κ p,d Cb,σ eκ p,d (1+Cb,σ ) t 1 + X 0 p ,
n
where we used (1 + u)2 ≤ 2eu , u ≥ 0, in the second line. The real constant κ p,d only
depends on κ̃ p,d , hence on Cd,
B DG
p . Finally, one plugs this bound into (7.58) at time
T to get the announced upper-bound.
Step 3 ( p ∈ (0, 2)). It remains to deal with the case p ∈ [1, 2). In fact, once we
β
observe that Assumption (HT ) ensures the global existence and uniqueness of the
solution X of (7.1) starting from a given random variable X 0 (independent of W ), it
can be solved following the approach developed in Step 4 of the proof of Proposi-
tion 7.6. We leave the details to the reader. ♦
then for every p ∈ [1, ∞), there exists a real constant κ p,d > 0 such that
1
n
T β∧ 2
∀ n ≥ 1,
sup X t − X̄ t ≤ κ p,d Cb,σ e κ p,d (1+Cb,σ )2 t
1 + X 0 p
t∈[0,T ] p n
where Cb,σ,T := Cb,σ,T + supt∈[0,T ] |b(t, 0)| + σ(t, 0) < +∞.
7.8 Further Proofs and Results 335
The aim of this section is to prove in full generality Claim (b) of Theorem 7.2. We
recall that the stepwise constant Euler scheme is defined by
∀ t ∈ [0, T ], X t := X̄ t ,
On the other hand, using the extended Hölder Inequality: for every p ∈ (0, +∞),
1 1
∀ r, s ≥ 1, + = 1, f g p ≤ f r p gsp ,
r s
with r = s = 2 (other choices are possible), leads to
sup |σ(t, X̄ t )(Wt − Wt )| ≤ sup σ(t, X̄ t ) sup |Wt − Wt |
t∈[0,T ] t∈[0,T ] t∈[0,T ]
p p
≤ sup σ(t, X̄ t ) sup |Wt − Wt | .
t∈[0,T ] t∈[0,T ]
2p 2p
336 7 Discretization Scheme(s) of a Brownian Diffusion
owing to (7.18) in Sect. 7.2.1. Finally, plugging these estimates into (7.60), yields
n T
sup | X̄ t − X t | ≤ 2 eκ p Cb,σ,T T (1 + |x|)
n
t∈[0,T ] n
p
κ2 p Cb,σ,T T T
+2 e (1 + log n),
(1 + |x|) × C W,2 p
n
κ Cb,σ,T T T T
≤ 2 C W,2 p + 1 e 2 p 1 + |x| (1 + log n) + .
n n
√ T
The result follows by noting that T
1 + log n + Tn ≤ 1 + T 1 + log n
n
√ n
for every integer n ≥ 1 and by setting κ̃ p := 2 max (1 + T )(C W,2 p + 1), κ2 p .
Step 2 (Random X 0 ). When X 0 is no longer deterministic one uses that X 0 and W
are independent so that, with obvious notations,
E sup | X̄ tn,X 0 − X tn,X 0 | p = P X 0 (d x0 )E sup | X̄ tn,x0 − X tn,x0 | p ,
t∈[0,T ] Rd t∈[0,T ]
One can derive from the above L p -rate of convergence an a.s.-convergence result.
The main result is given in the following theorem (which extends Theorem 7.3 stated
in the homogeneous Lipschitz continuous case).
7.8 Further Proofs and Results 337
Theorem 7.9 If (HTβ ) holds and if X 0 is a.s. finite, the continuous Euler scheme
X̄ n = ( X̄ tn )t∈[0,T ] a.s. converges toward the diffusion X for the sup-norm over [0, T ].
Furthermore, for every α ∈ [0, β ∧ 21 ),
a.s.
n α sup |X t − X̄ tn | −→ 0.
t∈[0,T ]
The same convergence rate holds with the stepwise constant Euler scheme ( X tn )t∈[0,T ] .
In particular,
p p
E 1{|X 0 |≤N } sup X̄ tn,(N ) − X t = E 1{|X 0 |≤N } sup X̄ tn,(N ) − X t(N )
t∈[0,T ] t∈[0,T ]
p
≤ E sup X̄ tn,(N ) − X t(N )
t∈[0,T ]
p(β∧ 21 )
p T
≤ C p,b,σ,β,T 1 + X 0(N ) p .
n
1
Let α ∈ (0, β ∧ 21 ) and let p > 1
β∧ 21 −α
. Then < +∞. Consequently,
n p(β∧ 2 −α)
1
n≥1
Beppo Levi’s monotone convergence Theorem for series with non-negative terms
implies
p
E 1{|X 0 |≤N } n pα sup X̄ tn − X t
n≥1 t∈[0,T ]
≤ C p,b,σ,β,T N p T p(β∧ 2 ) n − p(β∧ 2 −α) < +∞.
1 1
n≥1
338 7 Discretization Scheme(s) of a Brownian Diffusion
Hence
p
sup n pα X̄ tn − X t < +∞, P a.s.
n≥1 t∈[0,T ]
6
on the event X 0 ≤ N = X 0 ∈ Rd a.s.
= .
N ≥1
The proof for the stepwise constant Euler scheme follows exactly the same lines
since an additional log n term plays no role in the convergence of the above
series. ♦
Remarks and comments. • The above rate result strongly suggests that the critical
index for the a.s. rate of convergence is β ∧ 21 . The question is then: what happens
√ L
when α = β ∧ 21 ? It is shown in [155, 178] that (when β = 1), n(X t − X̄ tn ) −→
t , where = (t )t∈[0,T ] is a diffusion process driven by a Brownian motion W
independent of W . This weak convergence holds in a functional sense, namely for
the topology of the uniform convergence on C([0, T ], Rd ). This process is not P-
a.s. ≡ 0 if σ(x) ≡ 0, even a.s non-zero if σ never vanishes. The “weak functional”
feature means first that we consider the processes as random variables
taking
values in
their natural path space, namely the separable Banach space C [0, T ], Rd , . sup .
Then, one may consider the weak convergence of probability measures defined on
the Borel σ-field of this space (see [45], Chap. 2 for an introduction). In particular,
.sup being trivially continuous,
√ L
n sup |X t − X̄ tn | −→ sup |t |
t∈[0,T ] t∈[0,T ]
(This easily follows either from the Skorokhod representation theorem; a direct
approach is also possible.)
Exercise. One considers the geometric Brownian motion X t = e− 2 +Wt solution
t
to
d X t = X t dWt , X 0 = 1.
k
T
X̄ tkn = 1 + Wtn where tn = n
, Wtn = Wtn − Wt−1
n , ≥ 1.
=1
If Assumption (7.2) holds, then for every x ∈ Rd , there exists a unique solution,
denoted by (X tx )t∈[0,T ] , to the SDE (7.1) defined on [0, T ] and starting from x. The
mapping (x, t) → X tx defined on [0, T ] × Rd is called the flow of the SDE (7.1).
One defines likewise the flow of the Euler scheme(which always exists). We will
now elucidate the regularity of these flows when Assumption (HTβ ) holds.
Theorem 7.10 If the coefficients b and σ of (7.1) satisfy Assumption (HTβ ) for a real
constant C > 0, then the unique strong solution (X tx )t∈[0,T ] starting from x ∈ Rd on
[0, T ] and the continuous Euler scheme ( X̄ n,x )t∈[0,T ] satisfy
∀ x, y ∈ Rd , ∀ n ≥ 1
y n,y
sup |X tx − X t | + sup | X̄ tn,x − X̄ t | ≤ 2 eκ3 ( p,C T ) |x − y|,
t∈[0,T ] p t∈[0,T ] p
where κ3 ( p, u) = 2 + C(Cd, p ) u, u ≥ 0.
B DG 2
Proof. We focus on the diffusion process (X t )t∈[0,T ] . First note that if the above
bound holds for some p > 0 then it holds true for any p ∈ (0, p) since the · p -
norm is non-decreasing in p. Starting from
t
y
t
X tx − Xt = (x − y) + (b(s, X sx ) − b(s, X sy ))ds + σ(s, X sx ) − σ(s, X sy ) dWs
0 0
one gets
t
sup |X sx − X sy | ≤ |x − y| + |b(s, X sx ) − b(s, X sy )|ds
s∈[0,t] 0
s
+ sup σ(u, X u ) − σ(u, X u ) dWu .
x y
s∈[0,t] 0
Then, setting for every p ≥ 2, f p (t) := sup |X sx − X sy | , it follows from the
s∈[0,t] p
BDG Inequality (7.53) and the generalized Minkowski inequality (7.50) that
340 7 Discretization Scheme(s) of a Brownian Diffusion
5
t t
B DG
2
f p (t) ≤ |x − y| + C
y
X sx − X s p ds + Cd, σ(s, X s ) − σ(s, X s ) ds
x y
p
0 0
p
t t 1
B DG C
2
|X sx − X s |2 ds
y y
≤ |x − y| + C X sx − X s p ds + Cd, p
0 0 p
2
t t 1
y x y 2 2
≤ |x − y| + C X sx − X s p ds + Cd,
B DG C
p X s − X s ds .
0 0 p
The proof for the Euler scheme follows the same lines once we observe that s ∈
[0, s]. ♦
7.8.8 The Strong Error Rate for the Milstein Scheme: Proof
of Theorem 7.5
In this section, we return to the scalar case d = q = 1 and we prove Theorem 7.5.
Throughout this section Cb,σ, p,T and K b,σ, p,T are positive real constants that may
vary from line to line.
First we note that the genuine (or continuous) Milstein scheme ( X̄ tmil,n )t∈[0,T ] , as
defined by (7.38), can be written in an integral form as follows
t t
X̄ tmil,n = X0 + b( X̄ smil,n )ds + σ( X̄ smil,n )dWs
0 0
t s
+ (σσ )( X̄ umil,n )dWu dWs (7.61)
0 s
so that t t
X̄ tmil = X0 + b( X̄ smil )ds + H̄s dWs .
0 0
It follows from the boundedness of b and σ that b and σ satisfy a linear growth
assumption.
We will follow the lines of the proof of Proposition 7.6, the specificity of the
Milstein framework being that the diffusion coefficient is replaced by the process
H̄s . So, our task is to control the term
s∧τ̄ N
sup H̄u dWu
s∈[0,t] 0
in L p where τ̄ N = τ̄ Nn := inf t ∈ [0, T ] : | X̄ smil,n − X 0 | > N , n, N ≥ 1.
t∧τ̄ N
First assume that p ∈ [2, ∞). Since 0 H̄s dWs is a continuous local martingale,
it follows from the BDG Inequality (7.51) that
21
s∧τ̄ N t∧τ̄ N
sup B DG
H̄u dWu ≤ C p H̄s2 ds .
p
s∈[0,t] 0 p 0 2
Using that σ and σσ have at most linear growth since σ is bounded, we derive
1{s≤τ̄ } H̄s∧τ̄ ≤ Cb,σ,T 1 + X̄ mil .
N N p s∧τ̄ N p
Finally, following the lines of the first step of the proof of Proposition 7.6 leads to
⎛ 21 ⎞
s∧τ̄ N √ t 2
sup B DG ⎝
H̄u dWu ≤ Cb,σ,T C p t+ sup | X̄ mil | ds
u p
⎠.
s∈[0,t] 0 p 0 u∈[0,s∧τ̄ N ]
Still following the lines of the proof Proposition 7.6, we include the step 4 to deal
with the case p ∈ (0, 2).
Moreover, we get as a by-product that, for every p > 0 and every n ≥ 1,
sup | H̄t | ≤ K b,σ,T, p 1 + X 0 p < +∞, (7.63)
t∈[0,T ] p
where K b,σ,T, p does not depend on the discretization step n. As a matter of fact, this
follows from
sup | H̄t | ≤ Cb,σ 1 + sup | X̄ tmil,n | 1 + 2 sup |Wt |
t∈[0,T ] t∈[0,T ] t∈[0,T ]
where we used that · 2 p is a norm in the right-hand side of the inequality. A similar
1
bound holds when p ∈ (0, 1/2) since 1 + V 2 p ≤ 2 2 p 1 + V 2 p for any random
variable V .
Now, by Lemma 7.4 devoted to the L p -regularity of Itô processes, one derives the
existence of a real constant κb,σ, p,T ∈ (0, +∞) (not depending on n ≥ 1) such that
1
T 2
∀ t ∈ [0, T ], ∀ n ≥ 1, X̄ tmil,n − X̄ tmil,n p ≤ κb,σ, p,T 1 + X 0 p . (7.64)
n
7.8 Further Proofs and Results 343
Using the diffusion equation and the continuous Milstein scheme one gets
t t
εt = b(X s ) − b( X̄ smil ) ds + σ(X s ) − σ( X̄ smil ) dWs
0 0
t s
− (σσ )( X̄ u )dWu dWs
mil
0 s
t
t
= b(X s ) − b( X̄ s ) ds +
mil
σ(X s ) − σ( X̄ smil ) dWs
0 0
t
+ b( X̄ smil ) − b( X̄ smil ) ds
0
t
+ σ( X̄ smil ) − σ( X̄ smil ) − (σσ )( X̄ smil )(Ws − Ws ) dWs .
0
so that, using twice the generalized Minkowski Inequality (7.50) and the BDG
Inequality (7.51), one gets classically
5
t t
f p (t) ≤ b sup f p (s)ds + C pB DG σ sup f p (s)2 ds
0 0
s
+
sup
b( X̄ mil
u ) − b( X̄ mil
u ) du
s∈[0,t] 0 p
( )* +
(B)
s
+ sup σ( X̄ u ) − σ( X̄ u ) − (σσ )( X̄ u )(Wu − Wu ) dWu
.
mil mil mil
s∈[0,t] 0 p
( )* +
(C)
344 7 Discretization Scheme(s) of a Brownian Diffusion
where ρb (u) is defined by the above equation on the event { X̄ umil = X̄ umil }
and is equal to 0 otherwise. This defines an (Fu )-adapted process,
bounded by the Hölder coefficient [b ]αb of b . Using that for every ξ ∈ R,
|bb |(ξ) ≤ b sup b sup + |b(0)| |ξ| and (7.64) yields
T
(B) ≤ b sup b sup + |b(0)| sup | X̄ tmil |
t∈[0,T ] p n
1+α2 b
T
+[b ]αb K b,σ, p,T 1 + |x|
n
s u
+ sup b ( X̄ umil ) H̄v dWv du .
s∈[0,t] 0 u p
t s t
(−1)T
Likewise, if t ∈ n
, T
n
, then G u dWu ds = (t − s)G s dWs ,
(−1)T (−1)T
n s n
which completes the proof by summing all the terms from k = 1 up to with the last
one. ♦
7.8 Further Proofs and Results 345
We adopt a similar approach for the term (C). Elementary computations show
that
where ρσ (u) is an (Fu )-adapted process bounded by the Hölder coefficient [σ ]ασ of
σ . Consequently, for every p ≥ 1,
σ( X̄ umil ) − σ( X̄ umil ) − (σσ )( X̄ umil )(Wu − Wu )
p
1
≤ σ b( X̄ umil ) p (u − u) + σ(σ )2 ( X̄ umil ) p (Wu − Wu )2 − (u − u) p
2
1+α 1+ασ
+ [σ ]ασ X̄ umil − X̄ umil σ p(1+α
σ )
1+ασ
≤ Cb,σ, p,T 1 + |x| (u − u) + Z − 1 p (u − u) + [σ ]ασ (u − u) 2
2
1+α2 σ
T
≤ Cb,σ, p,T 1 + |x| .
n
346 7 Discretization Scheme(s) of a Brownian Diffusion
1+α2 σ
T
≤ C p Cb,σ, p,T 1 + |x|
B DG
.
n
5 1+ασ ∧αb
t t T 2
f p (t) ≤ b sup f p (s)ds + C pB DG σ sup f 2 (s)ds + Cb,σ, p,T (1 + |x|)
0 0 n
so that, owing to the “à la Gronwall” Lemma 7.3, there exists a real constant
1+ασ2 ∧αb
T
f p (T ) ≤ Cb,σ, p,T (1 + |x|) .
n
Step 2 (Extension to p ∈ (0, 2) and random starting values X 0 ). First one uses
that p → · p is non-decreasing to extend the above bound to p ∈ (0, 2). Then,
one uses that, if X 0 and W are independent, for any non-negative functional
: C([0, T ], Rd )2 → R+ , one has with obvious notations
E(X, X̄ mil
)= P X 0 (d x0 ) E (X x0 , X̄ mil,x0 ).
Rd
Applying this identity with (x, x̄) = sup |x(t) − x̄(t)| p , x, x̄ ∈ C([0, T ], R),
t∈[0,T ]
completes the proof.
(b) This second claim follows from the error bound established for the Brownian
motion itself: as concerns the Brownian motion, both stepwise constant and continu-
ous versions of the Milstein and the Euler scheme coincide. So a better convergence
rate is hopeless. ♦
In this section we return to the purely scalar case d = q = 1, mainly for notational
convenience, namely we consider the scalar version of (7.1), i.e.
C∞ ≡ b, σ ∈ C ∞ ([0, T ] × R) and ∀ k1 , k2 ∈ N, k1 + k2 ≥ 1,
k1 +k2 k1 +k2
∂ b ∂ σ
sup ∂t k1 ∂x k2 (t, x) + ∂t k1 ∂x k2 (t, x) < +∞.
(t,x)∈[0,T ]×R
In particular, b and σ are Lipschitz continuous in (t, x) ∈ [0, T ] × R. Thus, for every
t, t ∈ [0, T ] and every x, x ∈ R,
∂b ∂b
|b(t , x ) − b(t, x)| ≤
sup ∂x (s, ξ) |x − x| + sup ∂t (s, ξ) |t − t|.
(s,ξ)∈[0,T ]×R (s,ξ)∈[0,T ]×R
Consequently, the SDE (7.1) always has a unique strong solution (X t )t∈[0,T ] starting
from any Rd -valued random vector X 0 , independent of the Brownian motion W on
(, A, P). Furthermore, as
|b(t, x)| + |σ(t, x)| ≤ sup (|b(t, 0)| + |σ(t, 0)|) + C|x| ≤ C (1 + |x|),
t∈[0,T ]
We recall that the infinitesimal generator L of the diffusion reads on every function
g ∈ C 1,2 ([0, T ] × R)
∂g 1 ∂2g
Lg(t, x) = b(t, x) (t, x) + σ 2 (t, x) 2 (t, x).
∂x 2 ∂x
As for the source function f appearing in the Feynman–Kac formula, which will
also be the function of interest for the weak error expansion, we make the following
regularity and growth assumption:
⎧
⎪ ∞
⎨ f ∈ C (R, R)
(F∞ ) ≡ and
⎪
⎩
∀ k ∈ N, ∃ rk ∈ N, ∃ Ck ∈ (0, +∞), | f (k) (x)| ≤ C(1 + |x|rk ),
The first result of the section is the fundamental link between Stochastic Differ-
ential Equations and Partial Differential Equations, namely the representation of the
solution of the above parabolic PDE as the expectation of a marginal function of
the SDE at its terminal time T whose infinitesimal generator is the second-order
differential operator of the PDE. This representation is known as the Feynman–Kac
formula.
Theorem 7.11 (Feynman–Kac formula) Assume C∞ and (F∞ ) hold.
(a) The parabolic PDE
∂u
+ Lu = 0, u(T, . ) = f (7.66)
∂t
where (X st,x )s∈[t,T ] denotes the unique solution of the SDE (7.1) starting at x at time
t. If, furthermore, the SDE is autonomous—namely b(t, x) = b(x) and σ(t, x) =
σ(x)—then
∀ t ∈ [0, T ], u(t, x) = E f X Tx −t .
Notation: To alleviate notation, we will use throughout this section the notations
∂x f , ∂t f , ∂xt f , etc, for the partial derivatives instead of ∂∂xf , ∂∂tf , ∂x∂t
∂2 u
,…
Exercise. Combining the above bound for the spatial partial derivatives with
∂t u = −Lu, show that
t
since u satisfies the PDE (7.66). Now the local martingale Mt := ∂x u(s, X s )
0
σ(s, X s )dWs is a true martingale since
t 2
Mt = ∂x u(s, X s ) σ 2 (s, X s )ds ≤ C 1 + sup |X s |θ ∈ L 1 (P)
0 s∈[0,T ]
since X t is Ft -measurable. This shows that u(t, . ) is a regular version of the condi-
tional expectation on the left-hand side.
If we rewrite (7.69) with the solution (X st,x )s∈[0,T −t] , the same reasoning shows
that T
u(T, X T ) = u(t, x) +
t,x
∂x u(s, X st,x )σ(s, X s )dWs
t
property. Following e.g. [251] (see Theorem (1.7), Chap. IX, p. 368), this implies
t,x
weak uniqueness, i.e. that (X t+s )s∈[0,T −t] and (X sx )s∈[0,T −t] have the same distribution.
Hence E f (X T ) = E f (X T −t ) which completes the proof.
t,x x
♦
Remarks. • The proof of the Feynman–Kac formula itself (Claim (b)) only needs u
to be C 1,2 and b and σ to be continuous on [0, T ] × R and Lipschitz in x uniformly
in t ∈ [0, T ].
• In the time homogeneous case b(t, x) = b(x) and σ(t, x) = σ(x), one can proceed
by verification. Under smoothness assumption on b and σ, say C 2 with bounded
existing derivatives and Hölder second-order partial derivatives, one shows, using
the tangent process of the diffusion, that the function u(t, x) defined by u(t, x) =
E f (X Tx−t ) is C 1,2 in (t, x). Then, the above claim (b) shows the existence of a solution
to the parabolic P D E (7.66).
We may now pass to the second result of this section, the Talay–Tubaro weak error
expansion theorem, stated here in the non-homogeneous case (but in one dimension).
Theorem 7.12 (Smooth case) (Talay–Tubaro [270]): Assume that b and σ satisfy
(C∞ ), that f satisfies (F∞ ) and that X 0 ∈ L r0 (P). Then, the weak error can be
expanded at any order, namely
R
ck 1
∀ R ∈ N∗ , E f ( X̄ Tn ) − E f (X T ) = + O .
k=1
nk n R+1
Remarks. • The result at a given order R also holds under weaker smoothness
assumptions on b, σ and f (say CbR+5 , see [134]).
• Standard arguments show that the coefficients ck in the above expansion are the
first R terms of a sequence (ck )k≥1 .
In order to evaluate the increment u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ), we apply Itô’s formula
(see Sect. 12.8) between tk−1 and tk to the function u and use that the Euler scheme
satisfies the pseudo-SDE with “frozen” coefficients
1
L̄g(s, s, x, x) = b(s, x)∂x g(s, x) + σ 2 (s, x)∂x x g(s, x)
2
and ∂t g(s, s, x, x) = ∂t g(s, x). t
The bracket process of the local martingale Mt = ∂x u(s, X̄ sn )σ(s, X̄ sn )dWs is
0
given for every t ∈ [0, T ] by
T 2
MT = ∂x u(s, X̄ s ) σ 2 (s, X̄ sn )ds.
0
352 7 Discretization Scheme(s) of a Brownian Diffusion
Consequently, using that σ has (at most) linear growth in x, uniformly in t ∈ [0, T ],
and the control (7.67) of ∂x u(s, x), we have
MT ≤ C 1 + sup | X̄ tn |2 + sup | X̄ tn |r (1,T )+2 ∈ L 1 (P)
t∈[0,T ] t∈[0,T ]
since supt∈[0,T ] | X̄ tn | lies in every L p (P). Hence, (Mt )t∈[0,T ] is a true martingale, so
that
tk
E u(tk , X̄ tk ) − u(tk−1 , X̄ tk−1 ) = E
n n
(∂t + L̄)u(s, s, X̄ s , X̄ s )ds
n n
tk−1
(the integrability of the integral term follows from ∂t u = −Lu, which ensures the
polynomial growth of (∂t + L̄)u(s, s, x, x)) in x and x uniformly in s, s. Atthis stage,
the idea is to expand the above expectation into a term φ̄(s, X̄ sn ) Tn + O ( Tn )2 . To
this end we will again apply Itô’s formula to ∂t u(s, X̄ sn ), ∂x u(s, X̄ sn ) and ∂x x u(s, X̄ sn ),
taking advantage of the regularity of u.
– Term 1. The function ∂t u being C 1,2 ([0, T ] × R), Itô’s formula between s = tk−1
and s yields
s
∂t u(s, X̄ sn ) = ∂t u(s, X̄ sn ) + ∂tt u(r, X̄ rn ) + L̄(∂t u)(r, X̄ rn ) dr
s
s
+ σ(r, X̄ rn )∂t x u(r, X̄ rn )dWr .
s
First, let us show that the local martingale term is the increment between s and s of
a true martingale, denoted by (Mt(1) )t∈[0,T ] from now on. Note that ∂t u = −L u so
that ∂xt u = −∂x L u is clearly a function with polynomial growth in x uniformly in
t ∈ [0, T ] since
∂x b(t, x)∂x u + 1 σ 2 (t, x)∂x x u (t, x) ≤ C 1 + |x|θ0
2
This follows from the fact that φ̄(1) is defined as a linear combination of products of
b, ∂t b, ∂x b, ∂x x b, σ, ∂t σ, ∂x σ, ∂x x σ, ∂x u, ∂x x u at (t, x) or (t, x) (with “x = X̄ rn ” and
“x = X̄ rn ”).
– Term 2. The function ∂x u being C 1,2 , Itô’s formula yields
s
∂x u(s, X̄ sn ) = ∂x u(s, X̄ sn ) + ∂xt u(r, X̄ rn ) + L̄(∂x u)(r, r , X̄ rn , X̄ rn ) dr
s
s
+ ∂x x u(r, X̄ rn )σ(r, X̄ rn )dWr .
s
The stochastic integral is the increment of a true martingale (denoted by (Mt(2) )t∈[0,T ]
in what follows) and using that ∂xt u = ∂x (−L u), one shows likewise that
∂t x u(r, X̄ rn ) + L̄(∂x u)(r, r , X̄ rn , X̄ rn ) = L̄(∂x u) − ∂x (Lu) (r, r , X̄ rn , X̄ rn )
= φ̄(2) (r, r , X̄ rn , X̄ rn ),
where (Mt(3) )t∈[0,T ] is a martingale and φ̄(3) has a polynomial growth in (x, x) uni-
formly in (t, t).
Step 2 (A first bound). Collecting all the results obtained in Step 1 yields
where
1
φ̄(r, r , x, x) = φ̄(1) (r, x, x) + b(r , x)φ̄(2) (r, x, x) + σ 2 (r , x)φ̄(3) (r, x, x).
2
354 7 Discretization Scheme(s) of a Brownian Diffusion
Now, all the three expectations inside the integral are zero since the M (k) , k = 1, 2, 3,
are true martingales. Thus
E b(tk−1 , X̄ tnk−1 ) Ms(2) − Mt(2) = E b(tk−1 , X̄ tnk−1 ) E Ms(2) − Mt(2) | Ftk−1
k−1
( )* + ( )* k−1
+
Ftk−1 -measurable =0
= 0,
so that
tk s
E u(tk , X̄ n ) − u(tk−1 , X̄ n ) ≤ ds dr E |φ̄(r, r , X̄ rn , X̄ rn )|
tk tk−1
tk−1 tk−1
(tk − tk−1 )2
≤ Cφ̄ (T ) 1 + 2 E sup | X̄ tn |θ∨θ
t∈[0,T ] 2
2
T
≤ Cb,σ, f (T ) ,
n
7.8 Further Proofs and Results 355
where, owing to Proposition 6.6, the function Cb,σ, f (·) only depends on T (in a
non-decreasing manner). Summing over the terms for k = 1, . . . , n yields, keeping
in mind that X̄ 0n = X 0 ,
E f ( X̄ n ) − E f (X ) = E u(T, X̄ n ) − E u(0, X̄ n )
T T T 0
n
≤ E u(tk , X̄ n ) − E u(tk−1 , X̄ n )
tk tk−1
k=1
T 2
≤ nCb,σ, f (T )
n
T
= Cb,σ, f (T ) with Cb,σ, f (T ) = T Cb,σ, f (T ).
n
Let k ∈ {0, . . . , n − 1}. It follows from the preceding and the obvious equality tkk = T
n
,
k = 1, . . . , n, that,
tk T
∀ k ∈ {0, . . . , n}, E f ( X̄ n ) − E f (X t ) ≤ C
tk k b,σ, f (tk ) ≤ Cb,σ, f (T ) .
k n
φ(t, x) := φ̄(t, t, x, x)
(which is at least a C 1,2 -function and) whose time and space partial derivatives have
polynomial growth in x uniformly in t ∈ [0, T ] owing to the above properties of φ̄.
356 7 Discretization Scheme(s) of a Brownian Diffusion
The idea is once again to apply Itô’s formula, this time to φ̄( . , r , . , X̄ rn ). Let
r ∈ [tk−1 , tk ) (so that r = tk−1 ). Then,
r
φ̄(r, r , X̄ rn , X̄ rn ) = φ(r , X̄ rn ) + ∂x φ̄(v, v, X̄ vn , X̄ vn )d X̄ vn
tk−1
r 1
+ ∂t φ̄(v, r , X̄ vn , X̄ rn ) + ∂x x φ(v, r, X̄ vn , X̄ rn )σ 2 (v, X̄ vn ) dv
tk−1 2
r
= φ̄(r , v, X̄ vn , X̄ vn ) + ∂t φ̄(v, v, X̄ vn , X̄ vn ) + L̄ φ̄( . , v, . , X̄ vn )(v, v, X̄ vn , X̄ vn ) dv
tk−1
r
+ ∂x φ̄(v, v, X̄ vn , X̄ vn )σ(v, X̄ vn )dWv
tk−1
where we used that the mute variable v satisfies v = r = tk−1 . The stochastic integral
turns out to be the increment of a true square integrable martingale since
sup ∂x φ̄(v, v, X̄ vn , X̄ vn )σ(v, X̄ vn ) ≤ C 1 + sup | X̄ sn |θ ∈ L 2 (P)
v∈[0,T ] s∈[0,T ]
where θ ∈ N, owing to the above (ii) and the linear growth of σ. Then, Fubini’s
Theorem yields
1T 2
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ(tk−1 , X̄ tnk−1 ) (7.71)
2 n
tk s r
+ E ∂t φ̄(v, v, X̄ vn , X̄ vn )
tk−1 tk−1 tk−1
+ L̄ φ̄( . , v, . , X̄ vn )(v, v, X̄ vn , X̄ vn ) dv dr ds
tk s
+ E (Nr − Ntk−1 )dr ds .
tk−1 tk−1
( )* +
=0
Now,
E sup L̄ φ̄( . , v, . , X̄ v )(v, v, X̄ v , X̄ v ) < +∞
n n n
v,r ∈[0,T ]
owing to (7.65) and the polynomial growth of b, σ, u and its partial derivatives. The
same holds for ∂t φ̄(v, r , X̄ vn , X̄ rn ) so that
r
tk s 1 T 3
E ∂t φ̄(v, v, X̄ vn , X̄ vn ) + ( . , v, . , , X̄ vn )(v, v, X̄ vn , X̄ vn ) dv dr ds ≤ Cb,σ, f,T .
3 n
tk−1 tk−1 tk−1
7.8 Further Proofs and Results 357
n
E f ( X̄ Tn ) − f (X T ) = E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 )
k=1
T T T 2
= E φ(s, X̄ sn )ds + O .
2n 0 n
In turn, for every k ∈ {0, . . . , n}, the function φ(tk , . ) satisfies Assumption (F∞ )
with some bounds not depending on k (this in turn follows from the fact that the
space partial derivatives of φ have polynomial growth in x uniformly in t ∈ [0, T ]).
Consequently, by Step 2,
T
max E φ(tk , X̄ tnk ) − E φ(tk , X tk ) ≤ Cb,σ, f (T )
0≤k≤n n
so that
t t t
k k k
max E φ(s, X̄ sn )ds − φ(s, X s )ds = max E φ(s, X̄ sn ) − E φ(s, X s ) ds
0≤k≤n 0 0 0≤k≤n 0
T2
≤ Cb,σ, f (T ) .
n
which implies
T
sup E φ(s, X s ) − E φ(s, X s ) ≤ C f,b,σ (T ) .
s∈[0,T ] n
Hence
tk tk Cb,σ, f (T )
max E φ(s, X s )ds − E φ(s, X s )ds ≤ . (7.72)
0≤k≤n 0 0 n
which completes the proof of the first-order expansion. Note that the function φ can
be made explicit. ♦
One step beyond (Toward R = 2). To expand at a higher-order, say R = 2, one can
proceed as follows: first we go back to (7.71) which can be re-written as
1T 2
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ(tk−1 , X̄ tnk−1 )
2 n
tk s r
+E χ̄(v, v, X̄ vn , X̄ vn ) dv dr ds,
tk−1 tk−1 tk−1
where χ̄ and χ(t, x) = χ̄(t, t, x, x) and their partial derivatives satisfy similar
smoothness and growth properties as φ̄ and φ, respectively.
Then, one gets by mimicking the above computations
1T 2 1T 3
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ(tk−1 , X̄ tnk−1 ) + E χ(tk−1 , X̄ tnk−1 )
2 n 3 n
tk s r t
+E ψ̄(v, v, X̄ vn , X̄ vn )dt dv dr ds,
tk−1 tk−1 tk−1 tk−1
(where B(0, N ) denotes the ball centered at 0 with radius N and · denotes, e.g.
the Fröbenius norm of matrices).
(ii) LinG Linear growth
1
∃ C > 0, ∀ t ∈ R+ , ∀ x ∈ Rd , |b(t, x)| ∨ σ(t, x) ≤ C 1 + |x|2 2 .
1
∀ x ∈ Rd , L V (x) := b(x) | ∇V (x) + Tr σ(x)D 2 V (x)σ ∗ (x)
2 (7.74)
≤ λV (x),
then, for every x ∈ Rd , the SDE (7.1) has a unique (strong) solution (X tx )t≥0 on the
whole non-negative real line starting from x at time 0.
A typical example of where (i) Li ploc and (ii)WL yap are satisfied is the following.
Let p ∈ R and set
p
b(x) = κ 1 + |x|2 2 x, x ∈ Rd .
The function b is clearly locally Lipschitz continuous but not Lipschitz continuous
when p > 0. Assume σ is locally Lipschitz continuous in x with linear growth in
the sense of (ii) LinG , then one easily checks with V (x) = |x|2 + 1, that
L V (x) = 2 b(x) | x + σ(x)2
p
≤ 2κ(1 + |x|2 ) 2 |x|2 + C 2 (1 + |x|2 ).
7.9 The Non-globally Lipschitz Case (A Few Words On) 361
• If κ ≥ 0 and p ≤ 0, then (i) Li ploc and (ii) LinG are satisfied, as well as (7.74) (with
λ = 2κ + C 2 ).
• If κ < 0, then (7.74) is satisfied by λ = C 2 for every p whereas (ii) LinG is not
satisfied.
This allows us to deal with SDE s like perturbed dissipative gradient equations,
equations coming from Hamiltonian mechanics (see e.g. [267]), etc, where the drift
shares mean-reverting properties but has a (non-linear) polynomial growth at infinity.
Unfortunately, this stability, or more precisely non-explosive property induced by
the existence of a weak Lyapunov function as described above cannot be transferred
to the regular Euler or Milstein schemes, which usually have an explosive behavior,
depending on the step. Actually, this phenomenon can already be observed in a
deterministic framework with ODE s but is more systematic with (true) SDE s since,
unlike ODE s, no stability region usually exists depending on the starting value of
the scheme.
A natural idea, inspired by the numerical analysis of ODE s leads to the introduc-
tion fully and partially drift-implicit Euler schemes, but also new classes of explicit
approximation scarcely more complex than the regular Euler scheme. We will not go
further in this direction but will refer, for example, to [152] for an in-depth study of
the approximation of such SDE s, including moment bounds and convergence rates.
Chapter 8
The Diffusion Bridge Method:
Application to Path-Dependent Options
(II)
In this section we deal with some “path-dependent” (European) options. Such con-
tracts are characterized by the fact that their payoffs depend on the whole past of
the underlying asset(s) between the origin t = 0 of the contract
and its maturity T .
This means that these payoffs are of the form F (X t )t∈[0,T ] , where F is a functional
usually naturally defined from D([0, T ], Rd ) → R+ (where D([0, T ], Rd ) is the set
of càdlàg functions x : [0, T ] → Rd (1 ) and X = (X t )t∈[0,T ] denotes the dynamics of
the underlying asset). We still assume from now on that X = (X t )t∈[0,T ] is a solution
to an Rd -valued SDE of type (7.1):
1 We need to define F on càdlàg functions in view of the stepwise constant Euler scheme, not to
speak of jump diffusion driven by Lévy processes.
© Springer International Publishing AG, part of Springer Nature 2018 363
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_8
364 8 The Diffusion Bridge Method: Application …
Theorem 8.1 (a) If the domain D is bounded and hasa smooth enough boundary
(in fact C 3 ), if b ∈ C 3 (Rd , Rd ), σ ∈ C 3 Rd , M(d, q, R) , σ uniformly elliptic on D
(i.e. σσ ∗ (x) ≥ ε0 Id , ε0 > 0), then, for every bounded Borel function f vanishing in
a neighborhood of ∂ D,
1
E f (X T )1{τ D (X )>T } − E f ( X Tn )1{τ D ( X n )>T } = O √ as n → +∞ (8.1)
n
Note, however, that these assumptions are unfortunately not satisfied by standard
barrier options (see below).
– It is suggested in [264] (with a rigorous proof when X = W ) that if b, σ ∈
Cb4 (R, R), σ is uniformly elliptic and f ∈ C 4,2
pol (R ) (existing partial derivatives with
2
2 When α is continuous or stepwise constant and càdlàg, τ D (α) := inf{s ∈ [0, T ] : α(s) ∈
/ D}.
8.1 Theoretical Results About Time Discretization of Path-Dependent Functionals 365
More recently, relying on new techniques based on transport inequalities and the
Wasserstein metric, a partial result toward this conjecture has been established in [5]
with an error of O(n −( 3 −η) ) for every η > 0.
2
To take advantage of the above rates and the reasonable guess we may have about
higher-order expansions, we do not need to simulate the genuine Euler scheme itself
(which is meaningless) but some specific functionals of this scheme like the maxi-
mum, the minimum, time integrals, etc, between two time discretization instants tkn
n
and tk+1 , given the (simulated) values of the discrete time Euler scheme. This means
bridging this discrete time Euler scheme into its genuine extension. First, we deal
with the standard Brownian motion itself.
We still denote by (FtW )t≥0 the (completed) natural filtration of a standard Brownian
motion W . We begin with a quick study of the standard Brownian bridge between 0
and T .
t
YtW,T := Wt − W , t ∈ [0, T ], (8.4)
T T
st (s ∧ t)(T − s ∨ t)
E YsW,T YtW,T = s ∧ t − = , 0 ≤ s, t ≤ T.
T T
(b) Let T0 , T1 ∈ (0, +∞), T0 < T1 . Then
L (Wt )t∈[T0 ,T1 ] | Ws , s ∈
/ (T0 , T1 ) = L (Wt )t∈[T0 ,T1 ] | WT0 , WT1
t − T0 B,T1 −T0
L (Wt )t∈[T0 ,T1 ] | WT0 = x, WT1 = y = L x + (y − x) + (Yt−T ) t∈[T0 ,T1 ] ,
T1 − T0 0
Proof (a) The process Y W,T is clearly centered since W is. Elementary computations
based on the covariance structure of the standard Brownian Motion
E Ws Wt = s ∧ t, s, t ∈ [0, T ],
s t st
E (YtW,T YsW,T ) = E Wt Ws − E Wt WT − E Ws WT + 2 EWT2
T T T
st ts ts
= s∧t − − +
T T T
ts (s ∧ t)(T − s ∨ t)
= s∧t − = .
T T
L 2 (P)
Let GW = span Wt , t ≥ 0 be the closed vector subspace of L 2 (, A, P)
spanned by the Brownian motion W . Since it is a centered Gaussian process, it is
a well-known fact that independence and absence of correlation coincide in GW .
The process Y W,T belongs to this space by construction. Likewise one shows that,
for every u ≥ T , E (YtW,T Wu ) = 0 so that YtW,T ⊥ span(Wu , u ≥ T ). Consequently,
Y W,T is independent of (WT +u )u≥0 .
(b) First note that for every t ∈ [T0 , T1 ],
(T0 )
Wt = WT0 + Wt−T 0
,
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 367
s
W (T0 ) + YsW ,T1 −T0 .
(T0 )
Ws(T0 ) =
T1 − T0 T1 −T0
(T0 )
W ,T1 −T0
It follows from (a) that the process Y := (Yt−T 0
)t∈[T0 ,T1 ] is a Gaussian process,
(T )
measurable with respect to FTW1 −T0 0 by (a), hence it is independent of FTW since W (T0 )
0
is. Consequently, Y is independent of (Wt )t∈[0,T0 ] . Furthermore, Y is independent of
(WT(T1 −T
0)
)
0 +u u≥0
by (a). Hence it is L 2 (P)-orthogonal to WT1 +u − WT0 , u ≥ 0, in GW .
As it is also orthogonal to WT0 in GW since WT0 is FTW -measurable, it is orthogonal
0
to WT1 +u = WT(T1 −T
0)
0 +u
+ WT0 , u ≥ 0. Which in turn implies independence of Y and
(WT1 +u )u≥0 since all these random variables lie in GW .
Finally, the same argument – in GW , no correlation implies independence – implies
that Y is independent of (Ws )s∈R+ \(T0 ,T1 ) i.e. of the σ-field σ(Ws , s ∈
/ (T0 , T1 )). The
end of the proof follows from the above identity (8.5) and the exercises below. ♦
T1 − t t − T0
E Wt | WT0 = x, WT1 = y = x+ y, t ∈ [T0 , T1 ],
T1 − T0 T1 − T0
368 8 The Diffusion Bridge Method: Application …
and
(T1 − t)(s − T0 )
Cov Wt , Ws | WT0 = x, WT1 = y = , s ≤ t, s, t ∈ [T0 , T1 ].
T1 − T0
Proposition 8.2 (Bridge of the Euler scheme) Assume that σ(t, x) = 0 for every
t ∈ [0, T ], x ∈ R.
(a) The processes ( X̄ tn )t∈[tkn ,tk+1
n
] , k = 0, . . . , n − 1, are conditionally independent
given the σ-field σ( X̄ t n , k = 0, . . . , n).
n
k
(b) Furthermore, for every k ∈ {0, . . . , n}, the conditional distribution
L ( X̄ tn )t∈[tkn ,tk+1
n ] | X̄ n = x , = 0, . . . , n = L ( X̄ )t∈[t n ,t n ] | X̄ n = x k , X̄ t n
n
t
n
t k k+1
n
tk k+1
= xk+1
n(t − tkn ) B, T /n
= L xk + (xk+1 − xk ) + σ(tkn , xk )Yt−t n
T k t∈[tkn ,tk+1
n ]
B, T /n
where (Ys )s∈[0,T /n] is a Brownian bridge (related to a generic Brownian motion
B) as defined by (8.4). The distribution of this Gaussian process (sometimes called
a diffusion bridge) is entirely characterized by:
n(t − tkn )
– its expectation function xk + (xk+1 − xk )
T t∈[tkn ,tk+1
n
]
and
(s ∧ t − tkn )(tk+1
n
− s ∨ t)
– its covariance operator σ 2 (tkn , xk ) , s, t ∈ [tkn , tk+1
n
].
n
tk+1 − tkn
t − tkn n
W (tk ) ,T /n
X̄ tn = X̄ tnkn + n ( X̄ tk+1
n
n − X̄ tnkn ) + σ(tkn , X̄ tnkn )Yt−t n
tk+1 − tk
n k
n
(keeping in mind that tk+1 − tkn = T /n). Consequently, the conditional independence
W (tkn ) ,T /n
claim will follow if the processes Yt t∈[0,T /n]
, k = 0, . . . , n − 1, are inde-
n
pendent given σ X̄ t n , = 0, . . . , n . Now, it follows from the assumption on the
diffusion coefficient σ that
σ X̄ tnn , = 0, . . . , n = σ X 0 , Wtn , = 1, . . . , n .
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 369
σ X̄ tnkn , Wtkn , X̄ tnk+1
n ⊂ σ {Wtn , =1, . . . , n}
W (tkn ) ,T /n
so that Yt t∈[0,T /n]
is independent of X̄ tnn , Wtkn , X̄ tnn . The conclusion
k k+1
follows. ♦
Now we know the distribution of the genuine Euler scheme between two suc-
cessive discretization times tkn and tk+1
n
conditionally to the Euler scheme at its
discretization times. Now, we are in position to simulate some functionals of the
continuous Euler scheme, namely its supremum.
Proposition 8.3 The distribution of the supremum of the Brownian bridge starting
W,T,y
at 0 and arriving at y at time T , defined by Yt = Tt y + Wt − Tt WT on [0, T ], is
given by
⎧
⎨ 1 − exp − T2 z(z − y) if z ≥ max(y, 0),
W,T,y
P sup Yt ≤z =
t∈[0,T ] ⎩
0 if z ≤ max(y, 0).
Proof. The key is to have in mind that the distribution of Y W,T,y is that of the
conditional distribution of W given W T = y. So, we can derive the result from an
expression of the joint distribution of supt∈[0,T ] Wt , WT , e.g. from
P sup Wt ≥ z, WT ≤ y .
t∈[0,T ]
It is well-known from the symmetry principle that, for every z ≥ max(y, 0),
P sup Wt ≥ z, WT ≤ y = P(WT ≥ 2z − y).
t∈[0,T ]
370 8 The Diffusion Bridge Method: Application …
We briefly reproduce the proof for the reader’s convenience. If z = 0, the result is
d
obvious since WT = −WT . If z > 0, one introduces the hitting time
τz := inf{s > 0 : Ws = z} of [z, +∞) by W (convention inf ∅ = +∞). This is a
(FtW )-stopping time since [z, +∞) is a closed set and W is a continuous process
(this uses that z > 0). Furthermore, τz is a.s. finite since lim Wt = +∞ a.s. Conse-
t→+∞
quently, still by continuity of its paths, Wτz = z a.s. and Wτz +t − Wτz is independent
of FτWz . As a consequence, for every z ≥ max(y, 0), using that Wτz = z on the event
{τz ≤ T },
P sup Wt ≥ z, WT ≤ y = P τz ≤ T, WT − Wτz ≤ y − z
t∈[0,T ]
= P τz ≤ T, −(WT − Wτz ) ≤ y − z
= P τz ≤ T, WT ≥ 2z − y
= P WT ≥ 2z − y
+∞ ξ2
e− 2T
P sup Wt ≥ z, WT ≤ y = h T (ξ) dξ with h T (ξ) = √ .
t∈[0,T ] 2z−y 2πT
Corollary 8.1 Let λ > 0 and let x, y ∈ R. If Y W,T denotes the standard Brownian
bridge of W between 0 and T , then for every z ∈ R,
t
P sup x + (y − x) + λYtW,T ≤ z
t∈[0,T ] T
⎧
⎨1 − exp − T2λ2 (z − x)(z − y) i f z ≥ max(x, y),
= (8.6)
⎩
0 i f z < max(x, y).
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 371
t t
x + (y − x) + λYtW,T = λx + λ y + YtW,T
T T
are independent. This can also be interpreted as the random variables M X̄n,kn , X̄ n , k =
tkn n
tk+1
2n
x,y (z)
G n,k = 1 − exp − (z − x)(z − y) 1 , z ∈ R.
T σ 2 (tkn , x) z≥max(x,y)
Then, the inverse distribution simulation rule (see Proposition 1.1) yields that
t n
W (tk ) ,T /n d −1 d
sup x+ T
(y − x) + σ(tkn , x)Yt = (G n,k
x,y ) (U ), U = U ([0, 1])
t∈[0,T /n] n
d −1
= (G n,k
x,y ) (1 − U ), (8.7)
d −1
where we used that U = 1 − U . To determine (G n,k
x,y ) (at 1 − u), it remains to solve
the equation G x,y (z) := 1 − u under the constraint z ≥ max(x, y), i.e.
n,k
2n
1 − exp − (z − x)(z − y) = 1 − u, z ≥ max(x, y),
T σ 2 (tkn , x)
or, equivalently,
T 2 n
z 2 − (x + y)z + x y + σ (tk , x) log(u) = 0, z ≥ max(x, y).
2n
The above equation has two solutions, the solution below satisfying the constraint.
Consequently,
1
(G nx,y )−1 (1 − u) = x + y + (x − y)2 − 2T σ 2 (tkn , x) log(u)/n .
2
Finally,
L max X̄ tn | { X̄ tnk = xk , k = 0, . . . , n} = L max (G nxk ,xk+1 )−1 (1 − Uk+1 ) ,
t∈[0,T ] 0≤k≤n−1
where (Uk )1≤k≤n are i.i.d. and uniformly distributed random variables over the unit
interval.
Pseudo-code for Lookback style options
We assume for the sake of simplicity that the interest rate r is 0. By Lookback
style options we mean the class of options whose payoff involve possibly both X̄ Tn
and the maximum of (X t ) over [0, T ], i.e.
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 373
E f X̄ Tn , sup X̄ tn .
t∈[0,T ]
• Set S f = 0.
for m = 1 to M
• Simulate a path of the discrete time Euler scheme and set xk := X̄ tn,(m)n ,k=
k
0, . . . , n.
• Simulate (m) := max0≤k≤n−1 (G nxk ,xk+1 )−1 (1 − Uk(m) ), where (Uk(m) )1≤k≤n are
i.i.d. with U ([0, 1])-distribution.
• Compute f ( X̄ Tn,(m) , (m) ).
• Compute Sm := f ( X̄ Tn,(m) , (m) ) + Sm−1 .
f f
end.(m)
• Eventually,
Sf
E f X̄ Tn , sup X̄ tn M
t∈[0,T ] M
3 …Of course one needs to compute the empirical variance (approximately) given by
2
1 1
M M
f ((m) )2 − f ((m) )
M M
m=1 m=1
in order to design a confidence interval, without which the method is simply nonsense….
374 8 The Diffusion Bridge Method: Application …
(b) Derive a formula similar to (8.7) for the conditional distribution of the minimum
of the continuous Euler scheme using now the inverse distribution functions
n,k −1 1
(Fx,y ) (u) = x + y − (x − y) − 2 T σ (tk , x) log(u)/n , u ∈ (0, 1),
2 2 n
2
t W,T /n
of the random variable inf x+ T
(y − x) + σ(x)Yt .
t∈[0,T /n]
n
Warning! The above method is not appropriate for simulating the joint distribution
of the (n + 3)-tuple X̄ tnn , k = 0, . . . , n, inf t∈[0,T ] X̄ tn , supt∈[0,T ] X̄ tn .
k
where K denotes the strike price of the option and L (L > K ) its barrier.
In practice, the “Call” part is activated at T only if the process (X t ) hits the barrier
L ≤ K between 0 and T . In fact, as far as simulation is concerned, this “Call part” can
be replaced by any Borel function f such that both f (X T ) and f ( X̄ Tn ) are integrable
(this is always true if f has polynomial growth owing to Proposition 7.2). Note
that these so-called barrier options are in fact a sub-class of generalized maximum
Lookback options having the specificity that the maximum only shows up through
an indicator function.
Then, one may derive a general weighted formula for E f ( X̄ Tn )1{supt∈[0,T ] X̄ tn ≤L} ,
which is an approximation of E f (X T )1{supt∈[0,T ] X t ≤L} .
(8.8)
Then we use the conditional independence of the bridges of the genuine Euler
scheme given the values X̄ tnk , k = 0, . . . , n, established in Proposition 8.2. It fol-
lows that
E f ( X̄ Tn )1{supt∈[0,T ] X̄ tn ≤L} = E f ( X̄ Tn )P sup X̄ tn ≤ L | X̄ tk , k = 0, . . . , n
t∈[0,T ]
n
=E f ( X̄ Tn ) G nX̄ n , X̄ n (L)
tk tk+1
k=1
⎡ ⎛ ( X̄ tn −L)( X̄ tn −L)
⎞⎤
n−1 − 2n k k+1
σ 2 (tkn , X̄ tn )
= E ⎣ f ( X̄ Tn )1 ⎝1 − e ⎠⎦ .
T
max0≤k≤n X̄ tnk ≤L
k ♦
k=0
Furthermore, we know that the random variable in the right-hand side always
has a lower variance since it is a conditional expectation of the random variable in
the left-hand side, namely
( X̄ tn −L)( X̄ tn
n−1
− 2n k k+1
σ 2 (tkn , X̄ tn )
−L)
f ( X̄ T )1 1−e
n T
Var max0≤k≤n X̄ tk ≤L
k
k=0
≤ Var f ( X̄ Tn )1 supt∈[0,T ] X̄ tn ≤L
.
Exercises. 1. Down-and-Out option. Show likewise that for every Borel function
f ∈ L 1 (R, P X̄ n ),
T
(8.9)
and that the expression in the second expectation has a lower variance.
2. Extend the above results to barriers of the form
L(t) := eat+b , a, b ∈ R.
Remark Formulas like (8.8) and (8.9) can be used to produce quantization-based
cubature formulas (see [260]).
σ2
X tx = x exp (μt + σWt ), μ = r − , x > 0.
2
Then
T n−1
n
tk+1
n−1 T /n
(tkn )
X sx ds = X sx ds = X txkn exp (μs + σWs )ds.
0 k=0 tkn k=0 0
So, we need to approximate the time integrals coming out in the r.h.s of the above
equation. Let B be a standard Brownian motion. It is clear by a √ scaling argument
already used in Sect. 7.2.2 that sups∈[0,T /n] |Bs | is proportional to T /n in L p , p ∈
[1, ∞). Although it is not true in the a.s. sense, owing to a missing log log term
coming from the LIL, we may write “almost” rigorously
σ2 2
B + “O(n − 2 )”.
3
∀ s ∈ [0, T /n], exp (μs + σ Bs ) = 1 + μs + σ Bs +
2 s
Hence
T /n T /n
T μ T2 σ2 T 2
exp (μs + σ Bs )ds = + + σ B s ds +
0 n 2 n2 0 2 2n 2
σ 2 T /n 2
(Bs − s)ds + “O(n − 2 )”,
5
+
2 0
T /n
T r T2
= + + σ Bs ds
n 2 n2 0
σ2 T 2 1 2
( Bu − u)du + “O(n − 2 )”,
5
+
2 n 0
where Bu = Tn B Tn u , u ∈ [0, 1], is a standard Brownian motion on the unit interval
since, combining a scaling and a change of variable,
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 377
T /n 2 1
T
(Bs2 − s)ds = ( Bu2 − u)du.
0 n 0
#1
0 ( Bu − u)du is centered and that, when
2
Owing to the fact that the random variable
(tkn )
B is replaced successively by (Wt )t∈[0, Tn ] , k = 0, . . . , n − 1, the resulting random
variables are independent hence i.i.d. one can in fact consider that the contribution
#1 (t n )
of this term is O(n − 2 ). To be more precise, the random variable 0 ((Wu k )2 − u) du
5
$ n−1 1 $2 n−1 $ $
$ $ $ x 1 (tkn ) 2 $2
$ (t ) $ $ (Wu ) − u du $
n
$ x
X tkn (Wu ) − u) du $ =
k 2
$ X tkn $
$ 0 $ 0
k=0 2
k=02
n−1 $
$ $ $
$ x $2 $ (tkn ) 2 $2
1
= $X tkn $ $
$ (Wu ) − u du $
$
k=0
2 0 2
4 $ 1 $
T $ 2 $2
≤n $ Bu − u du $
$ X T 2
x 2
n $
0 2
T /n T /n
(t n ) T r T2 (tkn )
exp μs + σWs k ds Ikn := + +σ Ws ds.
0 n 2 n2 0
Simulation phase: Now, it follows from Proposition 8.2 applied to the Brown-
ian motion (which is its own genuine Euler scheme), that the n-tuple of processes
(tkn )
Wt t∈[0,T /n] , k = 0, . . . , n − 1, are independent processes given σ(Wtkn , k =
1, . . . , n) with conditional distribution given by
(t n ) (t n )
L (Wt )t∈[0,T /n] | Wtkn = wk , k = 1, . . . , n = L (Wt )t∈[0,T /n] | Wtn = w , Wt+1
n = w+1
nt W,T /n
=L (w+1 − w ) + Yt
T t∈[0,T /n]
T /n
nt T
m(w , w+1 ) = (w+1 − w )dt = (w+1 − w )
0 T 2n
T /n 2
(with w0 = 0) as a mean and a (deterministic) variance σ 2 = E YsW,T /n ds .
0
We can use the exercise below for a quick computation of this quantity based on
stochastic calculus. The computation that follows is more elementary and relies on
the distribution of the Brownian bridge between 0 and Tn :
T /n 2
W,T /n
E YsW,T /n ds = E (YsW,T /n Yt ) ds dt
0 [0,T /n]2
n T
= (s ∧ t) − (s ∨ t) ds dt
T [0,T /n]2 n
3
T
= (u ∧ v)(1 − (u ∨ v)) du dv
n [0,1]2
1 T 3
= .
12 n
Now we are in a position to write a pseudo code for a Monte Carlo # simulation.
T
Pseudo-code for the pricing of an Asian option with payoff h 0 X sx ds
for m := 1 to M
d
• Simulate the Brownian increments Wt(m) n := Tn Z k(m) , k = 1, . . . , n, Z k(m) i.i.d.
k+1
with distribution N (0; 1); set
– wk(m) := Tn Z 1(m) + · · · + Z n(m) ,
– xk(m) := x exp μtkn + σwk(m) , k = 0, . . . , n.
• Simulate independently ζk(m) , k = 1, . . . , n, i.i.d. with distribution N (0; 1) and set
⎛ ⎞
3
n,(m) d T r T 2 T 1 T 2 (m)
Ik := + + σ ⎝ (wk+1 − wk ) + √ ζk ⎠ , k = 0, . . . , n − 1.
n 2 n 2n 12 n
• Compute
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 379
n−1
(m)
hT =: h xk(m) Ikn,(m)
k=0
end.(m)
1 (m)
M
Premium e−r t h .
M m=1 T
end.
Time discretization error estimates: Set
T
n−1
AT = X sx ds and ĀnT = X txkn Ikn .
0 k=0
This scheme, which is clearly simulable but closely dependent on the Black–Scholes
model in its present form, induces the following time discretization error, established
in [187].
A T − ĀnT p = O(n − 2 )
3
Warning: Even more than in other chapters, we recommend the reader to carefully
read the “Practitioner’s corner” sections to get more information on practical aspects
in view of implementation. In particular, many specifications of the structure param-
eters of the multilevel estimators are specified a priori in various Tables, but these
specifications admit variants which are discussed and detailed in the Practitioner’s
corner sections.
9.1 Introduction
Then, one can solve the “companion” parabolic PDE associated to its infinitesimal
generator. See e.g. [2], where other examples in connection with pricing and hedging
of financial derivatives under a risk neutral assumption are connected with the corre-
sponding PDE derived by the Feynman–Kac formula (see Theorem 7.11). Numerical
methods based on numerical schemes adapted to these PDEs are proposed. How-
ever these PDE based approaches are efficient only in low dimensions, say d ≤ 3.
In higher dimensions, the Feynman–Kac representation is used in the reverse sense:
one computes u(0, x) = E f (X Tx ) by a Monte Carlo simulation rather than trying to
numerically solve the PDE. As far as derivative pricing and hedging is concerned
this means that PDE methods are efficient, i.e. they outperform Monte Carlo simu-
lation, until dimension d = 3, but beyond, the Monte Carlo (or quasi-Monte Carlo)
method has no credible alternative, except for very specific models. When dealing
with risk estimation where the structural dimension is always high, there is even less
alternative.
As we saw in Chap. 7, the exact simulation of such a diffusion process, at fixed
horizon T or “through” functionals of its paths, is not possible in general. A notice-
able exception is the purely one-dimensional setting (d = q = 1 with our notations)
where Beskos and Roberts proposed in [42] (see also [43]) such an exact procedure
for vectors (X t1 , . . . , X tn ), tk = kT
n
, k = 1, . . . , n, based on an acceptance-rejection
method involving a Poisson process. Unfortunately, this method admits no significant
extension in higher dimensions.
More recently, various groups, first in [9, 18], then in [149] and in [240]) devel-
oped a kind of “weak exact” simulation method for random variables of the form
f (X T ) in the following sense: they exhibit a simulable random variable satisfying
E = E f (X T ) and Var() < +∞. Beyond the assumptions made on b and σ, the
approach is essentially limited to this type of “vanilla” payoff (depending on the dif-
fusion at one fixed time T ) and seems, from this point of view, less flexible than the
multilevel methods developed in what follows. These methods, as will be seen further
on, rely, in the case of diffusions, on the simulation of discretization schemes which
are intrinsically biased. However, we have at hand two kinds of information: the
strong convergence rate of such schemes toward the diffusion itself in the L p -spaces
(see the results for the Euler and Milstein schemes in Chap. 7) on the one hand and
some expansions of the weak error, i.e. of the bias on the other hand. We briefly saw
in Sect. 7.7 how to use such a weak error expansion to devise a Richardson–Romberg
extrapolation to (partially) kill this bias. We will first improve this approach by intro-
ducing the multistep Richardson–Romberg extrapolation, which takes advantage of
higher-order expansions of the weak error. Then we will present two avatars of the
multilevel paradigm. The multilevel paradigm introduced by M. Giles in [107] (see
also [148]) combines both weak error expansions and a strong approximation rate
to get rid of the bias as quickly as possible. We present two natural frameworks for
9.1 Introduction 383
multilevel methods, the original one due to Giles, when a first order expansion of
the weak error is available, and one combined with multistep Richardson–Romberg
extrapolation, which takes advantage of a higher-order expansion.
The first section develops a convenient abstract framework for biased Monte
Carlo simulation which will allow us to present and analyze in a unified way various
approaches and improvements, starting from crude Monte Carlo simulation, standard
and multistep Richardson–Romberg extrapolation, the multilevel paradigm in both
its weighted and unweighted forms and, finally, the randomized multilevel methods.
On our way, we illustrate how this framework can be directly applied to time stepping
problems – mainly for Brownian diffusions at this stage – but also to optimize the
so-called Nested Monte Carlo method. The object of the Nested Monte Carlo method
is to compute by simulation of functions of conditional expectations which rely on
the nested combination of inner and outer simulations. This is an important field
of investigation of modern simulation, especially in actuarial sciences to compute
quantities like the Solvency Capital Requirement (SCR) (see e.g. [53, 76]). Numerical
tests are proposed to highlight the global efficiency of these multilevel methods.
Several exercises will hopefully help to familiarize the reader with the basic principles
of multilevel simulation but also some recent more sophisticated improvements like
antithetic meta-schemes. At the very end of the chapter, we return to the weak exact
simulation and its natural connections with randomized multilevel simulation.
E Yh → E Y0 as h → 0. (9.1)
384 9 Biased Monte Carlo Simulation, Multilevel Paradigm
However, stronger assumptions will be needed in the sequel, in terms of weak but also
strong (L 2 ) approximation. Our ability to specify the simulation complexity of Yh for
every h ∈ H will be crucial when selecting such a family for practical implementation.
Waiting for some more precise specifications, we will simply assume that
copies of Yh . For that purpose, all standard ways to measure this error (quadratic
norm, CLT to design confidence intervals or LIL) rely on the existence of a variance
for Yh , that is, Yh ∈ L 2 . Furthermore, except in very specific situations, one further
natural constraint on the (Yh )h∈H is that (Var(Yh ))h∈H remains bounded as h → 0
in H but, in practice, this is not enough and we often require the more precise
E Yh2 → E Y02 as h → 0.
In some sense, the term “approximation” to simply denote the convergence of the
first two moments is a language abuse, but we will see that Y0 is usually a functional
F(X ) of an underlying “structural stochastic process” X and Yh is defined as the same
functional of a discretization scheme F( X n ), so that the above two convergences are
usually the consequence of the weak convergence of X n toward X in distribution
with respect to the topology on the functional space of the paths of the processes X
and X n , typically the space D([0, T ], Rd ) of right continuous left limited functions
on [0, T ].
Let us give two typical examples where this framework will apply.
Examples.
1. Discretization scheme of a diffusion (Euler). Let X = (X t )t∈[0,T ] be a Brownian
diffusion with Lipschitz continuous drift b(t, x) and diffusion coefficient σ(t, x),
both defined on [0, T ] × Rd and having values in Rd and M(d, q, R), respectively,
associated to a q-dimensional Brownian motion W defined on a probability space
9.2 An Abstract Framework for Biased Monte Carlo Simulation 385
T T n
H= , n ∈ N∗ , h = , Yh = F (
X t )t∈[0,T ] and F ( X̄ tn )t∈[0,T ]
n n
respectively. Note that the complexity of the simulation is proportional to the number
n of time discretization steps of the scheme, i.e. κ(h) = κ̄h .
2. Nested Monte Carlo. The aim of the so-called Nested Monte Carlo simulation is
to compute expectations involving a conditional expectation, i.e. of the form
I0 = E Y0 with Y0 = f E (F(X, Z ) | X ) ,
K
1
F(X m , Z km ), where (Z km )k=1,...,K are i.i.d. copies of Z , independent of X ,
K
k=1
M
K
1 1
I M,K = f F(X m , Z km ) .
M m=1
K k=1
and
K
1 1
Yh = f F(X, Z k ) , h = ∈ H, (Z k )k=1,...,K i.i.d. copies of Z (9.4)
K k=1
K
independent of X .
If f is continuous with linear growth, it is clear by the SLLN under the above
assumptions that Yh → Y0 in L 1 , so that E Yh → E Y0 . If, furthermore, F(X, Z ) ∈
L 2 (P), then Var(Yh ) → Var(Y0 ) as h → 0 in H (i.e. as K → +∞). Consequently, if
K and M go to infinity, then I M,K will converge toward its target f E (F(X, Z ) | X )
as M, K → +∞ in a proper way, but, it is clearly a biased estimator at finite range.
Note that the complexity of the simulation of Yh as defined above is proportional
to the length K of this inner Monte Carlo simulation, i.e. again of the form κ(h) = κ̄h .
Two main settings are extensively investigated in nested Monte Carlo simulation:
the case of smooth functions f (at least Hölder) on the one hand and that of indicator
functions of intervals. This second setting is very important as it corresponds to the
computation of quantiles of a conditional expectation, and, in a second stage, of
quantities like Value-at-Risk which is the related inverse problem. Thus, it is a major
issue in actuarial sciences to compute the Solvency Capital Requirement (SCR) which
appears, technically speaking, exactly as a quantile of a conditional expectation. To
be more precise, this conditional expectation is the value of the available capital at a
maturity T (T = 30 years or even more) given a short term future (say 1 year). For a
first example of computation of SCR based on a nested Monte Carlo simulation, we
refer to [31]. See also [76].
Our aim in this chapter is to introduce methods which reduce the bias efficiently
while keeping the variance under control so as to satisfy a prescribed global quadratic
error with the smallest possible “budget”, that is, with the lowest possible complexity.
This leads us to model a Monte Carlo simulation as a constrained optimization
problem. This amounts to minimizing the (computational) cost of the simulation for
a prescribed quadratic error. We first develop this point of view for a crude Monte
Carlo simulation to make it clear and, more importantly, to have a reference. To this
end, we will also need to switch to more precise versions of (9.1) and (9.2), as will
be emphasized below. Other optimization approaches can be adopted which turn out
to be equivalent a posteriori.
9.3 Crude Monte Carlo Simulation 387
The starting point is the bias-variance decomposition of the error induced by using
independent copies (Yhk )k≥1 of Yh to approximate the quantity of interest I0 = E Y0
by the Monte Carlo estimator
M
1
IM = Yhk ,
M k=1
namely
2 2 Var(Yh )
IM 2 = E Y0 − E Yh +
E Y0 − . (9.5)
M
squared bias Monte Carlo variance
M
The two terms are orthogonal in L 2 since E E Yh − 1
M k=1 Yhk = 0.
Our aim in what follows is to minimize the cost of a Monte Carlo simulation for
a prescribed (upper-bound of the) R M S E = ε > 0, assumed to be small, and in any
case in (0, 1] in non-asymptotic results. The idea is to make a balance between the
bias term and the variance term by an appropriate choice of the parameter h and the
size M of the simulation in order to achieve this prescribed error bound at a minimal
cost. To this end, we will strengthen our bias assumption (9.1) by assuming a weak
rate of convergence or first order weak expansion error
α
WE 1
≡ E Yh = E Y0 + c1 h α + o(h α ) (9.6)
where α > 0. We assume that this first order expansion is consistent, i.e. that c1 = 0.
Remark. Note that under the assumptions of Theorem 7.8, (a) and (b), the above
assumption holds true in the framework of diffusion approximation by the Euler
scheme with α = 1 when Yh = f ( X̄ Tn ) (with h = Tn ). For path-dependent functionals,
we saw in Theorem 8.1 that α drops down (at least) to α = 21 in situations involving
the indicator function of an exit time, like for barrier options and the genuine Euler
scheme.
388 9 Biased Monte Carlo Simulation, Multilevel Paradigm
σ̄ 2
M S E(
I M ) ≤ c12 h 2α + + o h 2α ).
M
On the other hand, we make the natural assumption (see the above examples) that
the simulation of (one copy of) Yh has complexity
κ̄
κ(h) = . (9.7)
h
Cost(
IM ) = Mκ(h)
If we replace M S E(
I M ) by its above upper-bound and neglect the term o(h 2α ), this
problem can be re-written by the following approximate one:
M κ̄ σ̄ 2
inf , h ∈ H, M ∈ N∗ , c12 h 2α + ≤ ε2 . (9.9)
h M
Now, note that
M κ̄ M κ̄ σ̄ 2 /M κ̄ σ̄ 2
= = ,
h h σ̄ 2 /M h(σ̄ 2 /M)
which
h liesin H but is possibly slightly suboptimal. Thus, if H is of the form H =
n
, n ≥ 1 ,
h ∗ (ε) = hh/ h min (ε)−1 . (9.12)
We derive the simulation size by plugging h min (ε) into the saturated constraint
c12 h 2α + σ̄M = ε2 , which yields after elementary computations
2
1 σ̄ 2
M = M ∗ (ε) = 1+ .
2α ε2
Taking advantage of the fact that Var(Yh ) converges to Var(Y0 ), one may improve
the above minimal complexity as ε → 0 (and make more rigorous the above opti-
mization by taking into account the term o(h 2α ) that we neglected). In fact, one easily
checks that
1
(1 + 2α)1+ 2α 1
Cost(
1
inf IM ) κ̄ Var(Y0 )|c1 | α 1 as ε → 0. (9.13)
R M S E(
IM )≤ε 2α ε α
2+
The main conclusion of this introductory section is the following: the challenge
of reducing (killing…) the bias is to switch from the exponent 2 + α1 to an exponent
as close to 2 as possible. This is a quite important question, as highlighted by the
following simple illustration based on (9.11) or (9.13):
We assume that the weak error expansion of E Yh in the (h α )≥1 scale holds at the
second order, i.e.
α
WE 2
≡ E Yh = E Y0 + c1 h α + c2 h 2α + o h 2α , (9.15)
and that (9.2) holds, i.e. Var(Yh ) → Var(Y0 ) as well as σ̄ 2 = suph∈H Var(Yh ) < +∞.
Remark.
α The implementation
α of Richardson–Romberg extrapolation only requires
W E 1 . We assumed W E 2 to enhance its best possible performance. The existence
of such an expansion (already tackled in the diffusion framework) will be discussed
further on in Sect. 9.5.1, e.g. for nested Monte Carlo simulation.
h
One deduces by linearly combining (9.15) with h and 2
that
2α E Y h2 − E Yh = (2α − 1)E Y0 + (2−α − 1)c2 h 2α + o h 2α . (9.16)
1 α 2
Var(Y h )2 ≤
h ) = σ(Y 2 σ(Y h2 ) + σ(Yh )
(2α − 1)2
2α + 1 2
≤ α σ̄ 2 (9.17)
2 −1
with the notations of the former section. Also note that the simulation cost κ̃(h) of
h satisfies κ̃(h) ≤ κ(h) + 2κ(h) = 3κ̄ ; note that for some applications like nested
Y h
Monte Carlo one may obtain a slower increase of the complexity since κ̃(h) =
κ(h) ∨ (2κ(h)) = 2hκ̄ . This leads us to introduce, for every h ∈ H, the Richardson–
Romberg estimator
M M
1 1
I MR R =
I MR R (h, M) = hk =
Y 2α Y hk − Yhk , M ≥ 1,
M k=1
M(2α − 1) k=1
2
where (Yhk , Y hk )k≥1 are independent copies of (Yh , Y h2 ). Applying the results of the
2
former section to the family (Y h )h∈H with α̃ = 2α, κ̃(h), c̃1 = −(1 − 2−α )c2 and
the upper-bound (9.17) of σ(Ỹh ), we deduce from (9.11) that
inf Cost(
I MR R ) ≤ 1 .
I MR R −I0 2 ≤ε 4α (2α − 1)2 ε2+ 2α
1
In view of the exponent 2+ 2α of ε (the prescribed RMSE ) in the above minimal
Cost function, it turns out that we are halfway toward an unbiased simulation.
However, the brute force upper-bound established for Var(Y h ) is not satisfactory
since it relies on the triangle inequality for the standard deviation, so we will try
to improve it by investigating the internal structure of the family (Yh )h∈H . (In what
follows all lim , lim , etc, should always be understood for h ∈ H.)
h→0 h→0
h , we know that
If we apply the reverse triangle inequality to Y
h ) ≥ 1
Var(Y (2α σ(Yh ) − σ(Y h2 ))2
(2α − 1)2
so that
h ) = 2 + 1 Var(Y0 ) − 2
2α 1+α
lim Var(Y lim Cov(Yh , Y h2 ).
h→0 (2α − 1)2 (2α − 1)2 h→0
Combining the lower bound (9.18) and the upper bound (9.19) with the former
equality shows that:
h ) = Var(Y
lim Var(Y 0 ) iff lim Var(Y
h ) ≤ Var(Y
0 )
h→0 h→0
iff lim Cov(Yh , Y h ) ≥ Var(Y0 )
2
h→0
iff lim Cov(Yh , Y h ) = Var(Y0 ).
h→0 2
But, then
Yh − Y h 2 = Var(Yh − Y h ) + E (Yh − Y h ) 2
2 2 2 2
2
= Var(Yh ) − 2 Cov(Yh , Y h2 ) + Var(Y h2 ) + E (Yh − Y h2 )
→ Var(Y0 )(1 − 2 + 1) + 02 = 0 as h → 0 in H,
i.e.
L2
Yh − Y h2 −→ 0 as h → 0 in H. (9.20)
This strongly suggests, in order to better control the variance of the estimator, to
consider a family (Yh )h∈H which
strongly converges toward Y0 in quadratic norm
(which in turn will imply that Yh − Y h2 2 → 0) that is
L2
Yh −→ Y0 as h → 0 in H. (9.21)
9.4 Richardson–Romberg Extrapolation (II) 393
2. In the nested Monte Carlo framework as described by (9.3) and (9.4) at the begin-
ning of the chapter, (9.21) is an easy consequence of the SLLN and Fubini’s Theorem
when the function f is Lipschitz continuous and F(X, Z ) ∈ L 2 .
Exercise. Prove the statement about nested the Monte Carlo method in the pre-
ceding example. Extend the result to the case where f is ρ-Hölder under appropriate
assumptions on F(X, Z ).
As a conclusion to this section, let us mention that it may happen that (9.20)
holds and (9.21) fails. This is possible if the rate of convergence in (9.20) is not
fast enough. Such a situation is observed with the (weak) approximation of the
geometrical Brownian motion by a binomial tree (see [32]).
Brownian diffusions
Let X = (X t )t∈[0,T ] be the Brownian diffusion solution to the SDE with Lipschitz
continuous drift b(t, x) and diffusion coefficient σ(t, x), driven by a q-dimensional
Brownian motion on a probability space (, A, P) (with X 0 independent of W ). The
Euler scheme with step h = Tn – so that H = Tn , n ≥ 1 – reads:
T T
X̄ tnk+1
n = X̄ tnkn + b(tkn , X̄ tnkn ) + σ(tkn , X̄ tnkn )Uk+1
n
, X̄ 0n = X 0 ,
n n
394 9 Biased Monte Carlo Simulation, Multilevel Paradigm
where tkn = kT
n
and Ukn = Tn (Wtkn − Wtk−1n ), k = 1, . . . , n. We consider now the
T
Euler scheme with step 2n , designed with the same Brownian motion. We have
T T
=
X̄ t2n2n + b(tk , X̄ t 2n ) +
X̄ t2n2n
2n 2n
σ(t 2n , X̄ t2n2n )Uk+1
2n
, X̄ 02n = X 0 ,
k+1 2n k k 2n k k
where tk2n = kT
2n
and U 2n
k = 2n
T
(Wtk2n − Wtk−12n ), k = 1, . . . , 2n. Hence, it is clear by
The joint simulation of these two schemes can be simply performed as follows:
• First simulate 2n independent pseudo-random numbers Uk2n , k = 1, . . . , 2n, with
distribution N (0; Iq );
• then compute the Ukn , k = 1, . . . , n using (9.22).
An alternative method is to:
• first simulate the Brownian increments of the coarse scheme with step Tn ,
• then use the recursive simulation of the Brownian motion to simulate the incre-
T
ments of the refined scheme with step 2n . To be precise, a straightforward
application of the Brownian bridge method (see Proposition 8.1, applied with
( Tn )
Wt = W Tn +t − W Tn between 0 and T
n
) yields that
y − x T
L W (2k+1)T − W kTn | W kTn = x, W (k+1)T = y = N ;
2n n 2 4n
so that x + y T
L W (2k+1)T | W kTn = x, W (k+1)T = y = N ; .
2n n 2 4n
Note that the joint simulation of (functionals of) two genuine (continuous) Euler
schemes with respective steps Tn and 2n T
cannot be performed in an as elementarily
way: the joint simulation of the diffusion bridge method for both schemes remains,
to our knowledge, an open problem.
Nested Monte Carlo
– Weak error expansion ( f smooth). In the nested Monte Carlo framework defined
in (9.4) and (9.3), the question of the existence of a first- or second-order weak
error expansion, with α = 1, remains reasonably elementary when the function f is
smooth enough. Thus, if f is five times
differentiable
α with bounded existing partial
derivatives and F(X, Z ) ∈ L 5 , then W E 2 holds with α = 1. This is a special case
of an expansion at order R established in [198].
9.4 Richardson–Romberg Extrapolation (II) 395
– Weak error expansion ( f indicator function). When f = 1[a,+∞) or, more gen-
erally, any indicator function of an interval, establishing the existence of a first-order
weak error expansion is a much more involved task, first achieved in [128] to our
knowledge. It relies on smoothness assumptions α joint law of (0 , h ). In [113],
on the
a weak error expansion at order R ≥ 2, that is, W E R , still with α = 1, is established
by a duality method (see also Sect. 9.6.2 for a more precise statement).
– Complexity. Temporarily assume
K h = 1 – i.e. K 0 = 1 – for simplicity. Given the
form of Yh = f K1 k=1 F(X, Z k ) , (Z k )k≥1 i.i.d. the expression of the complexity
h differs from the generic case and can be slightly improved (if the computa-
of Y
tion
of one value of the function f is neglected) since one just needs to simulate
F(X, Z k ) k=1,...,2K to simulate both Yh and Y h2 . Consequently, the complexity of
the computation of Y h in this framework is
2κ̄ 3κ̄
κ(h) = κ̄ 2K = (rather than h
).
h
As for all Taylor-like expansions of this type, the coefficients cr are unique and
do not depend on R ≥ r .
We also make the strong convergence assumption (9.21) (Yh → Y0 in L 2 ) as well,
though it is not mandatory in this setting.
Definition 9.2. An increasing R-tuple n = (n 1 , . . . , n R ) of positive integers satis-
fying 1 = n 1 < n 2 < · · · < n R is called a vector of refiners. For every h ∈ H, the
resulting sub-family of approximators is denoted by Yn,h := Y nh i=1,...,R .
i
The driving idea is, a vector n of refiners being fixed, to extend the Richardson–
Romberg estimator by searching for a linear combination of the components of E Y nh
α r
which kills the resulting bias up to order R − 1, relying on the expansion W E R and
396 9 Biased Monte Carlo Simulation, Multilevel Paradigm
then deriving the multistep estimator accordingly. To determine this linear combina-
tion, whose coefficients hopefully will not depend on h ∈ H, we consider a generic
R-tuple of weights w R = (w1R , . . . , w RR ) ∈ R R .
To alleviate notation we will drop the superscript R and write w instead
of wR
α
when no ambiguity arises. Idem for the components w. Then, owing to W E R ,
R R R
h αr
R
αR
w j E Y nh = w j E Y0 + wj + cr
o h
j=1
j
j=1 j=1 r =1
n αj
⎡ ⎤
R R R
w
cr h αr ⎣ ⎦ + o h αR , (9.24)
j
= w j E Y0 + αr
j=1 r =1 j=1 j
n
where we interchanged the two sums in the second line. If w is a solution to the
system ⎧
⎪
⎪
R
⎪
⎪ wj = 1
⎪
⎨
j=1
⎪
⎪
R
wj
⎪
⎪ = 0, r = 1, . . . , R − 1,
⎪
⎩ n αr
j=1 j
or, equivalently,
R
wj
α(r −1)
= 0r −1 , r = 1, . . . , R, (9.25)
j=1 n j
(R) is defined by
where W R+1
R
(R) = wj
W . (9.26)
R+1
j=1
n αR
j
det[Vand(n −α −α −α −α
1 , . . . , n i−1 , 0, n i+1 , . . . , n R )]
wi = , i = 1, . . . , R.
det[Vand[(n −α −α −α
1 , . . . , n i , . . . , n R )]
As a consequence, the weight vector solution to (9.25) has a closed form since it
is classical background that
)
∀ x1 , . . . , x R > 0, det[Vand(x1 , . . . , xi , . . . , x R )] = (x j − xi ).
1≤i< j≤R
R+1 .
Synthetic formulas are given in the following proposition for w and W
Proposition 9.1 (a) For every fixed integer R ≥ 2, the weight vector w admits a
closed form given by
n α(R−1) 1
∀i ∈ {1, . . . , R}, wi = ) i = (−1) R−i ) . (9.27)
α α
(n i − n j ) |1 − (n j /n i )α |
j=i j=i
(b) Furthermore,
)
R
R+1 = (−1)
R−1
W where n! = ni . (9.28)
n!α i=1
n iα(R−1) n iα(R−1)
= * α α = (−1) R−i
* α α ,
k=i (n i − n k ) k=i |n i − n k |
where we used in the last line that the refiners are increasing.
Furthermore, setting X = 0 in the elementary decomposition of the rational ratio
R
1 1 1
*R = *
k=i ( xi − ) (X − )
1 1 1
i=1 (X − )
1
xi i=1 xk xi
)
R R
xi
(−1) R−1 xi = *
k=i xi − xk )
( 1 1
i=1 i=1
R * R
xiR k=i xk
= * = xiR wi .
i=1 k=i (x k − x i ) i=1
Replacing xi by its value n i−α for every index i leads to the announced result since,
by the very definition (9.26) of W R+1 ,
R
wi
R )
R
(−1) R−1
R+1 =
W = (−1) R−1 xiR wi = (−1) R−1 xi = .
i=1
n iα i=1 i=1
n!α ♦
R
c R αR
wi E Y nh = E Y0 + (−1) R−1 α
h + o h αR . (9.29)
i=1
i n!
Remark. The main point to be noticed is the universality of these weights, which do
not depend on h ∈ H, but only on the chosen vector of refiners n and the exponent α
of the weak error expansion.
These computations naturally suggest the following definition for a family of
multistep estimators of depth R.
Definition 9.3. (Multistep estimator) The family of multistep estimators of depth R
associated to the vector of refiners n is defined for every h ∈ H and every simulation
size M ≥ 1 by
M R
1
I MR R =
I MR R (R, n, h, M) = wi Y kh (9.30)
M k=1 i=1
ni
R M
1
= wi Y kh , (9.31)
M i=1 k=1
ni
k
where Yn,h are independent copies of Yn,h = Y nh i=1,...,R and the weight vector w
i
is given by (9.27).
The parameter R is called the depth of the estimator.
Remark. If R = 2 and n = (1, 2) the multistep estimator is just the regular
Richardson–Romberg estimator introduced in the previous section.
Exercises. 1. Weights of interest. Keep in mind that the formulas below, especially
those in (b), will be extensively used in what follows.
(a) Show that, if n i = i and α = 1, then, the depth R ≥ 2 being fixed,
9.4 Richardson–Romberg Extrapolation (II) 399
iR R+1 = (−1)
R−1
wi = (−1) R−i , i = 1, . . . , R and W . (9.32)
i!(R − i)! R!
1
wi = (−1) R−i *
1≤ j≤R, j=i |1 − N −α(i− j) |
α
N − 2 (R− j)(R− j+1)
= (−1) R−i *i−1 * R− j (9.33)
−α j ) −α j )
j=1 (1 − N j=1 (1 − N
and
R+1 = (−1)R(R−1) .
R−1
W
Nα 2
α
2. When c1 = 0. Assume that WE R holds with c1 = 0
α R
WE R ≡ E Yh = E Y0 + cr h αr + o h αR .
r =2
†
Prove the existence and determine an (R − 1)-tuple of weights w1† , . . . , wR−1
† such that
and a coefficient W R
R−1
† c R h αR + o(h αR ).
wi† E Y nh = W R
i
i=1
[Hint: One can proceed either by mimicking the general case or by an appropriate
re-scaling of the regular weights at order R − 1.]
of
I MR R defined in (9.30) is performed in practice by (9.31), which only requires R
multiplications.
– Variance. As expected
Var 1≤i≤R wi Y nh
Var
I MR R = i
.
M
Exercises. 1. Analysis of the Richardson–Romberg estimator. Let I0 = E Y0 .
Show by mimicking
theαanalysis of the crude Monte
Carlo simulation (carried out
under assumption W E 2 ) that, if H = hn , n ≥ 1 ,
α
– W E R holds
and
– Yh − Y0 2 → 0 as h → 0 in H,
then the Multistep Richardson–Romberg estimator of I0 satisfies
1
R
R R (1 + 2αR)1+ 2αR 1 |n| 1
inf Cost( I M ) ∼ |c R | Var
αR wi Y nh 1 1
h∈H
I MR R −I0 2 <ε
2αR i=1
i n! ε
R 2+ αR
1
(1 + 2αR)1+ 2αR 1 |n| 1
∼ |c R | αR Var(Y0 ) 1 1 as ε → 0
2αR n! ε R 2+ αR
where the optimal parameters h ∗ (ε) and M ∗ (ε) (simulation size) are given by
1
ε αR
h ∗ (ε) = hh/ h opt (ε)−1 ∈ H with h opt (ε) = 1 1
(1 + 2αR) 2αR |c R | αR
and
∗ 1 Var(Yh ∗ (ε) )
M (ε) = 1+ .
2αR ε2
(As defined, h ∗ (ε) is the nearest lower neighbor of h opt (ε) in H.)
2. Show that
|n|
(a) 1 ≥ R.
n! R
1 1
(b) If n i = i, i = 1, . . . , R, |n| = R(R+1)
2
and n! R = n! R ∼ R
e
as R ↑ ∞ so that
|n|
1 ∼
e(R+1)
2
.
n! R
|n| R−1
(c) If n i = N i−1 (N ∈ N, N ≥ 2), then 1 ∼N 2 .
n! R
(d) Show that the choice n i = i, i = 1, . . . , R, for the refiners may be not optimal
to minimize |n|1 [Hint: test n i = i + 1, i = 2, . . . , R].
n! R
9.4 Richardson–Romberg Extrapolation (II) 401
The conclusions that can be drawn from the above exercises are somewhat
R R con-
tradictory concerning the asymptotic cost of the Multistep estimator
I M M≥1 to
achieve an R M S E of ε:
• At a first glance, it does fill the gap between the crude Monte Carlo rate – ε−2− α
1
ℵ Practitioner’s corner
Much of the information provided in this section will be useful in Sect. 9.5.1 that
follows, devoted to weighted multilevel methods.
The refiners and the weights. As suggested in the preceding Exercise 2, the choice
n i = i is close to optimality to minimize |n|1 . The other somewhat hidden condition
n! R
to obtain the announced asymptotic behavior of the complexity as ε → 0 is the strong
convergence assumption Yh → Y0 in L 2 .
Two examples of weights.
(a) α = 1, n i = i, i = 1, . . . , R:
(b) α = 21 , n i = i, i = 1, . . . , R:
(R)
√ (R)
√ √
R = 2 :w1 = −(1 + 2), w2 = 2(1 + 2).
√ √ √
3− 2 3−1
R = 3 :w1(R) = √ √ , w2(R) = −2 √ √ ,
2 2− 3−1 2 2− 3−1
√
2−1
w3(R) =3 √ √ .
2 2− 3−1
√ √ 3 √ √ √
(1+ 2)(1+ 3)
R = 4 :w1(R) =− , w2(R) = 4 + 2 ( 3+ 2),
2 2
(R) 3 √ √ √ √ (R)
√ √
w3 = − ( 3+ 2)(2+ 3)(3+ 3), w4 = 4(2+ 2)(2+ 3).
2
402 9 Biased Monte Carlo Simulation, Multilevel Paradigm
Nested Monte Carlo. Once again, given the form of the approximator Yh , the
complexity of the procedure is significantly smaller in this case than announced in
the general framework. In fact,
nR
κ R R (h) =
h
so that, revisiting Exercise 1.,
1
(1 + 2αR)1+ 2αR 1 1
inf Cost(
1
I MR R ) ∼ |c R | αR Var(Y0 ) 1 1 as ε → 0.
h∈H
I MR R −I0 2 <ε
2αR n! ε
R 2+ αR
|n|
What should be noticed is that, when n i = i, i = 1, . . . , R, the ratio 1 (which
n! R
grows at least like R) is replaced in that framework by 1
1 ∼ e
R
as R ↑ +∞.
n! R
Brownian diffusions. Among the practical aspects to be dealt with, the most
important one isundoubtedly to simulate (independent copies of) the vector Yn,h
when Y0 = F X , X = (X t )t∈[0,T ] is a Brownian diffusion and Yh = F X n , where
F is a functional defined on D([0, T ], Rd ) and X n = ( X tn )t∈[0,T ] is the (càdlàg)
T
stepwise constant Euler (or Milstein) scheme with step n of the underlying SDE,
especially when n = (1, . . . , R).
Like for standard Richardson–Romberg extrapolation, one has the choice to sim-
ulate the Brownian increments in a consistent way as requested either by using an
abacus starting from the most refined scheme or by a recursive simulation starting
from the coarsest increment. The structure of the refiners (n i = i, i = 1, . . . , R)
makes the task significantly more involved than for the standard Romberg extrapo-
lation.
We refer to [225] for an algorithm adapted to these refined schemes which spares
time and complexity in the simulation of these Brownian increments.
Throughout this section, unless explicitly mentioned (e.g. in exercises), the refin-
ers have a geometric structure, namely
n i = N i−1 , i = 1, . . . , R. (9.34)
The basic principle of the multilevel paradigm is to split the Monte Carlo simulation
into two parts: a first part made of a coarse level based on an estimator of Yh = Y nh
1
with a not too small h, that will achieve the most part of the job to compute I0 = E Y0
with a bias E Yh , and a second part made of several correcting refined levels relying on
increments Y nh − Y n h to correct the former bias. These increments being small –
i i−1
since both Y nh and Y n h are close to Y0 – have small variance and need smaller
i i−1
simulated samples to perform the expected bias correction. By combining these
increments in an appropriate way, one can almost “kill” the bias while keeping under
control the variance and the complexity of the simulation under control.
Step 1 (Killing the bias). Let us assume
α R
WE R
≡ E Yh = E Y0 + cr h αr + o h αR . (9.35)
r =1
Then, starting again from the weighted combination (9.24) of the elements of Yn,h
with the weight vector w given by (9.27), one gets by an Abel transform
R R
wi E Y nh = W1 E Yh + Wi E Y nh − E Y n h
i i i−1
i=1 i=2
R
= W1 E Y h + Wi E Y nh − Y n h
i i−1
i=2
⎡ ⎤
⎢ R ⎥
⎢ ⎥
= E ⎢ Yh(1) + Wi Y (i)
h − Y h
(i)
⎥ (9.36)
⎣ ⎦
ni n i−1
i=2
coarse level
refined level i≥2
404 9 Biased Monte Carlo Simulation, Multilevel Paradigm
where
Wi = Wi(R) = wi + · · · + w R , i = 1, . . . , R, (9.37)
(i)
(note that W1 = w1 + · · · + w R = 1) and, in the last line, the families (Yn,h )i=1,...,R
are independent copies of Yn,h . This last point at this stage may seem meaningless
but, doing so, we draw the attention of the reader, in view of the future variance
computations (1 ).
As the weights satisfy the Vandermonde system (9.25), we derive from (9.36) that
R
E Yh(1) + Wi E Y (i) (i)
h − Y h
R+1 c R h αR + o h αR ,
= E Y0 + W (9.38)
ni n i−1
i=2
1
M1 R
Wi
Mi
,...,M R =
I MM1L2R Yh(1),k + Y (i),k
h − Y (i),k
h ,
M1 k=1 i=2
Mi k=1
ni n i−1
(i),k
where Yn,h , i = 1, . . . , R, k ≥ 1, are i.i.d. copies of Yn,h := Y nh i=1,...,R and the
i
weight vector (Wi )i=1,...,R is given by (9.37).
However, in view of the optimization of the Mi , it is more convenient to introduce
the formal framework of stratification: we re-write the above multilevel estimator as
a stratified estimator, i.e. we set qi = MMi , i = 1, . . . , R, so that
-M .
1 1
Yh(1),k
R
Wi
Mi
I MM L2R := + Yh(i),k
−Y (i),k
h .
M k=1
q1 i=2
qi k=1
ni n i−1
1 Other (i)
choices for the correlation structure of the families Yn,h could a priori be considered,
e.g. some negative correlations between successive levels, but this would cause huge simulation
problems since the control of the correlation between two families seems difficult to monitor a
priori.
9.5 The Multilevel Paradigm 405
Conversely, if
(so S R denotes here the “open” simplex of [0, 1] R ), the above estimator is well
defined as soon as M ≥ min1i qi (≥ R) by setting Mi = qi M ≥ 1, i = 1, . . . , R.
Then M1 + · · · + M R is not equal to M but lies in {M − R + 1, . . . , M}. This leads
us to the following slightly different formal definition.
1
M1 R
Wi
Mi
I MM L2R =
I MM L2R (q, h, R, n) := Yh(1),k + Y (i),k
h − Y (i),k
h ,
M1 k=1 i=2
Mi k=1
ni n i−1
(i),k (9.39)
(convention the 01 0k=1 = 0) where Mi = qi M, i = 1, . . . , R, Yn,h , i = 1, . . . ,
R, k ≥ 1, are i.i.d. copies of Yn,h := Y nh i=1,...,R and the weight vector (Wi )i=1,...,R
i
is given by (9.37).
h R α R
RR h α
M L2R
Bias I M
= Bias I1 = (−1) c R
R−1
+o (9.40)
n! n!
R
1 σW2 (i, h)
= , (9.42)
M i=1
qi
where we set
σW2 (1, h) = Var(Yh ) and σW2 (i, h) = Wi2 Var Y nh − Y n h , i = 2, . . . , R.
i i−1
κ̄M
R
≤ q1 + qi (n i + n i−1 )
h i=2
κ̄M R
= q1 + 1 + N −1 qi n i , (9.43)
h i=2
where we used that Mi = Mqi ≤ Mqi and the specified form n i = N i−1 , i =
1, . . . , R, (see (9.34)) of the refiners in the penultimate line (and below). If we set
κ1 = 1, κi = 1 + N −1 n i , i = 2, . . . , R,
then, if we neglect the error induced by replacing Mi = Mqi by Mqi , we may set
R
κ̄M
Cost(
I MM L2R ) = κi qi . (9.44)
h i=1
Exercise.
Show that in the case of nested Monte Carlo where
K
Yh = f K1 k=1 F(X, Z k ) , the complexity reads (with the same approximation
Mi = Mqi as above)
M R
ni κ̄M
R
Cost(
1
I MM L2R ) = κ̄ + Mi = q1 + qi n i (9.45)
h i=2
h h i=2
9.5 The Multilevel Paradigm 407
inf Cost(
I MM L2R ). (9.46)
R M S E(
I MM L2R )≤ε
I MM L2R )Var(
Cost( I MM L2R ) Effort( I MM L2R )
Cost(
I MM L2R ) = = . (9.47)
Var( I MM L2R ) Var(I MM L2R )
Plugging the respective expressions (9.42) and (9.44) of the variance and the
complexity into their product yields the following expression of the effort (after
simplifying by M)
R
R
κ̄ σW2 (i, h)
Effort(
I MM L2R ) = Cost(
I MM L2R )Var(
I MM L2R ) = κi qi (9.48)
h i=1 i=1
qi
so that the effort does not depend upon the size M of the simulation in the sense that
I MM L2R ) = Cost(
Effort( I MMqL2R )Var(
I MMqL2R ) = Effort(
I MMq L2R ),
where Mq = 1/ mini qi . Such a property is universal among Monte Carlo estima-
tors.
The bias-variance decomposition of the weighted multilevel estimator I MM L2R and
the fact that the R M S E constraint should be saturated to maximize the denominator
of the ratio in (9.47) allow us to reformulate our minimization problem as
- .
Effort(
I MM L2R )
inf M L2R
Cost( I M )= inf 2 (9.49)
I MM L2R −I0 2 ≤ε |Bias(
I MM L2R )|<ε ε2 − Bias I MM L2R
keeping in mind that the right-hand side does not depend on M when M ≥ Mq .
This reformulation suggests to consider the above optimization problem as a
function of two of the three free parameters, namely q and h, the depth R being fixed
(the root N has a special status). Note that the right-hand side of (9.49) seemingly
no longer depends on the simulation size M. In fact, this size is determined further
on in (9.53) as a function of the RMSE ε and the optimized parameters q ∗ (ε) and
h ∗ (ε). We propose a sub-optimal solution for (9.49) divided in two steps:
– first minimizing the effort for a fixed h ∈ H, i.e. the numerator of the above
ratio,
408 9 Biased Monte Carlo Simulation, Multilevel Paradigm
– then minimizing the cost in the bias parameter h ∈ H , that is, maximize the
denominator of the ratio (and plug the resulting optimized h in numerator).
Doing so, it is clear that we will only get sub-optimal solutions to the original
optimization problem since the numerator depend on h. Nevertheless, the compu-
tations become tractable and lead to closed forms for our optimized parameters q
and h.
Minimizing the effort.
This phase follows from the elementary lemma below, a straightforward applica-
tion of the Schwartz Inequality and its equality case (see [52]).
√
ai bi−1
and equality holds if and only if qi = , i = 1, . . . , R, with q † = q † (a, b) =
R −1
q†
i=1 ai bi .
Applying this to (9.48), we derive the solution to the effort minimization problem:
σW (i, h)
argminq∈S R Effort(
I1M L2R ) = q ∗ = ∗,†
√ (9.50)
q (h) κi i=1,...,R
σW
with q ∗,† (h) = √(i,h) . The resulting minimal effort reads
1≤i≤R κi
R
2
κ̄ √
min Effort(
I1M L2R ) = κi σW (i, h) .
q∈S R h i=1
However, this formal approach cannot be implemented as such in practice since the
true values of σW (i, h) are usually unknown and should be replaced by an upper-
bound (see (9.58) and (9.61) further on).
Minimizing the resulting cost.
If we now compile our results on the above effort minimization, the cost for-
mulation (9.44) and the bias expansion (9.40), the global optimization problem
inf Cost(I MM L2R ) is “dominated”, if we neglect the second-order term in
R M S E(
I MM L2R )≤ε
the bias, by
√ 2
R
i=1 κ i σ W
(i, h)
R 2α κ̄
inf h R 2α .
h∈H : c2R hn! < ε2 h ε − c R n!
2 2
9.5 The Multilevel Paradigm 409
ε n! R
h(ε) = . (9.51)
|c R | (1 + 2αR) 2αR
1
Plugging the value of the variance (9.42) and the bias (9.40) into this equation finally
yields
5 M L2R 6 5 ∗,† √ 6
∗ 1 Var
I1 1 q ∗
1≤i≤R κi σW (i, h (ε))
M (ε) = 1+ = 1+ .
2αR ε2 2αR ε2
(9.53)
410 9 Biased Monte Carlo Simulation, Multilevel Paradigm
One may re-write this formula in a more tractable or convenient way (see Table 9.1 in
Practitioner’s corner further on) by replacing κi , σW (i, h) and h ∗ (ε) by their available
expressions or bounds.
Remark. If we assume furthermore that Yh → Y0 in L 2 as ε → 0, then h ∗ (ε) → 0 so
∗
2
that 1≤i≤R κi σW (i, h (ε)) → Var(Y0 ) since σW (i, h ∗ (ε)) → 0 for i = 2, . . . , R,
σW (1, h ∗ (ε)) → Var(Y0 ) and W1 = 1. Finally, we get the following asymptotic result
1 1
(1 + 2αR)1+ 2αR |c R | αR Var(Y0 )
M L2R
inf Cost (h, M, q) 1 1 as ε → 0.
I MM L2R −I0 2 ≤ε 2αR n! R ε2+ αR
(9.54)
Temporary conclusions
– The first good news that can be drawn from this asymptotic result is that we still
1
have the term ε2+ αR , which shows that, the larger the depth R is, the closer we get
to an unbiased simulation.
– The second good news is that the ratio |n|1 (≥ R) that appeared in the multistep
n! R
framework and caused problems at fixed RMSE ε > 0 is now replaced by 1
1 =
n! R
1
R−1 which goes to 0 as R → +∞.
N 2
This
choice
is justified by the following facts: in the diffusion framework (Yh =
f X̄ T or F ( X̄ tn )t∈[0,T ] , X̄ n discretization scheme and h = Tn ), Assumption (V E)β
is consistent with the asymptotic variance of the Central Limit Theorem established
β β2
in [34], namely a weak convergence h − 2 Yh − Y Nh → NN−1 h Z , Z = N (0; 1)
d
(when β = 1). This is confirmed by various empirical evidences not reproduced here
for other values of β. In the nested Monte Carlo framework, (V E)β is also satisfied
(see Proposition 9.2 further on) through the elementary inequality Var Yh − Yh ≤
Yh − Yh 2 ≤ V1 |h − h |β . However, it is also clear that this bound is not the most
2
accessible one in many situations: thus in a diffusion framework, one has more
naturally access to a control of Yh − Y0 in quadratic norm, relying on results like
those established in Chap. 7 devoted to time discretization schemes of SDEs. In fact,
what follows can be easily adapted to a less sharp framework where one has only
access to a control of the form
2
∀ h ∈ H, Yh − Y0 2 ≤ V1 h β .
This is, for example, the framework adopted in [198] (see also the next exercise).
Let us briefly discuss this assumption from a technical viewpoint. First, note that
(V E)β is clearly implied by
2
(S E)β ≡ ∀ h, h ∈ H ∪ {0}, Yh − Yh 2 ≤ V1 |h − h |β (9.56)
where we used in the last line the specific form of the refiners n i = N i−1 . On the
coarse level,
σW (1, h) = σ(Yh ) = Var(Yh ).
As 1
κi = n i + n i−1 = n i 1 + , i ∈ {2, . . . , R}, (9.59)
N
we finally obtain, after setting
8
V1
θ = θh = , (9.60)
Var(Yh )
7 7β
β 7h h 72
1 θh 2 |W | 7
i ni
− n i−1 7
q1∗ = †,∗ , qi∗ = √ , i = 2, . . . , R (9.61)
q q †,∗ n i + n i−1
and
R
2 R
√ β −1 β
κi σW (i, h) ≤ Var(Yh ) 1 + θh 2 |Wi | n i−1 − n i−1 2 n i + n i−1 (9.62)
i=1 i=2
R
2
β β
−1 1
(i−1) 1−β
= Var(Yh ) 1 + θh (1 + N
2 ) (N − 1)
2 2 |Wi |N 2 ,
i=2
keeping in mind that (Wi )1≤i≤R = (Wi(R) )1≤i≤R is an R-tuple, not the first R terms
of a sequence.
Remarks. • If the two processes Y n h and Y nh do not remain “close” in a pathwise
i−1 i
sense, the multilevel estimator may “diverge”: while it maintains its performance
as a bias killer, it loses all its variance control abilities as observed, for example,
in [153] with the regular multilevel estimator presented in the next section. The same
effect can be reproduced with this weighted estimator. Typically with diffusions, this
means that the two discretization schemes involved in each level i are based on the
same driving Brownian motion W i (these W i being independent).
• However, in some specific situations, one may even get rid of the strong con-
vergence itself, see e.g. [32] for such an approach for diffusion processes where an
approximation of the underlying diffusion process is performed by a kind of binomial
tree which preserves the performances of the multi-level paradigm, though not con-
verging strongly. This idea is applied to deal with Lévy-driven diffusions processes
whose discretization schemes involve a wienerization of the small jumps.
Exercise (Effort control under quadratic convergence, see [198].) In practice, the
simplest or the most commonly known information available about a family (Yh )h∈H
of approximations of Y0 concerns the quadratic rate of convergence of Yh toward Y0 ,
9.5 The Multilevel Paradigm 413
namely that there exists β > 0, V1 > 0, such that the following property holds
2
(S E)0β ≡ ∀ h ∈ H, Yh − Y0 2 ≤ V̄1 h β . (9.63)
(a) Show that, under (S E)0β (and for refiners of the form n i = N i−1 )
β β β
σW (i, h) ≤ |Wi | V̄1 h 2 N −(i−1) 2 1 + N 2 , i = 2, . . . , R.
R
2 R
2
√ β
−1 1 β
(i−1) 1−β
κi σW (i, h) ≤ Var(Yh ) 1 + θ̄h h (1 + N 2 ) (N
2 2 + 1) |Wi |N 2
i=1 i=2
where θ̄h = V̄1
Var(Yh )
.
β
(c) Show that on the coarse level σW (1, h) ≤ σ(Y0 ) + V̄1 h 2 . Deduce the less sharp
inequality
R
2 R
2
√ β
−1 1 β
(i−1) 1−β
κi σW (i, h) ≤ Var(Yh ) 1 + θ̄0 h (1 + N
2 ) (N
2 2 + 1) |Wi |N 2
i=1 i=1
where θ̄0 = V̄1
Var(Y0 )
.
Optimization of the depth R = R(ε).
First, we need to make one last additional assumption, namely that the weak error
expansion holds true at any order and that the coefficients cr do not go too fast toward
infinity as r → +∞, namely
⎧ α
⎪
⎨ W E R holds for every depth R ≥2
α
WE ∞
≡ and (9.64)
⎪
⎩ 1
c∞ = lim R→+∞ |c R | R ∈ (0, +∞)
2αR
→ 1 and |c R | αR
1 R−1
remains bounded whereas n! R = N 2 ↑ +∞. Moreover, the complexity depends
on ε as ε−(2+ αR ) so that, the larger R is, the more this complexity behaves like that of
1
an unbiased estimator (in ε−2 ). These facts strongly suggest to consider R as large as
possible, under the constraint that h ∗ (ε) lies in H and remains close to h(ε), which
is equivalent to h(ε) ≤ h.
414 9 Biased Monte Carlo Simulation, Multilevel Paradigm
This argument is all the more true if ε is small but remains heuristic at this stage
and cannot be considered as a rigorous optimization: such a procedure may turn
out to be suboptimal, especially if the prescribed RMSE is not small enough. An
alternative for practical implementation is to numerically minimize the upper-bound
of the cost on the right-hand side of (9.52), once the parameter θ has been estimated
(see Practitioner’s corner further on). One reason for computing a closed formula
for the depth R and the other parameters of the Richardson–Romberg estimator is to
obtain asymptotic bounds on the complexity (see Theorem 9.1).
The inequality h(ε) ≤ h reads, owing to (9.51) and after taking the log and tem-
porarily considering R as a real number,
9
1 : 1 2 √
log h
c : 1 log hc∞α 2 log 1+2α R
1
R ≤ ϕ (R) := + ∞
α
+ ; + + ε
.
2 log N 2 log N α log N
Note that ϕε is increasing in R. The highest admissible depth R ∗ (ε) is the highest
integer R which satisfies the above equality. If we set R0 = 1 and Rk+1 = ϕε (Rk ),
then Rk ↑ R∞ and
R ∗ (ε) = R∞ . (9.65)
R R− j
W(R)
j = a a R−+1 b R− = a R− a+1 b .
= j =0
*
(b) The sequence α ↑ a∞ = 1≤k≤−1 (1 − N −kα )−1 as ↑ +∞ and
B∞ =
+∞
|b | < +∞ so that the weights W(R)
j are uniformly bounded and
=0
2
∀R ∈ N∗ , ∀ j ∈ {1, . . . , R} , |W(R)
j | ≤ a∞ B∞ < +∞. (9.67)
R R− j
W(R)
j = a b R− a R−+1 = a R− b a+1 .
= j =0
R √
Now, we have to inspect the behavior of i=1 κi σW i, h ∗ (ε) depending on the
value of β. Given (9.48), there are three different cases:
• β ∈ (1, +∞): fast strong approximation (corresponding, for example, to the use
of the Milstein scheme for vanilla payoffs in a local volatility model),
• β = 1: regular strong approximation (corresponding, for example, to the use of
the Euler scheme for vanilla or lookback payoffs in a local volatility model),
• β ∈ (0, 1): slow strong approximation (corresponding, for example, to the use of
the Euler scheme for payoffs including barriers in a local volatility model or the
computation of quantile for risk measure purposes).
Combining all of the preceding and elementary though slightly technical compu-
tations (see [198] for details), lead to the following theorem.
416 9 Biased Monte Carlo Simulation, Multilevel Paradigm
Proof of Theorem 9.1. The starting point of the proof is the upper-bound (9.52) of
the complexity of ML2R estimator. We will analyze its behaviour as ε → 0 when the
parameters h and R are optimized (see Table 9.1) i.e. set at h = h and R = R ∗ (ε).
1 R−1
First note that, with our choice of geometric refiners n i = N i−1 , n! R = N 2 .
(R)
Then, if we denote W∗ = sup R≥1 max1≤r ≤R Wi , we know that W∗ < +∞ owing
to Lemma 9.2. Hence, if we denote the infimum of the complexity of the ML2R
estimators as defined in the theorem statement by
M L2R
Cost opt = inf Cost
IM ,
h∈H, q∈S R , R, M≥1,
I MM L2R −I0 2 ≤ε
M L2R
where Effortopt is given by setting h = h in (9.48), namely
2
R
−1
β2 √
M L2R
Effortopt = Var(Yh )K 1 (β, θ, h, V1 ) |Wi |(n i−1 − n i−1 n i + n i−1 .
i=1
Keeping in mind that lim R ∗ (ε) = +∞ owing to (9.66), one deduces that
ε→0
1
(1 + 2αR)1+ 2αR 1 1
lim = 1 and, using (W E)α∞ , that lim |c R | αR =
c∞α < +∞.
R→+∞ 2αR R→+∞
Moreover, note that
⎡8 ⎛ ⎞⎤
2 log(1/ε) 1
R ∗ (ε) = ⎢ ∗ ⎝
⎢ α log N + r + O ⎥ as ε → 0
⎠⎥
⎢ log 1/ε ⎥
At this stage, it is clear that the announced rate results will follow from the
asymptotic behaviour of
R ∗ (ε) 2
1−β
|Wi |N 2 as ε → 0,
i=1
1−β
lim ε exp − √
2
2 log N log(1/ε) Costopt < +∞. ♦
ε→0 α
9.5 The Multilevel Paradigm 419
R
7 R 7 −γ( j−1) 1
lim 7W 7 N = .
R→+∞
j=2
j
Nγ − 1
(b) Let (v j ) j≥1 be a bounded sequence of positive real numbers. Let γ ∈ R and
assume that lim v j = 1 when γ ≥ 0. Then the following limits hold: for every N ≥ 2,
j
⎧
⎪
⎪ N γ( j−1) v j < +∞ for γ < 0,
R
7 R 7 γ( j−1) R→+∞ ⎨ j≥2
7W 7 N vj ∼ R for γ = 0,
j
⎪ 7 j−1 77 −γ j
j=2 ⎩ N γ R a∞ j≥1 77 =0
⎪ b 7 N for γ > 0.
(c) Use these results to improve the values of the constants K M L2R (α, β, N , θ, h,
c∞ ,
V1 ) in Theorem 9.1 compared to those derived by relying on Lemma 9.2 in the former
exercise.
Remark. These sharper results on the weights are needed to establish the CLT sat-
isfied by the ML2R estimators as ε → 0 (see [112]).
1
Wi(R) = wi(R) + · · · + w(R) where wi(R) = (−1) R−i * , i = 1, . . . , R
R
j=i |1 − N −α(i− j) |
(the formula for wi(R) is established in (9.33)). Keep in mind that W(R)
1 = 1.
– For a given root N ∈ N, N ≥ 2, the refiners are fixed as follows: n i = N i−1 ,
i = 1, . . . , R.
– When β ∈ (0, 1), the optimal choice for the root is N = 2.
– The choice of the coarsest bias parameter h is usually naturally suggested by the
model under consideration. As all the approaches so far have been developed for an
absolute error, it seems natural to choose h to be lower or close to one in practice: in
particular, in a diffusion framework with horizon T 1, we will not choose h = T
but rather T /2 or T /3, etc; likewise, for nested Monte Carlo if the variance of the
inner simulation is too large, the parameter K 0 = 1/h will be chosen strictly greater
than 1 (say equal to a few units). See further on in this section for other comments.
c∞ (and its connections with h and R ∗ (ε))
How to calibrate
First, it is not possible to include an estimation of c∞ in a pre-processing phase –
beyond the fact that R ∗ (ε) depends on c∞ – since the natural statistical estimators
α
are far too unstable. On the other hand, to the best of our knowledge, under W E ∞ ,
there are no significant results about the behavior of the coefficients cr as r → +∞,
whatever the assumptions on the model are.
– If h = 1, then h ∈ H can be considered as small and we observe that if the
coefficients cr of the weak error expansion have polynomial growth, i.e. |cr | = O(r a )
9.5 The Multilevel Paradigm 421
1
as r → ∞, then
c∞ = lim |cr | r = 1 = h so that, from a practical point of view,
r →+∞
1
|c R | R 1 as R = R(ε) grows.
– If h = 1, it is natural to perform a change of unit to return to the former situation,
i.e. express everything with respect to h. It reads
? @ ?@ αr
h αR
R
Yh Y0 h
E =E + cr hαr −1 + hαR−1 o .
h h r =1
h h
If we consider that this expansion should behave as the above normalized one, it is
natural to “guess” that |c R h αR−1 | R → 1 as R → +∞, i.e.
1
c∞ = h−α .
(9.68)
1
Note that this implies that c∞α h = 1, which dramatically simplifies the expression
1
of R(ε) in Table 9.1 since several terms involving log( c∞α h) vanish. The resulting
formula reads ⎡ 8 √1+4α ⎤
1 1 log
R ∗ (ε) = ⎢ ε
⎢ 2 + 4 + 2 α log(N ) ⎥
⎥ (9.69)
⎢ ⎥
which still grows as O log(1/ε) as ε → 0.
Various numerical experiments (see [113, 114]) show that the ML2R estimator,
implemented with this choice |c R | R h−α , is quite robust under variations of the
1
coefficients cr .
422 9 Biased Monte Carlo Simulation, Multilevel Paradigm
At this stage, for a prescribed RMSE , R(ε) can be computed. Note that, given the
value of R(ε) (and the way it has been specified), one has
h ∗ (ε) = h.
−β m
1 (h 0 , h 0 , m) = |h 0 − h 0 |
V1 V (Yhk0 − Yhk − Y h 0 ,h 0 ,m )2 ,
m k=1
0
m
1
where Y h 0 ,h 0 ,m = Yhk0 − Yhk . (9.71)
m k=1
0
We make a rather conservative choice by giving the priority to fulfilling the prescribed
error ε at the cost of a possible small increase in complexity. This led us to consider
in all the numerical experiments that follow
h h0 h
h0 = and h 0 = =
5 2 10
Exercise. Determine a way to estimate V̄1 in (9.63) when (S E)0β is the only
available strong error rate.
Calibration of the root N
This corresponds to the second phase described at the beginning of this section.
– First, let us recall that if β ∈ (0, 1), then the best root is always N = 2.
– When β ≥ 1 – keeping in mind that R will never go beyond 10 or 12 for common
values of the prescribed error ε, and once the parameters Var(Y0 ) and V1 have been
estimated – it is possible to compute the upper-bound (9.54) of Cost( I MM L2R ) for
various values of N and select the minimizing root.
where v1 ∈ (0, V1 ].
As a consequence, it is possible to design a confidence interval under the assump-
tion that this asymptotic Gaussian behavior holds true. The reasoning is as follows:
we start from the fact that
M L2R 2 M L2R M L2R
I M(ε) − I0 2 = Bias2
I M(ε) + Var
I M(ε) = ε,
7 M L2R 7
keeping in mind that I0 = E Y0 . Then, both 7Bias I M(ε) 7 and the standard deviation
M L2R
σ I M(ε) are dominated by ε. Let qc denote the quantile at confidence level c ∈ (0, 1)
for the normal distribution (2 ). If ε is small enough,
M L2R M L2R
P M L2R
I M(ε) ∈ E M L2R
I M(ε) − qc σ
I M(ε) , E M L2R
I M(ε) + qc σ
I M(ε) c.
M L2R
On the other hand, E M L2R
I M(ε) = I0 + Bias
I M(ε) so that
2c is used here to avoid confusion with the exponent α of the weak error expansion.
424 9 Biased Monte Carlo Simulation, Multilevel Paradigm
A M L2R M L2R B
E M L2R
I M(ε) − qc σ
I M(ε) , E M L2R
I M(ε) + qc σ I M(ε)
A 7 M L2R 7 M L2R 7 M L2R 7 M L2R B
⊂ I0 − 7Bias I M(ε) 7 + qc σ I M(ε) I M(ε) 7 + qc σ
, I0 + 7Bias I M(ε) .
(9.72)
Note that in general c is small so that qc > 1. A sharper (i.e. narrower) confidence
interval at a given confidence level c can be obtained by estimating on-line the
empirical variance of the estimator based on Eq. (9.41), namely
⎡ ⎤1
M1 A B 2 R W2 Mi A B 2 2
M L2R 1 ⎣ 1 (1),k (1) (i),k (i),k (i) (i)
σ I √ Y − Yh + i Y h −Y h − Y h −Y h ⎦
M(ε) M M1 k=1 h M1 i=2
Mi
k=1 n i n i−1 ni n i−1 Mi
A B M
where Y = 1
M k=1 Y k denotes the empirical mean of i.i.d. copies Y k of the inner
M
variable Y , Mi = qi (ε)M(ε), etc. Then, one evaluates the squared bias by setting
M L2R
Bias
I M(ε) ε2 − Var I MM L2R .
M L2R
Then plugging these estimates into (9.72), i.e. replacing the bias Bias I
M L2R M L2RM(ε)
and the standard-deviation σ I M(ε) by their above estimates Bias
I M(ε) and
M L2R 7 M L2R 7 M L2R
σ I M(ε) 7
, respectively, in the magnitude Bias I M(ε) 7
+ qc σ I M(ε) of the theo-
retical confidence interval yields the expected empirical confidence interval at level
c
A 7 M L2R 7 M L2R 7 M L2R 7 M L2R B
Iε,c = I0 − 7Bias
I M(ε) 7 + qc
σ
I M(ε) , I0 + 7Bias
I M(ε) 7 + qc
σ
I M(ε) .
α
ℵ Practitioner’s corner III: The assumptions (V E)β and W E R
In practice, one often checks (V E)β as a consequence of (S E)β , as illustrated below.
Brownian diffusions (discretization schemes)
Our task here is essentially to simulate two discretization schemes with step h
and h/N in a coherent way, i.e. based on increments of the same Brownian motion.
A discretization scheme with step h = Tn formally reads
(n) (n) √
X kh = X (k−1)h , h Uk(n) , k = 1, . . . , n, X̄ 0n = X 0 ,
9.5 The Multilevel Paradigm 425
W −W
where Uk(n) = kh √hk(h−1) = N (0; Iq ) are i.i.d. The most elementary way to jointly
d
7 7β
Yh − Y h 2 ≤ Vi 77h − h 77 with β = 1.
N 2 N
For more details, we refer to [34] and to [112] for the application to path-dependent
functionals.
α
– Condition W E R . As for the Euler scheme(s), we refer to Theorem 7.8 (for
1-marginals) and Theorem 8.1 for functionals. For 1-marginals F(x) = f x(T ) ,
1
the property (E R+1 ) implies W E R . When F is a “true” functional, the available
results
areαpartial since no higher-order expansion of the weak error is established
and W E R , usually with α = 21 (like for barrier options) or 1 (like for lookback
payoffs), should still be considered as a conjecture, though massively confirmed by
numerical experiments.
Nested Monte Carlo
We retain the notation introduced at the beginning of the chapter. For convenience,
we write
K
1 1 1
0 = E F(X, Z ) | X and h = F(X, Z k ), h = ∈H= , K ∈ K 0 N∗ ,
K K K
k=1
426 9 Biased Monte Carlo Simulation, Multilevel Paradigm
so that Y0 = f (0 ) and Yh = f h .
– Condition (V E)β or (S E)β . The following proposition yields criteria for (S E)β
to be fulfilled.
where
p p 1 p p
−
C X,Y, p =2 p + 1 p p + 1 + p p + 1 sup g sup + 1 p+1 2 C pM Z F(X, Z ) − E F(X, Z ) | Y p p+1 .
h∈(0,h 0 ]∩H h
Hence, Assumption (S E)β holds with β = p
2( p+1)
∈ 0, 21 .
Remark. The boundedness assumption made on the densities gh of h in the above
Claim (c) may look unrealistic. However, if the density g0 of 0 is bounded and
the assumptions of Theorem
9.2 α (b) further on – to be precise, in the item devoted
to weak error expansion W E R – are satisfied for R = 0, then
1
gh (ξ) ≤ g0 (ξ) + o(h 2 ) uniformly in ξ ∈ Rd .
Proof. (a) Except for the constant, this claim is a special case of Claim (b) when
p = 2. We leave the direct proof as an exercise.
z) = F(ξ, z) − E F(ξ, Z ), ξ ∈ Rd , z ∈ Rq , and assume that h = 1 and
(b) Set F(ξ, K
h = K > 0, 1 ≤ K ≤ K , K , K ∈ N∗ . First note that, X and Z being independent,
1
C 7 7p
71 K K 7
Y h − Y h p ≤ [ f ] p 7 Zk ) − 1 Z k )77 .
PX (dξ) E 7 F(ξ, F(ξ,
p Lip
Rd 7K K 7
k=1 k=1
where we again used (twice) Minkowski’s Inequality in the last line. Finally, for
every ξ ∈ Rd ,
K 1
1 K
Zk ) − 1
F(ξ, Z ) (h − h ) √1 + h 1 − 1 2
Z k ) ≤ C pM Z F(ξ,
F(ξ,
K K p
k=1 k=1 h h h
p
h 21 h 21
= C pM Z F(ξ,
Z ) (h − h ) 2
1
p
1 − +
h h
MZ
≤ 2 Cp 21
F(ξ, Z ) p (h − h ) .
p p p
Rd
p
= (2 C p ) F(X, Z ) − E (F(X, Z )| X ) p (h − h ) 2
M Z p p
p
≤ 4 C pM Z ) p F(X, Z ) p (h − h ) 2 .
p
If h = 0, we get
428 9 Biased Monte Carlo Simulation, Multilevel Paradigm
7 7p
7 1 K 77
p 7
E |Yh − Y0 | ≤ [
p
f ]Lip E 7 F(X, Z k ) − E F(X, Z ) |X 7
7 K 7
k=1
C 7 7
71 K 7 p
p 7 Z k )77 PX (dξ)
= [ f ]Lip E7 F(ξ,
Rd 7K 7
k=1
since E (F(X, Z ) |X ) = E F(ξ, Z ) |ξ=X a.s. and the result follows as above.
(c) The proof of this claim relies on Claim (b) and the following elementary lemma.
Lemma 9.3 Let and be two real-valued random variables lying in L p (, A, P),
p ≥ 1, with densities g and g , respectively. Then, for every a ∈ R,
2 p+1 p
p
1{ ≤a} − 1{≤a} ≤ p p+1 + p p+1 g sup + g sup − p+1 .
− p 1
2 p
(9.75)
Exercise. (a) Prove Claim (a) of the above proposition with the announced con-
stant.
(b) Show that if f is ρ-Hölder, ρ ∈ (0, 1] and p ≥ ρ2 , then
ρ
∀ h, h ∈ H, Yh − Yh p ≤ [ f ]ρ,H ol F(X, Z ) − E (F(X, Z ) | X )ρpρ |h − h | 2 .
α
– Condition W E R . For this expansion we again need to distinguish between the
smooth case and the case of indicator functions. In the non-regular case where f is an
indicator function, we will need a smoothness assumption on the joint distribution of
(0 , h ) that will be formulated as a smoothness assumption on the pair (0 , h −
9.5 The Multilevel Paradigm 429
0 ). The first expansion result in that direction was established in [128]. The result
in Claim (b) below is an extension of this result in the sense that if R = 1, our
assumptions coincide with theirs.
We denote by ϕ0 the natural regular (Borel) version of the conditional mean
function of F(X, Z ) given X , namely
ϕ0 (ξ) = E F(ξ, Z ) = E F(X, Z ) | X = ξ , ξ ∈ R.
Theorem 9.2 (Weak error expansion) (a) Smooth setting. Let R ∈ N∗ . Assume
F(X, Z ) ∈ L 2R+1 (P) and let f : R → R be a 2R + 1 times differentiable function
with bounded derivatives f (k) , k = R + 1, . . . , 2R + 1. Then, there exists real coef-
ficients c1 ( f ), . . . , c R ( f ) such that
R
∀h ∈ H, E Yh = E Y0 + cr ( f )h r + O(h R+1/2 ). (9.76)
r =1
(b) Density function and smooth joint density. Assume that F(X, Z ) ∈ L 2R+1 (P) for
some R ∈ N and that d = 1. Assume the distribution (0 , h − 0 ) has a smooth
density with respect to the Lebesgue measure on R2 . Let g0 be the density of 0
and let g | − =ξ̃ be (a regular version of) the conditional density of 0 given
0 h 0
R
hr 1
gh (ξ) = g0 (ξ) + g0 (ξ) Pr (ξ) + O(h R+ 2 ) uniformly in ξ ∈ R, (9.77)
r =1
r!
hr
R
1
G h (ξ) = G 0 (ξ) + E Pr (0 )1{0 ≤ξ} + O(h R+ 2 ). (9.78)
r =1
r!
hr
R
1
E Yh = E Y0 + E Pr (0 )1{0 ≤a} + O(h R+ 2 ).
r =1
r!
We will admit these results whose proofs turn out to be too specific and technical
for this textbook. A first proof of Claim (a) can be found in [198]. The whole
theorem is established in [113, 114] with closed forms for the coefficients cr ( f )
and the functions Pr involving the derivatives of f (for the coefficients cr ( f )), the
conditional mean ϕ0 , the densities gh , g0 −h , their derivatives and the so-called
partial Bell polynomials (see [69], p. 307). Claim (c) derives from (b) by integrating
with respect to ξ and it can be extended to any bounded Borel function : R → R
instead of 1(−∞,a] .
– Complexity of nested Monte Carlo. When computing the ML2R estimator in
nested Monte Carlo simulation, one has, at each fine level i ≥ 2,
ni K
n i−1 K
1 1
Y nh − Y n h = f F(X, Z k ) − f F(X, Z k )
i i−1 ni K k=1
n i−1 K k=1
with the same terms Z k inside each instance of the function f . Hence, if we neglect
the computational cost of f itself, the complexity of the fine level i reads
ni
κ̄ n i K = κ̄ .
h
√ √
As a consequence, it is easily checked that n j−1 + n j should be replaced by n j
in the above Table 9.1.
What is called here a “regular” multilevel estimator is the original multilevel esti-
mator introduced by M. Giles in [107] (see also [148] in a numerical integration
framework and [167] for the Statistical Romberg extrapolation). The bias reduction
is obtained by an appropriate stratification based on a simple “domino” or cascade
property described hereafter. This simpler weightless structure makes its analysis
possible under a first-order weak error expansion (and the β-control of the strong
convergence similar to that introduced for ML2R estimators). We will see later in
Theorem 9.3 the impact on the performances of this family of estimators compared
to the ML2R family detailed in Theorem 9.1 of the previous section.
Like for the ML2R estimator, we assume the refiners are powers of a root N :
n i = N i−1 , i = 1, . . . , L, where L will denote throughout this section the depth of
the regular multilevel estimator, following [107].
Note that, except for the final step – the depth optimization of L – most of what
follows is similar to what we just did in the weighted framework, provided one sets
R = L and Wi(L) = 1, i = 1, . . . , L.
Step 1 (Killing the bias). Let L ≥ 2 be an integer which will be the depth of the
regular multilevel estimator. Then
α
h h α
E Y nh − E Y0 = c1 +o .
L nL nL
L
= E Yh + E Y nh − Y n h
i i−1
i=2
⎡ ⎤
⎢ L
⎥
= E⎢
⎣ Yh + Y nh − Y n h ⎥ ⎦.
i=2 i−1
i
coarse level
refined level i
We will assume again that the Y nh − Y n h variables at each level are sampled
i i−1
independently, i.e.
- .
L h α h α
E Y0 = E Yh(1) + (i)
Yh −Y (i)
h − c1 +o ,
i=2
ni n i−1 nL nL
(i)
where the families Yn,h , i = 1, . . . , L, are independent.
Step 2 (Regular Multilevel estimator). As already noted in the previous section,
Yh(1) has no reason to be small since it is close to a copy of Y0 , whereas, by contrast,
Y (i)
h − Y h
(i)
is approximately 0 as the difference of two variables approximating
ni n i−1
Y0 . The seminal idea of the multilevel paradigm, introduced by Giles in [107], is
to dispatch the L different simulation sizes Mi across the levels i = 1, . . . , L so
that M1 + · · · + M L = M. This leads to the following family of (regular) multilevel
estimators attached to (Yh )h∈H and the refiners vector n:
432 9 Biased Monte Carlo Simulation, Multilevel Paradigm
1
M1 L
1
Mi
M L MC
I M,M := Yh(1),k + Y (i),k
h − Y (i),k
h ,
1 ,...,M L M1 Mi ni n i−1
k=1 i=2 k=1
(i),k
where M1 , . . . , M L ∈ N∗ , M1 + · · · + M L = M, Yn,h are i.i.d. copies
i=1,...,L ,k≥1
of Yn,h := Y nh i=1,...,L .
i
As for weighted ML2R estimators, it is more convenient to re-write the above
multilevel estimator as a stratified estimator by setting qi = MMi , i = 1, . . . , L. This
leads us to the following formal definition.
Definition 9.5 The family of (regular) multilevel estimators attached to (Yh )h∈H and
the refiners vector n is defined, as follows: for every h ∈ H, every q = (qi )1≤i≤R ∈ S R
and every integer M ≥ 1,
-M .
1 (i),k
Yh(1),k
1 L Mi
1
I MM L MC =
I MM L MC (q, h, L , n) = +
Yh −Y h (i),k
M k=1
q
i=2 k=1 i
q1 ni n i−1
(i),k (9.80)
(convention the 01 0k=1 = 0), where Mi = qi M, i = 1, . . . , R, Yn,h , i = 1,
. . . , R, k ≥ 1, are i.i.d. copies of Yn,h := Y nh i=1,...,R .
i
This definition is similar to that of ML2R estimators where all the weights Wi
would have been set to 1. Note that, as soon as M ≥ M(q) := (mini qi )−1 , all Mi =
qi M ≥ 1, i = 1, . . . , R. This condition on M will be implicitly assumed in what
follows so that no level is empty.
Bias. The telescopic/domino structure of the estimator implies, regardless of the
sizes Mi ≥ 1, that
α α
M L MC RR h h
Bias
IM = Bias
I1 = E Y nh − E Y0 = c1 +o . (9.81)
L nL nL
L
1 σ 2 (i, h)
= , (9.82)
M i=1
qi
9.5 The Multilevel Paradigm 433
where σ 2 (1, h) = Var Yh(1) and σ 2 (i, h) = Var Y (i) (i)
h − Y h , i = 2, . . . , L.
ni n i−1
Complexity. The complexity, or simulation cost, of the MLMC estimator is clearly
the same as that of its weighted counterpart investigated in the previous section, i.e.
is a priori given — or at least dominated – by
L
κ̄M
Cost(
I MM L MC ) = κi qi , (9.83)
h i=1
where κ1 = 1, κi = 1 + N −1 n i , i = 2, . . . , L.
As usual, at this stage, the aim is to minimize the cost or complexity of the whole
simulation for a prescribed R M S E ε > 0, i.e. solving
inf Cost(
I MM L MC ). (9.84)
R M S E(
I MM L MC )≤ε
Following the lines of what was done in the weighted multilevel ML2R estimator,
we check that this problem is equivalent to solving
Effort(
I MM L MC )
inf 2 ,
I MM L MC )|<ε ε2 − Bias
|Bias( I M L MC
M
I MM L MC ) = κ(
Effort( I MM L MC )Var( I MM L MC )
L L
κ̄ σ 2 (i, h)
= κi qi . (9.85)
h i=1 qi i=1
Note that neither the effort, nor the bias depend on M provided M ≥ M(q).
We still proceed by first minimizing the effort for a fixed h (and R) as a function
of the vector q ∈ SL and then minimizing the denominator of the above ratio cost in
the bias parameter h ∈ H.
Minimizing the effort. Lemma 9.1 yields the solution to the minimization of the
effort, namely
σ(i, h)
argminq∈SL Effort(
I MM L MC ) = q ∗ = ∗,†
√
q (h) κi i=1,...,L
σ(i, h)
with q ∗,† (h) = √ and a resulting minimal effort reading
1≤i≤L
κi
434 9 Biased Monte Carlo Simulation, Multilevel Paradigm
L
2
κ̄ √
min Effort(
I MM L MC ) = κi σ(i, h) .
q∈S L h i=1
Minimizing the resulting cost. Compiling the above results, the cost minimization
inf Cost(
I MM L MC )
R M S E(
I MM L MC )≤ε
7Bias(I M L MC )7≤ε n L εn
h∈H: h≤ |c |L n L
1
- 2α .
M
h
≤ sup h ε − c1
2 2
.
εn
h:0<h≤ L nL
|c1 |
This equivalence can be made rigorous (see [198] for more details) as well as the
fact that it suffices (at least asymptotically as ε → 0) to maximize the denominator.
First note that
h 2α 1
2α ε α
sup h ε2 − c12 = ε2 nL
(1 + 2α)1+ 2α |c1 |
1
εn
0<h< L nL
|c1 |
attains a maximum at α1
ε nL
h(ε) = (9.86)
|c1 | (1 + 2α) 2α
1
i=1 i=2
1−β
2
β 1− N 2 (L−1)
= Var(Yh ) 1 + θh a N 2
1−β
(9.89)
1− N 2
with
β 1−β 1 − 1 L−1
a N = (1 + N −1 ) 2 (N − 1) 2 N
1
2 and the convention = L −1
1−1
(9.90)
corresponding to the case β = 1 (3 )
Optimization of the depth L = L(ε). One checks that the upper-bound (9.88) of the
complexity formally goes to 0 as L ↑ +∞ for a fixed ε > 0 since n L = N L−1 ↑ +∞.
As in the weighted setting, the depth L is limited by the constraint on the bias
parameter h ∗ (ε) ≤ h. This constraint implies, owing to (9.87), that
α1
|c1 | √
log ε
1 + 2α h
L ≤ L max (ε) = 1 + .
log(N )
However, unlike the ML2R estimator, the exponent 2 + α1 of the (prescribed) RMSE
ε appearing in the upper-bound of the complexity does not depend on the depth L,
so it seems natural to try to go deeper in the optimization. This can be achieved in
different manners (see e.g. [141]) depending on what is assumed to be a variable or
a fixed parameter. We shall carry on with our approach which lets the bias parameter
h vary in H.
Based on the formula (9.88) of the complexity, the expression of the effort (9.89)
and using the formula (9.86) of h(ε) (before projection on H), the minimization
problem
β 1−β
3 Note that in the nested Monte Carlo framework, κi = n i , so that a N = (N − 1) 2 N 2 .
436 9 Biased Monte Carlo Simulation, Multilevel Paradigm
√ 2
1≤i≤L κi σ(i,
h(ε))
min (9.91)
1≤L≤L max (ε) nL
is dominated by
⎛ ⎞2
β 1−β
ε 2α 1− N 2 (L−1)
⎝N − L−1 − 1−β
2 (L−1) ⎠ ,
sup Var(Yh ) min 2 +θ √ aN N
h∈H 1≤L≤L max |c1 | 1 + 2α 1− N
1−β
2
(9.92)
still with our convention when β = 1, where a N is given by (9.90).
In (9.92) the function to be minimized in L is not decreasing (where L is viewed as
a real variable) so that saturating the constraint on L, e.g. by the condition
h(ε) ≤ h,
may not be optimal.
Lemma 9.4 The function of L, viewed as a [1, +∞)-valued real variable, to be
minimized in (9.92), attains its minimum at
√ 1−β
1 1 |c1 | 1 + 2α 2 N 2 −1 1
∗ (ε) = 1 + log + log β
log N α ε β 1−β 1 1−β
2θ(1 + N −1 ) 2 (N − 1) 2 N 2
2 +
(9.93)
1−β
N 2 −1
with the convention 1−β
= log N if β = 1.
2
Exercise. Prove the lemma. [Hint: consider directly the function between brackets
and inspect the three cases β ∈ (0, 1), β = 1 and β > 1.]
At this stage we define the optimized depth of the MLMC simulation for a pre-
scribed RMSE ε by 3 4
L ∗ (ε) = ∗ (ε) ∧ L max (ε) . (9.94)
Optimal size of the simulation. The global size M ∗ (ε) of the simulation is obtained
by saturating the RMSE constraint, i.e. I MM L MC − I0 2 ≤ ε with h =
h(ε), using that
M L MC
I MM L MC − I0 22 = M1∗ (ε) + Bias
Var I 2
I MM L MC . We get
M L MC
Var
I1
M ∗ (ε) = 2 .
ε − Bias
2 I MM L MC
Plugging the value of the variance (9.82) and the bias (9.81) into this equation finally
yields
5 M L MC 6 5 ∗,† √ ∗
6
1 Var
I1 1 q 1≤i≤L(ε) κi σ(i, h (ε))
M ∗ (ε) = 1+ = 1+
2α ε2 2α ε2
(9.95)
which can in turn be re-written in an implementable form (see Table 9.2 in Practi-
tioner’s corner) with the available formulas or bounds for κi and σ(i, h) and h ∗ (ε).
9.5 The Multilevel Paradigm 437
At this stage, it remains to inspect again the same three cases, depending on the
parameter β (β > 1, β = 1 and β ∈ (0, 1)) to obtain the following theorem.
i = N , i = 1, . . . , L,
i−1
Theorem 9.3 (MLMC estimator, see [107, 198]) Let n
α
N ∈ N \ {0, 1}. Assume (S E)β and W E 1 . Let θ = θh = Var(Y) .V1
⎧
⎪1 if β > 1,
Var(Yh ) ⎨ 2
inf Cost
I MM L MC K M L MC × log(1/ε) if β = 1,
h∈H, q∈S L , L , M≥1 ε2 ⎪ 1−β
⎩ −
I M
M
L MC −I0 2 ≤ε
2 2 ε α if β < 1,
where K M L MC = K M L MC (α, β, θ, h, c1 , V1 ).
(b) The rates in the right-hand side in (a) are achieved by setting in the defini-
tion (9.80) of the estimator
I MM L MC , h = h ∗ (ε, L ∗ (ε)), q = q ∗ (ε), R = R ∗ (ε) and
∗
M = M (ε).
Closed forms for the depth L ∗ (ε), the bias parameter h ∗ = h ∗ (ε, L ∗ (ε)) ∈ H,
the optimal allocation vector q ∗ (ε) and the global size M ∗ (ε) of the simulation are
reported in Table 9.2 below (see Practitioner’s corner hereafter for these formulas
and their variants).
Remarks. • The MLMC estimator achieves the rate ε−2 of an unbiased simulation
as soon as β > 1 (fast strong rate) under lighter conditions than ML2R since it only
requires a first-order weak error expansion.
−2+η
• When β = 1, the cost of the MLMC estimator 1 for a prescribed R M S E is o(ε )
for every η > 0 but it is slower by a factor log ε than the weighted ML2R Multilevel
estimator. Keep in mind than for ε = 1bp = 10−4 , log 1ε 6.9.
• When β ∈ (0, 1), the simulation cost is no longer close to unbiased simulation since
1−β
ε− α → +∞ as ε → 0 at a polynomial rate while the weighted ML2R estimator
remains close to the unbiased framework, see Theorem 9.1 and the comments that
follow. In that setting, which is great interest in derivative pricing since it corresponds,
for example, to barrier options or γ-hedging, the weighted ML2R clearly outperforms
the regular MLMC estimator.
Proof of Theorem 9.3. The announced rates can be obtained by saturating the value
of h at h i.e. by replacing ∗ (ε) in (9.94) by +∞. Then the proof follows the lines
of that of Theorem 9.1. At some place one has to use that
2
Var(Yh ∗ (ε) ) ≤ σ(Yh ∗ (ε) − Yh ) + σ(Yh )
2 2
≤ Var(Yh ) 1 + V1 (h − h ∗ (ε))1/2 ≤ Var(Yh ) 1 + V1 h1/2 .
⎜ ⎢ ⎥ 3 4⎟
L = L ∗ (ε) min ⎜ ⎝1 + ⎢
⎢ ⎥ , (ε) ⎟
⎥ ⎠
⎢ log(N ) ⎥
⎢ ⎥
⎛ ⎞
1 |c | α
1
log (1+2α) 2α ε1 1−β
⎜ ⎟
(ε) ⎜ + 2
log N 2 −1 1 ⎟
⎝ log(N ) β log N 1−β 1−β 1 β ⎠
2 2θ N 2 (1+N −1 ) 2 (N −1) 2
+
with the convention N 0−1 = log N when β = 1.
0
5 1 6
F 1 |c1 | α −(L−1)
h = h ∗ (ε) h h(1 + 2α) 2α N
ε
β
−1 2
β
1 θh 2 n −1 j−1 − n j
q1 (ε) = † , q j (ε) = √ , j = 2, . . . , L ,
qε qε† n j−1 + n j
q = q ∗ (ε) 8
L
V1
†
with qε s.t. q j (ε) = 1 and θ = .
Var(Yh )
j=1
⎡ ⎛ ⎞⎤
L
β −1 β
⎢ Var(Yh ) qε ⎝1 + θh 2
†
n j−1 − n j−1 2 n j−1 + n j ⎠ ⎥
⎢ ⎥
⎢ 1 j=2 ⎥
M = M ∗ (ε) ⎢ 1+ ⎥
⎢ ε 2 ⎥
⎢ 2α ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
ℵ Practitioner’s corner
Most recommendations are similar to that made for the ML2R estimator. We focus in
what follows on specific features of the regular MLMC estimators. The parameters
have been designed based on a complexity of fine level i given by κ(n i + n i−1 ).
When this complexity is of the form κn i (like in nested Monte Carlo), the terms
√ √
n j + n j−1 should be replaced by n j .
Payoff dependent parameters for MLMC estimator
We set n i = N i−1 , i = 1, . . . , L. The values of the (optimized) parameters are listed
in Table 9.2 (∗ superscripts are often removed for simplicity).
Let us make few additional remarks:
• Note that, by contrast with the ML2R framework, there is no a priori reason to
choose N = 2 as a root when β ∈ (0, 1).
• The value of =the depth
L ∗ (ε) proposed in Table>9.2 is sharper than the value
1 |c | 1
∗
L (ε) = 1 + log (1 + 2α) 2α 1 α
h / log(N ) given in [198]. However, the
ε
impact of this sharper optimized depth on the numerical simulations remains
limited and
L ∗ (ε) is a satisfactory choice in practice (it corresponds to setting
9.5 The Multilevel Paradigm 439
and θ = Var(YV1h∗ (ε) ) , both used to specify the parameters reported in Table 9.2. The
estimation procedures are the same as that developed in Sect. 9.5.1 in Practitioner’s
corner I (see (9.70) and (9.71)).
Calibration of the parameter c1
This is an opportunity of improvement offered by the MLMC estimator: if the coeffi-
cient α is known, e.g. by theoretical means,
itis possible to estimate the coefficient c1
α
in the first-order weak error expansion W E 1 . First we approximate by considering
the following proxies, defined for h 0 , h 0 ∈ H, 0 < h 0 < h 0 , by
E Yh 0 − E Yh 0
c1 (h 0 , h 0 ) =
c1
h α0 − (h 0 )α
if h 0 and h 0 are small (but not too small). Nevertheless, like for the estimation of
V1 (see Practitioner’s corner I in Sect. 9.5.1), we adopt a conservative strategy which
naturally leads us to consider not too small h 0 and h 0 , namely, like for the ML2R
estimator,
h h0 h
h0 = and h 0 = = .
5 2 10
Then, it remains to replace E Yh 0 − Y h0 by its usual estimator to obtain the (obvi-
2
ously biased) estimator of the form
m
(h 0 )−α 1
c1 (h 0 , h 0 , m) =
Yhk0 − Yhk ,
(1 − 2−α ) m k=1
0
where Yhk∗ (ε) , Y hk∗ (ε) k≥1 are i.i.d. copies of Yh ∗ (ε) , Y h∗ (ε) and m " M (m is a priori
2 2
small with respect to the size M of the master simulation).
One must be aware that this opportunity to estimate c1 (which has no counterpart
for MLL2R) is also a factor of instability for the MLMC estimators, if c1 is not
estimated or not estimated with enough accuracy. From this point of view the MLMC
estimators are less robust than the ML2R estimators.
440 9 Biased Monte Carlo Simulation, Multilevel Paradigm
are the key to controlling the effort of the multilevel estimator after the optimization
of the allocation vector (qi )i , see e.g. (9.57) still in the weighted Multilevel section.
Using the seemingly more natural (S E)0β : Yh − Y0 22 ≤ V̄1 h β (see [198]) yields a
less efficient design of the multilevel estimators, not to speak of the fact that V1 is
easier to estimate in practice than V̄1 . But the virtue of this choice goes beyond this
calibrating aspect. In fact, as emphasized in [32], assumptions (V E)β and (S E)β
may be satisfied in situations where Yh does not converge in L 2 toward Y0 because
the underlying scheme itself does not converge toward the continuous time process.
This is the case for the binomial tree (see e.g. [59]) toward the Black–Scholes model.
More interestingly, the author takes advantage of this situation to show that the
Euler discretization scheme of a Lévy driven diffusion can be wienerized and still
satisfies (S E)β for a larger β: the wienerization of the small jumps of a Lévy process
consists in replacing the increments of the compensated small enough jump process
over a time interval [t, t + δt] by increments c(Bt+ t − Bt ), where B is a standard
Brownian motion and c is chosen to equalize the variances (B is independent of the
9.5 The Multilevel Paradigm 441
existing Brownian and the large jump components of the Lévy process, if any). The
wienerization trick was introduced in [14, 66] and makes this modified Euler scheme
simulable, which is usually not the case for the standard Euler scheme if the process
jumps infinitely often on each nontrivial time interval with non-summable jumps.
Nested
Monte
α Carlo. As for the strong convergence (S E)β and weak error expan-
sion W E R , we refer to Practitioner’s corner II, Sect. 9.5.1.
Limiting behavior. The asymptotics of multilevel estimators, especially the CLT ,
has been investigated in various frameworks (Brownian diffusions, Lévy-driven dif-
fusions, etc) in several papers (see [34, 35, 75]). See also [112] for a general approach
where both weighted and regular Multilevel estimators are analyzed: a SLLN and a
CLT are established for multilevel estimators as the RMSE ε → 0.
We consider our standard Brownian diffusion SDE (7.1) with drift b and diffusion
coefficient σ driven by a q-dimensional standard Brownian motion W defined on
a probability space (, A, P) with q ≥ 2. In such a framework, we saw that the
Milstein scheme cannot be simulated efficiently due to the presence of Lévy areas
induced by the rectangular terms coming out in the second-order term of the scheme.
This seems to make impossible it to reach the unbiased setting “β > 1” since the
Euler scheme corresponds, as we saw above, to the critical case β = 1 (scheme of
order 21 ).
First we introduce a truncated Milstein scheme with step h = Tn . We start from
the definition of the discrete time Milstein scheme as defined by (7.41). The scheme
is not considered as simulable when q ≥ 2 because of the Lévy areas
C
n
tk+1
Wsi − Wtikn dWsj
tkn
for which no efficient method of simulation is known when i = j. On the other hand
a simple integration by parts shows that, for every i = j and every k = 0, . . . , n − 1,
C C
j
n n
tk+1 tk+1
j j
Wsi − Wtikn dWsj + Wsj − Wt n dWsi = (Wtik+1
n − Wtikn )(Wt n − Wt n ),
k k+1 k
tkn tkn
C
1
n
tk+1
whereas Wsi − Wtikn dWsi = ( Wtik+1
n − Wtikn )2 − T /n for i = 1, . . . , q.
tkn 2
442 9 Biased Monte Carlo Simulation, Multilevel Paradigm
The idea is to symmetrize the “rectangular” Lévy areas, i.e. to replace them by half
j j
of their sum with their symmetric counterpart, namely 21 (Wtin − Wtin )(Wt n − Wt n ).
k+1 k k+1 k
This yields the truncated Milstein scheme defined as follows:
X̆ 0n = X 0 ,
√
X̆ tnk+1
n = X̆ tnkn + h b tkn , X̆ tnkn + h σ tkn , X̆ tnkn
j
+ σ̆i j ( X̆ tnkn ) Wtik+1
n Wt n − 1{i= j} h , (9.96)
k+1
1≤i≤ j≤q
where tkn = kT
n
, Wtk+1
n = Wtk+1
n − Wtkn , k = 0, . . . , n − 1 and
1
σ̆i j (x) = ∂σ.i σ. j (x) + ∂σ. j σ.i (x) , 1 ≤ i < j ≤ q,
2
1
σ̆ii (x) = ∂σ.i σ.i (x), 1 ≤ i ≤ q
2
(where ∂σ.i σ. j is defined by (7.42)).
This scheme can clearly be simulated, but the analysis of its strong convergence
rate, e.g. in L 2 when X 0 ∈ L 2 , shows a behavior quite similar to the Euler scheme,
i.e.
7 n 7 T
7 7
max X̆ tkn − X tkn ≤ Cb,σ,T 1 + X 0 2
k=0,...,n 2 n
We consider a first scheme X̆ t2n,[1]
2n k=0,...,2n
with step h
2
= T
2n
which reads on two
k
time steps
9.6 Antithetic Schemes (a Quest for β > 1) 443
h
X̆ t2n,[1]
2n
G X̆ 2n,[1]
=M 2n , , W t2k+1 ,
2n
2k+1 t2k 2
h
X̆ t2n,[1]
2n
G X̆ 2n,[1]
=M 2n , , Wt2(k+1)
2n , k = 0, . . . , n − 1.
2(k+1) t2k+1 2
We consider a second scheme with step 2n T
, denoted by X̆ t2n,[2]
2n k=0,...,2n
, identical
k
to the first one, except that the Brownian increments are pairwise swapped, i.e. the
increments Wt2k+1 2n and Wt2(k+1)
2n are swapped. It reads
h
X̆ t2n,[2]
2n
G X̆ 2n,[2]
=M 2n , , W t2(k+1) ,
2n
2k+1 t2k 2
h
X̆ t2n,[2]
2n
G X̆ 2n,[2]
=M 2n , , Wt2k+1
2n , k = 0, . . . , n − 1.
2(k+1) t2k+1 2
The following theorem, established in [110] in the autonomous case (b(t, x) = b(x),
σ(t, x) = σ(x)), makes precise the fact that it produces a meta-scheme satisfying
(S E)β with β = 2 (and a root N = 2) in a multilevel scheme.
Theorem 9.4 Assume b ∈ C 2 (Rd , Rd ), σ ∈ C 2 (Rd , M(d, q, R)) with bounded exist-
ing partial derivatives of b, σ, σ̆ (up to order 2).
(a) Smooth payoff. Let f ∈ C 2 (Rd , R) also with bounded existing partial derivatives
(up to order 2). Then, there exists a real constant Cb,σ,T > 0
1 T
f X̆ Tn − f ( X̆ T2n,[1] ) + f ( X̆ T2n,[2] ) ≤ Cb,σ,T 1 + X 0 2 , (9.97)
2 2 n
Then
1 43 −η
T
f X̆ Tn − f ( X̆ T2n,[1] ) + f ( X̆ T2n,[2] ) ≤ Cb,σ,T,η 1 + X 0 2 (9.99)
2 2 n
For detailed proofs, we refer to Theorems 4.10 and 5.2 in [110]. At least in a one-
dimensional framework, the above condition (9.98) is satisfied if the distribution of
X T has a bounded density.
ℵ Practitioner’s corner: Application to multilevel estimators with root N = 2
444 9 Biased Monte Carlo Simulation, Multilevel Paradigm
The principle is rather simple at this stage. Let us deal with the simplest case of a
vanilla payoff (or 1-marginal) Y0 = f (X T ) and assume that f is such that (S E)β is
satisfied for some β ∈ (1, 2].
Coarse level. On the coarse (first) level i = 1, we simply set Yh(1) = f X̆ Tn,(1) ,
where the truncated Milstein scheme is driven by a q-dimensional Brownian motion
W (1) . Keep in mind that this scheme is of order 21 so that coarse and fine levels will
not be ruled by the same β.
Fine levels. At each fine level i ≥ 2, the regular difference Y (i) (i)
h − Y h (with
ni n i−1
1 [1],(i)
Yh + Y [2],(i)
h − Y (i)h ,
2 i n n i n i−1
where Y (i)h = f X̆ Tn N ,(i) , Y [rh ],(i) = f X̆ Tn N ,[r ],(i) , r = 1, 2. At each level, the
i−2 i−1
n i−1 ni
three truncated antithetic Milstein schemes are driven by the same q-dimensional
Brownian motions W (i) , i = 1, . . . , R (or L, depending on the implemented type of
multilevel meta-scheme). Owing to Theorem 9.4, it is clear that, under appropriate
assumptions,
1 2 h β 3 −
[1],(i)
Yh + Y [2],(i)
h − Y (i)h ≤ V1 with β= or β = 2.
2 ni ni n i−1 2 ni 2
For the sake of simplicity, we assume in this section that the root N = 2, but what
follows can be extended to any integer N ≥ 2 (see e.g. [113]). In a regular multilevel
Nested Monte Carlo simulation, the fine level i ≥ 2 – with or without weight – relies
9.6 Antithetic Schemes (a Quest for β > 1) 445
Table 9.3 Parameters of the ML2R estimator (standard antithetic diffusion framework)
N 2 (in our described setting)
⎡ 9 ⎤
: 2 √1+4α
1
: 1 1
⎢ 1 log(
c α h)
; log(
c α h) log ε ⎥
R = R ∗ (ε) ⎢ + ∞
+ + ∞
+2 ⎥
⎢2 log(N ) 2 log(N ) α log(N ) ⎥
⎢ ⎥
h = h ∗ (ε) h
n n i = N i−1 , i = 1, . . . , R (convention: n 0 = n −1 0 := 0)
β 7 7 −1 2
β
(R)
1 θh 2 7W j (N )7 n −1 j−1 − n j
q1 (ε) = † , q j (ε) = , j = 2, . . . , R,
qε qε† n j−1 + 2n j
q = q ∗ (ε) 8
R
V1
†
with qε s.t. q j (ε) = 1, θ = and V̆h = Var(Y̆h )
j=1
V̆h
⎡ ⎛ ⎞⎤
β
R
7 7 β
⎢ V̆ q † ⎝1 + θh 2 7W(R) (N )7 n −1 − n −1 2 n j−1 + 2 n j ⎠ ⎥
⎢ h ε j j−1 j ⎥
⎢ 1 j=2 ⎥
M = M ∗ (ε) ⎢ 1+ ⎥
⎢ ε 2 ⎥
⎢ 2αR ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
where h = 1
= h i = 2i−21 K 0 ∈ H. If f is Lipschitz continuous, we know (see Propo-
K
2 Var F(X,Z 1 )
sition 9.2(a)) that Yh − Y h2 2 ≤ [ f ]Lip h so that (S E)β and, subse-
2
quently, (V E)β is satisfied with β = 1.
The antithetic version of the fine level Yh − Y h2 that we present below seems to
have been introduced independently in [55, 61, 140]; see also [109, 114]. It consists
in replacing the estimator Yh in a smart enough way by a random variable Y h with
the same expectation as Yh , designed from the same X and the same innovations Z k
but closer to Y h2 in L 2 (P). This leads us to set
- K
2K
.
h = 1
Y f
1
F(X, Z k ) + f
1
F(X, Z k ) . (9.100)
2 K k=1
K k=K +1
owing to the discrete time B.D.G. Inequality (see (6.49)) applied to the sequence
F(X, Z k ) − F(X, Z K +k ) k≥1 of σ(X, Z 1 , . . . , Z n )-adapted martingale increments.
Then,
Minkowski’s Inequality yields
Exercises. 1. Extend the above antithetic multilevel nested Monte Carlo method
to locally ρ-Hölder continuous functions f : Rd → R satisfying
7 7
∀ x, y ∈ Rd , 7 f (x) − f (y)7 ≤ [ f ]ρ,loc |x − y|ρ 1 + |x|r + |y|r .
9.6 Antithetic Schemes (a Quest for β > 1) 447
2. Update Table 9.3 to take into account the specific complexity of the antithetic
nested Monte Carlo method.
3. Extend the antithetic approach to any root N ≥ 2 by replacing Yh in the generic
expression Y Nh − Yh of a fine level with root N by
N
K
h = 1
Y f
1
F(X, Z (n−1)K +k ) .
N n=1
K k=1
We know that Yh → Y0 = f E F(X, Z 1 ) | X in L 2 (P) as h → 0 in H. It follows
that, for every fixed h ∈ H,
E Yh ≥ E Y h → E Y0 as k → +∞.
Nk
Consequently, for the regular MLMC estimator (with h = h ∗ (ε), optimized bias
parameter), one has (see (9.81)):
E
I MM L MC = E Y h ≥ E Y0 = I0
N L−1
i.e. the regular MLMC estimator is upper-biased. This allows us to produce asym-
metrical narrower confidence intervals (see Practitioner’s corner in Sect. 9.5.2) since
the bias E I MM L MC − I0 is always non-negative.
Fig. 9.1 Clark–Cameron oscillator: CPU-time ratio w.r.t. CPU-time of ML2R (in seconds)
as a function of the prescribed RMSE ε (with D. Giorgi)
2
− μ 2T 1− tanh(σT
σT
)
e
E cos(X T2 ) = √ . (9.103)
cosh(σT )
P(1, 1, 1) 7.14556.
We defer to Sect. 12.11 in the Miscellany chapter the proof of this identity (also used
in [113] as a reference value).
We implemented the following schemes with their correspondences in terms of
characteristic exponents (α, β):
This model can be simulated in an exact way at fixed time t ∈ [0, T ] since the above
σ2
SDE has a closed solution X t = x0 e(r − 2 )t+σWt .
All the simulations of this section have been performed by processing a C++ script
on a Mac Book Pro (2.7 GHz intel Core i5).
Vanilla Call (α = β = 1)
The (discounted) Vanilla Call payoff is given by ϕ(X T ) = e−r T X T − K + . As for
the model parameters we set:
I0 = Call0 25.9308.
– The selected time discretization schemes are the Euler scheme (see (7.6)) for
which β = 1), and the Milstein scheme (see (7.39)) for which β = 2. Both are sim-
ulable since we are in one dimension and share the weak error exponent α = 1.
– The constant c∞ is set at 1 in the ML2R estimator and c1 is roughly estimated
in a preliminary phase to design the MLMC estimator (see Practitioner’s corner of
Sect. 9.5.2).
Though of no practical interest, one must have in mind that the Black–Scholes
SDE is quite a demanding benchmark to test the efficiency of discretization schemes
because both its drift and diffusion coefficients go to infinity as fast as possible under
450 9 Biased Monte Carlo Simulation, Multilevel Paradigm
a linear growth control assumption. This is all the more true given the high interest
rate.
All estimators are designed following the specifications detailed in the Practi-
tioner’s corners sections of this chapter (including the crude unbiased Monte Carlo
simulation based on the exact simulation of X T ). As for (weighted and unweighted)
multilevel estimators, we set h = 1 and estimate the “payoff-dependent” parameters
following the recommendations made in Practitioner’s corner I (see (9.70) and (9.71)
with h 0 = h/5 and h 0 = h 0 /2):
The parameters of the multilevel estimators have been settled following the values in
Tables 9.1 and 9.2. The crude Monte Carlo simulation has been implemented using
the parameter values given in Sect. 9.3.
We present below tables of numerical results for crude Monte Carlo, MLMC
and ML2R estimators for both Euler and Milstein schemes. Crude Monte Carlo
has been run for target RMSE running from ε = 2−k , k = 1, . . . , 6 and multilevel
estimators for RMSE running from ε = 2−k , k = 1, . . . , 8. We observe globally that
our estimators fulfill the prescribed RMSE so that it is relevant to compare the C PU
computation times.
Crude versus Multilevel estimators: See Tables 9.4, 9.5, 9.6, 9.7, 9.8 and 9.9. We
observe at a 2−6 level for the Euler scheme that MLMC is 45 times faster than crude
MC and ML2R is 130 times faster than crude MC. As for the Milstein scheme these
ratios attain 166 and 255, respectively. This illustrates in a striking way that multilevel
methods simply make possible simulations that would be unreasonable to undertake
otherwise (in fact, that is why we limit ourselves to this lower range of RMSE).
ML2R versus MLMC estimators: See Tables 9.6, 9.7, 9.8 and 9.9. This time we
make the comparison at an RMSE level of ε = 2−8 . As for the Euler scheme we
observe that the ML2R is 4.62 times faster than the MLMC estimator whereas, with
the Milstein scheme, this ratio is still 1.58. Note that, being in a setting β = 2 > 1,
both estimators are assumed to behave like unbiased estimators which means that,
in the case of ML2R, the constant C in the asymptotic rate Cε−2 of the complexity
seemingly remains lower with ML2R than with MLMC.
In the following figures, the performances (vertices) of the estimators are depicted
against the empirical RMSE ε (abscissas). The label above each point of the graph
indicates the target RMSE ε. The empirical RMSE, based on the bias-variance decom-
position, has been computed a posteriori by performing 250 independent trials for
each estimator. The performances of the estimators are measured and compared in
various ways, mixing CPU time and ε. Details are provided for each figure.
9.7 Examples of Simulation 451
Fig. 9.2 Call option pricing: CPU-time as a function of the empirical RMSE
ε. Top: log-
regular scale; Bottom: log-log scale (with D. Giorgi and V. Lemaire)
454 9 Biased Monte Carlo Simulation, Multilevel Paradigm
2 −6−6
●
●
2
●
2 −5
2 −5
●
1e−02 2 −4
● 2 −4
●
Estimator
2 −3 −3
● ●2 ● MC
CPU−Time x MSE
MLMC
2 −2−2
● 2
● ML2R
1e−03 ●
2 −1−1
2
2 −8 2 −7 ● Scheme
−6
2
2 −5 −3
Euler
2 −4 2
Milstein
2 −2 2 −1
2 −7 Exact
−8 2 −6 2 −5
−2 2 −1
2
−8 2 −7 −6 2 −5 −42
−4 2 −1
2 2 2 2 −3 −3 2
2 22−2−2
2 −8 2 −7 2 −1
1e−04 2 −6 2 −5 2 −4 2 −3
2 −7 2 −3 2 −2 2 −1
−8 ● 2 −6 2 −5 2 −4
● ● ●
●
2 ● ●
●
2 −8 2 −7 2 −6 2 −5 2 −4 2 −3 2 −2 2 −1
Empirical RMSE
Fig. 9.3 Call option pricing by Euler and Milsteinschemes: MSE × CPU-time as a
function of the empirical RMSE
ε, log-log- scale. (With D. Giorgi and V. Lemaire.)
Figure 9.3 illustrates for the Milstein scheme the “virtually” unbiased setting (β =
2) obtained here with the Multilevel Milstein scheme. We verify that multilevel curves
remain approximately flat but lie significantly above the true unbiased simulation
which, as expected, remains unrivaled. The weighted multilevel seems to be still
faster in practice.
For more simulations involving path-dependent options, we refer to [107, 109]
for regular multilevel and to [112, 198] for a comparison between weighted and
regular Multilevel estimators on other options like Lookback options (note that the
specifications are slightly more “conservative” in these last two papers than those
given here, whereas in Giles’ papers, the estimators designed are in an adaptive
form). Now we will briefly explore a path-dependent option, namely a barrier option
for which the characteristic exponent β is lower than 1, namely β = 0.5 (in (S E)0β )
and α is most likely also equal to 0.5. We still consider a Black–Scholes dynamics
to have access to a closed form for the option premium.
Barrier option (α = β = 0.5)
We consider an Up-and-Out Call option to illustrate the case β = 0.5 < 1 and α =
0.5. This path-dependent option with strike K and barrier B > K is defined by its
functional payoff
With K = 100 and B = 120, the price obtained by the standard closed-form solution
is
I0 = Up-and-Out Call0 = 1.855225.
Remark. The performances of the ML2R estimator on this option tends to support
the conjecture that an expansion of the weak error at higher-orders exists.
●
2−5
2−6
0.010
●
2−4
Estimator
2−5
MC
CPU−Time x MSE
MLMC
2−4
●
2−3 ML2R
0.001 Scheme
2−3 Euler
●
2−2
2−6 2−5 2−2
2−4
2−3
2−2
2−6
−5
●2
10.00
2−5 −4 Estimator
●2
2−6
MC
CPU−Time
●
1.00
MLMC
2−5 2−4
ML2R
−3
●2
0.10 Scheme
2−4
Euler
2−3
2−3 −2
0.01 ●2
2−2
2−2
A compound option being simply an option on an option, its payoff involves the
value of another option. We consider here the example of a European style Put-on-
Call, still in a Black–Scholes model, with parameters (r, σ) (see (9.104) above). We
consider a Call of maturity T2 and strike K 2 . At date T1 , T1 < T2 , the holder of the
option has the right to sell it at (strike) price K 1 . The payoff of such a Put-on-Call
option reads
e−r T1 K 1 − e−r (T2 −T1 ) E (X T2 − K 2 )+ | X T1 .
+
Note that, in these experiments, the underlying process (X t )t∈[0,T2 ] is not dis-
cretized in time since it can be simulated in an exact way. The bias error is exclusively
due to the inner Monte Carlo estimator of the conditional expectation.
The parameters of the Black–Scholes dynamics (X t )t∈[0,T2 ] are
whereas those of the Put-on-Call payoff are T1 = 1/12, T2 = 1/2 and K 1 = 6.5,
K 2 = 100.
To compute a reference price in spite of the absence of closed form we proceed
as follows: we first note that, by the (homogeneous) Markov property, the Black–
Scholes formula at time T1 reads (see Sect. 12.2)
e−r (T2 −T1 ) E (X T2 − K 2 )+ | X T1 = Call B S (X T1 , K 2 , r, σ, T2 − T1 )
(see Sect. 12.2) so that Y0 = g X T1 where
g(x) = f K 1 − Call B S (x, K 2 , r, σ, T2 − T1 ) +
.
This leads, with the numerical values of the current example, to the reference value
7 7
E Y0 1.39456 and 7E Y0 − 1.394567 ≤ 2−15 at a 95%-confidence level.
Exercise. Check the above reference value using a large enough Monte Carlo sim-
ulation with the above control variate. How would you still speed up this simulation?
Such a value can still be seen as a compound option, namely the value of a digital
option on the premium at time T1 of a call of maturity T2 . We kept the above param-
eters for the Black–Scholes model and the underlying call option but also for the
digital component, namely K 1 = 6.5, T1 = 1/2; K 2 = 100 and T2 = 1/2.
9.7 Examples of Simulation 459
1e−03 2−8
●
2−8 ●
2−7
CPU−Time x MSE
1e−05
2−8 2−7 2−6 2−5 2−4 2−3
Empirical RMSE
1e+02 ●2
−8
2−8
−7
1e+01 2−8 ●2
2−7
2−8 Estimator
2−8 2−7 ● MC
2−7 2−6−6
2
1e+00
CPU−Time
●
−7 MLMC
2 2−6
−6 ML2R
2 −5
2−5
2−6 ●2
−5
2 Scheme
1e−01 2−5
2−5 2−4 Nested
−4
−4
22
●
Antithetic
2−4
1e−02 2 −4
2
2−3
−3
−3
●2
−3 2−3
2
1e−03
Fig. 9.6 Digital option on a Call: CPU-time (y–axis, log scale) as a function of the empirical
RMSEε̃ (x-axis, log2 scale) (with V. Lemaire)
We only tested the first two regular MLMC and ML2R estimators since antithetic
pseudo-schemes are not efficient with non-smooth functions (here an indicator func-
tion).
Theorem 9.2(c) strongly suggests that weak error expansion holds with α = 1
and Proposition 9.2(c) suggests that the strong error rate is satisfied with β = 0.5− ,
so we adopt these values.
The multilevel estimators have been implemented for this simulation with the
parameters from [198], which are slightly more conservative than those prescribed
in this chapter, which explains why the crosses are slightly ahead of their optimal
positions (the resulting empirical RMSE is lower than the prescribed one).
Note in Fig. 9.6 that the ML2R estimator is faster than the MLMC estimator as a
function of the empirical RMSE ε by a factor approximately equal to 5 within the
range of our simulations. This is expected since we are in framework where β < 1.
Note that this toy example is close in spirit to the computation of Solvency Capital
Requirement which consists by its very principle in computing a kind of quantile – or
a value-at-risk – of a conditional expectation.
The webpage below, created and maintained by Mike Giles, is an attempt to list
the research groups working on multilevel Monte Carlo methods and their main
contributions:
people.maths.ox.ac.uk/gilesm/mlmc_community.html
9.7 Examples of Simulation 461
For an interactive ride through the world of multilevel, have a look at the website
simulations@lpsm at the url:
simulations.lpsm.paris/
∗
We also set pn = P(τ
= n) = πn − πn+1 for every n ∈ N . 6
We assume that n≥1 Z n 2 < +∞ so that, classically ( ),
L 2 & a.s
Z 1 + · · · + Z n −→ Y0 = Zn.
n≥1
Our aim is to compute I0 = E Y0 taking advantage of the assumption that the random
variables Z n are simulable (see further on) and to be more precise, to devise an
unbiased estimator of I0 . To this end, we will use the random time τ (with the
disadvantage of adding an exogenous variance to our estimator).
Let
τ
Zi Zi
Iτ = = 1{i≤τ }
π
i=1 i i≥1
πi
n≥1 Z n 1
< +∞ so that E n≥1 |Z n | < +∞,
6 As the L 1 -norm is dominated by the L 2 -norm,
which in turn implies that n≥1 |Z n | is P-a.s. absolutely convergent.
462 9 Biased Monte Carlo Simulation, Multilevel Paradigm
Z n 2
√ < +∞, (9.105)
n≥1
πn
then
L 2 & a.s
Iτ ∧n −→ Iτ
L 2 & a.s
(and Z 1 + · · · + Z n −→ Y0 as n → +∞). Then, the random variable Iτ is unbi-
ased with respect to Y0 in the sense that
- τ
.
Zi
E Iτ = E = E Y0
i=1
πi
and 2
2 2 Z n 2
Var Iτ = Iτ 2 − I02 with Iτ 2 ≤ √ . (9.106)
n≥1
πn
Var(Z n ) |E Z n |
< +∞ and √ < +∞, (9.107)
n≥1
πn n≥1
πn
L 2 & a.s
then Iτ ∧n −→ Iτ as n → +∞ and
21
Var(Z n ) |E Z n |
Iτ ≤ + √ . (9.108)
2
n≥1
πn n≥1
πn
9.8 Randomized Multilevel Monte Carlo (Unbiased Simulation) 463
Remark. When the random variables Z n are independent and E Z n = o Z n 2 , the
assumption (9.107) in (c) is strictly less stringent than that in (a) (under theindepen-
dence assumption). Furthermore, if one may neglect the additional term n≥1 |E√πZnn |
involving expectations in the upper-bound (9.108) of Iτ 2 , it is clear that the variance
in the independent case is most likely lower than the general upper-bound established
in (9.106)) since an2 = o(an ) when an → 0.
Proof of Proposition 9.4. (a) The space L 2 (, A, P) endowed with the L 2 -norm
Z
· 2 is complete so it suffices to show that the series with general term √nπn2 is
· 2 -normally convergent. Now, using that τ and the sequence (Z n )n≥1 are mutually
independent, we have for every n ≥ 1,
Z Z n Z n 2 √ Z n
n
√ 1{n≤τ } = √ 2 1{n≤τ } 2 = πn = √ 2 .
πn 2 πn πn πn
Then, the sequence Iτ ∧n converges in L 2 (P), hence toward Iτ as it is clearly its a.s.
limit since τ < +∞.
The L 2 -convergence of Yn toward Y∞ is straightforward under our assumptions
√
since 1 ≤ 1/ πi by similar 2
? L -normal @ convergence arguments.
Zn E Zn
On the other hand E 1{n≤τ } = πn = E Z n , n ≥ 1. Now as a conse-
πn πn
quence of the L 2 (hence L 1 )-convergence of Iτ ∧n toward Iτ , their expectations con-
n
verge, i.e. E Iτ = lim E Iτ ∧n = lim E Z i = E Y0 . Finally,
n n
i=1
Z n 2
Iτ ≤ √ .
2
n≥1
πn
Now
√ n
Zi √
n
Z i 2
πn+1 ≤ πn √ √ → 0 as n → +∞
πi πi πi
i=1 2 i=1
464 9 Biased Monte Carlo Simulation, Multilevel Paradigm
Z i 2
owing to Kronecker’s Lemma (see Lemma 12.1 applied with an = √
πi
and bn =
√
1/ πn ).
(c) First we decompose Iτ ∧n into
n
Zi
n
E Zi
Iτ ∧n = 1{i≤τ } + 1{i≤τ } ,
i=1
πi i=1
πi
since E Zj = E
Zi Zi E
Z j = 0 if i = j. Hence, for all integers n, m ≥ 1,
n+m 21
n+m
Var(Z i )
Zi
1{i≤τ } ≤
πi πi
i=n+1 2
i=n+1
τ ∧n
Var(Z n )
Zi
which implies, under the assumption πn
< +∞, that the series =
n≥1
i=1
πi
n
Zi
1{i≤τ } is a Cauchy sequence hence convergent in the complete space
π
i=1 i
τ
2
Zi
L (P), . 2 . Its limit is necessarily Iτ = since that is its a.s. limit.
π
i=1 i
On the other hand, as
|E Z i | |E Z i | √ |E Z i |
1
{i≤τ } ≤ πi = √ < +∞,
π πi πi
i≥1 i 2 i≥1 i≥1
n τ
E Zi |E Z i |
one deduces that 1{i≤τ } is convergent in L 2 (P). Its limit is clearly .
i=1
πi i=1
πi
Finally, Iτ ∧n → Iτ in L 2 (P) (and P-a.s.) and the estimate of Var(Iτ ) is a straightfor-
ward consequence of Minkowski’s Inequality. ♦
Unbiased Multilevel estimator
The resulting unbiased multilevel estimator reads, for every integer M ≥ 1,
9.8 Randomized Multilevel Monte Carlo (Unbiased Simulation) 465
τ (m)
Z i(m)
M M
1 1
I MR M L = Iτ(m)
(m) = , (9.109)
M m=1
M m=1 i=1
πi
where (τ (i) )i≥1 and (Z i(i) )i≥1 , i ≥ 1, are independent sequences, distributed as τ and
(Z i )i≥1 , respectively.
κ(Z n ) = κN n , n ≥ 1.
κ̄(Iτ ) = κ pn N n .
n≥1
(1 − p)N < 1
κ pN
and κ̄(Iτ ) = .
1 − (1 − p)N
• Variance. We make the assumption that there exists some β > 0 such that
466 9 Biased Monte Carlo Simulation, Multilevel Paradigm
E Z n2 ≤ V1 N −nβ , n ≥ 1,
for the same N as for the complexity. Then, the finiteness of the variance is satisfied
if Assumption (9.105) holds true. With the above assumptions and choice, we have
β
Z n 2 N−2 n −n/2
√ ≤ V1 = V1 (1 − p) N β (1 − p) .
πn (1 − p)
n−1
2
1 1
<1− p < ,
Nβ N
which may hold if and only if β > 1. A reasonable choice is then to set 1 − p to
be the geometric mean of these bounds, namely
1+β
p =1− N 2 . (9.110)
• Size of the simulation. To specify our estimator for a given RMSE, we take advan-
tage of its unbiased feature and note that
RML 2 R M L Var Iτ
I M − I0 2 = Var I M = .
M
RML
The prescription
I M − I0 2 = ε, reads like for any unbiased estimator,
M ∗ (ε) = ε−2 Var Iτ .
In view of practical
implementation a short pre-processing is necessary to roughly
calibrate Var Iτ .
However, we insist upon the fact that this unbiased simulation method can only
be implemented if the family (Yh )h∈H satisfies (S E)β with β > 1 (fast quadratic
approximation). By contrast, at a first glance, no assumptions on the weak error
expansion are needed. This observation should be tempered by the result stated in
Claim (b) of the former proposition, where the Z n are assumed independent. This
setting is actually the closest to multilevel multilevel approach and the second series
in (9.107) involves E Z n , which implies a control of the weak error to ensure its
convergence.
Exercise
(Linear refiners). Set n i = i, i ≥ 1, in the multilevel framework. Assume
H = k/n, n ≥ 1 and the sequence (Z n )n≥1 is made of independent random vari-
ables (see Claim (c) of the above Proposition 9.4).
• Inverse linear complexity: there exists a κ > 0 such that κ(Yh ) = κ/ h, h ∈ H.
β
• (S E)β ≡ Var(Yh − Y hi ) ≤ V1 h β 1 − 1/i , h ∈ H, i ≥ 2.
• (W E)1α ≡ E Yh + c1 h + o(h) as h → 0 in H.
(a) Devise a new version of randomized Monte Carlo based on these assumptions
(including a new distribution for τ ).
(b) Show that, once again, it is unfortunately only efficient (by efficient we mean that
it is possible to choose the distribution of τ so that the mean cost and the variance of
the unbiased randomized multilevel estimate both remain finite) if β > 1.
Application to weak exact simulation of SDEs
In particular, in an SDE framework, a multilevel implementation of the Milstein
scheme provides an efficient unbiased method of simulation of the expectation of a
marginal function of a diffusion E f (X T ). This is always possible in one dimension
(scalar Brownian motion, i.e. q = 1 with our usual notations) under appropriate
smoothness assumptions on the drift and the diffusion coefficient of the SDE.
In higher dimensions (q ≥ 2), an unbiased simulation method can be devised by
considering the antithetic (meta-)scheme, see Sect. 9.6.1,
1 [1],(i)
Z 1 = Yh(1) , and Zi = Yh + Y [2],(i)
h − Y (i)h , i ≥ 2,
2 ni ni n i−1
(with the notations of this section). Such an approach provides a rather simple weak
exact simulation method of a diffusion among those mentioned in the introduction
of this chapter.
We tested the performances of the Randomized Multilevel estimator in the above with
consistent and independent levels on a regular Black–Scholes Call in one dimension,
discretized by a Milstein scheme in order to guarantee that β > 1. The levels have
468 9 Biased Monte Carlo Simulation, Multilevel Paradigm
been designed following the multilevel framework described above. The parameters
of the Call are
x0 = 100, K = 80, σ = 0.4.
p ∗ = 0.931959
owing to Formula (9.110) (further numerical experiments show that this choice is
close to optimal).
First we compared the Randomized Multilevel estimator (see the former subsec-
tion) with consistent levels (the same Brownian underlying increments are used for
the simulation at all levels) and the Randomized Multilevel estimator with inde-
pendent levels (in the spirit of the multilevel paradigm described at the beginning
of Sect. 9.5). Figure 9.7 depicts the respective performances of the two estimators.
Together with the computations carried out in Proposition 9.4, it confirms the intu-
ition that the Randomized Multilevel estimator with independent levels outperforms
that with constant levels. To be more precise the variance of Iτ is approximately equal
to 1617.05 in the consistent levels case and 1436.61 in the independent levels case.
As a second step, with the same toy benchmark, we tested the Randomized Multi-
level estimator versus the weighted and regular Multilevel estimators described in the
former sections. Figure 9.8 highlights that even for not so small prescribed RMSE
(ε = 2−k , k = 1, . . . , 5 in this simulation), the Randomized Multilevel estimator
(with independent levels) is outperformed by both weighted and regular multilevel
Fig. 9.7 RML estimators: Constant vs independent levels (on a Black–Scholes Call). CPU-time
(y–axis, log scale) as a function of the empirical RMSE ε̃ (x-axis, log scale) (with D. Giorgi)
9.8 Randomized Multilevel Monte Carlo (Unbiased Simulation) 469
Fig. 9.8 RML estimator vs ML2R and MLMC estimators: CPU-time as a function of the
empirical RMSE ε̃ (x-axis, log scale) (with D. Giorgi)
Assume that the function f is regular, at least at some points. Our aim is to devise a
method to compute by simulation f (x) or higher derivatives f (k) (x), k ≥ 1, at such
points x. If the functional F(x, z) is differentiable in its first variable at x for P Z -
almost every z, if a domination or uniform integrability property holds (like (10.1)
or (10.5) below), if the partial derivative ∂∂xF (x, z) can be computed and Z can be
simulated both, at a reasonable computational cost, it is natural to compute f (x)
using a Monte Carlo simulation based on the representation formula
∂F
f (x) = E (x, Z ) .
∂x
1 Although this chapter comes right after the presentation of multilevel methods, we decided to
expose the main available approaches for simulation-based sensitivity computation on their own,
in order not to be polluted by technicalities induced by a mix (or a merge). However, the multilevel
paradigm may be applied efficiently to improve the performances of the Monte Carlo-based methods
that we will present: in most cases we have at hand a strong error rate and a weak error expansion.
This can be easily checked on the finite difference method, for example. For an application of the
multilevel approach to Greek computation, we refer to [56].
2 In fact, E can be replaced by the probability space itself: Z becomes the canonical vari-
able/process on this probability space (endowed with the distribution P = P Z of the process).
In particular, Z can be the Brownian motion or any process at time T starting at x or its entire path,
etc. The notation is essentially formal and could be replaced by the more general F(x, ω).
© Springer International Publishing AG, part of Springer Nature 2018 471
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_10
472 10 Back to Sensitivity Computation
This approach has already been introduced in Chap. 2 and will be more deeply devel-
oped further on, in Sect. 10.2, mainly devoted to the tangent process method for
diffusions.
Otherwise, when ∂∂xF (x, z) does not exist or cannot be computed easily (whereas
F can), a natural idea is to introduce a stochastic finite difference approach. Other
methods based on the introduction of an appropriate weight will be introduced in the
last two sections of this chapter.
The finite difference method is in some way the most elementary and natural method
for computing sensitivity parameters, known as Greeks when dealing with financial
derivatives, although it is an approximate method in its standard form. This is also
known in financial Engineering as the “Bump Method” or “Shock Method”. It can be
described in a very general setting which corresponds to its wide field of application.
Finite difference methods were been originally investigated in [117, 119, 194].
Proposition 10.1 Let x ∈ R. Assume that F satisfies the following local mean
quadratic Lipschitz continuous assumption at x
∃ ε0 > 0, ∀ x ∈ (x − ε0 , x + ε0 ), F(x, Z ) − F(x , Z )2 ≤ C F,Z |x − x |. (10.1)
ε2 2 C F,Z
2 − ( f (x) − ε2
[ f ]Lip )2+
2
≤ [ f ]Lip + (10.2)
2 M
ε2 2 C F,Z2
≤ [ f ]Lip + (10.3)
2 M
ε 2 C
≤ [ f ]Lip + √F,Z .
2 M
Proof. Let ε ∈ (0, ε0 ). It follows from the Taylor formula applied to f between x
and x ± ε, respectively, that
f (x + ε) − f (x − ε) ε2
f (x) − ≤ [ f ]Lip . (10.4)
2ε 2
On the other hand
F(x + ε, Z ) − F(x − ε, Z ) f (x + ε) − f (x − ε)
E =
2ε 2ε
and
F(x + ε, Z ) − F(x − ε, Z )
Var
2ε
f (x + ε) − f (x − ε) 2
F(x + ε, Z ) − F(x − ε, Z ) 2
=E −
2ε 2ε
E (F(x + ε, Z ) − F(x − ε, Z ))2 f (x + ε) − f (x − ε) 2
= −
4ε2 2ε
f (x + ε) − f (x − ε) 2
≤ C F,Z
2
− ≤ C F,Z
2
. (10.5)
2ε
Using the bias-variance decomposition (sometimes called Huyguens’ Theorem in
Statistics (3 )), we derive the following upper-bound for the mean squared error
2
1 F(x + ε, Z k ) − F(x − ε, Z k )
M
f (x) −
M k=1 2ε
2
f (x + ε) − f (x − ε) 2
= f (x) −
2ε
1 F(x + ε, Z k ) − F(x − ε, Z k )
M
+ Var
M k=1 2ε
f (x + ε) − f (x − ε) 2
= f (x) −
2ε
1 F(x + ε, Z ) − F(x − ε, Z )
+ Var
M 2ε
2 2 f (x + ε) − f (x − ε) 2
ε 1
≤ [ f ]Lip + C F,Z −
2
2 M 2ε
where we combined (10.4) and (10.5) to derive the last inequality, i.e. (10.3). To get
the improved bound (10.2), we first derive from (10.4) that
f (x + ε) − f (x − ε) ε2
≥ f (x) − [ f ]Lip
2ε 2
so that
F(x + ε, Z ) − F(x − ε, Z ) f (x + ε) − f (x − ε)
E =
2ε 2ε
f (x + ε) − f (x − ε)
≥
2ε +
ε 2
≥ f (x) − [ f ]Lip .
2 +
Plugging this lower bound into the definition of the variance yields the announced
inequality. ♦
However, this is not realistic since C F,Z is only an upper-bound of the variance term
and [ f ]Lip is usually unknown. An alternative is to choose
M(ε) = o ε−4
10.1 Finite Difference Method(s) 475
to be sure that the statistical error becomes smaller. However, it is of course useless
to carry on the simulation too far since the bias error is not impacted. Note that
such specification of the size M of the simulation breaks the recursive feature of the
estimator. Another way to use such an error bound is to keep in mind that, in order
to reduce the error by a factor of 2, we need to reduce ε and increase M as follows:
√
ε ε/ 2 and M 4 M.
Warning (what should never be done)! Imagine that we are using two independent
samples (Z k )k≥1 and (
Z k )k≥1 to simulate copies F(x − ε, Z ) and F(x + ε, Z ). Then,
1 F(x + ε, Z k ) − F(x − ε,
M
Zk )
Var
M k=1 2ε
1
= Var F(x + ε, Z ) + Var F(x − ε, Z )
4Mε2
Var F(x, Z )
.
2Mε2
ε2 σ F(x, Z )
[ f ]Lip + √ ,
2 ε 2M
√
where σ(F(x, Z )) = Var(F(x, Z )) is the standard deviation of F(x, Z ). This leads
to consider the unrealistic constraint M(ε) ∝ ε−6 to keep the balance
√ between the
bias term and the variance term; or equivalently to switch ε ε/ 2 and M 8 M
to reduce the error by a factor of 2.
Examples (Greeks computation). 1. Sensitivity in a Black–Scholes model. Vanilla
payoffs viewed as functions of a normal distribution correspond to functions F of
the form
σ2
√
F(x, z) = e−r t h x e(r − 2 )T +σ T z , z ∈ R, x ∈ (0, +∞),
d
so that elementary computations show, using that Z = N (0; 1),
σ2
F(x, Z ) − F(x , Z )2 ≤ [h]Lip |x − x |e 2 T
.
476 10 Back to Sensitivity Computation
convolution on (0, +∞) (4 ) with similar regularizing effects as the standard con-
volution on the whole real line. Under the appropriate growth assumption on the
function h (say polynomial growth), one shows from the above identity that the
function f is in fact infinitely differentiable over (0, +∞). In particular, it is twice
differentiable with Lipschitz continuous second derivative over any compact interval
included in (0, +∞).
2. Diffusion model with Lipschitz continuous coefficients. Let X x = (X tx )t∈[0,T ]
denote the Brownian diffusion solution of the SDE
where b and ϑ are locally Lipschitz continuous functions (on the real line) with at
most linear growth (which implies the existence and uniqueness of a strong solution
(X tx )t∈[0,T ] starting from X 0x = x). In such a case, one should instead write
F(x, ω) = h X Tx (ω) .
The Lipschitz continuity of the flow of the above SDE (see Theorem 7.10) shows
that
F(x, . ) − F(x , . )2 ≤ Cb,ϑ [h]Lip |x − x |eCb,ϑ T
where Cb,ϑ is a positive constant only depending on the Lipschitz continuous coef-
ficients of b and ϑ. In fact, this also holds for multi-dimensional diffusion processes
and for path-dependent functionals.
The regularity of the function f is a less straightforward question. But the answer
is positive in two situations: either h, b and σ are regular enough to apply results on
the flow of the SDE which allows pathwise differentiation of x → F(x, ω) (see The-
orem 10.1 further on in Sect. 10.2.2) or ϑ satisfies a uniform ellipticity assumption
ϑ ≥ ε0 > 0.
3. Euler scheme of a Brownian diffusion model with Lipschitz continuous coeffi-
cients. The same holds for the Euler scheme. Furthermore, Assumption (10.1) holds
uniformly with respect to n if Tn is the step size of the Euler scheme.
4. F can also be a functional of the whole path of a diffusion, provided F is Lipschitz
continuous with respect to the sup-norm over [0, T ].
4 Theconvolution
on (0, +∞) is defined between two non-negative functions f and g on (0, +∞)
+∞
by f g(x) = f (x/y)g(y)dy.
0
10.1 Finite Difference Method(s) 477
As emphasized in the section devoted to the tangent process below, the generic
parameter x can be the maturity T (in practice the residual maturity T − t, also
known as seniority), or any finite-dimensional parameter on which the diffusion
coefficients depend since they can always be seen as an additional component or a
starting value of the diffusion.
Exercises. 1. Adapt the results of this section to the case where f (x) is estimated
by its “forward” approximation
f (x + ε) − f (x)
f (x) .
ε
f (x + ε) − f (x − ε) ε2
= f (x) + f (3) (x) + O(ε3 ),
2ε 6
propose a Richardson–Romberg extrapolation method (see Sect. 9.4) based on the
estimator
f (x)ε,M = M1 1≤k≤M
F(x+ε,Z k )−F(x−ε,Z k )
2ε
with ε and ε/2
3. Multilevel estimators (see Sect. 9.5). (a) Assume f is C 4 and refine the above
expansion given in the previous exercise. Give a condition on the random function
F(x, ω) which makes it possible to design and implement a regular multilevel esti-
mator for f (x).
(b) Under which assumptions is it possible to design and implement a weighted
Multilevel estimator (ML2R)?
(c) Which conclusions should be drawn from (a) and (b).
4. Apply the above method(s) to approximate the γ-parameter by considering that
f (x + ε) + f (x − ε) − 2 f (x)
f (x)
ε2
under suitable assumptions on f and its derivatives.
The singular setting
In the above setting described in Proposition 10.1, we are close to a frame-
work in which one can interchange derivation and expectation: the (local) Lips-
chitz
continuous assumption on the random function x → F(x , Z ) implies that
F(x ,Z )−F(x,Z )
x −x
is a uniformly integrable family. Hence, as soon as
x ∈(x−ε0 ,x+ε0 )\{x}
x → F(x , Z ) is P-a.s. pathwise differentiable at x (or even simply in L 2 (P)), one
has f (x) = E Fx (x, Z ).
Consequently, it is important to investigate the singular setting in which f is
differentiable at x and F( . , Z ) is not Lipschitz continuous in L 2 . This is the purpose
of the next proposition (whose proof is quite similar to the Lipschitz continuous
setting and is subsequently left to the reader as an exercise).
478 10 Back to Sensitivity Computation
ε2 C H ol,F,Z
≤ [ f ]Lip + √ . (10.6)
2 (2ε)1−θ M
A dual point of view in this singular case is to (roughly) optimize the parameter
ε = ε(M), given a simulation size of M in order to minimize the quadratic error, or
at least its natural upper-bounds. Such an optimization performed on (10.6) yields
1
2θ C H ol,F,Z 3−θ
εopt = √
[ f ]Lip M
that when θ ∈ (0, 1), the lack of L 2 -regularity of x → F(x, Z ) slows down the con-
vergence of the finite difference method by contrast with the Lipschitz continuous
case where the standard rate of convergence of the Monte Carlo method is preserved.
Example of the digital option. A typical example of such a situation is the pricing
of digital options (or equivalently the computation of the δ-hedge of a Call or Put
options).
Let us consider, still in the standard risk neutral Black–Scholes model, a digital
Call option with strike price K > 0 defined by its payoff
h(ξ) = 1{ξ≥K }
10.1 Finite Difference Method(s) 479
σ2
√
and set F(x, z) = e−r t h x e(r − 2 )T +σ T z , z ∈ R, x ∈ (0, +∞) (r denotes the con-
stant interest rate as usual). We know that the premium of this option is given for
every initial price x > 0 of the underlying risky asset by
d
f (x) = E F(x, Z ) with Z = N (0; 1).
σ2 d
Set μ = r − 2
. It is clear since Z = −Z that
√
f (x) = e−r t P x eμ T +σ T Z ≥ K
log(x/K ) + μT
= e−r t P Z ≥ − √
σ T
log(x/K ) + μT
= e−r t 0 √ ,
σ T
where 0 denotes the c.d.f. of the N (0; 1) distribution. Hence the function f is
infinitely differentiable on (0, +∞).
d
On the other hand, still using Z = −Z , for every x, x ∈ R,
F(x, Z ) − F(x , Z )2
2
2
= e−2r T 1 − 1
Z ≥− log(x/K√ )+μT
Z ≥− log(xσ/K )+μT
σ T
√
T
2
2
= e−2r T E 1 )+μT
− 1
Z ≤ log(x/K
√
σ T
Z ≤ log(xσ/K
√
T
)+μT
log(min(x, x )/K ) + μT
−2r T log(max(x, x )/K ) + μT
=e 0 √ − 0 √ .
σ T σ T
−2r T
F(x, Z ) − F(x , Z )2 ≤ κ0 e√ log x − log x .
2
σ T
Consequently for every interval I ⊂ (0, +∞) bounded away from 0, there exists a
real constant Cr,σ,T,I > 0 such that
∀ x, x ∈ I, F(x, Z ) − F(x , Z )2 ≤ Cr,σ,T,I |x − x |,
i.e. the functional F is 21 -Hölder in L 2 (P) and the above proposition applies.
480 10 Back to Sensitivity Computation
In the former finite difference method with constant step, the bias never fades. Conse-
quently, increasing the accuracy of the sensitivity computation, it has to be resumed
from the beginning with a new ε. In fact, it is easy to propose a recursive version of
the above finite difference procedure by considering some variable steps ε which go
to 0. This can be seen as an application of the Kiefer–Wolfowitz principle originally
developed for Stochastic Approximation purposes.
We will focus on the “regular setting” (F Lipschitz continuous in L 2 ) in this
section, the singular setting is proposed as an exercise. Let (εk )k≥1 be a sequence of
positive real numbers decreasing to 0. With the notations and the assumptions of the
former section, consider the estimator
1 F(x + εk , Z k ) − F(x − εk , Z k )
M
f (x) M := . (10.7)
M k=1 2εk
2 1 f (x + εk ) − f (x − εk )
M 2
f (x) −
f (x) M =
f (x) −
2 M k=1 2εk
1 Var(F(x + εk , Z k ) − F(x − εk , Z k ))
M
+ 2
M k=1 4ε2k
2
[ f ]2Lip
M 2
C F,Z
≤ ε2k + (10.8)
4M 2 k=1 M
⎛ 2
⎞
2
1 ⎝ [ f ] M
2 ⎠
= ε2k + C F,Z
Lip
M 4M k=1
where we again used (10.4) to get Inequality (10.8). As a consequence, the RMSE
satisfies
2
[ f ] M 2
1
f (x) − f (x) M ≤ √ εk + C F,Z
2 .
Lip 2
(10.9)
2 M 4M k=1
k=1
εk = o k − 4
1
as k → +∞,
M
M
1
since, then, ε2k = o 1 as M → +∞ and
k=1 k=1 k2
M
√ 1 k −2 √ 1 dx √
M 1
1
1 = M ∼ M √ = 2 M as M → +∞.
k=1 k2 M k=1 M 0 x
However, the choice of too small steps εk may introduce numerical instability in
the computations, so we recommend to choose the εk of the form
εk = o k −( 4 +δ)
1
with δ > 0 small enough.
482 10 Back to Sensitivity Computation
f (x + εk ) − f (x − εk )
Now, since → f (x) as k → +∞,
2εk
M
1 f (x + εk ) − f (x − εk ) 2
→ f (x)2 as M → +∞.
M k=1 4ε2k
Plugging this into the above computations yields the refined asymptotic upper-bound
√
M f (x) −
2
lim f (x) M ≤ C F,Z
2
− f (x) .
M→+∞ 2
This approach has the same quadratic rate of convergence as a regular Monte
Carlo simulation (e.g. a simulation carried out with ∂∂xF (x, Z ) if it exists).
Now, we show that the estimator f (x) M is consistent, i.e. convergent toward
f (x).
Proposition 10.3 Under the assumptions of Proposition 10.1 and if εk goes to zero
as k goes to infinity, the estimator
f (x) M P-a.s. converges to its target f (x).
1 F(x + εk , Z k ) − F(x − εk , Z k )
M
f (x + εk ) − f (x − εk ) a.s.
− −→ 0 (10.10)
M 2εk 2εk
k=1
M
1 F(x + εk , Z k ) − F(x − εk , Z k ) − ( f (x + εk ) − f (x − εk ))
LM = , M ≥ 1.
k=1
k 2εk
M
M
1
E L 2M = E (L k )2 = Var(F(x + εk , Z k ) − F(x − εk , Z k ))
k=1 k=1
4k 2 ε2k
M
1
≤ E (F(x + εk , Z k ) − F(x − εk , Z k ))2
k=1
4k 2 ε2k
M
1 1 M
≤ C 2 4 ε2k = C F,Z
2 F,Z
2
k=1
4k εk
2
k=1
k2
so that
sup E L 2M < +∞.
M
2
where v̄ = C F,Z
2
− f (x) .
[Hint: Note that the sequence Var F(x+εk ,Z k )−F(x−ε
2εk
k ,Z k )
, k ≥ 1, is bounded and
use the following Central Limit Theorem: if (Yn )n≥1 is a sequence of i.i.d. random
variables such that there exists an η > 0 satisfying
1
Nn
n→+∞
sup E |Yn |2+η < +∞ and ∃ Nn → +∞ with Var(Yk ) −→ σ 2 > 0
n Nn k=1
then
1
Nn
L
√ Yk −→ N (0; σ 2 ) as n → +∞.]
Nn k=1
(b) Show that the resulting estimator f (x) M a.s. converges to its target f (x) as
soon as
1
2(1−θ)
< +∞.
k≥1 k εk
2
(c) Assume that εk = kca , k ≥ 1, where c is a positive real constant and a ∈ (0, 1).
Show that the exponent a corresponds to an admissible step if and only if a ∈
(0, 2(1−θ)
1
). Justify the choice of a ∗ = 2(3−θ)
1
for the exponent a and deduce that
√ − 2
the resulting rate of decay of the (RMSE) is O M 3−θ .
The above exercise shows that the (lack of) regularity of x → F(x, Z ) in L 2 (P)
impacts the rate of convergence of the finite difference method.
We retain the notation of the former section. We assume that there exists a p ∈
[1, +∞) such that
∀ ξ ∈ (x − ε0 , x + ε0 ), F(ξ, Z ) ∈ L p (P).
f (x) = E ∂x F(x, ω) .
This result must be understood as follows: when a sensitivity (the δ-hedge and the
γ parameter but also other “greek parameters”, as will be seen further on) related to
the premium E h(X Tx ) of an option cannot be computed by “simply” interchanging
differentiation and expectation, this lack of differentiability comes from the payoff
function h. This also holds true for path-dependent options.
Let us come to a precise statement of Kunita’s Theorem on the regularity of the
flow of an SDE.
d
t
∂bi
(s, X sx )Ysj (x)ds
ij
∀ t ∈ R+ , Yt (x) = δi j +
=1 0 ∂ y
q t
d
∂ϑik
+ (s, X sx )Ysj (x)dWsk , 1 ≤ i, j ≤ d,
=1 k=1 0 ∂ y
We will admit this theorem, which is beyond the scope of this monograph. We
refer to Theorem 3.3, p. 223, from [176] for Claim (a) and Theorem 4.4 (and its
proof), p. 227, for Claim (b).
n+β
Remarks. • If b and σ are Cbn+α for some n ∈ N, then, a.s. the flow x → X tx is Cb ,
0 < η < α (see again Theorem 3.3 in [176]).
One easily derives from the above theorem the slightly more general result about
the tangent process to the solution (X st,x )s∈[t,T ] starting from x at time t. This pro-
cess, denoted by (Y (t, x)s )s≥t , can be deduced from Y (x) by the following inverted
transition formula
∀ s ≥ t, Ys (t, x) = Ys (x)[Yt (x)]−1 .
d x Xx
X t = Yt (x) = t .
dx x
Exercise. Let d = q = 1. Show that under the assumptions of Theorem 10.1, the
tangent process at x, Yt (x) = ddx X tx , satisfies
Yt (x)
sup ∈ L p (P), p ∈ (0, +∞).
s,t∈[0,T ] s (x)
Y
Applications to δ-hedging
The tangent process and the δ hedge are closely related. Assume for convenience
that the interest rate is 0 and that a basket is made up of d risky assets whose price
dynamics (X tx )t∈[0,T ] , X 0x = x ∈ (0, +∞)d , is a solution to (10.11).
Then the premium of the payoff h(X Tx ) on the basket is given by
f (x) := E h(X Tx ).
The δ-hedge vector of this option (at time 0 and) at x = (x 1 , . . . , x d ) ∈ (0, +∞)d
is given by ∇ f (x).
We have the following proposition that establishes the existence and the repre-
sentation of f (x) as an expectation (with in view its computation by a Monte Carlo
simulation). It is a straightforward application of Theorem 2.2(b).
∂f
Then f is differentiable at x and its gradient ∇ f (x) = ∂x i
(x) admits the
1≤i≤d
following representation as an expectation:
∂f ∂ X Tx
(x) = E ∇h(X x
) , i = 1, . . . , d. (10.13)
∂x i T
∂x i
Remark. One can also consider the payoff of a forward start option
h(X Tx1 , . . . , X Tx ), 0 < T1 < · · · < TN , where h : (Rd ) N → R+ . Then, under simi-
N
lar a.s. differentiability assumptions on h (with respect to P(X T1 ,...,X TN ) and uniform
integrability, its premium f (x) is differentiable and
⎡ " # ⎤
∂f ∂ X Tx
⎣
(x) = E ∇h(X T1 , . . . , X T )
x x j
⎦.
∂x i N ∂x i
1≤ j≤d
In one dimension, one can take advantage of the semi-closed formula (10.12)
obtained in the above exercise for the tangent process.
Extension to an exogenous parameter θ
All theoretical results obtained for the δ, i.e. for the differentiation of the flow of an
SDE with respect to its initial value, can be, at least first formally, extended to any
parameter provided no ellipticity is required. This follows from the remark that if
5 Ifh is Lipschitz continuous, or even locally Lipschitz with polynomial growth (|h(x) − h(y)| ≤
C|x − y|(1 + |x|r + |y|r ) r > 0) and X x is a solution to an SDE with Lipschitz continuous coef-
ficients b and ϑ in the sense of (7.2), this uniform integrability is always satisfied: it follows from
Theorem 7.10 applied with p > 1.
10.2 Pathwise Differentiation Method 489
in [176], if (θ, x) → b(θ, x) and (θ, x) → ϑ(θ, x) are Cbk+α (0 < α < 1) with respect
to x and θ (6 ) then the solution of the SDE at a given time t will be C k+β (for every
β ∈ (0, α)) as a function of (x and θ). A more specific approach would show that
some regularity in the sole variable θ would be enough, but then this result does not
follow for free from the general theorem of the differentiability of the flows.
Assume b = b(θ, . ) and ϑ = ϑ(θ, . ) and the initial value x = x(θ), θ ∈ , ⊂
Rq , open set. One can also differentiate an SDE with respect to this parameter θ.
We can assume that q = 1 (by considering a partial derivative if necessary). Then,
one gets
t ∂ X (θ)
∂ X t (θ) ∂x(θ) ∂b ∂b s
= + θ, X s (θ) + θ, X s (θ) ds
∂θ ∂θ 0 ∂θ ∂x ∂θ
t ∂ X (θ)
∂ϑ ∂ϑ s
+ θ, X s (θ) + θ, X s (θ) dWs .
0 ∂θ ∂x ∂θ
ℵ Practitioner’s corner
Once coupled with the original diffusion process X x , this yields some expres-
sions for the sensitivity with respect to the parameter θ, possibly closed, but usu-
ally computable
by a Monte Carlo simulation of the Euler scheme of the couple
X t (θ), ∂ X∂θt (θ) .
As a conclusion, let us mention that this tangent process approach is close to the
finite difference method applied to F(x, ω) = h(X Tx (ω)) in the “regular setting”: it
appears as a limit case of the finite difference method. Their behaviors are similar,
except that for a given size of Monte Carlo simulation, the tangent process method has
a lower computational complexity. The only constraint is that it is more demanding
in terms of “preliminary” human – hence expansive– calculations. This is no longer
true since the recent emergence (in quantitative finance) of Automatic Differentiation
(AD) as described by Griewank in [135–137], Naumann in [215] and Hascoet in [144],
which, once included in the types of the variable spare much time prior to and during
the simulation itself (at least for the adjoint version of AD).
6A function g has a Cbk+α regularity if g is C k with k-th order partial derivatives globally α-Holder
and if all its partial derivatives up to k-th order are bounded.
490 10 Back to Sensitivity Computation
The approach based on the tangent process clearly requires smooth payoff functions,
typically almost everywhere differentiable, as emphasized above (and in Sect. 2.2).
Unfortunately, this assumption is not fulfilled by many of the usual payoffs functions
like the digital payoff
In Sect. 2.2, we saw that a family of random vectors X (θ) indexed by a parameter
θ ∈ , an open interval of R, all having an a.e. positive probability density p(θ, y)
with respect to a reference non-negative measure μ on Rd . Assume this density is
positive on a domain D of Rd for every θ ∈ where D does not depend on θ. Then
one can derive the sensitivity with respect to θ of functions of the form
f (θ) := E ϕ(X (θ)) = ϕ(y) p(θ, y)μ(dy),
D
then
∂ log p
∀ θ ∈ , f (θ) = E ϕ X (θ) θ, X (θ) .
∂θ
We considered an autonomous SDE mainly for simplicity and to get nice formulas,
but the extension to general SDE s is just a matter of notation.
Let pT (θ, x, y) and p̄T (θ, x, y) denote the density of X Tx (θ) and its discrete time
Euler scheme X̄ txn (θ) 0≤k≤n with step size Tn (the superscript n is dropped until the end
k
of the section). Then one may naturally propose the following naive approximation
∂ log pT ∂ log p̄T
f (θ) = E ϕ(X Tx (θ)) (θ, x, X Tx (θ)) E ϕ( X̄ Tx (θ)) (θ, x, X̄ Tx (θ)) .
∂θ ∂θ
Proposition 10.8 (Density of the Euler scheme) Let q ≥ d. Assume ϑϑ∗ (θ, x) ∈
G L(d, R) for every x ∈ Rd and θ ∈ .
(a) Then the distribution P X̄ x (dy) of X̄ xT is absolutely continuous with respect to
T n
n
the Lebesgue measure with a probability density given by
492 10 Back to Sensitivity Computation
∗
1 (ϑϑ∗ (θ,x))−1 y−x− Tn b(θ,x)
e− 2T
n
y−x− Tn b(θ,x)
p̄ T (θ, x, y) = T d2
√ .
n
(2π n ) det ϑϑ∗ (θ, x)
(b) The distribution P( X̄ xn ,..., X̄ xn ,..., X̄ xn ) (dy1 , . . . , dyn ) of the n-tuple ( X̄ txn , . . . , X̄ txn , . . . ,
t1 tk tn 1 k
'
n
p̄t n ,...,tnn (θ, x, y1 , . . . , yn ) = p̄ T (θ, yk−1 , yk ) (convention: y0 = x).
1 n
k=1
Proof. (a) is a straightforward consequence of the first step of the definition (7.5) of
the Euler scheme at time Tn applied to the diffusion (10.14) and the standard formula
for the density of a Gaussian vector since ϑ(θ, x)ϑ∗ (θ, x) is invertible. Claim (b)
follows from an easy induction based on the Markov property satisfied by the Euler
scheme. First, it is clear, again from the recursion satisfied by the discrete time Euler
scheme of (10.14) (adapted from (7.5)) that
n (θ) |F n
L X̄ txk+1 W
tk = L X̄ txk+1
n (θ) | X̄ t n ,
k
k = 0, . . . , n − 1
and that
n (θ) | X̄ n (θ) = yk
L X̄ txk+1 x
tk = L X̄ t1
n (θ) | X̄ (θ) = y
0 k = p̄ T (θ, x, yk )λd (d x).
n
One concludes by an easy induction that, for every bounded Borel function g :
(Rd )k → R+ ,
'
k
Eg X̄ tx1n , . . . , X̄ txkn = g(y1 , . . . , yk ) p̄ T (θ, y−1 , y )
(Rd )k =1
n
with the convention y0 = 0, using the Markov property satisfied by the discrete time
Euler scheme. ♦
The above proposition proves that every marginal X̄ txn has a density which, unfor-
k
tunately, cannot be made explicit. Moreover, to take advantage of the above closed
form for the density of the n-tuple, we can write
At this stage it is clear that the method also works for path-dependent options,
i.e. when considering F(θ) = E (X t (θ))t∈[0,T ] instead of f (θ) = E ϕ X T (θ)
10.3 Sensitivity Computation for Non-smooth Payoffs: The Log-Likelihood Approach (II) 493
(at least for specific functionals involving time averaging, a finite number of
instants, supremum, infimum, etc). This raises new difficulties in connection with
the Brownian bridge method for diffusions, that need to be encompassed.
Finally, let us mention that evaluating the rate of convergence of these approx-
imations from a theoretical point of view is quite a challenging problem since it
involves not only the rate of convergence of the Euler scheme itself, but also that of
the probability density functions of the scheme toward that of the diffusion (see [25]
where this problem is addressed).
Exercise. Apply the preceding to the case θ = x (starting value) when d = 1.
Comments. In fact, there is a way to get rid of non-smooth payoffs by regularizing
them over one time step before maturity and then applying to tangent process method.
Such an approach has been developed by M. Giles under the name of Vibrato Monte
Carlo which appears as a kind of degenerate nested Monte Carlo combined with a
pathwise differentiation based on the tangent process (see [108]).
In this section, for the sake of simplicity, we assume that d = 1 and q = 1 (scalar
Brownian motion).
Theorem 10.2 (Bismut’s formula) Let W = (Wt )t∈[0,T ] be a standard Brownian
motion on a probability space (, A, P) and let F W := (FtW )t∈[0,T ] be its augmented
natural filtration. Let X x = (X tx )t∈[0,T ] be a diffusion process, unique (Ft )t∈[0,T ] -
adapted solution to the autonomous SDE
7 This means that for every t ∈ [0, T ], (Hs (ω))(s,ω)∈[0,t]× is Bor ([0, t]) ⊗ Ft -measurable.
494 10 Back to Sensitivity Computation
T
T
ϑ(X sx )Hs
E f (X T )
x
Hs dWs =E f (X T )YT x
ds
0 0 Ys
d X tx
where Yt = dx
is the tangent process of X x at x.
Proof (Sketch). To simplify the arguments, we will assume in the proof that
|Ht | ≤ C < +∞, where C is a real constant and that f and f are bounded functions.
Let ε ≥ 0. Set on the probability space (, FT , P),
P(ε) = L (ε)
T .P
where
t
ε2 t
L (ε)
t = exp −ε Hs dWs − Hs2 ds , t ∈ [0, T ],
0 2 0
Consequently (see [251], Theorem 1.11, p. 372), the process X x has the same
distribution under P(ε) as X (ε) , the solution to
d X t(ε) = b(X t(ε) ) − εHt ϑ(X t(ε) ))dt + ϑ(X t(ε) dWt , X 0(ε) = x.
where we used once again Theorem 2.2 and the obvious fact that X (0) = X .
Using the tangent process method with ε as an auxiliary variable, one derives that
∂ X t(ε)
the process Ut := satisfies
∂ε
|ε=0
Ut
dUt = dYt − Ht ϑ(X tx )dt. (10.15)
Yt
We know that Yt is never 0, hence (up to some localization if necessary) we can apply
Itô’s formula to the ratio UYtt : elementary computations of the partial derivatives of
the function (u, y) → uy on R × (0, +∞) combined with Eq. (10.15) show that
Ut dUt Ut dYt 1 dU, Y t 2Ut dY t
d = − + −2 +
Yt Yt Yt2 2 Yt2 Yt3
Ht ϑ(X tx ) 1 dU, Y t 2Ut dY t
=− dt + −2 + .
Yt 2 Yt2 Yt3
Ut
dU, Y t = dY t ,
Yt
which yields
Ut ϑ(X tx )Ht
d =− dt.
Yt Yt
d X 0(ε)
Noting that U0 = dε
= dx
dε
= 0 finally leads to
t
ϑ(X s )Hs
Ut = −Yt ds, t ∈ [0, T ],
0 Ys
∂
then ∂x
E f (X Tx ) appears as a weighted expectation of f (X T ), namely
⎛ ⎞
⎜ T ⎟
∂ ⎜ x 1 Ys ⎟
⎜
E f (X T ) = E ⎜ f (X T ) dWs ⎟
⎟.
x
(10.16)
∂x ⎝ T ϑ(X )x
, 0 -. s /⎠
weight
Proof. We proceed like we did with the Black–Scholes model in Sect. 2.2: we first
assume that f is regular, namely bounded and differentiable with bounded derivative.
Then, using the tangent process method approach
∂
E f (X Tx ) = E f (X Tx )YT ,
∂x
d X tx
still with Yt = dx
. Then, we set
Yt
Ht = .
ϑ(X tx )
which yields the announced result. The extension to continuous functions with poly-
nomial growth relies on an approximation argument (e.g. by convolution). ♦
Remarks. • One retrieves in the case of a Black–Scholes model the formula (2.8)
Xx
for the δ, obtained in Sect. 2.2 by an elementary integration by parts, since Yt = xt
and ϑ(x) = σx.
• Note that the assumption
10.4 Flavors of Stochastic Variational Calculus 497
T 2
Yt
dt < +∞
0 ϑ(X tx )
is essentially an ellipticity assumption. Thus, if ϑ2 (x) ≥ ε0 > 0, one checks that the
assumption is always satisfied.
Exercises. 1. Apply the preceding to get a formula for the γ-parameter in a general
diffusion model. [Hint: Apply the above “derivative free” formula to the δ-hedge
formula obtained using the tangent process method.]
2. Show that if b − b ϑϑ − 21 ϑ ϑ = c ∈ R, then
∂ ecT
E f (X Tx ) = E f (X Tx )WT .
∂x ϑ(x)
with Lipschitz continuous coefficients b and σ and X x = (X tx )t∈[0,T ] its unique (FtW )-
adapted solution starting at x, where (Ft )t∈[0,T ] is the (augmented) filtration of the
Brownian motion W . We admit the following theorem, stated in a one-dimensional
setting for (at least notational) convenience.
where Y (t) is the tangent process of X x at time t, the solution to the SDE
The simplest interpretation (and original motivation!) is that the Malliavin derivative
is a derivative with respect to the Brownian path of W (viewed at time t). It can be
easily understood owing to the following formal chain rule of differentiation
∂ F(X x ) ∂ X tx
Dt F(X x ) = × .
∂ X tx ∂W
10.4 Flavors of Stochastic Variational Calculus 499
X x ,t
If one notes that, for any s ≥ t, X sx = X s t , the first term “ ∂ F(X ) x
∂ X tx
” in the above
(t)
product is clearly equal to D F(X )·(1[t,T ] Y. ) whereas the second term is the result
of a formal differentiation of the SDE at time t with respect to W , namely ϑ(X tx ).
An interesting feature of this derivative in practice is that it satisfies the usual
chaining rules like
Dt F 2 (X x ) = 2F(X x )Dt F(X x )
etc.
What is called Malliavin calculus is a way to extend this notion of differentiation
to more general functionals using some functional analysis arguments (closure of
operators, etc) using, for example, the domain of the operator Dt F (see e.g. [16,
208]).
Using the Haussman–Clark–Occone formula to get Bismut’s formula
As a first conclusion we will show that the Haussman–Clark–Occone formula con-
tains the Bismut formula. Let X x , H , f and T be as in Sect. 10.4.1. We consider the
two true martingales
t t
Mt = Hs dWs and Nt = E f (X T ) + x
E f (X T )YT(s) | Fs dWs , t ∈ [0, T ]
0 0
For practical implementation, one should be aware that the weighted estimators of
sensitivities, obtained by log-likelihood methods (exact, see Sect. 2.2.3, or approx-
imate, see Sect. 10.3.1), by Bismut’s formula or more general Malliavin inspired
methods, often suffer from high variance, especially for short maturities.
This phenomenon can be observed in particular when such a formula coexists with
a pathwise differentiated formula involving the tangent process for smooth enough
payoff functions.
This can be easily understood on the formula for the δ-hedge when the maturity
is small (see the toy-example in Sect. 3.1).
Consequently, weighted formulas for sensitivities need to be speeded up by effi-
cient variance reduction methods. The usual approach, known as localization, is to
isolate the singular part (where differentiation does not apply) from the smooth part.
Let us illustrate the principle of localization functions on a very simple toy-
example (d = 1). Assume that, for every ε > 0,
Functions ϕ can be obtained as mollifiers in convolution theory but other choices are
possible, like simply Lipschitz continuous functions (see the numerical illustration
in Sect. 10.4.4). Set f i (x) = E Fi (x, Z )) so that f (x) = f 1 (x) + f 2 (x). Then one
may use a direct differentiation to compute
∂ F1 (x, Z )
f 1 (x) = E
∂x
2
payoff
smooth payoff
1.5
0.5
-0.5
35 40 45 50 55 60 65
100
payoff
smooth payoff
80
60
40
20
-20
35 40 45 50 55 60 65
reproduced in Figs. 10.1 and 10.2 (pay attention to the scales of the y-axis in each
figure).
σ2
√
Let F(x, z) = e−r t h 1 x e(r − 2 )T +σ T z in the digital Call case and let F(x, z) =
σ2
√
e−r t h S x e(r − 2 )T +σ T z in the Asset-or-Nothing Call. We define f (x) = E F(x, Z ),
where Z is a standard Gaussian variable. With both payoff functions, we are in the
singular setting in which F( . , Z ) is not Lipschitz continuous but only 21 -Hölder
in L 2 (see the singular setting in (10.1.1)) whereas f is C ∞ on (0, +∞). We are
interested in computing the delta (δ-hedge) of the two options, i.e. f (x).
In such a singular case, the variance of the finite difference estimator explodes as
ε → 0 (see Proposition 10.2) and ξ → F(ξ, Z ) is not L p -differentiable for p ≥ 1
so that the tangent process approach cannot be used (see Sect. 10.2.2).
10.4 Flavors of Stochastic Variational Calculus 503
10
Crude estimator
RR extrapolation
Variance 6
0
0 0.2 0.4 0.6 0.8 1
epsilon
5000
Crude estimator
4500 RR extrapolation
4000
3500
Variance
3000
2500
2000
1500
1000
500
We first illustrate the variance explosion in Figs. 10.3 and 10.4 where the param-
eters have been set to r = 0.04, σ = 0.1, T = 1/12 (one month), x0 = K = 50 and
M = 106 .
To avoid the explosion of the variance one considers a smooth (namely Lipschitz
continuous) approximation of both payoffs. Given a small parameter η > 0, one
defines 1
h 1 (ξ), if |ξ − K | > η,
h 1,η (ξ) = 1
2η
ξ + 1
2
(1 − K
η
), if |ξ − K | ≤ η
and &
h S (ξ), if |ξ − K | > η,
h η,S (ξ) = K +η K +η
2η
ξ + 2
(1 − K
η
), if |ξ − K | ≤ η.
504 10 Back to Sensitivity Computation
1 F(x + ε, Z k ) − F(x − ε, Z k )
M
f (x)ε,M =
M k=1 2ε
and
1 Fη (x + ε, Z k ) − Fη (x − ε, Z k )
M
f η (x)ε,M = .
M k=1 2ε
The extrapolation is done using a linear combination of the estimators with step
(3) (5)
ε and 2ε , respectively. As E f (x)ε,M = f (x) + f 6(x) ε2 + f 5!(x) ε4 + O(ε6 ), we
easily check that the combination that kills the first term of this bias (in ε2 ) is
4
3
f (x) 2ε ,M − 13
f (x)ε,M (see Exercise 4. in Sect. 10.1.1). The same holds for f η .
Then, as in the proof of Propositions 10.1 (with θ = 21 ) and 10.2, we then prove
that
f (x) − 4 f (x) ε −
1
f (x) = O(ε4 ) + O √ 1 , (10.18)
3 2 ,M
3 ε,M εM
2
and
f (x) − 4 f (x) −
1
f (x) = O(ε4 ) + O √1 . (10.19)
η 3 η
ε
2 ,M 3 η ε,M 2 M
The control of the variance in the smooth case is illustrated in Figs. 10.5 and 10.6
when η = 2, and in Figs. 10.7 and 10.8 when η = 0.5. The variance increases when
η decreases to 0 but does not explode as ε goes to 0.
For a given ε, note that the variance is usually higher using the standard RR
extrapolation. However, in the Lipschitz continuous case the variance of the RR-
estimator and that of the crude finite difference converge toward the same value
when ε goes to 0. Moreover, from (10.18) we deduce the choice ε = O(M −1/9 ) to
keep the balance between the bias term of the RR estimator and the variance term.
As a consequence, for a given level of the L 2 -error, we can choose a bigger ε with
the RR estimator, which reduces the bias without increasing the variance.
The parameters of the model are the following: x0 = 50, K = 50, r = 0.04, σ =
0.3 and T = 52 1
(one week). The size of the Monte Carlo simulation is fixed to
M = 10 . We now compare the following estimators in the two cases with two
6
0.06
Crude estimator
RR extrapolation
0.055
0.05
Variance 0.045
0.04
0.035
0.03
0.025
0 0.2 0.4 0.6 0.8 1
epsilon
Fig. 10.5 Digital Call: Variance of the two estimators as a function of ε. T = 1/52. Smoothed
payoff with η = 1
150
Crude estimator
RR extrapolation
140
130
120
Variance
110
100
90
80
70
0 0.2 0.4 0.6 0.8 1
epsilon
Fig. 10.6 Asset- or- nothing Call: Variance of the two estimators as a function of ε, T = 1/52.
Smoothed payoff with η = 1
0.0631 since the RMSE has the form f (3) (x)ε2 + O √εM 1
(see Proposition 10.1
with θ = 21 ).
• Finite difference estimator with RR-extrapolation on the non-smooth payoffs with
ε = M − 9 0.2154 (based on (10.18)).
1
• Crude weighted estimator (with the standard Black–Scholes δ-weight 2.7) on the
non-smooth payoffs h 1 and h S .
• Localization (smoothing): Finite difference estimator on the smooth payoffs h 1,η
and h S,η with η = 1 and ε = M − 4 combined with the weighted estimator on the
1
0.16
Crude estimator
RR extrapolation
0.14
0.12
Variance
0.1
0.08
0.06
0.04
0 0.2 0.4 0.6 0.8 1
epsilon
Fig. 10.7 Digital Call: Variance of the two estimators as a function of ε, T = 1/52. Smoothed
payoff with η = 0.5
400
Crude estimator
RR extrapolation
350
300
Variance
250
200
150
100
0 0.2 0.4 0.6 0.8 1
epsilon
Fig. 10.8 Asset- or- nothing Call: Variance of the two estimators as a function of ε, T = 1/52.
Smoothed payoff with η = 0.5
Table 10.2 Asset- or- nothing Call: results with T = 1/52 (one week)
Estimator Mean Variance 95% CI
Finite difference 10.156 3736.2 [10.036; 10.276]
FD + RR 10.094 3001.1 [9.9865; 10.201]
Weighted estimator 10.101 227.98 [10.071; 10.131]
Loc. FD + WE 10.085 143.74 [10.061; 10.11]
Loc. FD + WE + RR 10.078 142.4 [10.054; 10.103]
The results are summarized in the following Tables 10.1 for the delta of the digital
Call option and 10.2 for the delta of the Asset-or-Nothing Call option.
In practice, the variance of these estimators can be improved by appropriate control
variates, not implemented in this numerical experiment.
Chapter 11
Optimal Stopping, Multi-asset
American/Bermudan Options
11.1 Introduction
This chapter is devoted to probabilistic numerical methods for solving optimal stop-
ping problems, in particular the pricing of multi-asset American and Bermudan
options. However, we do not propose an in-depth presentation of optimal stopping
theory in discrete and continuous time, which would be far beyond the scope of this
monograph. When necessary, we will refer to reference books or papers on this topic,
and many important theoretical results will be admitted throughout this chapter. For
an introduction to optimal stopping theory in discrete time we refer to [217] or more
recently [183] and, for a comprehensive overview of the whole theory, including
continuous time models and its applications to the pricing of American derivatives,
we recommend Shiryaev’s book [182, 262], which we will often refer to. On our
side, we will focus on time and space discretization aspects in Markovian models,
usually in connection with Brownian diffusion processes.
We will first explain how one can approximate a continuous time optimal stop-
ping problem, namely for Brownian diffusions, by a discrete time optimal stopping
problem for Rd -valued Markov chains. The specificity of direct Markovian optimal
stopping problems is that the quantity of interest (the Snell envelope) satisfies a
backward Dynamical Programming Principle (BDPP) which is a major step toward
an operating numerical scheme. Once this time discretization phase is achieved, we
will present two different methods of space discretization which allow us to imple-
ment approximations of the BDPP: Least squares regression methods, also known
as American Monte Carlo and optimal quantization methods (quantization trees).
As it does not add any specific difficulty, we will consider a general diffusion
framework (7.1), namely
1 This statement should be considered with caution since the traded asset may not correspond to
what is quoted: thus in the case of Foreign Exchange, the quoted process is the exchange rate itself
whereas the risky asset in arbitrage theory is the foreign bond quoted in domestic currency, i.e. the
value of the foreign bond (of an appropriate maturity) multiplied by the exchange rate. Thus, if r F
denotes the foreign interest rate, r should be replaced by r − r F in the dynamics of the exchange
rate. A similar situation occurs with a stock continuously distributing dividends at an exponential
rate λ: r should be replaced by r − λ, or with a future contract where r should be replaced by 0 in
the dynamics of the quoted price.
2 F = σ( N , X , s ∈ [0, t]).
t P s
11.1 Introduction 511
them. In mathematical terms, this means that admissible stopping strategies τ are
those which satisfy that the decision to stop between 0 and t is Ft -measurable, i.e.
∀ t ∈ [0, T ], {τ ≤ t} ∈ Ft .
In other words, τ is a [0, T ]-valued (Ft )-stopping time. Note that simply asking
{τ = t} to lie in Ft (the player decides to stop exactly at time t) may seem more
natural as an assumption, but, if it is the right notion in discrete time (equivalent to
the above condition), in continuous time, this simpler condition turns out to be too
loose to devise a consistent theory.
Exercise. Show that for any F-stopping time τ one has, for every t ∈ [0, T ],
{τ = t} ∈ Ft (the converse is not true in general in a continuous time setting).
For the sake of simplicity, we will assume that the instantaneous interest rate
is constant and invariant, i.e. the interest rate curve is constant equal to r so that
the discounting factor at time t is given by e−r t (this also has consequences on the
dynamics of risky assets under a risk neutral probability, see further on). At this stage,
the supremum of the mean gain over all admissible stopping rules at each time is the
best the player can hope for. Concerning measurability, it is not possible to consider
such a supremum in a naive way. It should be replaced by a P-essential supremum
(see Sect. 12.9 in the Miscellany Chapter). This quantity, defined below, is known as
the Snell envelope.
It is clear from Theorem 7.2 that, under
the above assumptions
made on b and
σ, if X 0 ∈ L r (P) for some r ≥ 1, then supt∈[0,T ] |X t |r < +∞, so that as soon
as the Borel function f : [0, T ] × Rd → R+ has r -polynomial growth in x uni-
formly in t ∈ [0, T ], that is, 0 ≤ f (t, x) ≤ C(1 + |x|r ), then supt∈[0,T ] f (t, X t ) ∈
L r (P). In particular, for any random variable τ : (, A) → [0, T ], 0 ≤ f (τ , X τ ) ≤
supt∈[0,T ] f (t, X t ) ∈ L r (P).
Definition 11.1 (a) The (Ft )-adapted process f (t, X t ), t ∈ [0, T ] is called the pay-
off process of the game (or the American payoff). Its discounted version, denoted by
Z t , reads (3 )
Z t = e−r t f (t, X t ), t ∈ [0, T ].
(b) Assume that X 0 ∈ L r (P) and f has r -polynomial growth. The (F, P)-Snell enve-
lope of the discounted payoff (Z t )t∈[0,T ] , is defined for every t ∈ [0, T ] by
F
Ut = P-esssup E Z τ | Ft , τ ∈ Tt,T
= P-esssup E e−r τ f (τ , X τ ) | Ft , τ ∈ Tt,T
F
< +∞ P-a.s. (11.2)
F
where τ ∈ Tt,T denotes the set of (Fs )s∈[0,T ] -stopping times having values in [t, T ].
3 Inthis chapter, to preserve the internal consistency of the notations, we will not use to denote
discounted payoffs or prices.
512 11 Optimal Stopping, Multi-asset American/Bermudan Options
F
where τ ∈ Tt,T denotes the set of [t, T ]-valued (Fs )s∈[0,T ] -stopping times. Then, if
X 0 = x, the Snell envelope of (Z t )t∈[0,T ] satisfies
Remark. • If the components X i of X are positive such that e−r t X ti are (P, Ft )-
martingales, and if the representation theorem of (FtW )-martingales as stochastic
integrals with respect to W can be reformulated as a representation theorem with
respect to the above (discounted) processes X ti , then the diffusion (X t )t∈[0,T ] is an
abstract model of complete financial market where X i are the quoted prices of the
risky assets. The last representation assumption is essentially satisfied under uniform
ellipticity of the diffusion coefficient of the SDE satisfied by the log-risky assets. In
such a model, one shows by hedging arguments that F(0, x) is the premium of an
11.1 Introduction 513
American option with payoff f (t, X tx ), t ∈ [0, T ]. We refer to [182] for a detailed
presentation with a focus on a model of stock market distributing dividends (see
also Sect. 11.1.2 for few more technical aspects).
∂f
where ∂xi
(t, x) denotes the right derivative with respect to xi at (t, x).
r
2. If f (t, . ) ∈ C 1 (Rd , Rd ) and, for every t ∈ [0, T ], ∇x f (t, . ) is Lipschitz continuous
in x, uniformly in t ∈ [0, T ], then
f (t, y) − f (t, x) ≥ (∇x f (t, x)|y − x) − [∇x f ]Lip |ξ − x||y − x|, ξ ∈ [x, y]
≥ ∇x f (t, x)|y − x − [∇x f ]Lip |y − x|2 ,
modeling a financial market, one usually assumes that the quoted prices of the d risky
assets X ti , i = 1, . . . , d, are non-negative, which has important consequences on the
possible choices for b and σ. As we assume that all components are observable,
our framework corresponds to a multi-asset local volatility model. To ensure the
positivity of the risky assets, a natural condition is to assume that
t
|ϑi. (s, X s )|2 ds t
∀ t ∈ [0, T ], X ti = x0i exp r t − 0
+ ϑi. (s, X s )|dWs q ,
2 0
x0i > 0.
F
where Tk:n = τ : (, A) → {k, . . . , n}, (F )0≤k≤n -stopping time and its sequence
of values (E Uk )0≤k≤n .
Note that Uk ≥ Z k a.s. for every k ∈ {0, . . . , n}.
Proposition 11.2 (Backward Dynamic Programming Principle (BDPP)) (a)
Pathwise Backward Dynamic Programming Principle. The Snell envelope (Uk )0≤k≤n
satisfies the following Backward Dynamic Programming Principle
us recall that this means that for every k ∈ {0, . . . , n − 1}, for every bounded or non-negative
5 Let
Borel function f : Rd → R,
(b) Optimal stopping times. The sequence of stopping times (τk )0≤k≤n defined by
Uk = u k (X k ), k = 0, . . . , n,
where
u n = fn and u k = max f k , Pk u k+1 , k = 0, . . . , n − 1.
The fact that θk is a stopping time is left as an exercise since {Z k ≥ E (Z θk+1 |Fk )},
{Z k < E (Z θk+1 |Fk )} = {Z k ≥ E (Z θk+1 |Fk )}c ∈ Fk and θk+1 is an (F )-stopping time.
As a first step we will prove (a) and (b) where X k is replaced by Fk in the condi-
tioning.
F
We proceed by a backward induction on the discrete instant k. If k = n, Tk:n is
reduced to {n} so that Un = E(Z n |Fn ) = Z n . Now, let k ∈ {0, . . . , n − 1} and let θ ∈
F
Tk:n . Set θ = θ ∨ (k + 1) so that θ = k1{θ=k} + θ1{θ≥k+1} . The random variable θ is a
{k + 1, . . . , n}-valued (F )-stopping time since {θ ≥ k + 1} = {θ = k}c ∈ Fk . Then,
using the chain rule for conditional expectation in the third line and the definition of
Uk+1 in the fourth line, we obtain
E f θ (X θ ) |Fk = f k (X k )1{θ=k} + E f θ (X θ )1{θ≥k+1} |Fk
= f k (X k )1{θ=k} + E f θ (X θ ) |Fk 1{θ≥k+1}
= f k (X k )1{θ=k} + E E f θ (X θ ) |Fk+1 |Fk 1{θ≥k+1}
≤ f k (X k )1{θ=k} + E Uk+1 |Fk )1{θ≥k+1} a.s.
≤ max f k (X k ), E Uk+1 |Fk a.s.
11.2 Optimal Stopping for Discrete Time Rd -Valued Markov Chains 517
By the very definition of the Snell envelope, Uk ≥ E (Z θk+1 |Fk ) a.s. so that
Uk ≥ E (Z θk |Fk ) = Z k 1{θk =k} + E Z θk+1 |Fk 1θ ≥k+1
a.s.
k
= Z k 1θ =k + E E Z θk+1 |Fk+1 |Fk 1θ ≥k+1
k k
= Z k 1 θ =k + E Uk+1 |Fk 1{θk ≥k+1} a.s.
k
owing to the induction assumption. Going back to the definition of θk we check that
{θk = k} = {Z k ≥ E (Z θk+1 |Fk )} and {θk ≥ k + 1} = {Z k < E (Z θk+1 |Fk )}. Hence
Uk ≥ E (Z θk |Fk ) = max Z k , E Uk+1 |Fk .
Plugging this equality into the definition of the stopping times θk shows, again by
a backward induction, that θk = τk a.s. since
E (Z τk+1 |Fk ) = E E Z τk+1 |Fk+1 )|Fk
= E Uk+1 |Fk = E Uk+1 | X k
= E E (Z τk+1 |Fk+1 ) | X k = E (Z τk+1 |X k )
a.s. for every k ∈ {0, . . . , n − 1}. This completes the proof of (11.8).
Finally, we derive that
τk = k1 Z + τk+1 1
k ≥E (Uk+1 |Fk ) Z k <E (Uk+1 |Fk )
= k1U + τk+1 1 , k = 0, . . . , n,
k =Z k Uk >Z k
518 11 Optimal Stopping, Multi-asset American/Bermudan Options
since Un = Z n . ♦
Remarks. • The optimal stopping time (starting at time k) is in general not unique
(see [182] or [59], Problem 21, p. 253 for an example dealing with the American Put
in a binomial model) but
τk = min ∈ {k, . . . , n}, U = Z
This proposition shows that two backward dynamic programming principles coex-
ist, a first one based on the Snell envelope, which can be considered as the “regular”
BDPP formula, and a second one, (11.8), based on a recursion on optimal stopping
times depending on the effective date k of entry into the game, which can be seen as
a dual form of the BDPP. It is on this second BDDP that the original paper [202] on
least squares regression methods is based (see the next section).
We aim at discretizing the optimal stopping problem defined by (11.2). This dis-
cretization is two-fold: first we discretize the instants at which the payoffs can be
“exercised” and, as a second step, we will discretize the diffusion itself, if necessary,
in order to have at hand a simulable underlying structure process. In this first phase,
the Markov chain of interest is (X tkn )0≤k≤n and the payoff functions are f k = f (tkn , .),
k = 0, . . . , n.
First we note that if tkn = kTn
, k = 0, . . . , n, then (X tkn )0≤k≤n is an (Ftkn )0≤k≤n -
Markov chain with transitions
and, for every x ∈ Rd , the value function (for discrete time observations or exercise):
t n ,x
Fn (tkn , x) = sup E e−r τ f τ , X τk , (11.10)
τ ∈Tkn
t n ,x
where (X sk )s∈[tkn ,T ] is the unique solution to the (S D E) starting from x at time tkn .
(a) For every x ∈ Rd , e−r tk Fn (tkn , X txn ) = P, (Ftkn ) -Snell e−r tk f (tkn , X txn ) .
n n
k k=0,...,n k
Fn (T, X Tx ) = f (T, X Tx )
and
Fn (tkn , X txkn ) = max f (tkn , X txkn ), e−r n E Fn (tkn , X txk+1
T
n ) | Ft n , k = 0, . . . , n − 1.
k
(11.11)
(c) The functions Fn (tkn , . ) satisfy the “functional” BDPP
−r T n
Fn (T, x) = f (T, x) and Fn (tkn , x) = max f (tkn , x), ek P Fn (tkn , . ) (x) ,
k = 0, . . . , n − 1.
(d) If the function f satisfies the Lipschitz assumption (Lip), then the functions
Fn (tkn , . ) are all Lipschitz continuous, uniformly in k ∈ {0, . . . , n}, n ≥ 1.
(b) ⇒ (a). This is a trivial consequence of the (BDPP) since e−r tk Fn (tkn , X tkn )
n
k=0,...,n
is the (P, (Ftkn )0≤k≤n )-Snell envelope associated to the obstacle sequence
e−r tk f (tkn , X tkn )
n
.
k=0,...,n
Applying the preceding to the case X 0 = x, we derive from the general theory on
optimal stopping that
Fn (0, x) = sup E e−r τ f (τ , X τ , τ ∈ T0n .
The extensionto times tkn , k = 1, . . . , n, follows likewise from the same reasoning
t n ,x
carried out with X tkn .
=k,...,n
520 11 Optimal Stopping, Multi-asset American/Bermudan Options
(d) A slight adaptation of Theorem 7.10 shows that there exists a real constant
C[b]Lip ,[σ]Lip ,T such that
sup |X st,x − X st,y | ≤ C[b]Lip ,[σ]Lip ,T |x − y|.
0≤t≤s≤T p
Using the Lipchitz continuity of f , the conclusion follows from the definition (11.10)
of the functions Fn (tkn , . ). ♦
The following rates of convergence of both the Snell envelope related to the
optimal stopping problem with payoff f (tkn , X tkn ) k=0,...,n and its value function
toward the continuous time Snell envelope and its value function at the discretization
times were established in [21]. Note that this rate can be significantly improved when
f is semi-convex.
Theorem 11.1 (Diffusion at discrete times) (a) Discretization of the stopping rules
for the structure process X : If f satisfies (Li p), i.e. is Lipschitz continuous in x uni-
formly in t ∈ [0, T ], then so are the value functions Fn (tkn , . ) and F(tkn , . ), uniformly
with respect to tkn , k = 0, . . . , n, n ≥ 1.
Furthermore, F(tkn , .) ≥ Fn (tkn , . ) and
Cb,σ, f,T
max F(tkn , X tkn ) − Fn (tkn , X tkn ) ≤ √
0≤k≤n p n
(b) Semi-convex payoffs. If f is semi-convex, then there exist real constants Cb,σ, f,T
and Cb,σ, f,T,K > 0 such that
Cb,σ, f,T
max Fn (tkn , X tkn ) − F(tkn , X tkn ) ≤
0≤k≤n p n
If the diffusion process is simulable at the instants tkn , as may happen if X tkn =
ϕ(X 0 , tkn , Wtkn ), where ϕ has an explicit (computable) form, the time discretization
can be stopped here. Otherwise, we need to perform a second time discretization of
the underlying diffusion process in order to be able to simulate the dynamics.
Now we pass to the second phase: we approximate the (usually not simulable)
Markov chain (X tkn )0≤k≤n by the discrete time Euler scheme ( X̄ tnn )0≤k≤n as defined by
k
11.2 Optimal Stopping for Discrete Time Rd -Valued Markov Chains 521
Eq. (7.5) in Chap. 7. We recall its definition for convenience (to emphasize a change
of notation concerning the Gaussian noise):
T T n
X̄ tnk+1
n = X̄ tnkn + b(tk , X̄ tkn ) + σ(tk , X̄ tkn )
n n n n
ξ , X̄ 0n = X 0 , k = 0, . . . , n − 1,
n n k+1
(11.12)
where (ξkn )1≤k≤n denotes a sequence of i.i.d. N (0; Iq )-distributed random vectors
given by
n
ξk :=
n
Wtkn − Wtk−1
n , k = 1, . . . , n.
T
In the absence of ambiguity, we will drop the superscript n in X̄ n and ξkn to write
X̄ and ξk . Also note that we temporarily gave up the notation Z kn for the Gaussian
innovations in favor of ξkn since Z is often representative of the reward process in
this chapter.
Proposition 11.4 (Euler scheme) The above proposition remains true when replac-
ing the sequence (X tkn )0≤k≤n by its Euler scheme ( X̄ tnn )0≤k≤n with step Tn , still with
k
the filtration Ftkn = σ(X 0 , FtWn ), k = 0, . . . , n. In both cases one just has to replace
k
the transitions Pk (x, dy) of the original process by that of its Euler scheme with step
T
n
, namely P̄kn (x, dy) defined for every k ∈ {0, . . . , n − 1} by
T T d
P̄kn f (x) = E f x + b(tkn , x) + σ(tkn , x) ξ , ξ = N (0; Iq ),
n n
and Fn by F̄n .
Remark. In fact, the result also holds true with the (smaller) innovation filtration
FkX 0 ,Z = σ(X 0 , Z 1 , . . . , Z k ), k = 0, . . . , n, since the Euler scheme is still a Markov
chain with the same transitions with respect to this filtration.
The rate of convergence between the Snell envelopes and their value functions
when switching mutatis mutandis from the diffusion “sampled” at discrete times
(X tkn )k=0,...,n to the Euler scheme ( X̄ tkn )k=0,...,n is also established in [21].
Theorem 11.2 (Euler scheme approximation) Under the assumptions of the for-
mer Theorem 11.1(a), there exists real constants Cb,σ, f,T and Cb,σ, f,T,K > 0 such
that
Cb,σ, f,T
max |Fn (tkn , X tkn ) − F̄n (tkn , X̄ tnkn )| ≤ √
0≤k≤n p n
ℵ Practitioner’s corner
If the diffusion process X is simulable at times tkn (in an exact way), there is no
need to introduce the Euler scheme and it will be possible to take advantage of the
semi-convexity of the payoff/obstacle function f to get a time discretization error of
the value function F by Fn at a O(1/n)-rate.
A typical example of this situation is provided by the multi-dimensional Black–
Scholes model (and its avatars for Foreign Exchange (Garman–Kohlhagen) or future
contracts (Black)) and, more generally, by models where the process (X tx )t∈[0,T ] can
be written at each time t ∈ [0, T ] as an explicit function of Wt , namely
∀ t ∈ [0, T ], X t = ϕ(t, Wt ),
where ϕ(t, x) can be computed at a very low cost. When d = 1 (although of smaller
interest for application in view of the available analytical methods based on vari-
ational inequalities), one can also rely on the exact simulation method of one-
dimensional diffusions (see [43]).
In the general case, we will rely on these two successive steps of discretization:
one for the optimal stopping rules and one for the underlying process to make it
simulable. However, in both cases, as far as numerics are concerned, we will rely on
the BDPP, which itself requires the computation of conditional expectations.
In both cases we now have access to a simulable Markov chain (either the Euler
scheme or the process itself at times tkn ). This task requires a second discretization
phase, making possible the computation of the discrete time Snell envelope (and its
value function).
∗
ei : R → R, i ∈ N , of
d
Mostly for convenience, we will consider a sequence
Borel functions such that, for every k ∈ {0, . . . , n}, ei (X k ) i∈N∗ is a complete system
of L 2 (, σ(X k ), P), i.e.
⊥
e (X k ), ∈ N∗ = {0}, k = 0, . . . , n.
In practice,
one
may choose different functions at every time k, i.e. families ei,k
so that ei,k (X k ) i≥1 makes up a Hilbert basis of L 2R (, σ(X k ), P).
Example. If X k = Wtkn , or equivalently after normalization, if X 0 = 0 and X k =
n
Wtkn , k = 1, . . . , n, the family of Hermite polynomials provides an orthonormal
kT
basis of L 2R , σ(X k ), P at each time step tkn , by setting e = H−1 , ≥ 1 (see e.g.
[162], Chap. 3, p. 167).
and set
τn[Nn ] := n,
where
[Nk ] 2
αk[Nk ] := argmin E Z τ [Nk ] − α|e (X k ) , α ∈ R Nk
, k = 0, . . . , n − 1.
k+1
(Keep in mind that, at each step k, ( · | · ) denotes the canonical inner product on
R Nk .)
In fact, this finite-dimensional optimization problem has a well-known solution
given by
−1
αk[Nk ] = Gram e[Nk ] (X k ) Z τ [Nk+1 ] | e[Nk ] (X k ) L 2 (P) ,
k+1 1≤≤Nk
Backward phase:
– At time n: For every m ∈ {1, . . . , M},
τn[Nn ],m,M := n.
– for k = n − 1 down to 0:
Compute
2
1 (m) [Nk ] (m)
M
αk[Nk ],M := argminα∈R Nk Z [N ],m,M − α|e (X k )
M m=1 τk+1k+1
Finally, the resulting approximation of the mean value at the origin of the Snell
envelope reads
1 (m)
M
E U0 = E (Z τ0 ) E (Z τ [N0 ] ) Z [N ],m,M as M → +∞.
0 M m=1 τ0 0
Remarks. • One may formally rewrite the second approximation phase by simply
replacing the distribution of the chain (X k )0≤k≤n by the empirical measure
1
M
δ X (m)
M m=1
x2 d i − x2
Hi (x) = (−1)i e 2 e 2 , i ≥ 0.
dxi
• The algorithmic analysis of the above described procedure shows that its imple-
mentation requires
– a forward simulation of M paths of the Markov chain,
– a backward non-linear optimization phase in which all the (stored) paths have
to interact through the computation at every time k of αk[N ],M , which depends on all
the simulated values X k(m) , m = 1, . . . , M.
However, still in very specific situations, the forward phase can be skipped if a
backward simulation method for the Markov chain (X k )0≤k≤n is available. Such is
the case for the Brownian motion at times nk T , using a backward recursive simulation
method, or, equivalently, the Brownian bridge method (see Chap. 8).
The rate of convergence of the Monte Carlo phase of the procedure is ruled by a
Central Limit Theorem due to Clément–Lamberton–Protter in [64], stated below.
Theorem 11.3 (CLT, see [64] (2003)) The Monte Carlo approximation satisfies a
Central Limit Theorem, namely
√ 1 (m)
M
L
M Z [Nk ],m,M − E Z [Nk ] , αk[Nk ],M , αk[Nk ] −→ N (0; ) as M → +∞,
M τk τk
m=1 0≤k≤n−1
M
2
αk[Nk ],M = arg min (m)
Uk+1 − α|e[Nk ] (X k(m) )
α∈R Nk m=1
! "−1
1 [Nk ] (m) [Nk ] (m) 1 (m) [Nk ] (m)
M M
= e (X k )e (X k ) Uk+1 e (X k )
M M
m=1 1≤, ≤Nk m=1 1≤≤Nk
and
Uk(m) = max f k (X k(m) ), αk[Nk ],M |e[Nk ] (X (m) ) , m = 1, . . . , M.
1 (m)
M
1
M
E U0 U (0) = max f 0 (X 0(m) ), α0[N0 ],M |e[N0 ] (X k(m) ) .
M m=1 M m=1
(11.13)
Exercise. Justify the above algorithm starting from the BDPP.
Pros
• The method is “natural”: it involves the approximation of a conditional expectation
by an affine regression operator performed on a truncated generating system of
L 2 (σ(X k ), P).
• The method appears to be “flexible”: there is an opportunity to change or adapt the
(truncated) basis of L 2 (σ(X k ), P) at each step of the procedure, e.g. by including the
payoff function in the (truncated) basis, at each step at least during the first backward
iterations of the induction.
11.3 Numerical Methods 527
Cons
• From a purely theoretical point of view the regression approach does not pro-
vide error bounds or a rate of approximation for the convergence of E Z τ [N0 ] toward
0
E Z τ0 = E U0 , which is mostly ruled by the rate at which the family e[Nk ] (X k ) “fills”
L 2 (, σ(X k ), P) when Nk goes to infinity. This point is discussed, for example,
in [118] in connection with the size of the Monte Carlo sample. However, in prac-
tice this information may be difficult to exploit, especially in higher dimensions, the
possible choices for the size Nk are limited by the storing capacity of the computing
device.
• Most computations are performed on-line since
they are payoff dependent. How-
ever, note that the Gram matrix of e[Nk ] (X k ) 1≤≤Nk can be computed off-line since
it only depends on the structure process.
• Due to the sub-optimality of the stopping times involved in the regression procedure,
the method tends to produce lower bounds for the u 0 (x0 ) = E U0 since it has a
methodological negative bias.
• The choice of the functions e (x) ∈N∗ , is crucial and needs much care (and
intuition). In practical implementations, it may vary at each time step (our choice of
a unique family is mostly motivated by notational simplicity). Furthermore, it may
have a biased effect for options deep in- or out-of-the-money since the coordinates
of the functions ei (X k ) are computed “locally” from simulated data which of course
lie most of the time where things happen (around the mean). This has an impact on
the prices of options with long maturity and/or deep-in- or out-in the money since
the calibration, mainly performed “at-the-money” of the coordinates of e[Nk ] (X k ),
induce their behavior at the aisles. On the other hand, this choice of the functions, if
they are smooth, has a smoothing effect which can be interesting to users (provided
it does not induce any hidden arbitrage…). To overcome the first problem one may
choose local functions like indicator functions of a Voronoi diagram (see the next
Sect. 11.3.2 devoted to quantization tree methods or Chap. 5) with the counterpart
that a smoothing effect can no longer be expected.
When there is a family of distributions “related” to the underlying Markov struc-
ture process, a natural idea can be to consider an orthonormal basis of L 2 (μ0 ), where
μ0 is a normalized distribution of the family. A typical example is the sequence of
Hermite polynomials for the normal distribution N (0; 1).
When no simple solution is available, considering the simple basis (t )≥0 remains
a quite natural and efficient choice in one dimension.
In higher dimensions (in fact the only case of interest in practice since one-
dimensional setting is usually solved through the associated variational inequality
by specific numerical schemes, see [2, 158]), this choice becomes more and more
influenced by the payoff itself.
• A huge RAM capacity is needed to store all the paths of the simulated Markov
chain (forward phase) except when a backward simulation procedure is available.
This induces a stringent limitation on the size M of the simulation, even with recent
devices, to prevent a swapping effect which would dramatically slow down the pro-
cedure. By a swapping effect we mean that, when the quantity of data to be stored
528 11 Optimal Stopping, Multi-asset American/Bermudan Options
becomes too large, the computer uses its hard disk to store it but the access to this
ROM memory is incredibly slow compared to the access to RAM memory.
• Regression methods are strongly payoff-dependent in the sense that a significant
part of the procedure (the product of the inverted Gram matrix by the projection of
the payoff at every time k) has to be done for each payoff.
Exercise. Write a regression algorithm based on the “primal” BDPP.
In this section, we continue to deal with the simple discrete time Markovian optimal
stopping problem introduced in the former section. The underlying idea of the quan-
tization tree method is to approximate the whole Markovian dynamics of the chain
(X k )0≤k≤n using a skeleton of the distribution supported by a tree. In some sense, the
underlying idea is to design some optimized tree methods in higher dimensions with
a procedure optimally fitted to the underlying marginal distributions of the chain in
order to prevent an explosion of their size (number of nodes per level).
For every k ∈ {0, . . . , n}, we replace the marginal X k by a function #
X k of X k taking
values in a grid k , namely # X k = πk (X k ), where πk : Rd → k is a Borel function.
The grid k = πk (Rd ) (also known as a codebook in Signal processing or Information
Theory) will always be assumed finite in practice, with size |k | = Nk ∈ N∗ .
However, note that all the error bounds established below still hold if the grids
are infinite provided πk is sub-linear, i.e.
|πk (x)| ≤ C 1 + |x| ,
so that #X k has at least as many finite moments as #X k . In such a situation, all in all
integrability assumptions X k and # X k should be coupled, i.e. X k , #
X k ∈ L r (P) instead
of X k ∈ L (P).
r
U uk ( #
#k = # Xk) #
u k : Rd → R+ , Borel function.
Error bounds
The following theorem establishes the control on the approximation of the true Snell
envelope (Uk )0≤k≤n by the quantized pseudo-Snell envelope (U #k )0≤k≤n using the
L p -mean approximation errors X k − #Xkp.
Theorem 11.4 (see [20] (2001), [235] (2011)) Assume that all functions f k :
Rd → R+ are Lipschitz continuous and that all the transitions Pk (x, dy) =
P(X k+1 ∈ dy | X k = x) are Lipschitz continuous in the following sense
n
n−
#k p ≤ 2[ f ]Lip
Uk − U [P]Lip ∨ 1 X − #
X p .
=k
=k
Proof. Step 1. First, we control the Lipschitz continuous constants of the functions
u k . It follows from the classical inequality
that
530 11 Optimal Stopping, Multi-asset American/Bermudan Options
[u k ]Lip ≤ max [ f k ]Lip , [Pk u k+1 ]Lip
≤ max [ f ]Lip , [Pk ]Lip [u k+1 ]Lip
with the convention [u n+1 ]Lip = 0. An easy backward induction yields [u k ]Lip ≤
n−k
[ f ]Lip [P]Lip ∨ 1 .
Step 2. We focus on claim (b) when p = 2. It follows from both Backward Dynamic
Programming formulas (original and quantized) and the above elementary inequality
that
#k | ≤ max | f k (X k ) − f k ( #
|Uk − U X k )|, E Uk+1 |X k − E U#k+1 | #
Xk
so that
2
#k |2 ≤ | f k (X k ) − f k ( #
|Uk − U X k )|2 + E Uk+1 |X k − E U#k+1 | #
Xk . (11.15)
2 2
E Uk+1 |X k − E U #k+1 | #X k = Pk u k+1 (X k ) − E # u k+1 ( # X k 2
X k+1 ) | #
2
2 2
X k 2 + u k (X k+1 ) − #
= [Pk u k+1 ]2Lip X k − # X k+1 )2 .
u k+1 ( #
(11.16)
Plugging this inequality into (11.15) and taking the expectation yields for every
k ∈ {0, . . . , n},
2
#k 2 ≤ [ f ]2Lip + [P]2Lip [u k+1 ]2Lip X k − #
Uk − U X k
+ Uk+1 − U
#k+1 2
2 2 2
Consequently,
n−1
2(n−k) 2 2
Uk − U
#k 2 ≤ 2[ f ]2Lip 1 ∨ [Pk ]Lip X − #
X 2 + [ f ]2Lip X n − #
X n 2
2
=k
n
2(n−k) 2
≤ 2[ f ]2Lip 1 ∨ [Pk ]Lip X − #
X 2 ,
=k
Remark. The above control emphasizes the interest of minimizing the L p -mean
quantization error X k − #
X k p at each time step of the Markov chain to reduce the
final resulting error.
Exercise. Prove claim (a) starting from the L p -error bound (5.18).
Example of application: the Euler scheme. Let ( X̄ tnn )0≤k≤n be the Euler scheme
k
with step Tn of the d-dimensional diffusion solution to the SDE (11.1). It defines a
homogenous Markov chain with transition
T T d
P̄kn g(x) = E g x + b(tk , X̄ tkn ) + σ(tk , X̄ tkn )
n n n n
Z , Z = N (0; Iq ).
n n
[σ]2Lip
As a consequence, setting Cb,σ = [b]Lip + yields
2
Cb,σ,T
[ P̄kn g]Lip ≤ 1 + [g]Lip , k = 0, . . . , n − 1,
n
i.e.
Cb,σ,T
[ P̄ n ]Lip ≤ 1 + .
n
Applying the control established in claim (b) of the above theorem yields with
obvious notations
n 1
√ C 2(n−) 2 2
Uk − U
#k ≤ 2[ f ]Lip 1+
b,σ,T X − # X 2
2 n
=k
n 21
√ C n
e2Cb,σ,T t X − #X 2 .
2
≤ 2e b,σ,T [ f ]Lip
=k
#k = #
Keeping in mind that U uk ( #
X k ), k = 0, . . . , n, we first get
⎧
⎨#
⎪ un = f n on
n,
#
u k (xik ) = max f k (xik ), E # u k+1 ( #
X k+1 )| #
X k = xik , (11.17)
⎪
⎩
i = 1, . . . , Nk , k = 0, . . . , n − 1,
(11.19)
Although the above super-matrix defines a family of Markov transitions, the
sequence ( # X k )0≤k≤n is definitely not a Markov Chain since there is no reason why
#
P X k+1 = x k+1 # #
j | X k = x i and P X k+1 = x j
k k+1 #
| X k = xik , #
X = xi , = 0, . . . , i − 1
should be equal.
In fact, one should rather view the quantized transitions
Nk+1
#k (xik , dy) =
P pikj δx k+1 , xik ∈ k , k = 0 = . . . , n − 1,
j
j=1
as spatial discretizations of the original transitions Pk (x, du) of the original Markov
chain.
Definition 11.2 The family of grids (k ), 0 ≤ k ≤ n, and the transition super-matrix
[ pikj ] defined by (11.19) defines a quantization tree of size N = N0 + · · · + Nn .
11.3 Numerical Methods 533
Remark. A quantization tree in the sense of the above definition does not characterize
the distribution of the sequence ( #
X k )0≤k≤n .
Implementation of the quantization tree descent. The implementation of the
whole quantization tree method relies on the computation of this transition super-
matrix. Once the grids (optimal or not) have been specified and the weights of the
super-matrix have been computed or, to be more precise, have been estimated, the
computation of the approximate value function E U #0 at time 0 amounts to an almost
instantaneous “backward descent” of the quantization tree based on (11.18).
If we can simulate M independent copies of the Markov chain (X k )0≤k≤n denoted
by X (1) , . . . , X (M) , then the weights pikj can be estimated by a standard Monte Carlo
estimator
(M),k m ∈ {1, . . . , M}, # (m)
X k+1 = x k+1
j &# X k(m) = xik M→+∞ k
pi j = −→ pi j
m ∈ {1, . . . , M}, #
X k(m) = xik
as M → +∞.
Remark. By contrast with the regression methods for which theoretical results are
mostly focused on the rate of convergence of the Monte Carlo phase, here we will
analyze the part of the procedure for which we refer to [17], where the transition
super-matrix is computed and its impact on the quantization tree is deeply analyzed.
Application. We can apply the preceding, still within the framework of an Euler
scheme, to our original optimal stopping problem. We assume that all random vectors
X k lie in L p (P), for a real exponent p > 2, and that they have been optimally quan-
tized (in L ) by grids of size Nk , k = 0, . . . , n, then, relying on the non-asymptotic
2
Zador Theorem (claim (b) from Theorem 5.1.2), we get with obvious notations,
n 21
Ū0n (n ≤ √2eCb,σ,T [ f ] C
− Ū e 2Cb,σ,T −2
σ p ( X̄ tnkn )2 Nk d .
0 2 Lip p ,d
k=0
Optimizing the quantization tree. At this stage, one can process an optimization
of the quantization tree. To be precise, one can optimize the size of the grids k
subject to a “budget” (or total allocation) constraint, typically
) *
n
−2
min e 2Cb,σ,T
σ p ( X̄ tnkn )2 Nk d , Nk ≥ 1, N0 + · · · + Nn = N .
k=0
(b) Derive an asymptotically optimal choice for the grid size allocation (as N →
+∞).
This optimization turns out to have a significant numerical impact, even if, in terms
of rate, the uniform choice Nk = N̄ = N n−1 (doing so we implicitly assume that X 0 =
x0 ∈ Rd , so that #X 0 = X 0 = x0 , N0 = 1 and U #0 = #u 0 (x0 )) leads to a quantization
error of the form
(n
|ū n0 (x0 ) − u#̄n0 (x0 )| ≤ Ū0n − Ū
0 2
√
n
≤ κb,σ,T [ f ]Lip max σ p ( X̄ tnkn ) × 1
0≤k≤n N̄ d
√
n
≤ [ f ]Lip κb,σ,T, p 1
N̄ d
since we know (see Proposition 7.2 in Chap. 7) that sup E max | X̄ tnkn | p < +∞. If
n 0≤k≤n
we plug that into the global estimate obtained in Theorem 11.2, we obtain the typical
error bound 1
#̄ 1 n2
|u 0 (x0 ) − u 0 (x0 )| ≤ Cb,σ,T, f,d
n
1 + 1 . (11.20)
n2 N̄ d
Remark. If we can directly simulate the sampled (X tkn )0≤k≤n of the diffusion instead
of its Euler scheme and if the obstacle/payoff function is semi-convex in the sense
of Condition (SC), then we get as a typical error bound
1
1 n2
u 0 (x0 ) − u#̄n0 (x0 ) ≤ Cb,σ,T, f,d + 1 .
n N̄ d
Remark. • The rate of decay N̄ − d becomes obviously bad as the spatial dimension
1
d of the underlying Markov process increases, but this phenomenon cannot be over-
come by such tree methods. This is a consequence of Zador’s Theorem. This rate
degradation is known as the curse of dimensionality.
• These rates can be significantly improved by introducing a Romberg-like extrapo-
lation method and/or some martingale corrections to the quantization tree (see [235]).
ℵ Practitioner’s corner
#0 = 0 and
Examples: 1. Brownian motion X k = Wtk . Then W
√
Wtk 2+δ = Cδ tk , k = 0, . . . , n.
Hence N0 = 1,
2(d+1)
d
2(d + 1) k
Nk N , k = 1, . . . , n,
d +2 n
and 1− d1
2(d + 1) n 1+ d
1
n
|V0 − #
v0 (0)| ≤ C W,δ =O
d +2 Nd
1
N̄ d
1
with N̄ = Nn .
Theoretically this choice may not look crucial since it has no impact on the
convergence rate, but in practice, it does influence the numerical performances.
2. Stationary processes. The process (X k )0≤k≤n is stationary and X 0 ∈ L 2+δ for some
δ > 0. A typical example in the Gaussian world, is, as expected, the stationary
Ornstein–Uhlenbeck process
then the process (X t )t≥0 is stationary, i.e. (X t+s )s≥0 and (X s )s≥0 have the same
distribution as processes, with common marginal distribution ν0 . If we “sample” (or
observe) this process at times tkn = kT
n
, k = 0, . . . , n, between 0 and T , it takes the
form of a Gaussian Autoregressive process of order 1
T T n
X tk+1
n = X tkn Id − B + Z , (Z kn )1≤k≤n i.i.d., N (0; Id ) distributed.
2n n k+1
The key feature of such a setting is that the quantization tree only relies on one
optimal N̄ -grid = 0,( N̄ ) = {x10 , . . . , x N̄0 } (say L 2 optimal for the distribution of
X 0 at level N̄ ) and one quantized transition matrix P( # X 1 = x 0j | #
X 0 = xi0 ) 1≤i, j≤n .
536 11 Optimal Stopping, Multi-asset American/Bermudan Options
- .
For every k ∈ {0, . . . , n}, X k 2+δ = X 0 2+δ , hence Nk = N
n+1
, k = 0, . . . , n,
n 2+d
1 1
n N
v0 ( #
V0 − # X 0 )2 ≤ C X,δ ≤ C X,δ 1 with N̄ = .
Nd
1
N̄ d n+1
c2 n 2 + d
1 1
c1
Err (n, N ) = + .
nα Nd
1
In fact, this identity defines by a backward induction new grids k . As a final step,
one translates these new grids so that 0 and 0 have the same mean. Of course, such
a procedure is entirely heuristic.
Pros
• The quantization tree, once optimized, appears as a skeleton of the distribution
of the underlying Markov chain. This optimization phase can be performed off-line
whereas the payoff dependent part, the Quantized BDPP, is almost instantaneous.
• In many situations, including the multi-dimensional Black–Scholes model, the
quantization tree can be designed on the Brownian motion itself, for which pre-
11.3 Numerical Methods 537
We rely in this brief section on the notations introduced in Sect. 11.2.1. So far we have
seen that the (P, (Fk )k=0,...,n )-Snell envelope (Uk )k=0,...,n of a sequence (Z k )k=0,...,n of
an (Fk )0≤k≤n -adapted sequence of integrable non-negative random variables defined
on a filtered probability space , A, P, (Fk )k=0,...,n ) is defined as a P-essential
supremum (see Sect. 12.9) and satisfies a Backward Dynamic Programming Principle
538 11 Optimal Stopping, Multi-asset American/Bermudan Options
Theorem 11.5 (Dual form of the Snell envelope (Rogers)) Let (Z k )k=0,...,n be as
above and let (Uk )k=0,...,n be its (P, (Fk )0≤k≤n )-Snell envelope. Let
Mk = M = (M )=0,...,n (P, F )-martingale, Mk = 0 .
Proposition 11.5 (Doob’s decomposition) Let (Sk )k=0,...,n be a (P, (Fk )0≤k≤n )-
super-martingale, i.e. a sequence of integrable (Fk )0≤k≤n -adapted random variables
satisfying
∀ k ∈ {0, . . . , n − 1}, E (Sk+1 | Fk ) ≤ Sk .
Then there exists a pair (M, A), unique up to P-a.s. equality, where M = (Mk )k=0,...,n
is a (P, (Fk )0≤k≤n )-martingale null at 0 and A = (Ak )k=0,...,n is a non-decreasing
sequence of random variables such that A0 = 0 and Ak is Fk−1 -measurable for every
k ∈ {1, . . . , n} such that
∀ k ∈ {0, . . . , n − 1}, Sk = S0 + Mk − Ak .
Exercise.
Prove
this proposition.
[Hint: First establish uniqueness by showing that
Ak = − k=1 E S | F−1 , k = 1, . . . , n.]
Proof of Theorem 11.5 Step 1 (k = 0). Let M ∈ M0 and let τ be an (Fk )0≤k≤n -
stopping time. Then, by the optional stopping theorem (see [217]), E Mτ = 0 so
that
E Z τ | F0 ) = E Z τ − M τ | F0 ) ≤ E sup (Z k − Mk ) | F0 P-a.s.
k∈{0,...,n}
11.4 Dual Form of the Snell Envelope (Discrete Time) 539
Conversely, it is clear from the BDPP (11.7) satisfied by the Snell envelope that
(Uk )k=0,...,n is a (P, (Fk )0≤k≤n )-super-martingale which dominates (Z k )k=0,...,n (i.e.
Uk ≥ Z k P-a.s. for every k = 0, . . . , n). Hence, it admits a Doob decomposition
(M ∗ , A∗ ) such that Uk = U0 + Mk∗ − A∗k , k = 0, . . . , n, with M ∗ and A∗ as in the
above Proposition 11.5. Consequently, for every k = 0, . . . , n,
which in turn implies E sup Z k − Mk∗ | F0 ≤ E (U0 | F0 ) = U0 .
0≤k≤n
Step 2 (Generic k). For a generic k ∈ {0, . . . , n}, one adapts the proof of the above
step by considering for τ a {k, . . . , n}-valued stopping time and, in the second part
of the proof, by considering the martingale M∗,k = M∨k ∗
− Mk∗ , = 0, . . . , n. Con-
ditioning by Fk completes the proof. ♦
+∞
x2 dx
χ Z (u) = ı̃ u
2
eı̃ux e− 2 x √ = −u Z (u)
−∞ 2π
so that
u2 u2
χ Z (u) = Z (0)e− 2 = e− 2 . ♦
d
Corollary 12.1 If Z = (Z 1 , . . . , Z d ) = N (0; I d) is a multivariate normal distri-
bution, then its characteristic function χ Z (u) = E eı̃(u | Z ) , u ∈ Rd , is given by
|u|2
χ Z (u) = e− 2 .
© Springer International Publishing AG, part of Springer Nature 2018 541
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_12
542 12 Miscellany
d
Proof. For every u ∈ Rd , (u | Z ) = i=1 u i Z i has a Gaussian distribution with
d
variance i=1 (u i )2 = |u|2 since the components Z i are independent and N (0; 1)-
distributed. Consequently,
d d
(u | Z ) = |u| ζ, ζ = N (0; 1),
|u|2
∀ u ∈ Rd , χ Z (u) = e− 2 . ♦
i i
Remark. An alternative argument is that the C-valued random variables eı̃u Z
,i =
1, . . . , d, are independent, so that
d
d
d
(u i )2 |u|2
e− = e−
i
Zi i
Zi
χ Z (u) = E eı̃u = E eı̃u = 2 2 .
i=1 i=1 i=1
x2
− x2
∀ x ∈ R+ , 0 (x) = 1 − e√ (a1 t + a2 t 2 + a3 t 3 + a4 t 4 + a5 t 5 + O e−
2
2 t6 ,
2π
1
where t := , p := 0.231 6419 and
1 + px
x2
inducing a maximal error of the form O e− 2 t 6 ≤ 7.5 10−8 .
The distribution function of the N (0; 1) distribution is given for every real number
t by
12.1 More on the Normal Distribution 543
t
1 x2
0 (t) := √ e− 2 d x.
2π −∞
The following tables give the values of 0 (t) for t = x0 , x1 x2 where x0 ∈ {0, 1, 2},
x1 ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and x2 ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.
For example, if t = 1.23 (i.e. row 1.2 and column 0.03) one has 0 (t) 0.8907.
t 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6661 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7290 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1,0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1,1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1,2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1,3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1,4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1,5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1,6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1,7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1,8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1,9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2,0 0.9772 0.9779 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2,1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2,2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2,3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2,4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2,5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2,6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2,7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2,8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2,9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
One notes that 0 (t) = 0.9986 for t = 2.99. This comes from the fact that the
mass of the normal distribution is mainly concentrated on the interval [−3, 3] as
emphasized by the table of the “large” values hereafter (for instance, we observe that
P({|X | ≤ 4.5}) ≥ 0.99999 !).
544 12 Miscellany
t 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,8 4,0 4,5
0 (t) .99865 .99904 .99931 .99952 .99966 .99976 .999841 .999928 .999968 .999997
σ2
X tx0 = x0 e(r − 2 )t+σWt , Wt = N (0; 1).
d
where
Call0 (x0 , K , r, σ, τ ) = x0 0 (d1 ) − e−r τ K 0 (d2 ), τ > 0, (12.1)
with σ2
log x0
+ (r + )τ √
d1 = K
√ 2
, d2 = d1 − σ τ . (12.2)
σ τ
As for the put option written on the payoff h T = (K − X Tx0 )+ , the premium is
where
Put 0 (x0 , K , r, σ, τ ) = e−r τ K 0 (−d2 ) − x0 0 (−d1 ). (12.3)
Theorem 12.1 (Baire σ-field Theorem) Let (S, d S ) be a metric space. Then
Bor (S, d S ) = σ C(S, R) ,
where C(S, R) denotes the set of continuous functions from (S, d S ) to R. When S is
σ-compact (i.e. is a countable union of compact sets), one may replace the space
C(S, R) by the space C K (S, R) of continuous functions with compact support.
Remarks • All norms being strongly equivalent on Rd , claims (i) and (ii) do not
depend on the selected norm on Rd .
• L 1 -Uniform integrability of a family of probability distribution (μi )i∈I defined on
(Rd , Bor (Rd )) can be defined accordingly by
lim sup |x| μi (d x) = 0.
R→+∞ i∈I {|x|≥R}
Proof of Theorem 12.3. Assume first that (i) holds. It is clear that
sup E |X i | ≤ R + sup E |X i |1{|X i |≥R} < +∞
i∈I i∈I
at least for large enough R ∈ (0, +∞). Now, for every i ∈ I and every A ∈ A,
|X i | d P ≤ R P(A) + |X i | d P.
A {|X i |≥R}
Owing to (i) there exists a real number R = R(ε) > 0 such that sup |X i | d P ≤
i∈I {|X i |≥R}
ε ε
. Then setting η = η(ε) = 2R yields (ii).
2
Conversely, for every real number R > 0, the Markov Inequality implies
supi∈I E |X i |
sup P(|X i | ≥ R) ≤ .
i∈I R
supi∈I E |X i |
Let η = η(ε) be given by (ii)(β). As soon as R > η
, sup P {|X i | ≥ R} ≤ η
i∈I
and (ii)(β) implies that
12.4 Uniform Integrability as a Domination Property 547
∀ i ∈ I, |X i | ≤ Yi P-a.s.
Proof. One derives from (ii) the existence of a subsequence (X n )n≥1 such that
X n → X a.s. Hence, by Fatou’s Lemma,
12.5 Interchanging…
Theorem 12.5 (Interchanging continuity and expectation) (see e.g. [52], Chap. 8)
(a) Let (, A, P) be a probability space, let I be a nontrivial interval of R and let
: I × → R be a Bor (I ) ⊗ A-measurable function. Let x0 ∈ I . If the function
satisfies:
(i) for every x ∈ I , the random variable (x, . ) ∈ L 1R (P),
(ii) P(dω)-a.s., x → (x, ω) is continuous at x0 ,
(iii) there exists Y ∈ L 1R+ (P) such that for every x ∈ I ,
The same extension as Claim (b), based on uniform integrability, holds true for
the differentiation Theorem 2.2 (see the exercise following the theorem).
The main reference for this topic is [45]. See also [239].
The basic result of weak convergence theory is the so-called Portmanteau Theorem
stated below (the definition and notation of weak convergence of probability measures
are recalled in Sect. 4.1).
Theorem 12.6 Let (μn )n≥1 be a sequence of probability measures on a Polish (met-
ric) space (S, δ) equipped with its Borel σ-field S and let μ be a probability measure
on the same space. The following properties are equivalent:
S
(i) μn =⇒ μ as n → +∞,
(ii) For every open set O of (S, δ).
◦
(iv) For every Borel set A ∈ S such that μ(∂ A) = 0 (where ∂ A = A\ A is the
boundary of A),
lim μn (A) = μ(A).
n
(v) Weak Fatou Lemma: For every non-negative lower semi-continuous function
f : S → R+ ,
0≤ f dμ ≤ lim f dμn .
S n S
550 12 Miscellany
(vi) For every bounded Borel function f : S → R such that μ Disc( f ) = 0,
lim f dμn = f dμ.
n S S
where Disc( f ) = x ∈ S : f is not continuous at x .
For a proof, we refer to [45], Chap. 1.
When dealing with unbounded functions, there is a kind of weak Lebesgue dom-
inated convergence theorem.
Proposition 12.2 Let (μn )n≥1 be a sequence of probability measures on a Polish
(metric) space (S, δ) weakly converging to μ.
(a) Let g : S → R+ be a (non-negative) μ-integrable continuous function and let
f : S → R be a μ-a.s. continuous Borel function. If
0 ≤ |f| ≤ g and g dμn −→ g dμ as n → +∞
S S
then f ∈ L (μ) and
1
f dμn −→ f dμ as n → +∞.
S S
(b) The conclusion still holds if f is (μn )n≥1 -uniformly integrable, i.e.
lim sup | f |dμn = 0.
R→+∞ n≥1 | f |≥R
we derive from the above proposition and the assumption made on g that, for every
such R,
lim f dμn −
f dμ ≤ 2 (g − g R )dμ = 2 g1{g>R} dμ.
n
As there are at most countably many R which are μ-atoms for g and f , we may
let R go to infinity, so that g1{g≥R} dμ → 0 as R → +∞. This completes the
proof. ♦
S
μn =⇒ μ ⇐⇒ ∀ u ∈ Rd ,
μn (u) −→
μ(u) .
For a proof we refer, for example, to [156], or to any textbook presenting a first
course in Probability Theory.
Theorem 12.7 Let (Mn )n≥0 be a square integrable discrete time (Fn )-martingale
defined on a filtered probability space (, A, P, (Fn )n≥0 ).
(a) Then there exists a unique non-decreasing Fn -predictable process null at 0
denoted by (Mn )n≥0 such that
n→+∞
Mn −→ M∞ on the event M∞ < +∞ ,
where M∞ is a finite random variable. If, furthermore, E M∞ < +∞, then M∞ ∈
L 2 (P).
Lemma 12.1 (Kronecker Lemma) Let (an )n≥1 be a sequence of real numbers and
let (bn )n≥1 be a non-decreasing sequence of positive real numbers with limn bn =
+∞. Then
an 1
n
converges in R as a series =⇒ ak −→ 0 as n → +∞ .
n≥1 n
b bn k=1
552 12 Miscellany
n
ak
Proof. Set Cn = , n ≥ 1, C0 = 0. The assumption says that Cn → C∞ =
bk
ak k=1
∈ R as n → +∞. Now, for every n ≥ 1, bn > 0 and
k≥1
bk
1 1
n n
ak = bk Ck
bn k=1 bn k=1
1 n
= bn Cn − Ck−1 bk
bn k=1
1
n
= Cn − bk Ck−1 ,
bn k=1
where we used an Abel Transform for series. The result follows by the extended
Césaro Theorem since bn ≥ 0 and lim bn = +∞. ♦
n
where
• X 0 is F0 -measurable,
ij
• (Ht )t≥0 = [Ht ]i=1:d, j=1:q t≥0 and (K t )t≥0 = [K ti ]i=1:d t≥0 are (Ft )-
progressively measurable processes (1 ) having values in Rd and M(d, q, R),
respectively,
T
• |K s |ds < +∞ P-a.s.,
0 T
• Hs 2 ds < +∞ P-a.s.,
0
• W is a q-dimensional (Ft )-standard Brownian motion, (2 )
is called an Itô process (3 ).
Note that the processes K and H in (12.4) are P-a.s. essentially unique (since a
continuous (local) martingale null at zero with finite variation is indistinguishable
from the null process).
In particular, an Itô process is a local martingale (4 ) if and only if, P-a.s, K t = 0
t ∈ [0, T ].
for every
t t t
If E Hs 2 ds = EHs 2 ds < +∞ then the process Hs dWs is
0 0 0 t≥0
an Rd -valued square integrable martingale with a bracket process
t i t j t 2
Hs dWs , Hs dWs = (H H ∗ )is j ds
0 0 0 1≤i, j≤d
1≤i, j≤d
1A stochastic process (Yt )≥0 defined on (, A, P is (Ft )-progressively measurable if for every
t ∈ R+ , the mapping (s, ω) → Ys (ω) defined on [0, t] × is Bor ([0, t]) ⊗ Ft -measurable.
2 This means that W is (F )-adapted and, for every s, t ≥ 0, s ≤ t, W − W is independent of F .
t
t s s
3 The stochastic integral is defined by s H d W = d ij j
0 s s j=1 H s d W s .
1≤i≤d
4 An Ft -adapted continuous process is a local martingale if there exists an increasing sequence
(τn )n≥1 of (Ft )-stopping times, increasing to +∞, such that (Mt∧τn − M0 )t≥0 is an (Ft )-martingale
for every n ≥ 1.
554 12 Miscellany
i j
i j
t t t t
Hs dWs , Hs dWs − Hs dWs , Hs dWs
0 0 0 0
are (P, Ft )-martingales for every i, j ∈ {1, . . . , d}. Otherwise, these are only local
martingales.
where
1
L f (t, X t ) = b(t, x) | ∇x f (t, x) + Tr σ D 2 f σ ∗ (t, x)
2
denotes the infinitesimal generator of the diffusion process.
For a comprehensive exposition of the theory of continuous martingales theory
and stochastic calculus, we refer to [162, 163, 249, 251, 256] among many other
references. For a more synthetic introduction with a view to applications to Finance,
we refer to [183].
The essential supremum is an object playing for a.s. comparisons a similar role
for random variables defined on a probability space (, A, P) to that played by the
regular supremum for real numbers on the real line.
Let us understand on a simple example why the regular supremum is not the right
notion in a framework where equalities and inequalities hold in an a.s. sense. First,
it is usually not measurable when taken over a non-countable infinite set of indices:
indeed, let (, A, P) = ([0, 1], Bor ([0, 1]), λ), where λ denotes the Lebesgue mea-
sure on [0, 1], let I ⊂ [0, 1] and X i = 1{i} , i ∈ I . Then, supi∈I X i = 1 I is measurable
if and only if I is a Borel subset of the interval [0, 1].
Furthermore, even when I = [0, 1], supi∈I X i = 1 is certainly measurable but is
therefore constantly 1, whereas X i = 0 P-a.s. for every i ∈ I . We will see, how-
12.9 Essential Supremum (and Infimum) 555
ever, that the notion of essential supremum defined below does satisfy esssup X i =
i∈I
0 P-a.s.
If (X i )i∈I is stable under finite supremum, then there exists a subsequence (X in )n∈N
P-a.s. non-decreasing such that
(c) Let (, P(), P) be a finite probability space such that P({ω}) > 0 for every
ω ∈ . Then supi∈I X i is the unique essential supremum of the X i .
We adopt the simplified notation esssup in the proof below to alleviate notation.
i∈I
Proof. (a) Let us begin by P-a.s. uniqueness. Assume there exists a random variable
Z satisfying (i) and (ii). Following (i) for Z and (ii) for esssup X i , we derive
i∈I
that esssup X i ≤ Z P-a.s. Finally, (i) for esssup X i and (ii) for Z give the reverse
i∈I i∈I
inequality.
As for the existence, first assume that the variables X i are [0, 1]-valued. Set
! "
MX := sup E sup X j , J ⊂ I, J countable ∈ [0, 1].
j∈J
This definition is consistent since, J being countable, sup j∈J X j is measurable and
[0, 1]-valued. As MX ∈ [0, 1] is a finite real number, there exists a sequence (Jn )n≥1
of countable subset of I satisfying:
1
E sup X j ≥ MX − .
j∈Jn n
556 12 Miscellany
Set
esssup X i := sup X j where J = ∪n≥1 Jn .
i∈I j∈J
This defines a random variable since J is itself countable as the countable union of
countable sets. This yields
Consequently,
E sup X j − sup X j ≤ 0,
j∈J ∪{i} j∈J
# $% &
≥0
When the X i have values in [−∞, +∞], one introduces an increasing bijection
: R → [0, 1] and sets
esssup X i := −1 esssup (X i ) .
i∈I i∈I
It is easy to check that, thus defined, esssup X i satisfies (i) and (ii). By uniqueness
i∈I
esssup X i does not depend P-a.s. on the selected function .
i∈I
(b) Following (a), there exists a sequence ( jn )n∈N such that J = jn , n ≥ 0 satisfies
Now, owing to the stability property of the supremum (12.6), one may build a
sequence (i n )n∈N such that X i0 = X j0 and, for every n ≥ 1,
12.9 Essential Supremum (and Infimum) 557
We will not detail all the properties of the P-essential supremum which are quite
similar to those of regular supremum (up to an a.s. equality). Typically, one has with
obvious notations and P-a.s.,
and,
∀λ ≥ 0, esssup (λ.X i ) = λ.esssup X i ,
i∈I i∈I
etc.
As a conclusion, we may define the essential infimum by setting, with the notation
of the above theorem,
n
n (x, y) = 1{ξk1 ≤x}∩{ξk2 ≤y} , (x, y) ∈ [0, 1]2
k=1
and
'n (x, y) = n (x, y) − nx y.
( log n )
Let n ≥ 1 be a fixed integer. Let ri = log pi
, i = 1, 2. We consider the increasing
reordered n-tuples of both components, denoted by (ξki,(n) )k=1,...,n , i = 1, 2 with the
two additional terms ξ0i,(n) = 0 and ξn+1
i,(n)
= 1. Then, set
* +
X1,2
n
= ξk1,(n)
1
, ξk2,(n)
2
, 0 ≤ k1 , k2 ≤ n .
x1 xr y1 yr
x= + · · · + r11 , y= + · · · + r22 .
p1 p1 p2 p2
1,(n) 1,(n)
As n is constant over rectangles of the form ξk1,(n) , ξk+1 × ξ1,(n) , ξ+1 ,
0 ≤ k, ≤ n, and 'n is decreasing with respect to the componentwise partial order
'n is “right-
on [0, 1]2 (the order induced by inclusions of boxes [[0, x]]). Finally,
'n (x, y) =
continuous” in the sense that lim 'n (u, v). Moreover, as
(u,v)→(x,y),u>x,v>y
'n is continuous at (1, 1),
both sequences are (0, 1)-valued, 'n (1, 1) = 0 and, for
x, y ∈ [0, 1),
'n (x, 1) =
lim 'n (u, v),
'n (1, y) = lim 'n (u, v).
(u,v)→(x,1),u>x,v<1 (u,v)→(x,y),u<1,v>y
inf 'n (x, y) =
min 'n ξk−
1
, ξ−
2
.
(x,y)∈[0,1]2 1≤k,≤n+1
'n (1− , y− ) =
lim 'n (1, y) ≥ −n Dn∗ (ξ 2 ) and
'n (x− , 1− ) ≥ −n Dn∗ (ξ 1 )
v→y,v<y
12.10 Halton Sequence Discrepancy (Proof of an Upper-Bound) 559
so that
Consequently,
sup 'n (x, y) ≤ max n Dn∗ (ξ 1 ), n Dn∗ (ξ 2 ), max 'n (x, y) . (12.9)
(x,y)∈[0,1]2
n
(x,y)∈X1,2
which is equivalent to
Consequently, the joint conditions ξk1 < x and ξk2 < y read: there exists r ∈ {0, . . . , r1 },
s ∈ {0, . . . , r2 }, u r ∈ {0, . . . , xr +1 − 1}, vs ∈ {0, . . . , ys+1 − 1} such that
⎧
⎨ k ≡ x1 + · · · + xr p1r −1 + u r p1r mod p1r +1
Sr1 ,r2 (u r , vs ) ≡
⎩
k ≡ y1 + · · · + ys p2s−1 + vs p2s mod p2s+1 .
Since p1r +1 and p2s+1 are mutually prime, we know by the Chinese Remainder The-
r +1 r +1
orem that
, then system - Sr1 ,r2 (u r , vr ) has a unique solution in {0, . . . , p1 p2 − 1},
hence + ηr,s,ur ,vs solutions lying in {0, . . . , n} with ηr,s,ur ,vr ∈ {0, 1}.
pr +1 q r +1
The solution k = 0 to the system S0,0 (0, 0) should be removed since it is not admis-
sible. Consequently, as (x, y) ∈ X1,2 n
,
560 12 Miscellany
n
n (x, y) = 1 + 1{ξk1 <x} 1{ξk2 <y}
k=1
+1 −1 ys+1 −1 , -
r1
r2 xr n
= 1−1+ + ηr,s,ur ,vs
r =0 s=0 u r =0 vs =0
p1r +1 p2s+1
r1
r2 , n -
= xr +1 ys+1 + ηr,s ,
r =0 s=0
p1r +1 p2s+1
where ηr,s ∈ [0, 1]. Owing to (12.8), (12.9) and the obvious fact that nx y =
n xr +1 ys+1
r +1 q s+1
, we derive that for every (x, y) ∈ X1,2
n
,
0≤r ≤r 0≤s≤r
p
1 2
r −1
+1 −1 ys+1 −1
r2 xr , -
'n (x, y) ≤ n n
xr +1 ys+1 r +1 s+1 − r +1 s+1 + ηr,s
p1 p2 p p2 #$%&
r =0 s=0 u r =0 vs =0 # $% 1 & ∈[0,1]
∈[−1,0]
≤ (r1 + 1)(r2 + 1)( p1 − 1)( p2 − 1).
( pi n )
We conclude by noting that, on the one hand, ri + 1 = log log pi
, i = 1, 2 and, on
the other hand,
∗
n D (ξ) ≤ max sup ' ∗ 1 ∗ 2
n (x, y) , n D (ξ ), n D (ξ ) .
n n n
(x,y)∈[0,1]2
Finally, following the above lines – in a simpler way – one shows that
Conditional on the process (Ws1 )0≤s≤t , X t2 has a centered Gaussian distribution with
stochastic variance
12.11 A Pitman–Yor Identity as a Benchmark 561
t
σ2 (μs + Ws1 )2 ds.
0
2
Let us compute E eı̃ X t . As W 1 and W 2 are independent,
t t 2
0 (μs+Ws )dWs = E E eı̃σ 0 (μs+ws )dWs W 1 = w
2 1 2
E eı̃ X t = E eı̃σ
σ2
t
= E e− 2 0 (μs+Ws ) ds
1 2
.
μ2
We apply Girsanov’s Theorem and we make a change of probability P∗ = e−μWt − 2 t . P
1
∗
= eμBt − 2 t , which yields
dP
σ2 t
E eı̃ X t = E∗ eμBt − 2 μ t e− 2 0 (Bs ) ds
2 1 2 2
(12.10)
2
= E∗ eμBt − 2 μ t E∗ e− 2 0 (Bs ) ds Bt
1 2 σ t 2
σ2 t 2
= e− 2 μ t E∗ eμBt E∗ e− 2 0 (Bs ) ds (Bt )2
1 2
(12.11)
d t t
where we used in the third line that, as B = −B and 0 (Bs )2 ds = 0 (−Bs )2 ds,
σ2 t 2 σ2 t 2
E∗ e− 2 0 (Bs ) ds Bt = E∗ e− 2 0 (Bs ) ds (Bt )2 . (12.12)
d
As Bt = N (0; t),
562 12 Miscellany
+∞
Bt2 x2 x2 dx
E∗ eμBt + 2t (1−σt coth σt) = eμx+ 2t (1−σt coth σt) e− 2t √
−∞ 2π
+∞
1 1 2 dx
=√ exp − x σ coth σt − 2μx √ .
t −∞ 2 2π
√
We set a = σ coth σt and b = μ/a and we get
b2
∗
B2
μBt + 2tt (1−σt coth(σt)) 1 b2 +∞ (ax − b)2 d x e2
E e =√ e 2 exp − √ = √ .
t −∞ 2 2π a t
Hence,
/
σt 1 1 1
e− 2 (1− σt ) .
b2 μ2 t tanh σt
e− 2 μ t+ 2 = √
1 2
ı̃ X t2
Ee = √ √
sinh σt t σ coth σt cosh σt
Since the right term of the above equality has no imaginary part, we obtained the
announced formula (9.105)
e− 2 (1− σt )
μ t 2 tanh σt
E cos(σ X t2 ) = √ .
cosh σt
Remark. Note that these computations are shortened when there is no drift term,
i.e. t
X t = Wt , X t = σ
1 1 2
X s1 dWs1 .
0
1 √ − 21
E e−σ 0 (Ws1 )2 ds
= cosh 2σ
16. V. Bally, An elementary introduction to Malliavin calculus, technical report (Inria) (2003),
https://hal.inria.fr/inria-00071868/document
17. V. Bally, The central limit theorem for a non-linear algorithm based on quantization. Proc. R.
Soc. 460, 221–241 (2004)
18. V. Bally, A. Kohatsu-Higa, A probabilistic interpretation of the parametrix method. Ann.
Appl. Probab. 25(6), 3095–3138 (2015)
19. V. Bally, G. Pagès, J. Printems, A stochastic quantization method for non-linear problems.
Monte Carlo Methods Appl. 7(1), 21–34 (2001)
20. V. Bally, G. Pagès, A quantization algorithm for solving discrete time multi-dimensional
optimal stopping problems. Bernoulli 9(6), 1003–1049 (2003)
21. V. Bally, G. Pagès, Error analysis of the quantization algorithm for obstacle problems. Stoch.
Process. Appl. 106(1), 1–40 (2003)
22. V. Bally, G. Pagès, J. Printems, First order schemes in the numerical quantization method.
Math. Financ. 13(1), 1–16 (2003)
23. V. Bally, G. Pagès, J. Printems, A quantization tree method for pricing and hedging multi-
dimensional American options. Math. Financ. 15(1), 119–168 (2005)
24. V. Bally, D. Talay, The distribution of the Euler scheme for stochastic differential equations:
I. Convergence rate of the distribution function. Probab. Theory Relat. Fields 104(1), 43–60
(1996)
25. V. Bally, D. Talay, The law of the Euler scheme for stochastic differential equations. II.
Convergence rate of the density. Monte Carlo Methods Appl. 2(2), 93–128 (1996)
26. C. Barrera-Esteve, F. Bergeret, C. Dossal, E. Gobet, A. Meziou, R. Munos, D. Reboul-Salze,
Numerical methods for the pricing of Swing options: a stochastic control approach. Methodol.
Comput. Appl. Probab. 8(4), 517–540 (2006). https://doi.org/10.1007/s11009-006-0427-8
27. O. Bardou, S. Bouthemy, G. Pagès, Pricing swing options using optimal quantization. Appl.
Math. Financ. 16(2), 183–217 (2009)
28. O. Bardou, S. Bouthemy, G. Pagès, When are Swing options bang-bang? Int. J. Theor. Appl.
Financ. 13(6), 867–899 (2010)
29. O. Bardou, N. Frikha, G. Pagès, Computing VaR and CVaR using stochastic approximation
and adaptive unconstrained importance sampling. Monte Carlo Appl. J. 15(3), 173–210 (2009)
30. J. Barraquand, D. Martineau, Numerical valuation of high dimensional multivariate American
securities. J. Financ. Quant. Anal. 30, 383–405 (1995)
31. D. Bauer, A. Reuss, D. Singer, On the calculation of the solvency capital requirement based
on nested simulations. Astin Bull. 42(2), 453–499 (2012)
32. D. Belomestny, T. Nagapetyan, Multilevel path simulation for weak approximation schemes
with application to Lévy-driven SDEs. Bernoulli 23(2), 927–950 (2017)
33. M. Benaïm, Dynamics of stochastic approximation algorithms, in Séminaire de Probabilités
XXXIII, ed. by J. Azéma, M. Émery, M. Ledoux, M. Yor. LNM, vol. 1709 (1999), pp. 1–68
34. M. Ben Alaya, A. Kebaier, Central limit theorem for the multilevel Monte Carlo Euler method.
Ann. Appl. Probab. 25(1), 211–234 (2015)
35. M. Ben Alaya, A. Kebaier, Multilevel Monte Carlo for Asian options and limit theorems.
Monte Carlo Methods Appl. 20(3), 181–194 (2014)
36. C. Bender, Dual pricing of multi-exercise options under volume constraints. Financ. Stoch.
15(1), 1–26 (2011)
37. C. Bender, C. Gärtner, N. Schweizer, Iterative improvement of lower and upper bounds for
backward SDEs. SIAM J. Sci. Comput. 39(2), B442–B466 (2017)
38. C. Bender, J. Schoenmakers, J. Zhang, Dual representations for general multiple stopping
problems. Math. Financ. 25(2), 339–370 (2015)
39. A. Benveniste, M. Métivier, P. Priouret, Algorithmes Adaptatifs et Approximations Stochas-
tiques (Masson, Paris, 1987), pp. 367 (Adaptive Algorithms and Stochastic Approximations
(Springer, Berlin, English version, 1993))
40. J. Bertoin, Lévy Processes. Cambridge tracts in Mathematics, vol. 121 (Cambridge University
Press, Cambridge, 1996), pp. 262
Bibliography 565
41. A. Berkaoui, M. Bossy, A. Diop, Euler scheme for SDEs with non-Lipschitz diffusion coef-
ficient: strong convergence. ESAIM Probab. Stat. 12, 1–11 (2008)
42. A. Beskos, O. Papaspiliopoulos, G.O. Roberts, Retrospective exact simulation of diffusion
sample paths with applications. Bernoull 12(6), 1077–1098 (2006)
43. A. Beskos, G.O. Roberts, Exact simulation of diffusions. Ann. Appl. Probab. 15(4), 2422–
2444 (2005)
44. P. Billingsley, Probability and Measure (First edition, 1979), 3rd edn. Wiley Series in Prob-
ability and Mathematical Statistics (A Wiley–Interscience Publication, Wiley, New York,
1995), pp. xiv+593
45. P. Billingsley, Convergence of Probability Measures, 1st edn. (Second edition 1999, 277pp.)
(Wiley, New York, 1968), pp. 253
46. J.-P. Borel, G. Pagès, Y.J. XIao, Suites à discrépance faible and intégration numérique, in
Probabilités Numériques, ed. by N. Bouleau, D. Talay (Coll. didactique, INRIA, 1992). ISBN-
10: 2726107087
47. B. Bouchard, I. Ekeland, N. Touzi, On the Malliavin approach to Monte Carlo approximation
of conditional expectations. Financ. Stoch. 8(1), 45–71 (2004)
48. N. Bouleau, D. Lépingle, Numerical Methods for Stochastic Processes. Wiley Series in Prob-
ability and Mathematical Statistics: Applied Probability and Statistics (A Wiley-Interscience
Publication; Wiley, New York, 1994), pp. 359
49. C. Bouton, Approximation gaussienne d’algorithmes stochastiques à dynamique markovi-
enne. Annales de l’I.H.P. Probabilités et Statistiques 24(1), 131–155 (1998)
50. P.P. Boyle, Options: a Monte Carlo approach. J. Financ. Econ. 4(3), 323–338 (1977)
51. O. Brandière, M. Duflo, Les algorithmes stochastiques contournent-ils les pièges? (French)
[Do stochastic algorithms go around traps?]. Ann. Inst. H. Poincaré Probab. Stat. 32(3), 395–
427 (1996)
52. M. Briane, G. Pagès, Théorie de l’intégration, convolution and transformée de Fourier, in
Cours and Exercices, 6th edn. (Vuibert, Paris, 2015), pp. 400
53. M. Broadie, Y. Du, C.C. Moallemei, Efficient risk estimation via nested sequential simulation.
Manag. Sci. 57(6), 1172–1194 (2011)
54. R. Buche, H.J. Kushner, Rate of convergence for constrained stochastic approximation algo-
rithm. SIAM J. Control Optim. 40(4), 1001–1041 (2002)
55. K. Bujok, B. Hambly, C. Reisinger, Multilevel simulation of functionals of Bernoulli random
variables with application to basket credit derivatives. Methodol. Comput. Appl. Probab.
17(3), 579–604 (2015)
56. S. Burgos, M. Giles, Computing Greeks using multilevel path simulation, in Monte Carlo and
Quasi-Monte Carlo Methods 2010 (Springer, Berlin, 2012), pp. 281–296
57. G. Callegaro, L. Fiorin, M. Grasselli, Quantized calibration in local volatility, in Risk (Cut-
ting Hedge: Derivatives Pricing), (2015), pp. 56–67, https://www.risk.net/risk-magazine/
technical-paper/2402156/quantized-calibration-in-local-volatility
58. G. Callegaro, L. Fiorin, M. Grasselli, Pricing via quantization in stochastic volatility models.
Quant. Financ. 17(6), 855–872 (2017)
59. L. Carassus, G. Pagès, Finance de marché, Modèles mathématiques à temps discret (Vuibert,
Paris, 2016), pp. xii+ 385. ISBN 978-2-311-40136-3
60. J.F. Carrière, Valuation of the early-exercise price for options using simulations and nonpara-
metric regression. Insur. Math. Econ. 19, 19–30 (1996)
61. N. Chen, Y. Liu, Estimating expectations of functionals of conditional expected via multilevel
nested simulation, Presentation at MCQMC’12 Conference, Sydney (2010)
62. D.-Y. Cheng, A. Gersho, B. Ramamilrthi, Y. Shoham, Fast search algorithms for vector quan-
tization and pattern matching. Proc. IEEE Int. Conf. Acoust. Speech Signal Process 1, 9.11.1–
9.11.4 (1984)
63. K.L. Chung, An estimate concerning the Kolmogorov limit distribution. Trans. Am. Math.
Soc. 67, 36–50 (1949)
64. É. Clément, A. Kohatsu-Higa, D. Lamberton, A duality approach for the weak approximation
of stochastic differential equations. Ann. Appl. Probab. 16(3), 1124–1154 (2006)
566 Bibliography
65. É. Clément, D. Lamberton, P. Protter, An analysis of a least squares regression method for
American option pricing. Financ. Stoch. 6(2), 449–471 (2002)
66. S. Cohen, M.M. Meerschaert, J. Rosinski, Modeling and simulation with operator scaling.
Stoch. Process. Appl. 120(12), 2390–2411 (2010)
67. P. Cohort, Sur quelques problèmes de quantification, thèse de l’Université Paris 6, Paris (2000),
pp. 187
68. P. Cohort, Limit theorems for random normalized distortion. Ann. Appl. Probab. 14(1), 118–
143 (2004)
69. L. Comtet, Advanced Combinatorics. The Art of Finite and Infinite Expansions. Revised and
enlarged edition. D. Reidel Publishing Co., Dordrecht, 1974, pp. xi+343. ISBN: 90-277-0441-
4 05-02
70. S. Corlay, G. Pagès, Functional quantization-based stratified sampling methods. Monte Carlo
Methods Appl. J. 21(1), 1–32 (2015). https://doi.org/10.1515/mcma-2014-0010
71. R. Cranley, T.N.L. Patterson, Randomization of number theoretic methods for multiple inte-
gration. SIAM J. Numer. Anal. 13(6), 904–914 (1976)
72. D. Dacunha-Castelle, M. Duflo, Probabilités et Statistique II : Problèmes à temps mobile
(Masson, Paris, 1986). English version: Probability and Statistics II (Translated by D. M,
McHale) (Springer, New York, 1986), pp. 290
73. R.B. Davis, D.S. Harte, Tests for Hurst effect. Biometrika 74, 95–101 (1987)
74. S. Dereich, F. Heidenreich, A multilevel Monte Carlo algorithm for Lévy-driven stochastic
differential equations. Stoch. Process. Appl. 121, 1565–1587 (2011)
75. S. Dereich, S. Li, Multilevel Monte Carlo for Lévy-driven SDEs: central limit theorems for
adaptive Euler schemes. Ann. Appl. Probab. 26(1), 136–185 (2016)
76. L. Devineau, S. Loisel, Construction d’un algorithme d’accélération de la méthode des simu-
lations dans les simulation pour le calcul du capital économique Solvabilité II. Bull. Français
d’Actuariat, Institut des Actuaires 10(17), 188–221 (2009)
77. L. Devroye, Non Uniform Random Variate Generation (Springer, New York, 1986), 843pp.
First edition available at http://www.eirene.de/Devroye.pdf or http://www.nrbook.com/
devroye/
78. J. Dick, Higher order scrambled digital nets achieve the optimal rate of the root mean square
error for smooth integrands. Ann. Stat. 39(3), 1372–1398 (2011)
79. Q. Du, V. Faber, M. Gunzburger, Centroidal Voronoi tessellations: applications and algorithms.
SIAM Rev. 41, 637–676 (1999)
80. Q. Du, M. Emelianenko, L. Ju, Convergence of the Lloyd algorithm for computing centroidal
Voronoi tessellations. SIAM J. Numer. Anal. 44, 102–119 (2006)
81. M. Duflo, Algorithmes stochastiques, coll, in Mathématiques and Applications, vol. 23
(Springer, Berlin, 1996), pp. 319
82. M. Duflo, Random Iterative Models, translated from the 1990 French original by Stephen S.
Wilson and revised by the author. Applications of Mathematics (New York), vol. 34 (Springer,
Berlin, 1996), pp. 385
83. D. Egloff, M. Leippold, Quantile estimation with adaptive importance sampling. Ann. Stat.
38(2), 1244–1278 (2010)
84. N. El Karoui, J.P. Lepeltier, A. Millet, A probabilistic approach of the réduite in optimal
stopping. Probab. Math. Stat. 13(1), 97–121 (1988)
85. M. Emelianenko, L. Ju, A. Rand, Nondegeneracy and weak global convergence of the Lloyd
algorithm in Rd . SIAM J. Numer. Anal. 46(3), 1423–1441 (2008)
86. P. Étoré, B. Jourdain, Adaptive optimal allocation in stratified sampling methods. Methodol.
Comput. Appl. Probab. 12(3), 335–360 (2010)
87. M. Fathi, N. Frikha, Transport-entropy inequalities and deviation estimates for stochastic
approximations schemes. Electron. J. Probab. 18(67), 36 (2013)
88. H. Faure, Suite à discrépance faible dans T s , Technical report, University Limoges (France,
1986)
89. H. Faure, Discrépance associée à un système de numération (en dimension s). Acta Arith. 41,
337–361 (1982)
Bibliography 567
90. O. Faure, Simulation du mouvement brownien des diffusions, Thèse de l’ENPC (France,
Marne-la-Vallée, 1992), pp. 133
91. L. Fiorin, G. Pagès, A. Sagna, Markovian and product quantization of an Rd -valued Euler
scheme of a diffusion process with applications to finance, to appear in Methodology and
Computing in Applied Probability (2017), arXiv:1511.01758v3
92. H. Föllmer, A. Schied, in Stochastic Finance, An Introduction in Discrete Time (2nd edn. in
2004). De Gruyter Studies in Mathematics, vol. 27 (De Gruyter, Berlin, 2002), pp. 422
93. G. Fort, É. Moulines, A. Schreck, M. Vihola, Convergence of Markovian stochastic approxi-
mation with discontinuous dynamics. SIAM J. Control Optim. 54(2), 866–893 (2016)
94. J.C. Fort, G. Pagès, Convergence of stochastic algorithms: from the Kushner & Clark Theorem
to the Lyapunov functional. Adv. Appl. Probab. 28, 1072–1094 (1996)
95. J.C. Fort, G. Pagès, Decreasing step stochastic algorithms: a.s. behavior of weighted empirical
measures. Monte Carlo Methods Appl. 8(3), 237–270 (2002)
96. J.C. Fort, G. Pagès, Asymptotics of optimal quantizers for some scalar distributions. J. Comput.
Appl. Math. 146(2), 253–275 (2002)
97. N. Fournier, A. Guillin, On the rate of convergence in Wasserstein distance of the empirical
measure. Probab. Theory Relat. Fields 162(3–4), 707–738 (2015)
98. A. Friedman, Stochastic Differential Equations and Applications Volume 1–2. Probability and
Mathematical Statistics, vol. 28 (Academic Press [Harcourt Brace Jovanovich, Publishers],
New York, 1975), pp. 528
99. J.H. Friedman, J.L. Bentley, R.A. Finkel, An algorithm for finding best matches in logarithmic
expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977)
100. N. Frikha, S. Menozzi, Concentration bounds for stochastic approximations. Electron. Com-
mun. Probab., 183 17(47), 1–15 (2012)
101. S. Gadat, F. Panloup, Optimal non-asymptotic bound of the Ruppert-Polyak averaging without
strong convexity, pre-print, (2017), arXiv:1709.03342v1
102. J.G. Gaines, T. Lyons, Random generation of stochastic area integrals. SIAM J. Appl. Math.
54(4), 1132–1146 (1994)
103. S.B. Gelfand, S.K. Mitter, Recursive stochastic algorithms for global optimization in Rd .
SIAM J. Control Optim. 29, 999–1018 (1991)
104. S.B. Gelfand, S.K. Mitter, Metropolis-type annealing algorithms for global optimization in
Rd . SIAM J. Control Optim. 31, 111–131 (1993)
105. A. Gersho, R.M. Gray, Special issue on quantization, I-II (A. Gersho and R.M. Gray eds.)
IEEE Trans. Inf. Theory 28, 139–148 (1988)
106. A. Gersho, R.M. Gray, Vector Quantization and Signal Compression (Kluwer, Boston, 1992),
pp. 732
107. M.B. Giles, Multilevel Monte Carlo path simulation. Oper. Res. 56(3), 607–617 (2008)
108. M.B. Giles, Vibrato Monte Carlo Sensitivities, Monte Carlo and quasi-Monte Carlo Methods
2008 (Springer, Berlin, 2009), pp. 369–382
109. M.B. Giles, Multilevel Monte Carlo methods. Acta Numer. 24, 259–328 (2015)
110. M.B. Giles, L. Szpruch, Antithetic multilevel Monte Carlo estimation for multi-dimensional
SDEs, in Monte Carlo and quasi-Monte Carlo methods 2012. Springer Proceedings in Math-
ematics and Statistics, vol. 65, (Springer, Heidelberg, 2013), pp. 367–384
111. M.B. Giles, L. Szpruch, Multilevel Monte Carlo methods for applications in finance, in Recent
Developments in Computational Finance, Interdisciplinary Mathematics Sciences, vol. 14
(World Scientific Publishing, Hackensack, 2013), pp. 3–47
112. D. Giorgi, V. Lemaire, G. Pagès, Limit theorems for weighted and regular Multilevel estima-
tors. Monte Carlo Methods Appl. 23(1), 43–70 (2017)
113. D. Giorgi, Théorèmes limites pour estimateurs Multilevel avec et sans poids. Comparaisons
et applications, PhD thesis (2017), pp. 123 (cf 68. cohort). (UPMC)
114. D. Giorgi, V. Lemaire, G. Pagès, Weak error for nested Multilevel Monte Carlo, (2018),
arXiv:1806.07627.
115. P. Glasserman, Monte Carlo Methods in Financial Engineering (Springer, New York, 2003),
pp. 596
568 Bibliography
163. I. Karatzas, S.E. Shreve, Methods of Math. Finance. Applications of Mathematics, vol. 39
(Springer, New York, 1998), xvi+407pp
164. O. Kavian, Introduction à la théorie des points critiques and applications aux problèmes
elliptiques (1993) (French) [Introduction to critical point theory and applications to elliptic
problems]. coll. Mathématiques and Applications [Mathematics and Applications], vol. 13
(Springer, Paris [Berlin]), pp. 325. ISBN: 2-287-00410-6
165. H.L. Keng, W. Yuan, Application of Number Theory to Numerical Analysis (Springer and
Science Press, Beijing, 1981), pp. 241
166. A.G. Kemna, A.C. Vorst, A pricing method for options based on average asset value. J. Bank.
Financ. 14, 113–129 (1990)
167. A. Kebaier, Statistical Romberg extrapolation: a new variance reduction method and applica-
tions to option pricing. Ann. Appl. Probab. 15(4), 2681–2705 (2005)
168. J.C. Kieffer, Exponential rate of convergence for Lloyd’s method I. IEEE Trans. Inf. Theory
(Special issue on quantization) 28(2), 205–210 (1982)
169. J. Kiefer, J. Wolfowitz, Stochastic estimation of the maximum of a regression function. Ann.
Math. Stat. 23, 462–466 (1952)
170. P.E. Kloeden, E. Platten, Numerical Solution of Stochastic Differential Equations. Applica-
tions of Mathematics, vol. 23 (Springer, Berlin (New York), 1992), pp. 632
171. A. Kohatsu-Higa, R. Pettersson, Variance reduction methods for simulation of densities on
Wiener space. SIAM J. Numer. Anal. 40(2), 431–450 (2002)
172. V. Konakov, E. Mammen, Edgeworth type expansions for Euler schemes for stochastic dif-
ferential equations. Monte Carlo Methods Appl. 8(3), 271–285 (2002)
173. V. Konakov, S. Menozzi, S. Molchanov, Explicit parametrix and local limit theorems for some
degenerate diffusion processes. Ann. Inst. H. Poincaré Probab. Stat. (série B) 46(4), 908–923
(2010)
174. U. Krengel, Ergodic Theorems. De Gruyter Studies in Mathematics, vol. 6 (Springer, Berlin,
1987), pp. 357
175. L. Kuipers, H. Niederreiter, Uniform Distribution of Sequences (Wiley, New York, 1974),
pp. 390
176. H. Kunita, Stochastic differential equations and stochastic flows of diffeomorphisms, in Cours
d’école d’été de Saint-Flour XII–1982, LN-1097 (Springer, Berlin, 1984), pp. 143–303
177. H. Kunita, Stochastic flows and stochastic differential equations. Reprint of the 1990 orig-
inal. Cambridge Studies in Advanced Mathematics, vol. 24 (Cambridge University Press,
Cambridge, 1997), pp. xiv+346
178. T.G. Kurtz, P. Protter, Weak limit theorems for stochastic integrals and stochastic differential
equations. Ann. Probab. 19, 1035–1070 (1991)
179. H.J. Kushner, H. Huang, Rates of convergence for stochastic approximation type algorithms.
SIAM J. Control 17(5), 607–617 (1979)
180. H.J. Kushner, G.G. Yin, Stochastic Approximation and Recursive Algorithms and Applica-
tions, 2nd edn. Applications of Mathematics, Stochastic Modeling and Applied Probability,
vol. 35 (Springer, New York, 2003)
181. J.P. Lambert, Quasi-Monte Carlo, low-discrepancy, and ergodic transformations. J. Comput.
Appl. Math. 12–13, 419–423 (1985)
182. D. Lamberton, Optimal stopping and American options, a course at the Ljubljana
Summer School on Financial Mathematics (2009), https://www.fmf.uni-lj.si/finmath09/
ShortCourseAmericanOptions.pdf
183. D. Lamberton, B. Lapeyre, Introduction to Stochastic Calculus Applied to Finance (Chapman
& Hall, London, 1996), pp. 185
184. D. Lamberton, G. Pagès, P. Tarrès, When can the two-armed bandit algorithm be trusted?
Ann. Appl. Probab. 14(3), 1424–1454 (2004)
185. B. Lapeyre, G. Pagès, Familles de suites à discrépance faible obtenues par itération d’une
transformation de [0, 1]. Comptes rendus de l’Académie des Sciences de Paris, Série I(308),
507–509 (1989)
Bibliography 571
186. B. Lapeyre, G. Pagès, K. Sab, Statistics. Sequences with low discrepancy. Generalization and
application to Robbins-Monro algorithm 21(2), 251–272 (1990)
187. B. Lapeyre, E. Temam, Competitive Monte Carlo methods for the pricing of Asian options.
J. Comput. Financ. 5(1), 39–59 (2001)
188. S. Laruelle, C.-A. Lehalle, G. Pagès, Optimal split of orders across liquidity pools: a stochastic
algorithm approach. SIAM J. Financ. Math. 2, 1042–1076 (2011)
189. S. Laruelle, C.-A. Lehalle, G. Pagès, Optimal posting price of limit orders: learning by trading.
Math. Financ. Econ. 7(3), 359–403 (2013)
190. S. Laruelle, G. Pagès, Stochastic Approximation with averaging innovation. Monte Carlo
Methods Appl. J. 18, 1–51 (2012)
191. V.A. Lazarev, Convergence of stochastic approximation procedures in the case of a regression
equation with several roots (transl. from). Problemy Pederachi Informatsii 28(1), 66–78 (1992)
192. A. Lejay, V. Reutenauer, A variance reduction technique using a quantized Brownian motion
as a control variate. J. Comput. Financ. 16(2), 61–84 (2012)
193. M. Ledoux, M. Talagrand, Probability in Banach Spaces. Isoperimetry and Processes. Reprint
of the 1991 edition. Classics in Mathematics (Springer, Berlin, 2011), pp. xii+480
194. P. L’Ecuyer, G. Perron, On the convergence rates of IPA and FDC derivatives estimators.
Oper. Res. 42, 643–656 (1994)
195. M. Ledoux, Concentration of measure and logarithmic Sobolev inequalities, technical report
(1997), http://www.math.univ-toulouse.fr/~ledoux/Berlin.pdf
196. J. Lelong, Almost sure convergence of randomly truncated stochastic algorithms under veri-
fiable conditions. Stat. Probab. Lett. 78(16), 2632–2636 (2008)
197. J. Lelong, Asymptotic normality of randomly truncated stochastic algorithms. ESAIM.
Probab. Stat. 17, 105–119 (2013)
198. V. Lemaire, G. Pagès, Multilevel Richardson-Romberg extrapolation. Bernoulli 23(4A),
2643–2692 (2017)
199. V. Lemaire, G. Pagès, Unconstrained recursive importance sampling. Ann. Appl. Probab.
20(3), 1029–1067 (2010)
200. L. Ljung, Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4),
551–575 (1977)
201. S.P. Lloyd, Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)
202. F.A. Longstaff, E.S. Schwarz, Valuing American options by simulation: a simple least-squares
approach. Rev. Financ. Stud. 14, 113–148 (2001)
203. H. Luschgy, Martingale in diskreter Zeit : Theory und Anwendungen (Springer, Berlin, 2013),
pp. 452
204. H. Luschgy, G. Pagès, Functional quantization of Gaussian processes. J. Funct. Anal. 196(2),
486–531 (2001)
205. H. Luschgy, G. Pagès, Functional quantization rate and mean regularity of processes with an
application to Lévy processes. Ann. Appl. Probab. 18(2), 427–469 (2008)
206. D. McLeish, A general method for debiasing a Monte Carlo estimator. Monte Carlo Methods
Appl. 17(4), 301–315 (2011)
207. J. Mac Names, A fast nearest-neighbor algorithm based on principal axis search tree. IEEE
T. Pattern. Anal. 23(9), 964–976 (2001)
208. P. Malliavin, A. Thalmaier Stochastic Calculus of Variations in Mathematical Finance.
Springer Finance. (Springer, Berlin, 2006), pp. xii+142
209. S. Manaster, G. Koehler, The calculation of implied variance from the Black-Scholes model:
a note. J. Financ. 37(1), 227–230 (1982)
210. G. Marsaglia, T.A. Bray, A convenient method for generating normal variables. SIAM Rev.
6, 260–264 (1964)
211. M. Matsumata, T. Nishimura, Mersenne twister: a 623-dimensionally equidistributed uniform
pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8(1), 3–30 (1998)
212. G. Marsaglia, W.W. Tsang, The ziggurat method for generating random variables. J. Stat.
Softw. 5(8), 363–372 (2000)
572 Bibliography
213. M. Métivier, P. Priouret, Théorèmes de convergence presque sûre pour une classe
d’algorithmes stochastiques à pas décroissant. (French. English summary) [Almost sure con-
vergence theorems for a class of decreasing-step stochastic algorithms]. Probab. Theory Relat.
Fields 74(3), 403–428 (1987)
214. G.N. Milstein, A method of second order accuracy for stochastic differential equations. Theor.
Probab. Appl. (USSR) 23, 396–401 (1976)
215. U. Naumann, The Art of Differentiating Computer Programs: An Introduction to Algorith-
mic Differentiation. Software, Environments and Tools (SIAM, RWTH Aachen University,
Aachen, Germany, 2012), pp. xviii+333
216. J. Neveu, Bases mathématiques du calcul des Probabilités, Masson, Paris, 213 pp; English
translation: Mathematical Foundations of the Calculus of Probability (1965) (Holden Day,
San Francisco, 1964)
217. J. Neveu, Martingales à temps discret (Masson, 1972, 218pp. English translation: Discrete-
Parameter Martingales (North-Holland, New York, 1975), 236pp
218. D.J. Newman, The Hexagon Theorem. IEEE Trans. Inf. Theory (A. Gersho and R.M. Gray
eds.) 28, 137–138 (1982)
219. H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods CBMS-NSF
regional conference series in Applied mathematics (SIAM, Philadelphia, 1992)
220. Numerical Recipes: Textbook available at http://www.ma.utexas.edu/documentation/nr/
bookcpdf.html. See also [247]
221. A. Owen, Local antithetic sampling with scrambled nets. Ann. Stat. 36(5), 2319–2343 (2008)
222. G. Pagès, Sur quelques problèmes de convergence, thèse University Paris 6, Paris (1987) pp.
144
223. G. Pagès, Van der Corput sequences, Kakutani transform and one-dimensional numerical
integration. J. Comput. Appl. Math. 44, 21–39 (1992)
224. G. Pagès, A space vector quantization method for numerical integration. J. Comput. Appl.
Math. 89, 1–38 (1998) (Extended version of “Voronoi Tessellation, space quantization algo-
rithms and numerical integration, ed. by M. Verleysen, Proceedings of the ESANN’ 93,
Bruxelles, Quorum Editions, (1993), pp. 221–228
225. G. Pagès, Multistep Richardson-Romberg extrapolation: controlling variance and complexity.
Monte Carlo Methods Appl. 13(1), 37–70 (2007)
226. G. Pagès, Quadratic optimal functional quantization methods and numerical applications, in
Proceedings of MCQMC, Ulm’06 (Springer, Berlin, 2007), pp. 101–142
227. G. Pagès, Functional co-monotony of processes with an application to peacocks and barrier
options, in Séminaire de Probabilités XXVI, LNM, vol. 2078 (Springer, Cham, 2013), pp.
365–400
228. G. Pagès, Introduction to optimal quantization for numerics. ESAIM Proc. Surv. 48, 29–79
(2015)
229. G. Pagès, H. Pham, J. Printems, Optimal quantization methods and applications to numeri-
cal problems in finance, in Handbook on Numerical Methods in Finance, ed. by S. Rachev
(Birkhauser, Boston, 2004), pp. 253–298
230. G. Pagès, H. Pham, J. Printems, An optimal Markovian Quantization algorithm for multi-
dimensional stochastic control problems. Stoch. Dyn. 4(4), 501–545 (2004)
231. G. Pagès, J. Printems, Optimal quadratic quantization for numerics: the Gaussian case. Monte
Carlo Methods Appl. 9(2), 135–165 (2003)
232. G. Pagès, J. Printems (2005), http://www.quantize.maths-fi.com, website devoted to optimal
quantization
233. G. Pagès, J. Printems, Optimal quantization for finance: from random vectors to stochastic
processes. Mathematical Modeling and Numerical Methods in Finance (special volume) ed.
by A. Bensoussan, Q. Zhang guest éds.), coll. Handbook of Numerical Analysis, ed. by P.G.
Ciarlet (North Holland, 2009), pp. 595–649
234. G. Pagès, A. Sagna, Recursive marginal quantization of the Euler scheme of a diffusion
process. Appl. Math. Financ. 22(5), 463–98 (2015)
Bibliography 573
235. G. Pagès, B. Wilbertz, Optimal Delaunay and Voronoi quantization methods for pricing Amer-
ican options, in Numerical Methods in Finance, ed. by R. Carmona, P. Hu, P. Del Moral, N.
Oudjane (Springer, New York, 2012), pp. 171–217
236. G. Pagès, B. Wilbertz, Sharp rate for the dual quanization problem, forthcoming, in Séminaire
de Probabilités XLIX (Springer, Berlin, 2018)
237. G. Pagès, Y.-J. Xiao, Sequences with low discrepancy and pseudo-random numbers: theoret-
ical results and numerical tests. J. Stat. Comput. Simul. 56, 163–188 (1988)
238. G. Pagès, J. Yu, Pointwise convergence of the Lloyd I algorithm in higher dimension. SIAM
J. Control Optim. 54(5), 2354–2382 (2016)
239. K.R. Parthasarathy, Probability measures on metric spaces, in Probability and Mathematical
Statistics, vol. 3 (Academic Press, Inc., New York-London, 1967), pp. xi+276
240. L. Paulot, Unbiased Monte Carlo Simulation of Diffusion Processes (2016), arXiv:1605.01998
241. M. Pelletier, Weak convergence rates for stochastic approximation with application to multiple
targets and simulated annealing. Ann. Appl. Probab. 8(1), 10–44 (1998)
242. R. Pemantle, Non-convergence to unstable points in urn models and stochastic approxima-
tions. Ann. Probab. 18(2), 698–712 (1990)
243. H. Pham, Some applications and methods of large deviations in finance and insurance, in
Paris-Princeton Lectures on Mathematical Finance 2004. LNM, vol. 1919 (Springer, New
York, 2007), pp. 191–244
244. J. Pitman, M. Yor, A decomposition of Bessel bridges. Z. Wahrsch. Verw. Gebiete 59(4),
425–457 (1982)
245. B.T. Polyak, A new method of stochastic approximation type, (Russian) Avtomat. i Telemekh.
7, 98–107 (1991); translation in Automation Remote Control, 51, part 2, 937–946
246. B.T. Polyak, A.B. Juditsky, Acceleration of stochastic approximation by averaging. SIAM J.
Control Optim. 30(4), 838 (1992). https://doi.org/10.1137/0330046
247. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C++. The Art
of Scientific Computing, 2nd edn. updated for C++ (Cambridge University Press, Cambridge,
2002), pp. 1002
248. P.D. Proïnov, Discrepancy and integration of continuous functions. J. Approx. Theory 52,
121–131 (1988)
249. P.E. Protter, Stochastic integration and differential equations, 2nd edn. Version 2.1. Corrected
Third Printing. Stochastic Modelling and Applied Probability, vol. 21 (Springer, Berlin, 2004),
pp. xiv+419
250. P. Protter, D. Talay, The Euler scheme for Lévy driven stochastic differential equations. Ann.
Probab. 25(1), 393–423 (1997)
251. D. Revuz, M. Yor, Continuous martingales and Brownian motion, 3rd edn. Grundlehren der
Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol.
293 (Springer, Berlin, 1999), pp. 602
252. C. Rhee, P.W. Glynn, Unbiased estimation with square root convergence for SDE models.
Oper. Res. 63(5), 1026–1043 (2012)
253. H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407
(1951)
254. R.T. Rockafellar, S. Uryasev, Optimization of conditional value-at-risk. J. Risk 2(3), 21–41
(2000)
255. L.C.G. Rogers, Monte Carlo valuation of American options. Math. Financ. 12(3), 271–286
(2002)
256. L.C.G. Rogers, D. Williams, in Diffusions, Markov Processes, and Martingales, vol. 2 (Itô
calculus) (2000). Reprint of the second (1994) edition. Cambridge Mathematical Library
(Cambridge University Press, Cambridge 2000), pp. xiv+480
257. K.F. Roth, On irregularities of distributions. Mathematika 1, 73–79 (1954)
258. W. Rudin, Real and Complex Analysis (McGraw-Hill, New York, 1966), pp. xiii+416
259. D. Ruppert, Efficient Estimations from a Slowly Convergent Robbins-Monro Process, Cornell
University Operations Research and Industrial Engineering, Technical Report 781 Ithaca, New
York, 1988, pp. 29
574 Bibliography
260. A. Sagna, Pricing of barrier options by marginal functional quantization. Monte Carlo Methods
Appl. 17(4), 371–398 (2011)
261. K.-I. Sato, Lévy processes and Infinitely Divisible Distributions (Cambridge University Press,
Cambridge, 1999)
262. A.N. Shyriaev, Optimal stopping rules (2008). Translated from the 1976 Russian second edi-
tion by A.B. Aries. Reprint of the, translation Stochastic Modelling and Applied Probability,
vol. 8 (Springer, Berlin, 1978), pp. 217
263. A.N. Shiryaev, Probability. Graduate Texts in Mathematics, 2nd edn. (Springer, New York,
1995), pp. 621; Original Russian edition, 1980, original Russian second edition, 1989
264. P. Seumen-Tonou, Méthodes numériques probabilistes pour la résolution d’équations du
transport and pour l’évaluation d’options exotiques, Thèse de l’Université de Provence
(France, Marseille, 1997), pp. 116
265. I.M. Sobol’, Distribution of points in a cube and approximate evaluation of integrals, Zh.
Vych. Mat. Mat. Fiz., 7, 784–802 (1967) (in Russian); U.S.S.R Comput. Math. Math. Phys.
7, 86–112 (1967) (in English)
266. I.M. Sobol’, Y.L. Levitan, The production of points uniformly distributed in a multi-
dimensional cube. Technical Report 40, Institute of Applied Mathematics, USSR Academy
of Sciences (1976) (in Russian)
267. C. Soize The Fokker–Planck Equation for Stochastic Dynamical Systems and Its Explicit
Steady State Solutions. Series on Advances in Mathematics for Applied Sciences, vol. 17
(World Scientific Publishing Co., Inc., River Edge, 1994), pp. xvi+321
268. M. Sussman, W. Crutchfield, M. Papakipos, Pseudo-random numbers Generation on GPU,
in Graphic Hardware 2006, Proceeding of the Eurographics Symposium Vienna, Austria,
September 3–4, 2006, ed. by M. Olano, P. Slusallek (A K Peters Ltd., 2006), pp. 86–94
269. J.N. Tsitsiklis, B. Van Roy, Regression methods for pricing complex American-style options.
IEEE Trans. Neural Netw. 12(4), 694–703 (2001)
270. D. Talay, L. Tubaro, Expansion of the global error for numerical schemes solving stochastic
differential equations. Stoch. Anal. Appl. 8, 94–120 (1990)
271. B. Tuffin, Randomization of Quasi-Monte Carlo Methods for error estimation: survey and
normal approximation. Monte Carlo Methods Appl. 10(3–4), 617–628 (2004)
272. C. Villani, Topics in Optimal Transportatio. Graduate Studies in Mathematics, vol. 58 (Amer-
ican Mathematical Society, Providence, 2003), pp. xvi+370
273. S. Villeneuve, A. Zanette, Parabolic A.D.I. methods for pricing American option on two
stocks. Math. Oper. Res. 27(1), 121–149 (2002)
274. H.F. Walker, P. Ni, Anderson acceleration for fixed-point iterations. SIAM J. Numer. Anal.
49(4), 1715–1735 (2011)
275. D.S. Watkins, Fundamentals of Matrix Computations. Pure and Applied Mathematics (Hobo-
ken), 3rd edn. (Wiley, Hoboken, 2010), pp. xvi+644
276. M. Winiarski, Quasi-Monte Carlo Derivative valuation and Reduction of simulation bias,
Master thesis (Royal Institute of Technology (KTH), Stockholm (Sweden), 2006)
277. A. Wood, G. Chan, Simulation of stationary Gaussian processes in [0, 1]d . J. Comput. Gr.
Stat. 3(4), 409–432 (1994)
278. Y.-J. Xiao, Contributions aux méthodes arithmétiques pour la simulation accélérée (Thèse
de l’ENPC, Paris, 1990), pp. 110
279. Y.-J. Xiao, Suites équiréparties associées aux automorphismes du tore. C.R. Acad. Sci. Paris
(Série I) 317, 579–582 (1990)
280. P.L. Zador, Development and evaluation of procedures for quantizing multivariate distribu-
tions. Ph.D. dissertation, Stanford University (1963), pp. 111
281. P.L. Zador, Asymptotic quantization error of continuous signals and the quantization dimen-
sion. IEEE Trans. Inf. Theory 28(2), 139–149 (1982)
Index
A Berry-Esseen Theorem, 30
Acceptance-rejection method, 10 Bertrand’s conjecture, 114
α-quantile, 209 Best-of-call, 34, 59, 198
α-strictly convex function, 237 Binomial distribution, 8
American Monte Carlo, 509 Birkhoff’s pointwise ergodic theorem, 119
Anderson’s acceleration method, 157 Bismut’s formula, 493
Antithetic method (variance reduction), 59 Black formula, 545
Antithetic random variables, 62 Black–Scholes formula(s), 544
Antithetic schemes, 441 Black–Scholes formula (call), 34, 544
Antithetic schemes (Brownian diffusions), Black–Scholes formula (put), 544
441 Black–Scholes (Milstein scheme), 304
Antithetic schemes (nested Monte Carlo), Black–Scholes with dividends, 544
444 Blackwell–Rao method, 80
Asian Call, 75 Box, 99
Asian call-put parity equation, 73 Box–Muller method, 20
Asian option, 56, 375 Bridge of a diffusion, 368
a.s. rate (Euler schemes), 283 Brownian bridge, 365
Asset-or-Nothing Call, 501, 502 Brownian bridge method (barrier), 371
Automatic differentiation, 489 Brownian bridge method (lookback), 371
Automorphism of the torus, 101 Brownian bridge method (max/min), 371
Averaging principle, 194 Brownian motion (simulation), 24
Averaging principle (stochastic approxima- Bump method, 472
tion), 253 Burkhölder–Davis–Gundy (B.D.G.)
inequality (continuous time), 324
Burkhölder–Davis–Gundy (B.D.G.)
B inequality (discrete time), 242
Backward Dynamic Programming Principle
(BDPP), 515
Backward Dynamic Programming Principle C
(BDPP) (functional), 516 Càdlàg, 5
Backward Dynamic Programming Principle Call option (European), 34
(BDPP) (pathwise), 515 Call-put parity equation, 72
Baire σ-field, 545 Call-put parity equation (Asian), 73
Bally–Talay theorem, 317 Cameron-Marin formula, 90
Barrier option, 81 Central Limit Theorem (Lindeberg), 552
Basket option, 54 Cheng’s partial distance search, 224
Bernoulli distribution, 8 χ2 -distribution, 31
© Springer International Publishing AG, part of Springer Nature 2018 575
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0
576 Index
Cholesky decomposition, 23 E
Cholesky matrix, 23 Effort, 51
Chow Theorem, 216 Ergodic transform, 119
Chung’s CLT, 106 Essential infimum, 557
Chung’s LIL , 108 Essential supremum, 554
CIR process, 307 Euler indicator function, 2
Clark–Cameron oscillator, 447 Euler–Maruyama schemes, 273
Clark–Cameron oscillator (quantization), Euler schemes, 273
160 Euler scheme (continuous), 274
Co-monotony, 60 Euler scheme (discrete time), 273
Coboundary, 119 Euler scheme (genuine), 274
Euler scheme (stepwise constant), 274
Competitive Learning Vector Quantization
Exchange spread option, 80
(CLVQ), 161, 216
Exponential distribution, 7
Complexity, 51
Extreme discrepancy, 99
Componentwise partial order, 99
Compound option, 455
Confidence interval, 32 F
Confidence interval (multilevel), 423 Faure sequences, 113
Confidence interval (theoretical), 30 Feynman–Kac formula, 317, 348
Confidence level, 30 Finite difference (constant step), 472
Continuation function, 526 Finite difference (decreasing step), 480
Control variate (adaptive), 64 Finite difference methods (greeks), 472
Control variate (static), 49 Finite variation function (Hardy and Krause
Convergence (weak), 96 sense), 104
Covariance matrix, 22 Finite variation function (measure sense),
Cubature formula (quantization), 144 102
Curse of dimensionality, 534 First order weak expansion error, 387
Curse of dimensionality (discrepancy), 120 Flow of an SDE, 272, 339
Curse of dimensionality (quantization), 143, Flow (of the Euler scheme), 339
163 Forward start option, 488
Fractional Brownian motion (simulation), 25
Functional monotone class theorem, 545
D
De La Vallée Poussin criterion, 547 G
-method, 237 Gamma distribution, 15
Depth (ML2R estimator), 405 Garman–Kohlhagen formula, 545
Depth (multilevel estimator), 431 Generalized Minkowski inequality, 323
Depth (multistep estimator), 398 Geometric Brownian motion, 33
Deviation inequality (diffusion), 297 Geometric distribution(s), 8
Geometric mean option, 59
Deviation inequality (Euler scheme), 296
Glivenko–Cantelli’s theorem, 96
Diffusion process (Brownian), 272
Greek computation, 472
Digital call, 502
Gronwall lemma, 283
Digital option, 478
Discrepancy at the origin, 99
Discrepancy (extreme), 99 H
Discrepancy (star), 99 Hölder Inequality (extended), 335
Discrete time Euler scheme, 273 Halton sequences, 109
Distortion function (quadratic), 135 Halton sequence (super-), 113
Distribution function, 5 Hammersley procedure, 116
Doob’s decomposition, 538 Haussman–Clark–Occone formula, 497
Doob’s inequality, 284 Hoeffding’s inequality, 233, 298
Index 577
L N
Lévy’s area (quantized), 159 Nearest neighbor projection, 135
Lindeberg Central Limit Theorem, 552 Negative binomial distribution(s), 9
Lloyd’s algorithm, 156 Negatively correlated variables, 59
Lloyd’s algorithm (randomized), 161 Nested Monte Carlo, 385
Lloyd’s map, 156 Newton–Raphson (zero search), 155
Lloyd’s method I, 156 Niederreiter sequences, 115
Lloyd’s method I (randomized), 161 Non-central χ2 (1) distribution (quantiza-
Local inertia, 82 tion), 158
Local martingale (continuous), 553 Normal distribution, 541
Local volatility model, 514 Numerical integration (quantization), 163
Log-likelihood method, 45
Lookback options, 372
Low discrepancy (sequence with), 109 O
Lower semi-continuity, 233 Optimal quantizer, 138
Lower Semi-Continuous (L.S.C.), 233 Option (forward start), 488
L p -Wasserstein distance, 145 Option (lookback), 372
L q -discrepancy, 105 Option on a basket, 54
Lyapunov function, 360 Option on an index, 54
578 Index
V Z
Value function, 512 Ziggurat method, 18