100% found this document useful (1 vote)
236 views

(Universitext) Pagès, Gilles - Numerical Probability - An Introduction With Applications To Finance-Springer (2018)

Uploaded by

EMMANUEL WANDJI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
236 views

(Universitext) Pagès, Gilles - Numerical Probability - An Introduction With Applications To Finance-Springer (2018)

Uploaded by

EMMANUEL WANDJI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 591

Universitext

Gilles Pagès

Numerical
Probability
An Introduction with Applications to
Finance
Universitext
Universitext

Series editors
Sheldon Axler
San Francisco State University

Carles Casacuberta
Universitat de Barcelona

Angus MacIntyre
Queen Mary, University of London

Kenneth Ribet
University of California, Berkeley

Claude Sabbah
École polytechnique, CNRS, Université Paris-Saclay, Palaiseau

Endre Süli
University of Oxford

Wojbor A. Woyczyński
Case Western Reserve University

Universitext is a series of textbooks that presents material from a wide variety of


mathematical disciplines at master’s level and beyond. The books, often well
class-tested by their author, may have an informal, personal even experimental
approach to their subject matter. Some of the most successful and established books
in the series have evolved through several editions, always following the evolution
of teaching curricula, to very polished texts.

Thus as research topics trickle down into graduate-level teaching, first textbooks
written for new, cutting-edge courses may make their way into Universitext.

More information about this series at http://www.springer.com/series/223


Gilles Pagès

Numerical Probability
An Introduction with Applications to Finance

123
Gilles Pagès
Laboratoire de Probabilités,
Statistique et Modélisation
Sorbonne Université
Paris
France

ISSN 0172-5939 ISSN 2191-6675 (electronic)


Universitext
ISBN 978-3-319-90274-6 ISBN 978-3-319-90276-0 (eBook)
https://doi.org/10.1007/978-3-319-90276-0

Library of Congress Control Number: 2018939129

Mathematics Subject Classification (2010): 65C05, 91G60, 65C30, 68U20, 60H35, 62L15, 60G40,
62L20, 91B25, 91G20

© Springer International Publishing AG, part of Springer Nature 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Julie, Romain and Nicolas
Preface

This book is an extended written version of the Master 2 course “Probabilités


Numériques” (i.e., Numerical Probability or Numerical Methods in Probability)
which has been part of the Master program “Probability and Finance” since 2007.
This Master program has been jointly run in Paris by Université Pierre et Marie
Curie and École Polytechnique since its creation by Nicole El Karoui in 1990.
Our aim is to present different aspects of the Monte Carlo method and its various
avatars (the quasi-Monte Carlo method, stochastic approximation, quantization-
based cubature formulas, etc.), both on a theoretical level – most of the stated
results are rigorously proved – as well as on a practical level through numerous
examples arising in the pricing and hedging of derivatives.
This book is divided into two parts, one devoted to exact simulation, hence
unbiased, and the other to approximate, hence biased, simulation. The first part is
subsequently mostly devoted to general simulation methods (Chap. 1) and basic
principles of Monte Carlo simulation (confidence interval in Chap. 2), sensitivity
computation (still in Chap. 2), and variance reduction (in Chap. 3). The second part
deals with approximate – and subsequently biased – simulation through the pre-
sentation and analysis of discretization schemes of Brownian diffusions (defined as
solutions to Stochastic Differential Equations driven by Brownian motions), espe-
cially Euler–Maruyama and Milstein schemes (in Chap. 7). The analysis of their
convergence in both the strong sense (Lp ) and the weak sense (higher-order
expansions of bias error in the scale of the scale of the discretization step) is
presented in detail with their consequences on the Monte Carlo simulation of
“vanilla” payoffs in local volatility models, i.e., of functionals only depending on
the diffusion at a fixed time T. As a second step (Chap. 8), this analysis is partially
extended to path-dependent functionals relying on the diffusion bridge (or
Brownian bridge) with their applications to the pricing of various classes of exotic
options. The computation of sensitivities – known as Greeks in finance – is revisited
in this more general biased setting in Chap. 10 (shock method, tangent flow, the
log-likelihood method for Euler scheme, Malliavin–Monte Carlo).

vii
viii Preface

Once these biased methods have been designed, it remains to devise a “bias
chase.” This is the objective of Chap. 9, entirely devoted to recent developments in
the multilevel paradigm with or without weights, where we successively consider a
“smooth” and “non-smooth” frameworks.
Beyond these questions directly connected to Monte Carlo simulation, three
chapters deal with less classical aspects: Chap. 6 investigates recursive stochastic
approximation “à la Robbins–Monro”, viewed as a tool for implicitation or cali-
bration but also for risk measure computation (quantiles, VaR). The quasi-Monte
Carlo method, its theoretical foundation, and its limitations, are developed and
analyzed in Chap. 4, supported by plenty of practical advice to practitioners (not
only for the celebrated Sobol sequences). Finally, we dedicate a chapter (Chap. 5)
to optimal vector quantization-based cubature formulas for the fast computation of
expectations in medium dimensions (say up to 10) with possible applications to a
universal stratified sampling variance reduction method.
In the – last – Chap. 12, we present a nonlinear problem (compared to expec-
tation computation), optimal stopping theory, mostly in a Markovian framework.
Time discretization schemes of the so-called Snell envelope are briefly presented,
corresponding in finance of derivative products to the approximation of American
style options by Bermudan options which can only be exercised at discrete time
instants up to the maturity. Due to the nonlinear Backward Dynamic Programming
Principle, such a discretization is not enough to devise implementable algorithms.
A spatial discretization is necessary: To this end, we present a detailed analysis of
regression methods (“à la Longstaff–Schwartz”) and of the quantization tree
approach, both developed during the decade 1990–2000.
Each chapter has been written in such a way that, as far as possible, an auton-
omous reading independent of former chapters is possible. However, some remain
connected (in particular Chaps. 7 and 8) and the reading of Chap. 1, devoted to
basic simulation, or the beginning of Chap. 2, which presents Monte Carlo simu-
lation and the crucial notion of confidence interval, remains mandatory to benefit
from the whole book. Within each chapter, we have purposefully scattered various
exercises, in direct connection with their environment, either of a theoretical or
applied nature (simulations).
We have added a large bibliography (more than 250 articles or books) with two
objectives: first to refer to seminal, or at least important papers, on a topic devel-
oped in the book and, secondly, propose further readings which complete partially
tackled topics in the book: For instance, in Chap. 12, most efforts are focused on
numerical methods for pricing of American options. For the adaptation to the
computation of d-hedge or the extension to swing options (such as take-or-pay or
swing option contracts), we only provide references in order to keep the size of the
book within a reasonable limit.
The mathematical prerequisites to successfully approach the reading or the
consultation of this book are, on the one hand, a solid foundation in the theory of
Lebesgue integration (including r-fields and measurability), Probability and
Preface ix

Measure Theory (see among others [44, 52, 154]). The reader is also assumed to be
familiar with discrete- and continuous-time martingale theory with a focus on
stochastic calculus, at least in a Brownian framework (see [183] for an introduction
or [251], [162, 163] for more advanced textbooks). A basic background on discrete-
and continuous-time Markov processes is also requested, but we will not go far
beyond their very definitions in our proofs. As for programming ability, we leave to
the reader the choice of his/her favorite language. Algorithms are always described
in a meta-language. Of course, experts in high-performance computing and paral-
lelized programming (multi-core, CUDA, etc.) are invited to take advantage of their
expertise to convince themselves of the high compatibility of Monte Carlo simu-
lation with these modern technologies.
Of course, we would like to mention that, though most of our examples are
drawn or inspired by the pricing and hedging of derivatives products, the field of
application of the presented methods is far larger. Thus, stratified sampling or
importance sampling to the best of our knowledge are not often used by quants,
structurers, and traders, but given their global importance in Monte Carlo simula-
tion, we decided to present them in detail (with some clues on how to apply them in
the finance of derivative products). By contrast, to limit the size of the manuscript,
we renounced to present more advanced and important notions such as the dis-
cretization and simulation of Stochastic Differential Equations with jumps, typically
driven by Lévy processes. However we hope that, having assimilated the Brownian
case, our readers will be well prepared to successfully immerse themselves in the
huge literature, from which we selected the most significant references for further
reading.
The book contains more than 150 exercises, some theoretical, others devoted to
simulation of numerical experiments. The exercises are distributed over the chap-
ters, usually close to the theoretical results necessary to solve them. Several of them
are accompanied by hints.
To conclude, I wish to thank those who preceded me in teaching this course at
Université Pierre and Marie Curie: Bruno Bouchard, Benjamin Jourdain, and
Bernard Lapeyre but also Damien Lamberton and Annie Millet who taught similar
courses in other universities in Paris. Their approaches inspired me on several
occasions. I would like to thank the students of the Master 2 “Probabilités and
Finance” course since I began to teach it in the academic year 2007–08 who had
access to the successive versions of the manuscript in its evolutionary phases: Many
of their questions, criticisms, and suggestions helped improving it. I would like to
express my special gratitude to one of them, Emilio Saïd, for his enthusiastic
involvement in this task. I am grateful to my colleagues Fabienne Comte, Sylvain
Corlay, Noufel Frikha, Daphné Giorgi, Damien Lamberton, Yating Liu, Sophie
Laruelle, Vincent Lemaire, Harald Luschgy, Thibaut Montes, Fabien Panloup,
Victor Reutenauer, and Abass Sagna, who volunteered to read carefully some
chapters at different stages of drafting. Of course, all remaining errors are mine.
I also have widely benefited from many illustrations and simulations provided by
x Preface

Daphné Giorgi, Jean-Claude Fort, Vincent Lemaire, Jacques Printems, and


Benedikt Wilbertz. Many of them are also long-term collaborators with whom I
have shared exciting mathematical adventures during so many years. I thank them
for that.

Paris, France Gilles Pagès


May 2017
Contents

1 Simulation of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.1 Pseudo-random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Fundamental Principle of Simulation . . . . . . . . . . . . . . . 4
1.3 The (Inverse) Distribution Function Method . . . . . . . . . . . . . 5
1.4 The Acceptance-Rejection Method . . . . . . . . . . . . . . . . . . . . 10
1.5 Simulation of Poisson Distributions
(and Poisson Processes) . . . . . . . . . . . . . . . . . . . . . ....... 18
1.6 Simulation of Gaussian Random Vectors . . . . . . . . . ....... 20
1.6.1 d-dimensional Standard Normal Vectors . . . ....... 20
1.6.2 Correlated d-dimensional Gaussian Vectors,
Gaussian Processes . . . . . . . . . . . . . . . . . . ....... 22
2 The Monte Carlo Method and Applications to Option Pricing ... 27
2.1 The Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . . . . ... 27
2.1.1 Rate(s) of Convergence . . . . . . . . . . . . . . . . . . . ... 28
2.1.2 Data Driven Control of the Error:
Confidence Level and Confidence Interval . . . . . ... 29
2.1.3 Vanilla Option Pricing in a Black–Scholes Model:
The Premium . . . . . . . . . . . . . . . . . . . . . . . . . . ... 33
2.1.4 @ Practitioner’s Corner . . . . . . . . . . . . . . . . . . . ... 37
2.2 Greeks (Sensitivity to the Option Parameters): A First
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 38
2.2.1 Background on Differentiation of Functions
Defined by an Integral . . . . . . . . . . . . . . . . . . . . ... 38
2.2.2 Working on the Scenarii Space (Black–Scholes
Model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 40
2.2.3 Direct Differentiation on the State Space:
The Log-Likelihood Method . . . . . . . . . . . . . . . ... 45
2.2.4 The Tangent Process Method . . . . . . . . . . . . . . . ... 47

xi
xii Contents

3 Variance Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 49


3.1 The Monte Carlo Method Revisited: Static Control
Variate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 49
3.1.1 Jensen’s Inequality and Variance Reduction . . ..... 53
3.1.2 Negatively Correlated Variables,
Antithetic Method . . . . . . . . . . . . . . . . . . . . . ..... 59
3.2 Regression-Based Control Variates . . . . . . . . . . . . . . . ..... 64
3.2.1 Optimal Mean Square Control Variates . . . . . . ..... 64
3.2.2 Implementation of the Variance Reduction:
Batch versus Adaptive . . . . . . . . . . . . . . . . . . ..... 66
3.3 Application to Option Pricing: Using Parity Equations
to Produce Control Variates . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Complexity Aspects in the General Case . . . . . . . . . 73
3.3.2 Examples of Numerical Simulations . . . . . . . . . . . . . 74
3.3.3 The Multi-dimensional Case . . . . . . . . . . . . . . . . . . . 79
3.4 Pre-conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.1 The Abstract Paradigm of Important Sampling . . . . . 86
3.6.2 How to Design and Implement
Importance Sampling . . . . . . . . . . . . . . . . . . . ..... 88
3.6.3 Parametric Importance Sampling . . . . . . . . . . ..... 89
3.6.4 Computing the Value-At-Risk by Monte Carlo
Simulation: First Approach . . . . . . . . . . . . . . ..... 93
4 The Quasi-Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . ..... 95
4.1 Motivation and Definitions . . . . . . . . . . . . . . . . . . . . . ..... 95
4.2 Application to Numerical Integration: Functions
with Finite Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Sequences with Low Discrepancy: Definition(s)
and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3.1 Back Again to the Monte Carlo
Method on ½0; 1d . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3.2 Roth’s Lower Bounds for the Star Discrepancy . . . . . 108
4.3.3 Examples of Sequences . . . . . . . . . . . . . . . . . . . . . . 109
4.3.4 The Hammersley Procedure . . . . . . . . . . . . . . . . . . . 116
4.3.5 Pros and Cons of Sequences with Low
Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.3.6 @ Practitioner’s Corner . . . . . . . . . . . . . . . . . . . . . . 122
4.4 Randomized QMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4.1 Randomization by Shifting . . . . . . . . . . . . . . . . . . . . 126
4.4.2 Scrambled (Randomized) QMC . . . . . . . . . . . . . . . . 129
Contents xiii

4.5 QMC in Unbounded Dimension: The Acceptance-Rejection


Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.6 Quasi-stochastic Approximation I . . . . . . . . . . . . . . . . . . . . . 132
5 Optimal Quantization Methods I: Cubatures . . . . . . . . . . . . . . . . . 133
5.1 Theoretical Background on Vector Quantization . . . . . . . . . . 134
5.2 Cubature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2.1 Lipschitz Continuous Functions . . . . . . . . . . . . . . . . 145
5.2.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.3 Differentiable Functions With Lipschitz Continuous
Gradients (C1Lip ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.4 Quantization-Based Cubature Formulas for
EðFðXÞ j Y Þ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3 How to Get Optimal Quantization? . . . . . . . . . . . . . . . . . . . . 152
5.3.1 Dimension 1... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3.2 The Case of the Normal Distribution
N ð0; Id Þ on Rd , d  2 . . . . . . . . . . . . . . . . . . . . . . . 161
5.3.3 Other Multivariate Distributions . . . . . . . . . . . . . . . . 162
5.4 Numerical Integration (II): Quantization-Based
Richardson–Romberg Extrapolation . . . . . . . . . . . . . . . . . . . 163
5.5 Hybrid Quantization-Monte Carlo Methods . . . . . . . . . . . . . . 167
5.5.1 Optimal Quantization as a Control Variate . . . . . . . . 167
5.5.2 Universal Stratified Sampling . . . . . . . . . . . . . . . . . . 168
5.5.3 A(n Optimal) Quantization-Based Universal
Stratification: A Minimax Approach . . . . . . . . . . . . . 170
6 Stochastic Approximation with Applications to Finance . . . . . . . . 175
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.2 Typical a:s: Convergence Results . . . . . . . . . . . . . . . . . . . . . 179
6.3 Applications to Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.3.1 Application to Recursive Variance Reduction
by Importance Sampling . . . . . . . . . . . . . . . . . . . . . 188
6.3.2 Application to Implicit Correlation Search . . . . . . . . 198
6.3.3 The Paradigm of Model Calibration
by Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.3.4 Recursive Computation of the VaR
and the CVaR (I) . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.3.5 Stochastic Optimization Methods
for Optimal Quantization . . . . . . . . . . . . . . . . . . . . . 216
6.4 Further Results on Stochastic Approximation . . . . . . . . . . . . . 225
6.4.1 The Ordinary Differential Equation (ODE)
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.4.2 L2 -Rate of Convergence and Application
to Convex Optimization . . . . . . . . . . . . . . . . . . . . . . 235
xiv Contents

6.4.3 Weak Rate of Convergence: CLT . . . . . . . . . . . . . . . 238


6.4.4 The Averaging Principle for Stochastic
Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.4.5 Traps (A Few Words About) . . . . . . . . . . . . . . . . . . 259
6.4.6 (Back to) VaRa and CVaRa Computation (II):
Weak Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.4.7 VaRa and CVaRa Computation (III) . . . . . . . . . . . . . 261
6.5 From Quasi-Monte Carlo to Quasi-Stochastic
Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
7 Discretization Scheme(s) of a Brownian Diffusion . . . . . . . . . . . . . 271
7.1 Euler–Maruyama Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.1.1 The Discrete Time and Stepwise Constant
Euler Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.1.2 The Genuine (Continuous) Euler Scheme . . . . . . . . . 274
7.2 The Strong Error Rate and Polynomial Moments (I) . . . . . . . 275
7.2.1 Main Results and Comments . . . . . . . . . . . . . . . . . . 275
7.2.2 Uniform Convergence Rate in Lp ðPÞ . . . . . . . . . . . . 276
7.2.3 Proofs in the Quadratic Lipschitz Case for
Autonomous Diffusions . . . . . . . . . . . . . . . . . . . . . . 283
7.3 Non-asymptotic Deviation Inequalities for the Euler
Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.4 Pricing Path-Dependent Options (I)
(Lookback, Asian, etc) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
7.5 The Milstein Scheme (Looking for Better Strong Rates...) . . . 301
7.5.1 The One Dimensional Setting . . . . . . . . . . . . . . . . . . 301
7.5.2 Higher-Dimensional Milstein Scheme . . . . . . . . . . . . 307
7.6 Weak Error for the Discrete Time Euler Scheme (I) . . . . . . . . 310
7.6.1 Main Results for E f ðXT Þ: the Talay–Tubaro
and Bally–Talay Theorems . . . . . . . . . . . . . . . . . . . 312
7.7 Bias Reduction by Richardson–Romberg Extrapolation
(First Approach) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
7.7.1 Richardson–Romberg Extrapolation
with Consistent Brownian Increments . . . . . . . . . . . . 319
7.8 Further Proofs and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 323
7.8.1 Some Useful Inequalities . . . . . . . . . . . . . . . . . . . . . 323
7.8.2 Polynomial Moments (II) . . . . . . . . . . . . . . . . . . . . . 325
7.8.3 Lp -Pathwise Regularity . . . . . . . . . . . . . . . . . . . . . . 329
7.8.4 Lp -Convergence Rate (II): Proof of Theorem 7.2 . . . . 331
7.8.5 The Stepwise Constant Euler Scheme . . . . . . . . . . . . 335
7.8.6 Application to the a:s:-Convergence
of the Euler Schemes and its Rate . . . . . . . . . . . . . . 336
Contents xv

7.8.7 The Flow of an SDE, Lipschitz Continuous


Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
7.8.8 The Strong Error Rate for the Milstein Scheme:
Proof of Theorem 7.5 . . . . . . . . . . . . . . . . . . . . . . . 340
7.8.9 The Feynman–Kac Formula and Application to the
Weak Error Expansion by the PDE Method . . . . . . . 346
7.9 The Non-globally Lipschitz Case (A Few Words On) . . . . . . 359
8 The Diffusion Bridge Method: Application to Path-Dependent
Options (II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
8.1 Theoretical Results About Time Discretization
of Path-Dependent Functionals . . . . . . . . . . . . . . . . . . . . . . . 363
8.2 From Brownian to Diffusion Bridge: How to Simulate
Functionals of the Genuine Euler Scheme . . . . . . . . . . . . . . . 365
8.2.1 The Brownian Bridge Method . . . . . . . . . . . . . . . . . 365
8.2.2 The Diffusion Bridge (Bridge of the Genuine
Euler Scheme) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
8.2.3 Application to Lookback Style Path-Dependent
Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
8.2.4 Application to Regular Barrier Options:
Variance Reduction by Pre-conditioning . . . . . . . . . . 374
8.2.5 Asian Style Options . . . . . . . . . . . . . . . . . . . . . . . . . 375
9 Biased Monte Carlo Simulation, Multilevel Paradigm . . . . . . . . . . 381
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
9.2 An Abstract Framework for Biased Monte Carlo
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
9.3 Crude Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . 387
9.4 Richardson–Romberg Extrapolation (II) . . . . . . . . . . . . . . . . . 390
9.4.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . 390
9.4.2 @ Practitioner’s Corner . . . . . . . . . . . . . . . . . . . . . . 393
9.4.3 Going Further in Killing the Bias:
The Multistep Approach . . . . . . . . . . . . . . . . . . . . . 395
9.5 The Multilevel Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
9.5.1 Weighted Multilevel Setting . . . . . . . . . . . . . . . . . . . 403
9.5.2 Regular Multilevel Estimator (Under First
Order Weak Error Expansion) . . . . . . . . . . . . . . . . . 430
9.5.3 Additional Comments and Provisional Remarks . . . . 440
9.6 Antithetic Schemes (a Quest for b [ 1) . . . . . . . . . . . . . . . . . 441
9.6.1 The Antithetic Scheme for Brownian
Diffusions: Definition and Results . . . . . . . . . . . . . . 441
9.6.2 Antithetic Scheme for Nested Monte Carlo
(Smooth Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
xvi Contents

9.7 Examples of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447


9.7.1 The Clark–Cameron System . . . . . . . . . . . . . . . . . . . 447
9.7.2 Option Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
9.7.3 Nested Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 455
9.7.4 Multilevel Monte Carlo Research Worldwide . . . . . . 460
9.8 Randomized Multilevel Monte Carlo
(Unbiased Simulation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
9.8.1 General Paradigm of Unbiased Simulation . . . . . . . . 461
9.8.2 Connection with Former Multilevel Frameworks . . . . 466
9.8.3 Numerical Illustration . . . . . . . . . . . . . . . . . . . . . . . 467
10 Back to Sensitivity Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 471
10.1 Finite Difference Method(s) . . . . . . . . . . . . . . . . . . . . . . . . . 472
10.1.1 The Constant Step Approach . . . . . . . . . . . . . . . . . . 472
10.1.2 A Recursive Approach: Finite Difference
with Decreasing Step . . . . . . . . . . . . . . . . . . . . . . . . 480
10.2 Pathwise Differentiation Method . . . . . . . . . . . . . . . . . . . . . . 484
10.2.1 (Temporary) Abstract Point of View . . . . . . . . . . . . . 484
10.2.2 The Tangent Process of a Diffusion
and Application to Sensitivity Computation . . . . . . . 485
10.3 Sensitivity Computation for Non-smooth Payoffs:
The Log-Likelihood Approach (II) . . . . . . . . . . . . . . . . . . . . 490
10.3.1 A General abstract result . . . . . . . . . . . . . . . . . . . . . 490
10.3.2 The log-Likelihood Method for the Discrete
Time Euler Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 491
10.4 Flavors of Stochastic Variational Calculus . . . . . . . . . . . . . . . 493
10.4.1 Bismut’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . 493
10.4.2 The Haussman–Clark–Occone Formula:
Toward Malliavin Calculus . . . . . . . . . . . . . . . . . . . 497
10.4.3 Toward Practical Implementation:
The Paradigm of Localization . . . . . . . . . . . . . . . . . 500
10.4.4 Numerical Illustration: What is Localization
Useful for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
11 Optimal Stopping, Multi-asset American/Bermudan Options . . . . . 509
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
11.1.1 Optimal Stopping in a Brownian Diffusion
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
11.1.2 Interpretation in Terms of American Options
(Sketch) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
11.2 Optimal Stopping for Discrete Time Rd -Valued Markov
Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Contents xvii

11.2.1 General Theory, the Backward Dynamic


Programming Principle . . . . . . . . . . . . . . . . . . . . . . 515
11.2.2 Time Discretization for Snell Envelopes Based
on a Diffusion Dynamics . . . . . . . . . . . . . . . . . . . . . 518
11.3 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
11.3.1 The Regression Methods . . . . . . . . . . . . . . . . . . . . . 522
11.3.2 Quantization Methods II: Non-linear Problems
(Quantization Tree) . . . . . . . . . . . . . . . . . . . . . . . . . 528
11.4 Dual Form of the Snell Envelope (Discrete Time) . . . . . . . . . 537
12 Miscellany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
12.1 More on the Normal Distribution . . . . . . . . . . . . . . . . . . . . . 541
12.1.1 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . 541
12.1.2 Numerical Approximation of the Cumulative
Distribution Function U0 . . . . . . . . . . . . . . . . . . . . . 542
12.1.3 Table of the Distribution Function
of the Normal Distribution . . . . . . . . . . . . . . . . . . . . 542
12.2 Black–Scholes Formula(s) (To Compute Reference Prices) . . . 544
12.3 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
12.4 Uniform Integrability as a Domination Property . . . . . . . . . . . 545
12.5 Interchanging... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
12.6 Weak Convergence of Probability Measures
on a Polish Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
12.7 Martingale Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
12.8 Itô Formula for Itô Processes . . . . . . . . . . . . . . . . . . . . . . . . 553
12.8.1 Itô Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
12.8.2 The Itô Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
12.9 Essential Supremum (and Infimum) . . . . . . . . . . . . . . . . . . . . 554
12.10 Halton Sequence Discrepancy
(Proof of an Upper-Bound) . . . . . . . . . . . . . . . . . . . . . . . . . . 557
12.11 A Pitman–Yor Identity as a Benchmark . . . . . . . . . . . . . . . . 560

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
Notation

⊳ General Notation

• N ¼ f0; 1; 2; . . .g and N ¼ f1; 2; . . .g ¼ Nnf0g. R denotes the set of real


numbers, R þ denotes the set of nonnegative real numbers, etc.
• The cardinality of a set A is denoted by jAj or card(A) (in case of possible
ambiguity with a norm).
• b xc denotes the integer part of the real number x, i.e., the largest integer not
greater than x. fxg ¼ x  bxc denotes the fractional part of the real number x.
x ¼ maxðx; 0Þ.
• The base of imaginary numbers in C is denoted by ~ı (with ~ı2 ¼ 1).
• The notation u 2 Kd denotes the column vector u of the vector space Kd ,
K ¼ R or C. A row vector will be denoted by u or ut .
P
• ðujvÞ ¼ u v ¼ 1  i  d ui vi denotes the canonical inner product of vectors u ¼
pffiffiffiffiffiffiffiffiffiffi
ðu1 ; . . .; ud Þ and v ¼ ðv1 ; . . .; vd Þ of Rd . juj ¼ ðujuÞ denotes the canonical
Euclidean norm on Rd derived from this inner product (unless stated otherwise).
Sometimes, it will be specified as j  jd .
• A or At denotes the transpose of a matrix A, depending on the notational
environment.
• Sðd; RÞ denotes the set of symmetric d d matrices and S þ ðd; RÞ the set of
positive semidefinite (symmetric) matrices. For A 2 Sðd; RÞ and u 2 Rd , we
write Au 2 ¼ u Au.
• Mðd; q; KÞ denotes a vector space of matrices with d rows and q columns with
K-valued entries. For A 2 Mðd; q; KÞ, the Fröbenius norm of A, denoted by
pffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
k Ak, is defined by k Ak ¼ AA  ¼ jaij j2 and the operator norm is
i;j
denoted by jk Akj.
 
• Cb ðS; Rd Þ :¼ f : ðS; dÞ ! Rd ; continuous and bounded , where ðS; dÞ denotes
a metric space.

xix
xx Notation

• Dð½0; T; Rd Þ ¼ ff : ½0; T ! Rd ; càdlàg} (càdlàg: a commonly used French


acronym for “continu à droite, limité à gauche” which means “right continuous
left limited”).
jf ðxÞf ðyÞjp
• For a function f : ðRd ; j  jd Þ ! ðRp ; j  jp Þ, ½f Lip ¼ sup jxyjd and f is
x6¼y
Lipschitz continuous with coefficient ½f Lip if ½f Lip \ þ 1.
• An assertion PðxÞ depending on a generic element x of a measure space
ðE; e; lÞ is true l-almost everywhere (denoted l-a:e:) if it is true outside a
l-negligible set of E.
• For a function f : X ! ðRd ; j  jÞ, k f ksup ¼ sup jf ðxÞj denotes the sup-norm.
x2X
• C n ðX; FÞ denotes the set of n times continuously differentiable functions f :
X ! F (X E, open subset of a Banach space E, and F also a Banach space).
When E ¼ R, X may be any non-trivial interval of R with the usual conventions
at the endpoints if contained in X.
• For any sequence ðan Þn  0 having values in vector space (or any additive
structure), we define Dan ¼ an  an1 , n  1.
• lim an and lim an will denote the limsup and the liminf of the sequence ðan Þn
n n
respectively.
an
• For ðan Þn  1 and ðbn Þn  1 two sequences of real numbers, an † bn if lim  1.
n bn

⊳ Probability (and Integration) Theory Notation

• kd denotes the Lebesgue measure on ðRd ; BorðRd ÞÞ where BorðRd Þ denotes the
r-field of Borel subsets of Rd .
R
• Let p 2 ð0; þ 1Þ. LpRd ðX; A; lÞ ¼ ff : ðX; AÞ ! Rd s:t: X jf jp dl\ þ 1g,
where l is nonnegative measure (changing the canonical Euclidean norm j  j on
Rd for another norm has no impact on this space but does have an impact on the
value of the integral itself). LpRd ðX; A; lÞ denotes the set of equivalence classes
of LpRd ðX; A; lÞ with respect to the l-a:e: equality. We write
R 1
k f kp ¼ X jf jp dl p .
• PX or P X 1 ðor Lð X ÞÞ denotes the distribution (or law) on ðRd ; BorðRd ÞÞ of a
random vector X : ðX; A; PÞ ! Rd but also of an Rd -valued stochastic process
X ¼ ðXt Þt2½0;T .
• Let X be a set. If C PðXÞ, rðCÞ denotes the r-field spanned by C or,
equivalently, the smallest r-field on X which contains C.
• The cumulative distribution function (cdf) of a random variable X, usually
denoted FX or UX , is defined for every x 2 R, FX ðxÞ ¼ PðX  xÞ.
Notation xxi

• vX will denote the characteristic function of an Rd -valued random vector X. It is


defined by vX ðuÞ ¼ E e~ıðu j XÞ (where ~ı2 ¼ 1).
d
• X ¼ Y stands for equality in distribution of the random vectors (or processes) X
and Y.
ðSÞ
• ln ) l denotes the weak convergence of ln toward l, where ln and l are
probability measures on the metric space ðS; dÞ equipped with its Borel r-field
BorðSÞÞ.
L
• ! denotes the convergence in distribution of random vectors (i.e., the weak
convergence of their distributions).
• The acronym i.i.d. stands for independent identically distributed.
• N ðm; r2 Þ denotes the Gaussian distribution with mean m and variance r2 on the
real line. U0 always denotes the cdf of the normal distribution N ð0; 1Þ. In higher
dimensions, N ðm; RÞ denotes the distribution of the Gaussian vector of mean m
and covariance matrix R.
Chapter 1
Simulation of Random Variables

1.1 Pseudo-random Numbers

From a mathematical point of view, the definition of a sequence of (uniformly dis-


tributed) random numbers (over the unit interval [0, 1]) should be:

Definition. A sequence (xn )n≥1 of [0, 1]-valued real numbers is a sequence of


random numbers if there exists a probability space (, A, P), a sequence Un , n ≥ 1,
of i.i.d. random variables with uniform distribution U ([0, 1]) and ω ∈  such
that xn = Un (ω) for every n ≥ 1.

However, this naive and abstract definition is not satisfactory because the “sce-
nario” ω ∈  may be not a “good” one, i.e. not a “generic”…? Many probabilistic
properties (like the law of large numbers, to quote the most basic one) are only sat-
isfied P-a.s. Thus, if ω lies in the negligible set that does not satisfy one of them, the
induced sequence will not be “admissible”.
In any case, one usually cannot have access to an i.i.d. sequence of random
variables (Un ) with distribution U ([0, 1])! Any physical device producing such a
sequence would be too slow and unreliable. The works of by logicians like Martin-
Löf lead to the conclusion that, roughly speaking, a sequence (xn ) that can be gen-
erated by an algorithm cannot be considered as “random” one. Thus the digits of
the real number π are not random in that sense. This is quite embarrassing since an
essential requested feature for such sequences is to be generated almost instantly on
a computer!
The approach coming from the computer and algorithmic sciences is not much
more tractable since their definition of a sequence of random numbers is that the com-
plexity of the algorithm to generate the first n terms behaves like O(n). The rapidly
growing need for good (pseudo-)random sequences that came with the explosion of
Monte Carlo simulation in many fields of Science and Technology after World War II
(we refer not only to neutronics) led to the adoption of a more pragmatic approac –
say heuristic – based on statistical tests. The idea is to submit some sequences to
statistical tests (uniform distribution, block non-correlation, rank tests, etc).
© Springer International Publishing AG, part of Springer Nature 2018 1
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_1
2 1 Simulation of Random Variables

For practical implementation, such sequences are finite, as the accuracy of com-
puters is. One considers some sequences (xn ) of so-called pseudo-random numbers
defined by
yn
xn = , yn ∈ {0, . . . , N − 1}.
N
One classical process is to generate the integers yn by a congruential induction:

yn+1 ≡ ayn + b mod N

where gcd(a, N ) = 1, so that ā (class of a modulo N ) is invertible with respect to


the multiplication (modulo N ). Let (Z/N Z)∗ denote the set of such invertible classes
(modulo
 N ). The multiplication
 of classes (modulo N ) is an internal law on (Z/N Z)∗
and (Z/N Z)∗ , × is a commutative group for this operation. By the very definition
of the Euler indicator function ϕ(N ) as the number of integers a in {0, . . . , N − 1}
such that gcd(a, N ) = 1, the cardinality of (Z/N Z)∗ is equal to ϕ(N ). Let us recall
that the Euler function is multiplicative and given by the following closed formula

  
1
ϕ(N ) = N 1− .
p
p|N , p prime

Thus ϕ( p) = p − 1 for every prime integer p ∈ N and ϕ( p k ) = p k − p k−1 (pri-


mary numbers), etc. In particular, if p is prime, then (Z/N Z)∗ = (Z/N Z) \ {0}.
If b = 0 (the most common case), one speaks of a homogeneous generator. We will
focus on this type of generator in what follows.
 Homogeneous congruent generators. When b = 0, the period of the sequence (yn )
is given by the multiplicative order of a in ((Z/N Z)∗ , ×), i.e.
   
τa := min k ≥ 1 : a k ≡ 1 mod N = min k ≥ 1 : ā k = 1 .

Moreover, we know by Lagrange’s Theorem that τa divides the cardinality ϕ(N )


of (Z/N Z)∗ .
For pseudo-random number simulation

 purposes, we search for pairs (N , a) such
that the period τa of a in (Z/N
 Z) , × isvery large. This needs an in-depth study
of the multiplicative groups (Z/N Z)∗ , × , bearing in mind that N itself should be
large to allow the element a to have a large period. This suggests to focus on prime
or primary Numbers since, as seen above, in these cases ϕ(N ) is itself large.
In fact, the structure of these groups has long been understood. We summarize
these results below.

Theorem 1.1 Let N = p α , p prime, α ∈ N∗ be a primary number.


(a) If α = 1 (i.e. N = p prime), then (Z/N Z)∗ , × (whose cardinality is p − 1) is a
cyclic group. This means that there exists an a ∈ {1, . . . , p − 1} s.t. (Z/ p Z)∗ = a.
Hence the maximal period is τ = p − 1.
1.1 Pseudo-random Numbers 3

(b) If p = 2, α ≥ 3, (Z/N Z)∗ (whose cardinality is 2α−1 = N /2) is not cyclic. The
maximal period is then τ = 2α−2 with a ≡ ±3 mod 8. (If α = 2 (N = 4), the group
of size 2 is trivially cyclic!)
(c) If p = 2, then (Z/N Z)∗ (whose cardinality is p α−1 ( p − 1)) is cyclic, hence
τ = p α−1 ( p − 1).  by any element a whose class a in (Z/ p Z) spans
 It is generated
the cyclic group (Z/ pZ)∗ , × .

What does this theorem say in connection with our pseudo-random number gen-
eration problem?  
First some very good news: when N is a prime number, the group (Z/N Z)∗ , ×

 i.e. there exists an a ∈ {1, . . . , N } such that (Z/N Z) = ā , 1 ≤ n ≤
n
is cyclic,
N − 1 . The bad news is that we do not know which a satisfies this property, not all do
and, even worse, we do not know how to find any. Thus, if p = 7, ϕ(7) = 7 − 1 = 6
and o(3) = o(5) = 6 but o(2) = o(4) = 3 (which divides 6) and o(6) = 2 (which
again divides 6).
The second bad news is that the length of a sequence, though a necessary property
of a sequence (yn )n , provides no guarantee or even clue that (xn )n is a good sequence
of pseudo-random numbers! Thus, the (homogeneous) generator of the FORTRAN
IMSL library does not fit into the formerly described setting: one sets N := 231 − 1 =
2 147 483 647 (which is a Mersenne prime, see below) discovered by Leonhard Euler
in 1772), a := 75 , and b := 0 (a ≡ 0 mod 8), the point being that the period of 75 is
not maximal.
Another approach to random number generation is based on a shift register and
relies upon the theory of finite fields.
At this stage, a sequence must pass successfully through various statistical tests,
keeping in mind that such a sequence is finite by construction and consequently
cannot satisfy asymptotically such common properties as the Law of Large Numbers,
the Central Limit Theorem or the Law of the Iterated Logarithm (see the next chapter).
Dedicated statistical toolboxes like DIEHARD (Marsaglia, 1998) have been devised
to test and “certify” sequences of pseudo-random numbers.
The aim of this introductory section is just to give the reader the flavor of pseudo-
random number generation, but in no case we do recommend the specific use of any
of the above generators or discuss the virtues of such or such a generator.
For more recent developments on random numbers generator (like a shift register,
for example, we refer e.g. to [77, 219]). Nevertheless, let us mention the Mersenne
twister generators. This family of generator has been introduced in 1997 by Makoto
Matsumata and Takuji Nishimura in [211]. The first level of Mersenne Twister Gener-
ators (denoted by M T - p) are congruential generators whose period N p is a Mersenne
prime, i.e. an prime number of the form N p = 2 p − 1 where p is itself prime. The
most popular and now worldwide implemented generator is the M T -19937, owing
to its unrivaled period 219937 − 1 106000 since 210 103 ). It can simulate a uni-
form distribution in 623 dimensions (i.e. on [0, 1]623 ). A second “shuffling” device
improves it further. For recent improvements and their implementations in C ++,
see
www.math.sci.hiroshima-u.ac.jp/ m-mat/MT/emt.html
4 1 Simulation of Random Variables

Recently, new developments in massively parallel computing have drawn the


attention back to pseudo-random number simulation. In particular, consider the GPU-
based intensive computations which use the graphics device of a computer as a com-
puting unit which may run hundreds of parallel computations. One can imagine that
each pixel is a (virtual) small computing unit achieving a small chain of elemen-
tary computations (a thread). What is really new is that access to such intensive
parallel computing becomes cheap, although it requires some specific programming
language (like CUDA on Nvidia GPU). As concerns its use for intensively parallel
Monte Carlo simulation, some new questions arise: in particular, the ability to gen-
erate independently (in parallel!) many “independent” sequences of pseudo-random
numbers, since the computing units of a GPU never “speak” to each other or to any-
body else while running: each pixel is a separate (virtual) thread. The Wichmann–Hill
pseudo-random number generator (which is in fact a family of 273 different meth-
ods) seems to be a good candidate for Monte Carlo simulation on a GPU. For more
insight on this topic, we refer to [268] and the references therein.

1.2 The Fundamental Principle of Simulation

Theorem 1.2 (Fundamental theorem of simulation) Let (E, d E ) be a Polish


space (complete and separable) and let X : (, A, P) → (E, Bord E (E)) be a ran-
dom variable with distribution PX . Then there exists a Borel function ϕ : ([0, 1],
B([0, 1]), λ[0,1] ) → (E, Bord E (E), PX ) such that

PX = λ[0,1] ◦ ϕ−1 ,

where λ[0,1] ◦ ϕ−1 denotes the image measure by ϕ of the Lebesgue measure λ[0,1]
on the unit interval.

We will admit this theorem. For a proof, we refer to [48] (Theorem A.3.1, p. 38).
It also appears as a “building block” in the proof of the Skorokhod representation
theorem for weakly converging sequences of random variables having values in a
Polish space.
As a consequence this means that, if U denotes any random variable with uniform
distribution on (0, 1) defined on a probability space (, A, P), then

d
X = ϕ(U ).

The interpretation is that any E-valued random variable can be simulated from
a uniform distribution…provided the function ϕ can be computed (at a reasonable
computational cost, i.e. in algorithmic terms with reasonable complexity). If this is the
case, the yield of the simulation is 1 since every (pseudo-)random number u ∈ [0, 1]
produces a PX -distributed random number. Except in very special situations (see
below), this result turns out to be of purely theoretical interest and is of little help for
1.2 The Fundamental Principle of Simulation 5

practical simulation. However, the fundamental theorem of simulation is important


from a theoretical point view in Probability Theory since, as mentioned above, it
provides a fundamental step in the proof of the Skorokhod representation theorem
(see e.g [45], Chap. 5).
In the three sections below we provide a short background on the most classical
simulation methods (inversion of the distribution function, the acceptance-rejection
method, Box–Muller for Gaussian vectors). This is, of course, far from being exhaus-
tive. For an overview of the different aspects of simulation of non-uniform random
variables or vectors, we refer to [77]. In fact, many results from Probability Theory
can give rise to simulation methods.

1.3 The (Inverse) Distribution Function Method

Let μ be a probability distribution on (R, Bor (R)) with distribution function F


defined for every x ∈ R by
 
F(x) := μ (−∞, x] .

The function F is always non-decreasing, “càdlàg”(a French acronym for “right con-
tinuous with left limits” which we shall adopt in what follows) and lim x→+∞ F(x) =
1, lim x→−∞ F(x) = 0.
One can always associate to F its canonical left inverse Fl−1 defined on the open
unit interval (0, 1) by
 
∀u ∈ (0, 1), Fl−1 (u) = inf x ∈ R : F(x) ≥ u .

One easily checks that Fl−1 is non-decreasing, left-continuous and satisfies

∀ u ∈ (0, 1), Fl−1 (u) ≤ x ⇐⇒ u ≤ F(x).

Proposition 1.1 If U = U ((0, 1)), then X := Fl−1 (U ) = μ.


d d

Proof. Let x ∈ R. It follows from the preceding that

{X ≤ x} = {Fl−1 (U ) ≤ x} = {U ≤ F(x)}

so that P(X ≤ x) = P(Fl−1 (U ) ≤ x) = P(U ≤ F(X )) = F(x). ♦

Remarks. • If F is increasing and continuous on the real line, then F has an inverse
function denoted by F −1 defined (0, 1) (also increasing and continuous) such that
F ◦ F −1 = Id|(0,1) and F −1 ◦ F = IdR . Clearly F −1 = Fl−1 by the very definition
of Fl−1 . The above proof can be made even more straightforward since the event
{F −1 (U ) ≤ x} = {U ≤ F(x)} by simple left composition of F −1 by the increasing
function F.
6 1 Simulation of Random Variables

• If μ has a probability density f such that { f = 0} has an empty interior, then


x
F(x) = f (u)du is continuous and increasing.
−∞
• One can replace R by any interval [a, b] ⊂ R or R (with obvious conventions).
• One could also have considered the right continuous canonical inverse Fr−1 by:
 
∀u ∈ (0, 1), Fr−1 (u) = inf x : F(x) > u .

One shows that Fr−1 is non-decreasing, right continuous and that

Fr−1 (u) ≤ x =⇒ u ≤ F(x) and u < F(x) =⇒ Fr−1 (u) ≤ x.

Hence Fr−1 (U ) = X since


d

≤ P(F(x) ≥ U ) = F(x)
P(Fr−1 (U ) ≤ x) = P(X ≤ x) = F(x).
≥ P(F(x) > U ) = F(x)

When X takes finitely many values in R, we will see in Example 4 below that this
simulation method corresponds to the standard simulation method of such random
variables.
 Exercise. (a) Show that, for every u ∈ (0, 1),
 
F Fl−1 (u)− ≤ u ≤ F ◦ Fl−1 (u),

so that if F is continuous (or equivalently μ has no atom: μ({x}) = 0 for every x ∈ R),
then F ◦ Fl−1 = I d(0,1)
d
(b) Show that if F is continuous, then F(X ) = U ([0, 1]).
(c) Show that if F is (strictly) increasing, then Fl−1 is continuous and Fl−1 ◦ F = IdR.
(d) One defines the survival function of μ by F̄(x) = 1 − F(x) = μ (x, +∞) ,
x ∈ R. One defines the canonical right inverse of F̄ by
 
∀u ∈ (0, 1), F̄r−1 (u) = inf x : F̄(x) ≤ u .

Show that F̄r−1 (u) = Fl−1 (1 − u). Deduce that F̄r−1 is right continuous on (0, 1)
and that F̄r−1 (U ) has distribution μ. Define F̄l−1 and show that F̄r−1 (U ) has distribu-
tion μ. Finally, establish for F̄r−1 similar properties to (a)–(b)–(c).

(Informal) Definition. The yield (often denoted by r ) of a simulation procedure is


defined as the inverse of the number of pseudo-random numbers used to generate
one PX -distributed random number.
One must keep in mind that the yield is attached to a simulation method, not to
a probability distribution (the fundamental theorem of simulation always provides a
simulation method with yield 1, except that it is usually not tractable).
1.3 The (Inverse) Distribution Function Method 7

Typically, if X = ϕ(U1 , . . . , Um ), where ϕ is a Borel function defined on [0, 1]m


and U1 , . . . , Um are independent and uniformly distributed over [0, 1], the yield of
this ϕ-based procedure to simulate the distribution of X is r = 1/m.
Thus, the yield r of the (inverse) distribution function is consequently always
equal to r = 1.
d
 Examples. 1. Simulation of an exponential distribution. Let X = E(λ), λ > 0.
Then x
∀ x ∈ (0, +∞), FX (x) = λ e−λξ dξ = 1 − e−λx .
0

d
Consequently, for every u ∈ (0, 1), FX−1 (u) = − log(1 − u)/λ. Now, using that U =
d
1 − U if U = U ((0, 1)) yields

log(U ) d
X =− = E(λ).
λ
2. Simulation of a Cauchy(c), c > 0, distribution. We know that P X (d x) =
c
d x.
π(x 2 + c2 )
x
c du 1 x  π
∀x ∈ R, FX (x) = = Arctan + .
−∞ u2 +c π
2 π c 2
 
Hence FX−1 (x) = c tan π(u − 1/2) . It follows that
  d
X = c tan π(U − 1/2) = Cauchy(c).

3. Simulation of a Pareto (θ), θ>0, distribution. This distribution reads


θ
P X (d x)= 1+θ 1{x≥1} d x. Its distribution function FX (x) = 1 − x −θ so that, still using
x  
d d
U = 1 − U if U = U (0, 1) ,

X = U − θ = Pareto(θ).
1 d

4. Simulation of a distribution supported by a finite set E. Let E := {x1 , . . . , x N }


be a subset of R indexed so that i → xi is increasing. Let X : (, A, P) → E be
an E-valued random variable with distribution P(X = xk ) = pk , 1 ≤ k ≤ N , where
pk ∈ [0, 1], p1 + · · · + p N = 1. Then, one checks that its distribution function FX
reads
∀ x ∈ R, FX (x) = p1 + · · · + pi if xi ≤ x < xi+1
8 1 Simulation of Random Variables

with the convention x0 = −∞ and x N +1 = +∞. As a consequence, its left continuous


canonical inverse is given by


N
−1
∀u ∈ (0, 1), FX,l (u) = xk 1{ p1 +···+ pk−1 <u≤ p1 +···+ pk }
k=1

so that
d

N
X= xk 1{ p1 +···+ pk−1 <U ≤ p1 +···+ pk } .
k=1

The yield of the procedure is still r = 1. However, when implemented naively, its
complexity – which corresponds to (at most) N comparisons for every simulation –
may be quite high. See [77] for some considerations (in the spirit of quick sort algo-
rithms) which lead to a O(log N ) complexity. Furthermore, this procedure underlines
that one has access to the probability weights pk with an arbitrary accuracy. This is
not always the case, even in a priori simple situations, as emphasized in Example 6
below.
It should be observed of course that the above simulation formula is still appro-
priate for a random variable taking values in any set E, not only for subsets of R!
5. Simulation of a Bernoulli random variable B( p), p ∈ (0, 1). This is the simplest
application of the previous method since

d
X = 1{U ≤ p} = B( p),

6. Simulation of a Binomial random variable B(n, p) , p ∈ (0, 1) , n ≥ 1 . One


relies on the very definition of the binomial distribution as the law of the sum of n
independent B( p)-distributed random variables, i.e.


n
d
X= 1{Uk ≤ p} = B(n, p).
k=1

where U1 , . . . , Un are i.i.d. random variables, uniformly distributed over [0, 1].
Note that this procedure has a very bad yield, namely r = n1 . Moreover, it needs
n comparisons like the standard method (without any shortcut).
Why not use the above standard method for a random variable taking finitely
many values? Because the cost of the computation of the probability weights pk is
much too high as n grows.
7. Simulation of geometric random variables G( p) and G ∗ ( p), p ∈ (0, 1). This
is the distribution of the first success instant when independently repeating the
same Bernoulli experiment with parameter p. Conventionally, G( p) starts at time 0
whereas G ∗ ( p) starts at time 1.
1.3 The (Inverse) Distribution Function Method 9

To be precise, if (X k )k≥0 denotes an i.i.d. sequence of random variables with


distribution B( p), p ∈ (0, 1), then

τ ∗ := min{k ≥ 1 : X k = 1} = G ∗ ( p)
d

and
d
τ := min{k ≥ 0 : X k = 1} = G( p).

Hence
P(τ ∗ = k) = p(1 − p)k−1 , k ∈ N∗ := {1, 2, . . . , n, . . .}

and
P(τ = k) = p(1 − p)k , k ∈ N := {0, 1, 2, . . . , n, . . .}

(so
 that both random variables are P-a.s. finite since one has k≥0 P(τ = k) =
∗ ∗ ∗
k≥1 P(τ = k) = 1). Note that τ + 1 has the same G ( p)-distribution as τ .

The (random) yields of the above two procedures are r = τ ∗ and r = τ +1 , respec-
1 1

tively. Their common mean (average yield) r̄ = r̄ ∗ is given by


    1
1 1
E =E = p(1 − p)k−1
τ +1 τ∗ k≥1
k
p 1
= (1 − p)k
1 − p k≥1 k
p
=− log(1 − x)|x=1− p
1− p
p
=− log( p).
1− p

 Exercises. 1. (Conditional distributions). Let X : (, A, P) → (R, Bor (R)) be a


real-valued random variable with distribution function F and left continuous canon-
ical inverse Fl−1 . Let I = [a, b], −∞ ≤ a < b ≤ +∞, be a nontrivial interval of R.
d
Show that, if U = U ([0, 1]), then
  d
Fl−1 F(a) + (F(b) − F(a))U = L(X | X ∈ I ).

2. Negative binomial distributions. The negative binomial distribution with parameter


(n, p) is the law μn, p of the n-th success in an infinite sequence of independent
Bernoulli trials, namely, using the above notations or the geometric distributions, the
distribution of
 
τ (n) = min k ≥ 1 : card{1 ≤  ≤ k : X  = 1} = n .
10 1 Simulation of Random Variables

Show that

μn, p (k) = 0, k ≤ n − 1, μn, p (k) = Ck−1 p (1 − p)k−n , k ≥ n.


n−1 n

Compute the mean yield of its (natural and straightforward) simulation method.

1.4 The Acceptance-Rejection Method

This method is due to Von Neumann (1951). It is contemporary with the development
of computers and of the Monte Carlo method. The original motivation was to devise
a simulation method for a probability distribution ν on a measurable space (E, E),
absolutely continuous with respect to a σ-finite non-negative measure μ with a density
given, up to a constant, by f : (E, E) → R+ when we know that f is dominated by
a probability distribution g · μ which can be simulated at “low computational cost”.
(Note that ν =  ff dμ · μ.)
E
In most elementary applications (see below), E is either a Borel set of Rd equipped
with its Borel σ-field and μ is the trace of the Lebesgue measure on E or a subset
E ⊂ Zd equipped with the counting measure.
Let us be more precise. So, let μ be a non-negative σ-finite measure on (E, E)
and let f, g : (E, E) −→ R+ be two Borel functions. Assume that f ∈ L 1R+ (μ) with

E f dμ > 0 and that g is a probability density with respect to μ satisfying further-
more g > 0 μ-a.s. and there exists a positive real constant c > 0 such that

f (x) ≤ c g(x) μ(d x)-a.e.

Note that this implies c ≥ f dμ. As mentioned above, the aim of this section is to
E
show how to simulate some random numbers distributed according to the probability
distribution
f
ν= ·μ
Rd f dμ

using some g · μ-distributed (pseudo-)random numbers. In particular, to make the


problem consistent, we will assume that ν = μ, which in turn implies that

c> f dμ.
Rd

The underlying requirements on f , g and μ to undertake a practical implementa-


tion of the method described below are the following:

• the numerical value of the real constant c is known;


1.4 The Acceptance-Rejection Method 11

• we know how to simulate (at a reasonable cost) on a computer a sequence of i.i.d.


random vectors (Yk )k≥1 with the distribution g · μ
• we can compute on a computer the ratio gf (x) at every x ∈ Rd (again at a reasonable
cost).

As a first (not so) preliminary step, we will explore a natural connection (in
distribution) between an E-valued random variable X with distribution ν and Y
an E-valued random variable with distribution g · μ. We will see that the key idea
is completely elementary. Let h : (E, E) → R be a test function (measurable and
bounded or non-negative). On the one hand

1
E h(X ) =  h(x) f (x)μ(d x)
E f dμ E
1 f
=  h(y) (y)g(y)μ(dy) since g > 0 μ−a.e.
f dμ E g
E
 
1 f
=  E h(Y ) (Y ) .
E f dμ g

We can also forget about the last line, stay on the state space E and note in a somewhat
artificial way that
 1 
c
E h(X ) =  h(y) 1{u≤ 1 f
(y)} du g(y)μ(dy)
E f dμ
c g
E 0
1
c
=  h(y)1{u≤ 1 f (y)} g(y)μ(dy)du
E f dμ E 0
c g

c 
=  E h(Y )1{c U ≤ f (Y )}
E f dμ
g

where U is uniformly distributed over [0, 1] and independent of Y .


c
By considering h ≡ 1, we derive from the above identity that  =
E f dμ
1
 so that finally
P c U ≤ gf (Y )
 
 f 
E h(X ) = E h(Y )  c U ≤ (Y )) .
g

The proposition below takes advantage of this identity in distribution to propose a


simulation procedure. In fact, it is simply a reverse way to make (and interpret) the
above computations.
12 1 Simulation of Random Variables

Proposition 1.2 (Acceptance-rejection simulation method) Let (Un , Yn )n≥1


be a sequence of i.i.d. random variables with distribution U ([0, 1]) ⊗ PY (indepen-
dent marginals) defined on (, A, P) where PY (dy) = g(y)μ(dy) is the distribution
of Y on (E, E). Set
 
τ := min k ≥ 1 : c Uk g(Yk ) ≤ f (Yk ) .

Then, τ has a geometric distribution G ∗ ( p) with parameter given by


  f dμ
p := P c U1 g(Y1 ) ≤ f (Y1 ) = E and
c
d
X := Yτ = ν.

Remark. The (random) yield of the method is τ1 . Hence, we know that its mean yield
is given by   
1 p log p E f dμ c
E =− = log  .
τ 1− p c − E f dμ E f dμ

p log p
Since lim − = 1, the closer to f dμ the constant c is, the higher the yield
p→1 1− p E
of the simulation.

Proof. Step 1. Let (U, Y ) be a pair of random variables with distribution


U ([0, 1]) ⊗ PY . Let h : Rd → R be a bounded Borel test function. We have

 
E h(Y )1{c U g(Y )≤ f (Y )} = h(y)1{c ug(y)≤ f (y)} g(y)μ(dy)⊗du
E×[0,1]
 
= 1{c ug(y)≤ f (y)}∩{g(y)>0} du h(y)g(y)μ(dy)
E [0,1]
 1 
= h(y) 1{u≤ f (y) }∩{g(y)>0} du g(y)μ(dy)
cg(y)
E 0
f (y)
= h(y) g(y)μ(dy)
{g(y)>0} cg(y)
1
= h(y) f (y)μ(dy),
c E

where we used successively Fubini’s Theorem, g(y) > 0 μ(dy)-a.e., Fubini’s Theo-
rem again and cgf (y) ≤ 1 μ(dy)-a.e. Note that we can apply Fubini’s Theorem since
the reference measure μ is σ-finite.
Putting h ≡ 1 yields 
E f dμ
c= .
P(c U g(Y ) ≤ f (Y ))
1.4 The Acceptance-Rejection Method 13

Then, elementary conditioning yields

  f (y)
E h(Y )|{c U g(Y ) ≤ f (Y )} = h(y)  μ(dy) = h(y)ν(dy),
E E f dμ E

i.e.
L (Y |{c U g(Y ) ≤ f (Y )}) = ν.

Step 2. Then (using that τ is P-a.s. finite)


    
E h(Yτ ) = E 1{τ =n} h(Yn )
n≥1
  n−1  
= P {c U1 g(Y1 ) > f (Y1 )} E h(Y1 )1{c U1 g(Y1 )≤ f (Y1 )}
n≥1
  
= (1 − p)n−1 E h(Y1 )1{c U1 g(Y1 )≤ f (Y1 )}
n≥1
  
= p(1 − p)n−1 E h(Y1 ) | {c U1 g(Y1 ) ≤ f (Y1 )}
n≥1

= 1× h(y)ν(dy)
E

= h(y)ν(dy). ♦
E

Remark. An important point to be noticed is that we do not need to know the


numerical value of f dμ to implement the above acceptance-rejection procedure.
E

Corollary 1.1 Set by induction for every n ≥ 1


 
τ1 := min k ≥ 1 : c Uk g(Yk ) ≤ f (Yk )

and  
τn+1 := min k ≥ τn + 1 : c Uk g(Yk ) ≤ f (Yk ) .

(a) The sequence (τn − τn−1 )n≥1 (with the convention τ0 = 0) is i.i.d. with distribu-
tion G ∗ ( p) and the sequence
X n := Yτn

is an i.i.d. PX -distributed sequence of random variables.


(b) Furthermore, the random yield of the simulation of the first n PX -distributed
random variables Yτk , k = 1, . . . , n, is

n a.s.
ρn = −→ p as n → +∞.
τn
14 1 Simulation of Random Variables

Proof. (a) Left as an exercise (see below).


(b) The fact that ρn = τnn is obvious. The announced a.s. convergence follows from
the Strong Law of Large Numbers since (τn − τn−1 )n≥1 is i.i.d. and E τ1 = 1p , which
τn a.s.
implies that n
−→ 1p . ♦

 Exercise. Prove item (a) of the above corollary.


Before proposing some first applications, let us briefly present a more applied point
of view which is closer to what is really implemented in practice when performing
a Monte Carlo simulation based on the acceptance-rejection method.
ℵ Practitioner’s corner (the Practitioner’s point of view)
The practical implementation of the acceptance-rejection method is rather simple.
Let h : E → R be a PX -integrable Borel function. How to compute E h(X ) using
Von Neumann’s acceptance-rejection method? It amounts to the simulation of an
n-sample (Uk , Yk )1≤k≤n on a computer and to the computation of


n
1{cUk g(Yk )≤ f (Yk )} h(Yk )
k=1
.

n
1{cUk g(Yk )≤ f (Yk )}
k=1

Note that


n
1
n
1{cUk g(Yk )≤ f (Yk )} h(Yk ) 1{cUk g(Yk )≤ f (Yk )} h(Yk )
k=1
n k=1
= , n ≥ 1.

n
1
n
1{cUk g(Yk )≤ f (Yk )} 1{cUk g(Yk )≤ f (Yk )}
k=1
n k=1

Hence, owing to the strong Law of Large Numbers (see the next chapter if necessary)
this quantity a.s. converges as n goes to infinity toward
1 f (y)
du μ(dy)1{cug(y)≤ f (y)} h(y)g(y) h(y)g(y)μ(dy)
E cg(y)
0
=
1 f (y)
du μ(dy)1{cug(y)≤ f (y)} g(y) g(y)μ(dy)
0 E cg(y)
f (y)
h(y)μ(dy)
c
= E
f (y)
μ(dy)
E c
f (y)
= h(y)  μ(dy)
E R f dμ
d

= h(y)ν(dy).
E
1.4 The Acceptance-Rejection Method 15

This third way to present the same computations shows that in terms of practical
implementation, this method is in fact very elementary.
Classical applications
 Uniform distributions on a bounded domain D ⊂ Rd .
Let D ⊂ a + [−M, M]d , λd (D) > 0 (where a ∈ Rd and M > 0) and let
d
Y = U (a + [−M, M]d ), let τ D := min{n : Yn ∈ D}, where (Yn )n≥1 is an i.i.d.
sequence defined on a probability space (, A, P) with the same distribution as
Y . Then,
d
Yτ D = U (D),

where U (D) denotes the uniform distribution over D.


This follows from the above Proposition 1.2 with E = a + [−M, M]d , μ :=
λd|(a+[−M,M]d ) (Lebesgue measure on a + [−M, M]d ),

g(u) := (2M)−d 1a+[−M,M]d (u)

and
f (x) = 1 D (x) ≤ (2M)d g(x)
  
=:c

f
so that  · μ = U (D).
Rd f dμ
As a matter of fact, with the notations of the above proposition,
 f 
τ = min k ≥ 1 : c Uk ≤ (Yk ) .
g

However, gf (y) > 0 if and only if y ∈ D and if so, gf (y) = 1. Consequently τ = τ D .


A standard application is to consider the unit ball of Rd , D := Bd (0; 1). When
d = 2, this is a step of the so-called polar method, see below, for the simulation of
N (0; I2 ) random vectors.
 The γ(α)-distribution.
Let α > 0 and PX (d x) = f α (x) (α)
dx
where

f α (x) = x α−1 e−x 1{x>0} (x).


 +∞
(Keep in mind that (a) = 0 u a−1 e−u du, a > 0). Note that when α = 1 the
gamma distribution is simply the exponential distribution. We will consider E =
(0, +∞) and the reference σ-finite measure μ = λ1|(0,+∞) .
– Case 0 < α < 1 . We use the rejection method, based on the probability density
αe 
gα (x) = x α−1 1{0<x<1} + e−x 1{x≥1} .
α+e
16 1 Simulation of Random Variables

The fact that gα is a probability density function follows from an elementary com-
putation. First, one checks that f α (x) ≤ cα gα (x) for every x ∈ R+ , where

α+e
cα = .
αe

Note that this choice of cα is optimal since f α (1) = cα gα (1). Then, one uses
the inverse distribution function to simulate the random variable with distribution
PY (dy) = gα (y)λ(dy). Namely, if G α denotes the distribution function of Y , one
checks that, for every x ∈ R,
 
e αe 1 1
G α (x) = x α 1{0<x<1} + + − e−x 1{x>1}
α+e α+e e α

so that for every u ∈ (0, 1),


  α1  
α+e  − log (1 − u) α + e 1
G −1
α (u) = u 1 .
αe
e e
e u< α+e u≥ α+e

Note that the computation of the  function is never required to implement this
method.
– Case α ≥ 1 . We rely on the following classical lemma about the γ distribution
that we leave as an exercise to the reader.

Lemma 1.1 Let X  and X  two independent random variables with distributions
γ(α ) and γ(α ), respectively. Then X = X  + X  has distribution γ(α + α ).

Consequently, if α = n ∈ N, an induction based on the lemma shows that

X = ξ1 + · · · + ξn

where the random variables ξk are i.i.d. with exponential distribution since γ(1) =
E(1). Consequently, if U1 , . . . , Un are i.i.d. uniformly distributed random variables
 n 
d

X = − log Uk .
k=1

To simulate a random variable with general distribution γ(α), one writes α =


α + {α}, where α := max{k ≤ α, k ∈ N} denotes the integer part of α and {α} ∈
[0, 1) its fractional part.

 Exercises. 1. Prove the above lemma.


2. Show that considering the normalized probability density function of the γ(α)-
distribution (which involves the computation of (α)) as a function f α will not
improve the yield of the simulation.
1.4 The Acceptance-Rejection Method 17

3. Let α = α + n, α = α ∈ [0, 1). Show that the yield of the simulation is given
by r = n+τ1
α
, where τα has a G ∗ ( pα ) distribution with pα related to the simulation
of the γ(α)-distribution. Show that

p  (1 − p)k 
n
r̄ := E r = − log p + .
(1 − p)n+1 k=1
k

 Acceptance-rejection method for a bounded density with compact support.


Let f : Rd → R+ be a bounded Borel function with compact support (hence inte-
gral with respect to the Lebesgue measure). If f can be computed at a reasonable cost,
one may simulate the distribution ν =  d f.dλ
f
d
λd by simply considering a uniform
R
distribution on a hypercube dominating f . To be more precise, let a = (a1 , . . . , ad ),
b = (b1 , . . . , bd ), κ ∈ Rd such that


d
supp( f ) = { f = 0} ⊂ K = κ + [ai , bi ].
i=1

Let E = K , let μ = λd|E be the reference measure and g = λd (K1


1 the density of the
) K
uniform distribution over K (this distribution is very easy to simulate as emphasized
in a former example). Then


d
f (x) ≤ c g(x), x ∈ K with c =  f sup λd (K ) =  f sup (bi − ai ).
i=1

Then, if (Yn )n≥1 denotes an i.i.d. sequence defined on a probability space (, A, P)
with the uniform distribution over K , the stopping strategy τ of Von Neumann’s
acceptance-rejection method reads
 
τ = min k :  f sup Uk ≤ f (Yk ) .

Equivalently this can be rewritten in a more intuitive way as follows: let Vn =


(Vn1 , Vn2 )n≥1 be an i.i.d. sequence of random vectors defined on a probability space
(, A, P) having a uniform distribution over K × [0,  f sup ]. Then

d  
Vτ1 = ν where τ = min k ≥ 1 : Vk2 ≤ f (Vk1 ) .

 Simulation of a discrete distribution supported by the non-negative integers.


d
Let Y = G ∗ ( p), p ∈ (0, 1)., i.e. such that distribution satisfies PY = g · m, where
m is the counting measure on N∗ (m(k) = 1, k ∈ N∗ ) and gk = p(1 − p)k−1 , k ≥ 1.
18 1 Simulation of Random Variables

Let f = ( f k )k≥1 be a function from N


→ R+ defined by f k = ak (1 − p)k−1 and
satisfying κ := supn an < +∞ (so that n f n < +∞). Then

κ
f k ≤ c gk with c= .
p
 pak  d
Consequently, if τ := min k ≥ 1 : Uk ≤ κ
, then Yτ = ν where νk :=
ak (1 − p)k−1
 , k ≥ 1.
n an (1 − p)
n−1

There are many other applications of Von Neumann’s acceptance-rejection


method, e.g. in Physics, to take advantage of the fact the density to be simulated
is only known up to a constant. Several methods have been devised to speed it up,
i.e. to increase its yield. Among them let us cite the Ziggurat method for which we
refer to [212]. It was developed by Marsaglia and Tsang in the early 2000s.

1.5 Simulation of Poisson Distributions (and Poisson


Processes)

The Poisson distribution with parameter λ > 0, denoted by P(λ), is an integer-valued


probability measure analytically defined by

  λk
∀ k ∈ N, P(λ) {k} = e−λ .
k!
To simulate this distribution in an exact way, one relies on its close connection with
the Poisson counting process. The (normalized) Poisson counting process is the
counting process induced by the Exponential random walk (with parameter 1). It is
defined by   
∀ t ≥ 0, Nt = 1{Sn ≤t} = min n : Sn+1 > t
n≥1

where Sn = X 1 + · · · + X n , n ≥ 1, S0 = 0 and (X n )n≥1 is an i.i.d. sequence of ran-


dom variables with distribution E(1) defined on a probability space (, A, P).

Proposition 1.3 The process (Nt )t≥0 has càdlàg (1 ) paths and independent station-
ary increments. In particular, for every s, t ≥ 0, s ≤ t, Nt − Ns is independent of
Ns and has the same distribution as Nt−s . Furthermore, for every t ≥ 0, Nt has a
Poisson distribution with parameter t.

1A French acronym for right continuous with left limits (continu à droite limitée à gauche).
1.5 Simulation of Poisson Distributions (and Poisson Processes) 19

Proof. Let (X k )k≥1 be a sequence of i.i.d. random variables with an exponential


distribution E(1). Set, for every n ≥ 1,

Sn = X 1 + · · · + X n .

Let t1 , t2 ∈ R+ , t1 < t2 and let k1 , k2 ∈ N. Assume temporarily that k2 ≥ 1. Then

P(Nt2 − Nt1 = k2 , Nt1 = k1 ) = P(Nt1 = k1 , Nt2 = k1 + k2 )


= P(Sk1 ≤ t1 < Sk1 +1 ≤ Sk1 +k2 ≤ t2 < Sk1 +k2 +1 ).

Now, if we set A = P(Sk1 ≤ t1 < Sk1 +1 ≤ Sk1 +k2 ≤ t2 < Sk1 +k2 +1 ) for convenience,
we get

A= k +k2 +1
1{x e−(x1 +···+xk1 +k2 +1 ) d x1 · · · d xk1 +k2 +1
R+1 1 +···+xk1 ≤t1 ≤x1 +···+xk1 +1 , x1 +···+xk1 +k2 ≤t2 <x1 +···+xk1 +k2 +1 }

The usual change of variable x1 = u 1 and xi = u i − u i−1 , i = 2, . . . , k1 + k2 + 1,


yields

A= e−u k1 +k2 +1 du 1 · · · du k1 +k2 +1 .


{0≤u 1 ≤···≤u k1 ≤t1 ≤u k1 +1 ≤···≤u k1 +k2 ≤t2 <u k1 +k2 +1 }

Integrating downward from u k1 +k2 +1 to u 1 we get, owing to Fubini’s Theorem,

A= du 1 · · · du k1 +k2 e−t2
{0≤u 1 ≤···≤u k1 ≤t1 ≤u k1 +1 ≤···≤u k1 +k2 ≤t2 <u k1 +k2 +1 }

(t2 − t1 )k2 −t2


= du 1 · · · du k1 e
{0≤u 1 ≤···≤u k1 ≤t1 } k2 !
t1k1 (t2 − t1 )k2 −t2
= e
k1 ! k2 !
t k1 (t2 − t1 )k2
= e−t1 1 × e−(t2 −t1 ) .
k1 ! k2 !

When k2 = 0, one computes likewise

t1k1 −t2 t k1
P(Sk1 ≤ t1 < t2 < Sk1 +1 ) = e = e−t1 1 e−(t2 −t1 ) .
k1 ! k1 !

Summing over k2 ∈ N shows that, for every k1 ∈ N,

t1k1
P(Nt1 = k1 ) = e−t1
k1 !
20 1 Simulation of Random Variables

d
i.e. Nt1 = P(t1 ). Summing over k1 ∈ N shows that, for every k2 ∈ N,

d d
Nt2 − Nt1 = Nt2 −t1 = P(t2 − t1 ).

Finally, this yields for every k1 , k2 ∈ N,

P(Nt2 − Nt1 = k2 , Nt1 = k1 ) = P(Nt2 − Nt1 = k2 ) × P(Nt1 = k1 ),

i.e. the increments Nt2 − Nt1 and Nt1 are independent.


One shows likewise, with a few more technicalities, that in fact, if 0 < t1 < t2 <
· · · < tn , n ≥ 1, then the increments (Nti − Nti−1 )i=1,...,n are independent and station-
d
ary in the sense that Nti − Nti−1 = P(ti − ti−1 ). ♦

Corollary 1.2 (Simulation of a Poisson distribution) Let (Un )n≥1 be an i.i.d.


sequence of uniformly distributed random variables on the unit interval. The process
null at zero and defined for every t > 0 by
  d
Nt = min k ≥ 0 : U1 · · · Uk+1 < e−t = P(t)

is a normalized Poisson process.

Proof. It follows from Example 1 in the former Sect. 1.3 that the exponentially
distributed i.i.d. sequence (X k )k≥1 can be written in the following form

X k = − log Uk , k ≥ 1.

Using that the random walk (Sn )n≥1 is non-decreasing it follows from the definition
of a Poisson process that for every t ≥ 0,

Nt = min{k ≥ 0 such that Sk+1 > t},


  
= min k ≥ 0 such that − log U1 · · · Uk+1 > t ,
 
= min k ≥ 0 such that U1 · · · Uk+1 < e−t . ♦

1.6 Simulation of Gaussian Random Vectors

1.6.1 d-dimensional Standard Normal Vectors


 
A first method to simulate a bi-variate normal distribution N 0; I2 is the so-called
Box–Muller method, described below.
1.6 Simulation of Gaussian Random Vectors 21

Proposition 1.4 Let R 2 and  : (, A, P) → R be two independent r.v. with dis-
tributions L(R 2 ) = E( 21 ) and L() = U ([0, 2π]), respectively. Then

d
X := (R cos , R sin ) = N (0; I2 ),

where R := R2.
Proof. Let f be a bounded Borel function.

 
x 2 + x22 d x1 d x2 ρ2 dρdθ
f (x1 , x2 )exp − 1 = f (ρ cos θ, ρ sin θ)e− 2 1R∗ (ρ)1(0,2π) (θ)ρ
R2 2 2π + 2π

using the standard change of variable: x1 = ρ cos θ, x2 = ρ sin θ. We use the facts
√ (0, 2π) × (0, +∞) →
that (ρ, θ) → (ρ cos θ, ρ sin θ) is a C 1 -diffeomorphism from
R2 \ (R+ × {0}) and λ2 (R+ × {0}) = 0. Setting now ρ = r , one has:

x 2 + x22  d x1 d x2
r
√ √ e− 2 dr dθ
f (x1 , x2 ) exp − 1 = f ( r cos θ, r sin θ) 1 ∗ (ρ)1(0,2π) (θ)
R 2 2 2π 2 R+ 2π
  
= IE f R 2 cos , R 2 sin 

= IE( f (X )). ♦

Corollary 1.3 (Box–Muller method) One can simulate a distribution N (0; I2 )


from a pair (U1 , U2 ) of independent random variables with distribution U ([0, 1]) by
setting 
 
X := −2 log(U1 ) cos(2πU2 ), −2 log(U1 ) sin(2πU2 ) .

The yield of the simulation is r = 1/2 with respect to the N (0; 1) distribution and
r = 1 when the aim is to simulate an N (0; I2 )-distributed (pseudo-)random vector
or, equivalently, two N (0; 1)-distributed (pseudo-)random numbers.
Proof. Simulate the exponential distribution using the inverse distribution func-
d d d
tion using U1 = U ([0, 1]) and note that if U2 = U ([0, 1]), then 2πU2 = U ([0, 2π])
(where U2 is taken independent of U1 ). ♦

Application to the simulation of the multivariate normal distribution


To simulate a d-dimensional vector X = (X 1 , . . . , X d ) with N (0; Id ) distribution,
one may assume that d is even and “concatenate” the above process. We consider a
d
d-tuple (U1 , . . . , Ud ) = U ([0, 1]d ) random vector (so that U1 , . . . , Ud are i.i.d. with
distribution U ([0, 1])) and we set
    
X 2i−1 , X 2i = −2 log(U2i−1 ) cos(2πU2i ), −2 log(U2i−1 ) sin(2πU2i ) ,
(1.1)
i = 1, . . . , d/2.
22 1 Simulation of Random Variables

A second popular method for simulating bi-variate normal distributions is the


Polar method due to Marsaglia. It relies on the simulation of uniformly distributed
random variables on the 2-dimensional Euclidean unit ball by an acceptance-rejection
method.
d
 Exercise (Marsaglia’s Polar method). See [210]. Let (V1 , V2 ) = U (B(0; 1)),
where B(0, 1) denotes the canonical Euclidean unit ball in R2 . Set R 2 = V12 + V22
and  
X := −2 log(R 2 )/R 2 V1 , V2 .
   
d
(a) Show that R 2 = U ([0, 1]), ,
V1 V2
R R
∼ cos(θ), sin(θ) , R 2 and ,
V1 V2
R R
are
d
independent. Deduce that X = N (0; I2 ).
d  
(b)  Let (U  1 , U2 ) = U ([−1, 1] ). Show that L (U1 , U2 ) | U1 + U2 ≤ 1
2 2 2

= U B(0; 1) . Derive a simulation method for N (0; I2 ) combining the above identity
and an appropriate acceptance-rejection algorithm to simulate the N (0; I2 ) distribu-
tion. What is the yield of the resulting procedure?
(c) Compare the performances of Marsaglia’s polar method with those of the
Box–Muller algorithm (i.e. the acceptance-rejection rule versus the computation
of trigonometric functions). Conclude.

1.6.2 Correlated d-dimensional Gaussian Vectors, Gaussian


Processes

Let X = (X 1 , . . . , X d ) bea centered Rd -valued Gaussian vector with covariance


matrix  = Cov(X i , X j ) 1≤i, j≤d . The symmetric matrix  is positive semidefinite
(but possibly non-invertible). It follows  from the very definition of Gaussian vectors
that any linear combination (u|X ) = i u i X i has a (centered) Gaussian distribution
with variance u ∗ u so that the characteristic function of X is given by
  1 ∗
χX (u) := E ei(u|X ) = e− 2 u u , u ∈ Rd .

As a well-known consequence the covariance matrix  characterizes the distribution


of X , which allows us to denote N (0; ) the distribution of X .
The key to simulating such a random vector is the following general lemma (which
has nothing to do with Gaussian vectors). It describes how the covariance is modified
by a linear transform.
Lemma 1.2 Let Y be an Rd -valued square integrable random vector and let A ∈
M(q, d) be a q × d matrix. Then the covariance matrix C AY of the random vector
AY is given by
C AY = ACY A∗

where A∗ stands for the transpose of A.


1.6 Simulation of Gaussian Random Vectors 23

This result can be used in two different ways.


– Square root of . It is a classical result that there exists a unique positive semidef-
√ √ 2 √ √ ∗
inite matrix  commuting with  such that  =  =   . Consequently,
owing to the above lemma,

d √ d
If Z = N (0; Id ) then  Z = N (0; ).

One can compute  by diagonalizing  in the orthogonal group: if
 = P Diag(λ1 , . . . , λd )P ∗ with P P ∗ = Id and λ1 , . . . , √
λd ≥ 0, then, by
√ unique-
ness
√ of the square root of  as defined above, it is clear that  = PDiag( λ1 , . . . ,
λd )P ∗ .
– Cholesky decomposition of . When the covariance matrix  is invertible (i.e.
definite), it is much more efficient to rely on the Cholesky decomposition (see e.g.
Numerical Recipes [220]) by decomposing  as

 = TT∗

where T is a lower triangular matrix (i.e. such that Ti j = 0 if i < j). Then, owing to
Lemma 1.2,
d
T Z = N (0; ).

In fact, the Cholesky decomposition is the matrix formulation of the Hilbert–Schmidt


orthonormalization procedure. In particular, there exists a unique such lower triangu-
lar matrix T with positive diagonal terms (Tii > 0, i = 1, . . . , d) called the Cholesky
matrix of .
The Cholesky-based approach performs better since it approximately divides the
complexity of this phase of the simulation by a factor almost 2.
d d
 Exercises. 1. Let Z = (Z 1 , Z 2 ) be a Gaussian vector such that Z 1 = Z 2 = N (0; 1)
and Cov(Z 1 , Z 2 ) = ρ ∈ [−1, 1].
(a) Compute for every u ∈ R2 the Laplace transform L(u) = E e(u|Z ) of Z .
(b) Compute for every σ1 , σ2 > 0 the correlation (2 ) ρ X 1 ,X 2 between the random
variables X 1 = eσ1 Z 1 and X 2 = eσ2 Z 2 .
(c) Show that inf ρ∈[−1,1] ρ X 1 ,X 2 ∈ (−1, 0) and that, when σi = σ > 0,
inf ρ∈[−1,1] ρ X 1 ,X 2 = −e−σ .
2

2. Let  be a positive definite matrix. Show the existence of a unique lower tri-
angular matrix T and a diagonal matrix D such that both T and D have positive

2 The correlation ρ
X 1 ,X 2 between two square integrable non-a.s. constant random variables defined
on the same probability space is defined by

Cov(X 1 , X 2 ) Cov(X 1 , X 2 )
ρ X 1 ,X 2 = = √ .
σ(X 1 )σ(X 2 ) Var(X 1 )Var(X 2 )
24 1 Simulation of Random Variables


diagonal entries and i, j=1 Ti2j = 1 for every i = 1, . . . , d. [Hint: change the refer-
ence Euclidean norm to perform the Hilbert–Schmidt decomposition] (3 ).
Application to the simulation of a standard Brownian motion at fixed times
Let W = (Wt )t≥0 be a standard Brownian motion defined on a probability space
(, A, P). Let (t1 , . . . , tn ) be a non-decreasing n-tuple (0 ≤ t1 < t2 < . . . , tn−1 <
tn ) of instants. One elementary definition of a standard Brownian motion is that it
is a centered Gaussian process with covariance given by C W (s, t) = E (Ws Wt ) =
s ∧ t (4 ). The resulting simulation method relying on the Cholesky decomposition
of the covariance structure of the Gaussian vector (Wt1 , . . . , Wtn ) given by
 
(tW1 ,...,tn ) = ti ∧ t j 1≤i, j≤n

is a first possibility.
However, it seems more natural to use the independence and the stationarity of
the increments, i.e. that
 
d
Wt1 , Wt2 − Wt1 , . . . , Wtn − Wtn−1 = N 0; Diag(t1 , t2 − t1 , . . . , tn − tn−1 )

so that
⎡ ⎤
Wt1 ⎡ ⎤
⎢ ⎥ Z1
⎢ Wt2 − Wt1 ⎥ d √ √ √ ⎢ . ⎥
⎢ .. ⎥ = Diag t1 , t2 − t1 , . . . , tn − tn−1 ⎣ .. ⎦
⎣ . ⎦
Zn
Wtn − Wtn−1

d
where (Z 1 , . . . , Z n ) = N 0; In . The simulation of (Wt1 , . . . , Wtn ) follows by sum-
ming the increments.

Remark. To be more precise, one derives from the above result that
⎡ ⎤ ⎡ ⎤
Wt1 Wt1
⎢ Wt2 ⎥ ⎢ Wt2 − Wt1 ⎥  
⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ = L ⎢ .. ⎥ where L = 1{i≥ j} 1≤i, j≤n .
⎣ . ⎦ ⎣ . ⎦
Wtn Wtn − Wtn−1

3 This modified Cholesky decomposition is faster than the standard one (see e.g. [275]) since it

avoids square root computations even if, in practice, the cost of the decomposition itself remains
negligible compared to that of a large-scale Monte Carlo simulation.
4 This definition does not include the fact that W has continuous paths, however, it can be derived

using the celebrated Kolmogorov criterion, that it has a modification with continuous paths (see e.g.
[251]).
1.6 Simulation of Gaussian Random Vectors 25
√ √ √ 
Hence, if we set T = L Diag t1, t2 − t1 , . . . , tn − tn−1 , we checks on the one

hand that T = t j − t j−1 1{i≥ j} 1≤i, j≤n and on the other hand that
⎡ ⎤
Wt1 ⎡ ⎤
⎢ Wt2 ⎥ Z1
⎢ ⎥ ⎢ .. ⎥
⎢ .. ⎥ = T ⎣ . ⎦ .
⎣ . ⎦
Zn
Wtn

We derive, owing to Lemma 1.2, that T T ∗ = T In T ∗ = tW1 ,...,tn . The matrix T being
lower triangular, it provides the Cholesky decomposition of the covariance matrix
tW1 ,...,tn .
ℵ Practitioner’s corner (Warning!)
In quantitative finance, especially when modeling the dynamics of several risky
assets, say d, the correlation between the Brownian sources of randomness B =
(B 1 , . . . , B d ) attached to the log-return is often misleading in terms of notations:
since it is usually written as


q
j
∀ i ∈ {1, . . . , d}, Bti = σi j Wt ,
j=1

where W = (W 1 , . . . , W q ) is a standard q-dimensional Brownian motion (i.e. each


component W j , j = 1, . . . , q, is a standard Brownian motion and all these compo-
nents are independent). The normalized covariance matrix of B (at time 1) is given
by
 i j 
q
 
Cov B1 , B1 , = σi σ j = σi. |σ j. = (σσ ∗ )i j ,
=1

where σi. is the column vector [σi j ]1≤ j≤q , σ = [σi j ]1≤i≤d,1≤ j≤q and (·|·) denotes the
canonical inner product on Rq . So one should process the Cholesky decomposition
on the symmetric non-negative matrix  B1 = σσ ∗ . Also keep in mind that, if q < d,
then σ B1 has rank at most q and cannot be invertible.
Application to the simulation of a Fractional Brownian motion at fixed times
A fractional Brownian motion (fBm) W H = (WtH )t≥0 with Hurst coefficient H ∈
(0, 1) is defined as a centered Gaussian process with a correlation function given for
every s, t ∈ R+ by

H 1 2H 
C W (s, t) = t + s 2H − |t − s|2H .
2

When H = 21 , W H is simply a standard Brownian motion. When H = 21 , W H has


none of the usual properties of a Brownian motion (except the stationarity of its
26 1 Simulation of Random Variables

increments and some self-similarity properties): it has dependent increments, it is


not a martingale (or even a semi-martingale). Neither is it a Markov process.
A natural approach to simulating the fBm W H at times t1 , . . . , tn is to rely on the
H
Cholesky decomposition of its auto-covariance matrix [C W (ti , t j )]1≤i, j≤n . On the
one hand, this matrix is ill-conditioned, which induces instability in the computation
of the Cholesky decomposition. This should balanced out with the fact that such a
computation can be made only once and stored off-line (at least for usual values of
the ti like ti = inT , i = 1, . . . , n).
Other methods have been introduced in which the auto-covariance matrix is
embedded in a circuit matrix. It relies on a (fast) Fourier transform procedure. This
method, originally introduced in [73], has recently been improved in [277], where a
precise algorithm is described.
Chapter 2
The Monte Carlo Method and
Applications to Option Pricing

2.1 The Monte Carlo Method

The basic principle of the Monte Carlo method is to implement on a computer the
Strong Law of Large Numbers (SLLN): if (X k )k≥1 denotes a sequence, defined on
a probability space (, A, P), of independent integrable random variables with the
same distribution as X , then

X 1 (ω) + · · · + X M (ω) M→+∞


P(dω)-a.s. X M (ω) := −→ mX := E X.
M
The sequence (X k )k≥1 is also called an i.i.d. sequence of random variables with
distribution μ = PX (that of X ) or a(n infinite) sample of the distribution μ. Two
conditions are requested to undertake the Monte Carlo simulation of (the distribution
μ of) X , i.e. to implement the above SLLN on a computer for the distribution μ or
for the random vector X .
 One can generate on a computer some as perfect as possible (pseudo-)random
numbers which can be regarded as a “generic” realization (Uk (ω))k≥1 of a sample
(Uk )k≥1 of the uniform distribution over the unit interval
 [0, 1]. Note that, as a con-
sequence, for any integer d ≥ 1, (U(k−1)d+ (ω) 1≤≤d )k≥1 is also generic for the
sample ((U(k−1)d+ )1≤≤d )k≥1 of the uniform distribution over [0, 1]d (at least the-
 ⊗d
oretically) since U ([0, 1]d ) = U ([0, 1]) . This problem has already been briefly
discussed in the introductory chapter.
 In practice, X reads or is represented either as
d
– X = ϕ(U ), where U is a uniformly distributed random vector on [0, 1]d
where the Borel function ϕ : u → ϕ(u) can be computed at any u ∈ [0, 1]d
or
d
– X = ϕτ (U1 , . . . , Uτ ), where (Un )n≥1 is a sequence of i.i.d. random vari-
ables, uniformly distributed over [0, 1], τ is a simulable finite stopping rule (or
stopping time) for the sequence (Uk )k≥1 , taking values in N∗ . By “stopping rule” we
mean that the event {τ = k} ∈ σ(U1 , . . . , Uk ) for every k ≥ 1 and by “simulable” we
© Springer International Publishing AG, part of Springer Nature 2018 27
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_2
28 2 The Monte Carlo Method and Applications to Option Pricing

mean that 1{τ =k} = ψk (U1 , . . . , Uk ) where ψk has an explicit form for every k ≥ 1.
Of course, we also assume that, for every k ≥ 1, the function ϕk defined on the set
of finite [0, 1]k -valued sequences is a computable function as well.
This procedure is at the core of Monte Carlo simulation. We provided several
examples of such representations in the previous chapter. For further developments
on this wide topic, we refer to [77] and the references therein, but in some way,
one may consider that a significant part of scientific activity in Probability Theory is
motivated by or can be applied to simulation.
Once these conditions are fulfilled, a Monte Carlo simulation can be performed.
But this leads to two important issues:
– what about the rate of convergence of the method?
and
– how can the resulting error be controlled?
The answers to these questions relies on fundamental results in Probability Theory
and Statistics.

2.1.1 Rate(s) of Convergence

The (weak) rate of convergence in the SLLN is ruled by the Central Limit Theorem
(CLT), which says that if X is square integrable (X ∈ L 2 (P)), then
√   L
M X M − mX −→ N (0; σX2 ) as M → +∞,
 
where σX2 = Var(X ) := E (X − E X )2 = E X 2 − (E X )2 is the variance (its square
root σX is called standard deviation) (1 ). Also note that the mean quadratic error of
convergence (i.e. the convergence rate in L 2 (P)) is exactly
 
 X − m  = √σX .
M X 2
M

This error is also known as the RMSE (for Root Mean Square Error).

L
1 Thesymbol −→ stands for convergence in distribution (or “in law” whence the notation “L”): a
sequence of random variables (Yn )n≥1 converges in distribution toward Y∞ if

∀ f ∈ Cb (R, R), E f (Yn ) −→ E f (Y∞ ) as n → +∞.

It can be defined equivalently as the weak convergence of the distributions PYn toward the distribution
PY∞ . Convergence in distribution is also characterized by the following property

∀ A ∈ B(R), P(Y∞ ∈ ∂ A) = 0 =⇒ P(Yn ∈ A) −→ P(Y∞ ∈ A) as n → +∞.

The extension to Rd -valued


random vectors is straightforward. See [45] for a presentation of weak
convergence of probability measures in a general framework.
2.1 The Monte Carlo Method 29

If σX = 0, then X M = X = mX P-a.s. So, throughout this chapter, we may assume


without loss of generality that σ X > 0.
The Law of the Iterated Logarithm (LIL) provides an a.s. rate of convergence,
namely
 
M   M  
lim X M − mX = σX and lim X M − mX = −σX .
M 2 log(log M) M 2 log(log M)

A proof of this (difficult) result can be found e.g. in [263]. All these rates stress the
main drawback of the Monte Carlo method: it is a slow method since dividing the
error by 2 requires an increase of the size of the simulation by 4 (even a bit more
viewed from the LIL).

2.1.2 Data Driven Control of the Error: Confidence Level


and Confidence Interval

Assume σX > 0. As concerns the control of the error, one relies on the CLT. It is
obvious from the definition of convergence in distribution that the CLT also reads

√ X M − mX L
M −→ N (0; 1) as M → +∞
σX

d d  
since σX Z = N (0; σX2 ) if Z = N (0; 1) and f · /σX ∈ Cb (R, R) if f ∈ Cb (R, R).
Moreover, the normal distribution having a density, it has no atom. Consequently,
this convergence implies (in fact it is equivalent, see [45]) that for all real numbers
a, b, a > b,
 
√ X M − mX  
lim P M ∈ [b, a] = P N (0; 1) ∈ [b, a] as M → +∞.
M σX
= 0 (a) − 0 (b),

where 0 denotes the cumulative distribution function (c.d.f.) of the standard normal
distribution N (0; 1) defined by
 x
ξ2 dξ
0 (x) = e− 2 √
−∞ 2π

(see Sect. 12.1.3 for a tabulation of 0 ).


 Exercise. Show that, like any distribution function of a symmetric random variable
without atom, the distribution function of the centered normal distribution on the real
line satisfies
30 2 The Monte Carlo Method and Applications to Option Pricing

∀ x ∈ R, 0 (x) + 0 (−x) = 1.
 
Deduce that P |N (0; 1)| ≤ a = 20 (a) − 1 for every a > 0.

In turn, if X 1 ∈ L 3 (P), the convergence rate in the Central Limit Theorem is ruled
by the following Berry–Esseen Theorem (see [263], p 344).

Theorem 2.1 (Berry–Esseen Theorem) Let X 1 ∈ L 3 (P) with σX > 0 and let
(X k )k≥1 be an i.i.d. sequence of real-valued random variables defined on (, A, P).
Then

√  C E |X 1 − E X 1 |3 1
∀ n ≥ 1, ∀ x ∈ R, P M( X̄M − mX ) ≤ x σX − 0 (x) ≤ √ .
σX3 M 1 + |x|3


Hence, the rate of convergence in the CLT is again 1/ M, which is rather slow,
at least from a statistical viewpoint. However, this is not a real problem within the
usual range of Monte Carlo simulations (at least many thousands, usually one hundred
√ X M − mX
thousand or one million paths). Consequently, one can assume that M
σX
has a standard normal distribution. This means in particular that one can design
a probabilistic control of the error directly derived from statistical concepts: let
α ∈ (0, 1) denote a confidence level (close to 1 in practice) and let qα be the two-
sided α-quantile defined as the unique solution to the equation
 
P |N (0; 1)| ≤ qα = α i.e. 2 0 (qα ) − 1 = α.

Then, setting a = qα and b = −qα , one defines a theoretical random confidence


interval at level α by

σ σ
JMα := X M − qα √ X , X M + qα √ X
M M

which satisfies
 
 α
 √ |X M − mX |
P mX ∈ JM = P M ≤ qα
σX
M→+∞  
−→ P |N (0; 1)| ≤ qα = α.

However, at this stage this procedure remains purely theoretical since the confi-
dence interval JMα involves the standard deviation σX of X , which is usually unknown.
Here comes the “magic trick” which led to the tremendous success of the Monte
Carlo method: the variance of X can in turn be evaluated on-line, without any addi-
tional assumption, by simply adding a companion Monte Carlo simulation to estimate
the variance σX2 , namely
2.1 The Monte Carlo Method 31

M
1
VM = (X k − X M )2 (2.1)
M −1 k=1
M
1 M 2
= X k2 − X −→ Var(X ) = σX2 as M → +∞.
M −1 k=1
M −1 M

The above convergence (2 ) follows from the SLLN applied to the i.i.d. sequence of
integrable random variables (X k2 )k≥1 (and the convergence of X M , which also follows
from the SLLN). It is an easy exercise to show that moreover E V M = σX2 , i.e. using the
terminology of Statistics, V M is an unbiased estimator of σX2 . This remark is of little
importance in practice due to the common large ranges of Monte Carlo simulations.
Note that the above a.s. convergence is again ruled by a CLT if X ∈ L 4 (P) (so that
X 2 ∈ L 2 (P)).
 Exercises. 1. Show that V M is unbiased, that is E V M = σX2 .
2. Show that the sequence (X M , V M ) M≥1 satisfies the following recursive equation

1 
X M = X M−1 + XM − X M−1
M
M −1 X
= X M−1 + M , M ≥ 1
M M

(with the convention X 0 = 0) and

1 
V M = V M−1 − V M−1 − (XM − X M−1 )(XM − X M )
M −1
M −2 (X − X M−1 )2
= V M−1 + M , M ≥ 2.
M −1 M

As a consequence, one derives from Slutsky’s Theorem (3 ) that

√ X M − mX √ X M − mX σ L
M  = M ×  X −→ N (0; 1) as M → +∞.
σX
VM VM

Of course, within the usual range of Monte Carlo simulations, one can always con-
sider that, for large M,

2 When X “already” has an N (0; 1) distribution, then V M has a χ2 (M − 1)-distribution. The χ2 (ν)-
distribution, known as the χ2 -distribution with ν ∈ N∗ degrees of freedom, is the distribution of the
sum Z 12 + · · · + Z ν2 , where Z 1 , . . . , Z ν are i.i.d. with N (0; 1) distribution. The loss of one degree
of freedom for V M comes from the fact that X 1 , . . . , XM and X M satisfies a linear equation which
induces a linear constraint. This result is known as Cochran’s Theorem.
3 If a sequence of Rd -valued random vectors satisfy L
Z n −→ Z ∞ and a sequence of random variables
P L
Cn −→ c∞ ∈ R, then Cn Z n −→ c∞ Z ∞ , see e.g. [155].
32 2 The Monte Carlo Method and Applications to Option Pricing

√ X M − mX
M  N (0; 1).
VM

Note that when X is itself normally distributed, one shows that the empirical mean
X M and the empirical variance V M are independent, so that the true distribution of
the left-hand side of the above (approximate) equation is a Student distribution with
M − 1 degrees of freedom (4 ), denoted by T (M − 1).
Finally, one defines the confidence interval at level α of the Monte Carlo simula-
tion by ⎡   ⎤
VM VM ⎦
IMα = ⎣ X M − qα , X M + qα
M M

which will still satisfy (for large M)


   
P mX ∈ IMα P |N (0; 1)| ≤ qα = α.

For numerical implementation one often considers qα = 2, which corresponds


to the confidence level α = 95.44 % 95 %. With the constant increase in perfor-
mance of computing devices, higher confidence levels α became common, typically
α = 99% or 99.5% (like for the computation of Value-at-Risk, following Basel II
recommendations).
The “academic” birth date of the Monte Carlo method is 1949, with the publication
of a seemingly seminal paper by Metropolis and Ulam “The Monte Carlo method”
in J. of American Statistics Association (44, 335–341). However, the method had
already been extensively used for several years as a secret project of the U.S. Defense
Department.
One can also consider that, in fact, the Monte Carlo method goes back to the
celebrated Buffon’s needle experiment and should subsequently be credited to the
French naturalist Georges Louis Leclerc, Comte de Buffon (1707–1788).
As concerns quantitative Finance and more precisely the pricing of derivatives, it
seems difficult to trace the origin of the implementation of the Monte Carlo method
for the pricing and hedging of options. In the academic literature, the first academic
paper dealing in a systematic manner with a Monte Carlo approach seems to go back
to Boyle in [50].

4 The Student distribution


√ with ν degrees of freedom, denoted by T (ν), is the law of the random
ν Z0
variable  , where Z 0 , Z 1 , . . . , Z ν are i.i.d. with N (0; 1) distribution. As expected
Z 12 + · · · + Z ν2
for the coherence of the preceding, one shows that T (ν) converges in distribution to the normal
distribution N (0; 1) as ν → +∞.
2.1 The Monte Carlo Method 33

2.1.3 Vanilla Option Pricing in a Black–Scholes Model:


The Premium

For the sake of simplicity, we consider a 2-dimensional risk-neutral correlated Black–


Scholes model for two risky assets X 1 and X 2 under its unique risk neutral probability
(but a general d-dimensional model can be defined likewise):

d X t0 = r X t0 dt, X 00 = 1,
    (2.2)
d X t1 = X t1 r dt + σ1 dWt1 , X 01 = x01 , d X t2 = X t2 r dt + σ2 dWt2 , X 02 = x02 ,

with the usual notations (r interest rate, σi > 0 volatility of X i ). In particular, W =


(W 1 , W 2 ) denotes a correlated bi-dimensional Brownian motion such that

W 1 , W 2 t = ρ t, ρ ∈ [−1, 1].

This implies that W 2 can be decomposed as Wt2 = ρWt1 + 1 − ρ2 W t2 , where
1 2
(W , W ) is a standard 2-dimensional Brownian motion. The filtration (Ft )t∈[0,T ] of
this market is the augmented filtration of W , i.e. Ft = FtW := σ(Ws , 0 ≤ s ≤ t, NP )
where NP denotes the family of P-negligible sets of A (5 ). By “filtration of the mar-
ket”, we mean that (Ft )t∈[0,T ] is the smallest filtration satisfying the usual conditions
towhich the process (X t )t∈[0,T ] is adapted. By “risk-neutral”, we mean that e−r t X t is
a P, (Ft )t )-martingale. We will not go further into financial modeling at this stage,
for which we refer e.g. to [185] or [163], but focus instead on numerical aspects.
For every t ∈ [0, T ], we have

σi2
X t0 = er t , X ti = x0i e(r − 2 )t+σi Wti
, i = 1, 2.

(One easily verifies using Itô’s Lemma, see Sect. 12.8, that X t thus defined satis-
fies (2.2); formally finding the solution by applying Itô’s Lemma to log X ti , i = 1, 2,
assuming a priori that the solutions of (2.2) are positive).
When r = 0, X i is called a geometric Brownian motion associated to W i with
volatility σi > 0.
A European vanilla option with maturity T > 0 is an option related to a European
payoff
h T := h(X T )

which only depends on X at time T . In such a complete market the option premium
at time 0 is given by
V0 = e−r t E h(X T )

5 One shows that, owing to the 0-1 Kolmogorov law, this filtration is right continuous, i.e. Ft =
∩s>t Fs . A right continuous filtration which contains the P-negligible sets satisfies the so-called
“usual conditions”.
34 2 The Monte Carlo Method and Applications to Option Pricing

and more generally, at any time t ∈ [0, T ],


 
Vt = e−r (T −t) E h(X T ) | Ft .

The fact that W has independent stationary increments implies that X 1 and X 2
have independent stationary ratios, i.e.
   
X Ti d X Ti −t
= is independent of Ft .
X ti x0i
i=1,2 i=1,2

As a consequence, if we define for every T > 0, x0 = (x01 , x02 ) ∈ (0, +∞)2 ,

v(x0 , T ) = e−r t E h(X T ),

then
 
Vt = e−r (T −t) E h(X T ) | Ft
     
−r (T −t)
X Ti
=e E h X ti × Ft
X ti
i=1,2
   
−r (T −t) X Ti −t
=e Eh x i
by independence
x0i
i=1,2 |x i =X ti , i=1,2
= v(X t , T − t).

 Examples. 1. Vanilla call with strike price K :


 
h(x 1 , x 2 ) = x 1 − K + .

There is a closed form for such a call option – the celebrated Black–Scholes
formula for option on stock (without dividend) – given by

Call0B S = C(x0 , K , r, σ1 , T ) = x0 0 (d1 ) − e−r t K 0 (d2 ) (2.3)


σ12
log(x0 /K ) + (r + )T √
with d1 = √ 2
, d2 = d1 − σ1 T , (2.4)
σ1 T

where 0 denotes the c.d.f. of the N (0; 1)-distribution.


2. Best-of-call with strike price K :
 
h T = max(X T1 , X T2 ) − K + .
2.1 The Monte Carlo Method 35

A quasi-closed form is available involving the distribution function of the bi-variate


(correlated) normal distribution. Laziness may lead to price it by Monte Carlo sim-
ulation (a PDE approach is also appropriate but needs more care) as detailed below.
3. Exchange Call Spread with strike price K :
 
h T = (X T1 − X T2 ) − K + .

For this payoff no closed form is available. One has a choice between a PDE approach
(quite appropriate in this 2-dimensional setting but requiring some specific develop-
ments) and a Monte Carlo simulation.
We will illustrate below the regular Monte Carlo procedure on the example of a
Best-of-Call which is traded on an organized market, unlike its cousin the Exchange
Call Spread.
Pricing a best-of-call by a monte carlo simulation
To implement a (crude) Monte Carlo simulation we need to write the payoff as a
function of independent uniformly distributed random variables, or, equivalently, as
a tractable function of such random variables. In our case, we write it as a function of
two standard normal variables, i.e. a bi-variate standard normal distribution (Z 1 , Z 2 ),
namely

e−rt h T = ϕ(Z 1 , Z 2 )
d

      
σ12 √ σ2 √   
:= max x01 exp − T +σ1 T Z 1 , x02 exp − 2 T +σ2 T ρZ 1 + 1 − ρ2 Z 2 − K e−rt
2 2
+

d
where Z = (Z 1 , Z 2 ) = N (0; I2 ) (the dependence of ϕ in x0i , etc, is dropped). Then,
simulating a M-sample (Z m )1≤m≤M of the N (0; I2 ) distribution using e.g. the Box–
Muller method yields the estimate
  
Best-o f -Call 0 = e−r t E max(X T1 , X T2 ) − K +

= E ϕ(Z 1 , Z 2 )
M
1
ϕM := ϕ(Z m ).
M m=1

One computes an estimate for the variance using the same sample

M
1 M
V M (ϕ) = ϕ(Z m )2 − ϕ2 Var(ϕ(Z ))
M − 1 m=1 M −1 M

since M is large enough. Then one designs a confidence interval for E ϕ(Z ) at level
α ∈ (0, 1) by setting
36 2 The Monte Carlo Method and Applications to Option Pricing
⎡   ⎤
V M (ϕ) V M (ϕ) ⎦
IMα = ⎣ϕ M − qα , ϕ M + qα
M M

 
where qα is defined by P |N (0; 1)| ≤ qα = α (or equivalently by 20 (qα )
− 1 = α).
 Numerical Application. We consider a European “Best-of-Call” option with the
following parameters

r = 0.1, σi = 0.2 = 20 %, ρ = 0.5, X 0i = 100, T = 1, K = 100.

The confidence level is set at α = 0.95.


The Monte Carlo simulation parameters are M = 2m , m = 10, . . . , 20 (keep in
mind that 210 = 1024). The (typical) results of a numerical simulation are reported
in Table 2.1 below, see also Fig. 2.1.
 Exercise. Proceed likewise with an Exchange Call Spread option.

Remark. Once the script is written for one option, i.e. one payoff function, it is
almost instantaneous to modify it to price another option based on a new payoff
function: the Monte Carlo method is very flexible, much more than a PDE approach.

Table 2.1 Black–Scholes Best - of- Call. Pointwise estimate and confidence intervals as a
function of the size M of the Monte Carlo simulation, M = 2k , k = 10, . . . , 20, 230
M ϕM Iα,M
210 = 1 024 18.5133 [17.4093; 19.6173]
211 = 2 048 18.7398 [17.9425; 19.5371]
212 = 4 096 18.8370 [18.2798; 19.3942]
213 = 8 192 19.3155 [18.9196; 19.7114]
214 = 16 384 19.1575 [18.8789; 19.4361]
215 = 32 768 19.0911 [18.8936; 19.2886]
216 = 65 536 19.0475 [18.9079; 19.1872]
217 = 131 072 19.0566 [18.9576; 19.1556]
218 = 262 144 19.0373 [18.9673; 19.1073]
219 = 524 288 19.0719 [19.0223; 19.1214]
220 = 1 048 576 19.0542 [19.0191 19.0892]
··· ··· ···
230 = 1.0737.109 19.0766 [19.0756; 19.0777]
2.1 The Monte Carlo Method 37

20.5

20

19.5
Iα,M

19

18.5

18

17.5
10 12 14 16 18 20 22 24 26 28 30
M = 2m, m = 1,…,30

Fig. 2.1 Black–Scholes Best- of- Call. The Monte Carlo estimator (⊗) and its confidence
interval at level α = 95% for sizes M ∈ {2k , k = 10, . . . , 30}

2.1.4 ℵ Practitioner’s Corner

The practitioner should never forget that performing a Monte Carlo simulation to
compute mX = E X , consists of three mandatory steps:
1. Specification of a confidence level α ∈ (0, 1) (α 1, typically, α = 95%, 99%,
99.5%, etc).
2. Simulation of an M-sample X 1 , X 2 , . . . , X M of i.i.d. random vectors having the
same distribution as X and (possibly recursive) computation of both its empirical
mean X̄M and its empirical variance V̄M .
3. Computation of the resulting confidence interval I M at confidence level α, which
will be the only trustable result of the Monte Carlo simulation.
We will see further on in Chap. 3 how to specify the size M of the simulation to
comply with an a priori accuracy level. The case of biased simulation is deeply inves-
tigated in Chap. 9 (see Sect. 9.3 for the analysis of a crude Monte Carlo simulation
in a biased framework).
38 2 The Monte Carlo Method and Applications to Option Pricing

2.2 Greeks (Sensitivity to the Option Parameters): A First


Approach

2.2.1 Background on Differentiation of Functions Defined by


an Integral

The greeks or sensitivities denote the set of parameters obtained as derivatives of the
premium of an option with respect to some of its parameters: the starting value, the
volatility, etc. It applies more generally to any function defined by an expectation.
In elementary situations, one simply needs to apply some more or less standard
theorems like the one reproduced below (see [52], Chap. 8 for a proof). A typical
example of such a “elementary situation” is the case of a possibly multi-dimensional
risk neutral Black–Scholes model.

Theorem 2.2 (Interchanging differentiation and expectation) Let (, A, P) be


a probability space, let I be a nontrivial interval of R. Let ϕ : I ×  → R be a
Bor (I ) ⊗ A-measurable function.
(a) Local version. Let x0 ∈ I . If the function ϕ satisfies:
(i) for every x ∈ I , the random variable ϕ(x, . ) ∈ L 1R (, A, P),
∂ϕ
(ii) P(dω)-a.s. (x0 , ω) exists,
∂x
(iii) there exists a Y ∈ L 1R+ (P) such that, for every x ∈ I ,

P(dω)-a.s. ϕ(x, ω) − ϕ(x0 , ω) ≤ Y (ω)|x − x0 |,



then, the function (x) := E ϕ(x, . ) = ϕ(x, ω)P(dω) is defined at every x ∈ I ,

differentiable at x0 with derivative
 
∂ϕ
 (x0 ) = E (x0 , ·) .
∂x

(b) Global version. If ϕ satisfies (i) and


∂ϕ
(ii)glob P(dω)-a.s., (x, ω) exists at every x ∈ I ,
∂x
(iii)glob There exists a Y ∈ L 1R+ (, A, P) such that, for every x ∈ I ,

∂ϕ(x, ω)
P(dω)-a.s. ≤ Y (ω),
∂x

then, the function (x) := E ϕ(x, . ) is defined and differentiable at every x ∈ I , with
derivative  
∂ϕ
 (x) = E (x, . ) .
∂x
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 39

Remarks. • The local version of the above theorem may be necessary to prove the
differentiability of a function defined by an expectation over the whole real line (see
the exercise that follows).
• All of the preceding remains true if one replaces the probability space (, A, P)
by any measurable space (E, E, μ) where μ is a non-negative measure (see again
Chap. 8 in [52]). However, this extension is no longer true as seen when dealing with
the uniform integrability assumption mentioned in the exercises hereafter.
• Some variants of the result can be established to obtain a theorem for right or
left differentiability of functions  = Eω ϕ(x, ω) defined on the real line, (partially)
differentiable functions defined on Rd , for holomorphic functions on C, etc. The
proofs follow the same lines.
• There exists a local continuity result for such functions ϕ defined as an expectation
which is quite similar to Claim (a). The domination property by an integrable non-
negative random variable Y is requested on ϕ(x, ω) itself. A precise statement is
provided in Chap. 12 (with the same notations). For a proof we still refer to [52],
Chap. 8. This result is often useful to establish the (continuous) differentiability of a
multi-variable function by combining the existence and the continuity of its partial
derivatives.
d
 Exercise. Let Z = N (0; 1) be defined on a probability space (, A, P), ϕ(x, ω) =

x − Z (ω) + and (x) = E ϕ(x, Z ) = E (x − Z )+ , x ∈ R.
(a) Show that  is differentiable on the real line by applying the local version of
Theorem 2.2 and compute its derivative.
(b) Let I denote a non-trivial interval of R. Show that if ω ∈ {Z ∈ I } (i.e. Z (ω) ∈ I ),
the function x → x − Z (ω) + is never differentiable on the whole interval I .
 Exercises (Extension to uniform integrability). One can replace the domination
property (iii) in Claim (a) (local version) of the above Theorem 2.2 by the less
stringent uniform integrability assumption
 
ϕ(x, .) − ϕ(x0 , .)
(iii)ui is P -uniformly integrable on (, A, P).
x − x0 x∈I \{x0 }

For the definition and some background on uniform integrability, see Chap. 12,
Sect. 12.4.
1. Show that (iii)ui implies (iii).
2. Show that (i)–(ii)–(iii)ui implies the conclusion of Claim (a) (local version) in
the above Theorem 2.2.
3. State a “uniform integrable” counterpart of (iii)glob to extend Claim (b) (global
version) of Theorem 2.2.
4. Show that uniform integrability of the above family of random variables follows
from its L p -boundedness for (any) p > 1.
40 2 The Monte Carlo Method and Applications to Option Pricing

2.2.2 Working on the Scenarii Space (Black–Scholes Model)

To illustrate the different methods to compute the sensitivities, we will consider the
one dimensional risk-neutral Black–Scholes model with constant interest rate r and
volatility σ > 0:  
d X tx = X tx r dt + σdWt , X 0x = x > 0,
 σ2

so that X tx = x exp (r − 2
)t + σWt . Then, we consider for every x ∈ (0, +∞),

(x, r, σ, T ) := E ϕ(X Tx ), (2.5)

where ϕ : (0, +∞) → R lies in L 1 (P X Tx ) for every x ∈ (0, +∞) and T > 0. This
corresponds (when ϕ is non-negative), to vanilla payoffs with maturity T . However,
we skip on purpose the discounting factor in what follows to alleviate notation: one
can always imagine it is included as a constant in the function ϕ since we will work
at a fixed time T . The updating of formulas is obvious. Note that, in many cases,
new parameters directly attached to the function ϕ itself are of interest, typically the
strike price K when ϕ(x) = (x − K )+ (call option), (K − x)+ (Put option), |x − K |
(butterfly option), etc.
First, we are interested in computing the first two derivatives  (x) and  (x) of
the function  which correspond (up to the discounting factor) to the δ-hedge of the
option and its γ parameter, respectively. The second parameter γ is involved in the
so-called “tracking error”. But other sensitivities are of interest to the practitioners
like the vega, i.e. the derivative of the (discounted) function  with respect to the
volatility parameter, the ρ (derivative with respect to the interest rate r ), etc. The
aim is to derive representations of these sensitivities as expectations in order to
compute them using a Monte Carlo simulation, in parallel with the computation of
the premium.
The Black–Scholes model is here clearly a toy model to illustrate our approach
since, for such a model, closed forms s exist for standard payoffs (Call, Put and their
linear combinations) and more efficient methods can be successfully implemented
like solving the PDE derived from the Feynman–Kac formula (see Theorem 7.11).
At least in a one – like here – or a low-dimensional model (say d ≤ 3) using a finite
difference (elements, volumes) scheme after an appropriate change of variable. For
PDE methods we refer to [2].
We will first work on the scenarii space (, A, P), because this approach contains
the “seed” of methods that can be developed in much more general settings in which
the SDE no longer has an explicit solution. On the other hand, as soon as an explicit
expression is available for the density pT (x, y) of X Tx , it is more efficient to use the
method described in the next Sect. 2.2.3.
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 41

Proposition 2.1 (a) If ϕ : (0, +∞) → R is differentiable and ϕ has polynomial


growth, then the function  defined by (2.5) is differentiable and
 
 
X Tx
∀ x > 0,  (x) = E ϕ (X T )
x
. (2.6)
x

(b) If ϕ : (0, +∞) → R is differentiable outside a countable set and is locally Lip-
schitz continuous with polynomial growth in the following sense

∃ m ≥ 0, ∃ C > 0, ∀ u, v ∈ R+ , |ϕ(u) − ϕ(v)| ≤ C|u − v|(1 + |u|m + |v|m ),

then  is differentiable everywhere on (0, +∞) and  is given by (2.6).


(c) If ϕ : (0, +∞) → R is simply a Borel function with polynomial growth, then 
is still differentiable and
 
W
∀ x > 0,  (x) = E ϕ(X Tx ) T . (2.7)
xσT

Remark. The above formula (2.7) can be seen as a first example of Malliavin weight
to compute a greek – here the δ-hedge – and the first method that we will use to
establish it, based on an integration by parts, as the embryo of the so-called Malliavin
Monte Carlo approach to Greek computation. See, among other references, [47] for
more developments in this direction when the underlying diffusion process is no
longer an elementary geometric Brownian motion.

Proof. (a) This straightforwardly follows from the explicit expression for X Tx and
the above differentiation Theorem 2.2 (global version): for every x ∈ (0, +∞),

∂ ∂Xx Xx
ϕ(X Tx ) = ϕ (X Tx ) T = ϕ (X Tx ) T .
∂x ∂x x
 
Now |ϕ (u)| ≤ C 1 + |u|m , where m ∈ N and C ∈ (0, +∞), so that, if 0 < x ≤
L < +∞,

∂ 
 
ϕ(X Tx ) ≤ Cr,σ,T 1 + L m exp (m + 1)σWT ∈ L 1 (P),
∂x

where Cr,σ,T is another positive real constant. This yields the domination condition
of the derivative.
(b) This claim follows from Theorem 2.2(a) (local version) and the fact that, for
every T > 0, P(X Tx = ξ) = 0 for every ξ ≥ 0.
(c) First, we still assume that the assumption (a) is in force. Then
42 2 The Monte Carlo Method and Applications to Option Pricing
 √ √  2 du
   
 (x) = ϕ x exp μT + σ T u) exp μT + σ T u e−u /2 √
R 2π
  √ 
1 ∂ϕ x exp (μT + σ T u) −u 2 /2 du
= √ e √
xσ T R ∂u 2π
 √
  ∂e−u /2 du
2
1
=− √ ϕ x exp (μT + σ T u) √
xσ T R ∂u 2π
 √
1   −u 2 /2 du
= √ ϕ x exp (μT + σ T u) u e √
xσ T R 2π
 √
1  √ du
ϕ x exp (μT + σ T u) T u e−u /2 √ ,
2
=
x σT R 2π

where we used an integration by parts to obtain the third equality, taking advantage
of the fact that, owing to the polynomial growth assumptions on ϕ,
 √  2
lim ϕ x exp (μT + σ T u) e−u /2 = 0.
|u|→+∞

Finally, returning to ,

1  
 (x) = E ϕ(X Tx )WT . (2.8)
x σT
When ϕ is not differentiable, let us first sketch the extension of the formula by a
density argument. When ϕ is continuous and has compact support in R+ , one may
assume without loss of generality that ϕ is defined on the whole real line as a con-
tinuous function with compact support. Then ϕ can be uniformly approximated by
differentiable functions ϕn with compact support (use a convolution  by mollifiers,

see [52], Chap. 8). Then, with obvious notations, n (x) := x σT
1
E ϕn (X Tx )WT con-
verges uniformly on compact sets of (0, +∞) to  (x) defined by (2.8) since

E |WT |
|n (x) −  (x)| ≤ ϕn − ϕsup .
xσT

Furthermore, n (x) converges (uniformly) toward f (x) on (0, +∞). Conse-


quently,  is differentiable with derivative f  . ♦

Remark. We will see in the next section a much quicker way to establish claim (c).
The above method of proof, based on an integration by parts, can be seen as a
W
toy-introduction to a systematic way to produce random weights like xσTT in the
differentiation procedure of , especially when the differential of ϕ does not exist.
The most general extension of this approach, developed on the Wiener space (6 ) for
functionals of the Brownian motion is known as the Malliavin-Monte Carlo method.

6 The Wiener space C (R


 + , R ) and its Borel
d
 σ-field for the topology of uniform convergence on
compact sets, namely σ ω → ω(t), t ∈ R+ , endowed with the Wiener measure, i.e. the distribution
of a standard d-dimensional Brownian motion (W 1 , . . . , W d ).
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 43

 Exercise (Extension to Borel functions with polynomial growth). (a) Show that
as soon as ϕ is a Borel function with polynomial growth, the function f defined
by (2.5) is continuous. [Hint: use that the distribution X Tx has a probability density
pT (x, y) on the positive real line which continuously depends on x and apply the
continuity theorem for functions defined by an integral, see Theorem 12.5(a) in the
Miscellany Chap. 12.]
(b) Show that (2.8) holds true as soon as ϕ is a bounded Borel function. [Hint:
Apply the Functional Monotone Class Theorem (see Theorem 12.2 in the Miscellany
Chap. 12) to an appropriate vector subspace of functions ϕ and use the Baire σ-field
Theorem.]
(c) Extend the result to Borel functions ϕ with polynomial growth. [Hint: use that
ϕ(X Tx ) ∈ L 1 (P) and ϕ = limn ϕn with ϕn = (n ∧ ϕ ∨ (−n)).]
(d) Derive from the preceding a simple expression for (x) when ϕ = 1 I is the
indicator function of a nontrivial interval I .
Comments: The extension to Borel functions ϕ always needs at some place an
argument based on the regularizing effect of the diffusion induced by the Brownian
motion. As a matter of fact, if X tx were the solution to a regular ODE this extension to
non-continuous functions ϕ would fail. We propose in the next section an approach –
the log-likelihood method – directly based on this regularizing effect through the
direct differentiation of the probability density pT (x, y) of X Tx .
 Exercise. Prove claim (b) in detail.
Note that the assumptions of claim (b) are satisfied by usual payoff functions like
ϕCall (x) = (x − K )+ or ϕ Put := (K − x)+ (when X Tx has a continuous distribution).
In particular, this shows that
   
∂E ϕCall (X Tx ) Xx
= E 1{X Tx ≥K } T .
∂x x

The computation of this quantity – which is part of that of the Black–Scholes for-
mula – finally yields as expected:
 
∂E ϕCall (X Tx )
= er t 0 (d1 ),
∂x
where d1 is given by (2.4) (keep in mind that the discounting factor is missing).
 Exercises. 0. A comparison. Try a direct differentiation of the Black–Scholes
formula (2.3) and compare with a (formal) differentiation based on Theorem 2.2.
You should find by both methods

∂Call0B S
(x) = 0 (d1 ).
∂x
But the true question is: “how long did it take you to proceed?”
44 2 The Monte Carlo Method and Applications to Option Pricing

1. Application to the computation of the γ (i.e.  (x)). Show that, if ϕ is differentiable
with a derivative having polynomial growth,

1   
 (x) := E ϕ (X Tx )X Tx − ϕ(X Tx ) WT
x 2 σT
and that, if ϕ is continuous with compact support,
  
 1 WT2 1
 (x) := 2 E ϕ(X T )
x
− WT − .
x σT σT σ

Extend this identity to the case where ϕ is simply Borel with polynomial growth.
Note that a (somewhat simpler) formula also exists when the function ϕ is itself
twice differentiable, but such a smoothness assumption is not realistic, at least for
financial applications.
2. Variance reduction for the δ (7 ). The above formulas are clearly not the unique
representations of the δ as an expectation: using that E WT = 0 and E X Tx = xer t ,
one derives immediately that
 
  x  Xx
 (x) = ϕ (xer t )er t + E ϕ (X T ) − ϕ (xer t ) T
x

as soon as ϕ is differentiable at xer t . When ϕ is simply Borel

1   
 (x) = E ϕ(X Tx ) − ϕ(xer t ) WT .
xσT
3. Variance reduction for the γ. Show that

1   x x  
 (x) = E ϕ (X )X − ϕ(X x
) − xe rt 
ϕ (xe rt
) + ϕ(xe rt
) WT .
x 2 σT T T T

4. Testing the variance reduction, if any. Although the former two exercises are enti-
tled “variance reduction” the above formulas do not guarantee a variance reduction at
a fixed time T . It seems intuitive that they do only when the maturity T is small. Per-
form some numerical experiments to test whether or not the above formulas induce
some variance reduction.
As the maturity increases, test whether or not the regression method introduced
in Sect. 3.2 works with these “control variates”.
5. Computation of the vega (8 ). Show likewise that E ϕ(X Tx ) is differentiable with
respect to the volatility parameter σ under the same assumptions on ϕ, namely

7 Inthis exercise we slightly anticipate the next chapter, which is entirely devoted to variance
reduction.
8 Which is not a greek letter…
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 45

∂  
E ϕ(X Tx ) = E ϕ (X Tx )X Tx WT − σT
∂σ
if ϕ is differentiable with a derivative having polynomial growth. Derive without any
further computations – but with the help of the previous exercises – that
  
∂ WT2 1
E ϕ(X T ) = E ϕ(X T )
x x
− WT −
∂σ σT σ

if ϕ is simply Borel with polynomial growth. [Hint: use the former exercises.]
This derivative is known (up to an appropriate discounting) as the vega of the
option related to the payoff ϕ(X Tx ). Note that the γ and the vega of a Call satisfy
(after discounting by e−r t )

vega(x, K , r, σ, T ) = x 2 σT γ(x, K , r, σ, T ),

which is the key of the tracking error formula.


In fact, the beginning of this section can be seen as an introduction to the so-called
tangent process method (see Sect. 2.2.4 at the end of this chapter and Sect. 10.2.2).

2.2.3 Direct Differentiation on the State Space: The


Log-Likelihood Method

In fact, the formulas established in the former section for the Black–Scholes model
can be obtained by working directly on the state space (0, +∞), taking advantage of
the fact that X Tx has a smooth and explicit probability density pT (x, y) with respect to
the Lebesgue measure on (0, +∞), which is known explicitly since it is a log-normal
distribution.
This probability density also depends on the other parameters of the model like
the volatility, σ, the interest rate r and the maturity T . Let us denote by θ one of
these parameters which is assumed to lie in a parameter set . More generally,
we could imagine that X Tx (θ) is an Rd -valued solution at time T to a stochastic
differential equation whose coefficients b(θ, x) and σ(θ, x) depend on a parameter
θ ∈  ⊂ R. An important result of stochastic analysis for Brownian diffusions is that,
under uniform ellipticity assumptions (or the less stringent “parabolic Hörmander
ellipticity assumptions”, see [24, 139]), combined with smoothness assumptions on
both the drift and the diffusion coefficient, such a solution of an SDE does have a
smooth density pT (θ, x, y) – at least in (x, y) – with respect to the Lebesgue measure
on Rd . For more details, we refer to [25] or [11, 98]. Formally, we then get

(θ) = E ϕ(X Tx (θ)) = ϕ(y) pT (θ, x, y)μ(dy)
Rd
46 2 The Monte Carlo Method and Applications to Option Pricing

so that, formally,

∂ pT
 (θ) = ϕ(y) (θ, x, y)μ(dy)
Rd ∂θ
 ∂ pT
(θ, x, y)
∂θ
= ϕ(y) p (θ, x, y)μ(dy)
Rd pT (θ, x, y) T
 
∂ log pT
= E ϕ(X Tx ) (θ, x, X Tx ) . (2.9)
∂θ

Of course, to be valid, the above computations need appropriate assumptions


(domination, uniform integrability, etc) to justify interchange of integration and dif-
ferentiation (see Exercises below and also Sect. 10.3.1).
Using this approach in the simple case of a Black–Scholes model, one immediately
retrieves the formulas established in Proposition 2.1(c) of Sect. 2.2.2 for the δ-hedge
(in particular, when the function ϕ defining the payoff is only Borel). One can also
retrieve by the same method the formulas for all the greeks.
However, this straightforward and simple approach to “greek” computation
remains limited beyond the Black–Scholes world by the fact that it is mandatory
to have access not only to the regularity of the probability density pT (θ, x, y) of the
asset at time T but also to its explicit expression as well as that of the partial derivative
∂ log p
of its logarithm ∂θ T (θ, x, X Tx ) in order to include it in a simulation process.
 
A solution in practice is to replace the true diffusion process X tx (θ) t∈[0,T ] (start-
ing at x) by an approximation, typically its discrete time Euler scheme with step Tn ,
denoted by ( X̄ txn (θ))0≤k≤n , where X̄ 0x (θ) = x and tkn := kT n
(see Sect. 7.1 in Chap. 7
k
for its definition and first properties). Then, under a light ellipticity assumption (non-
degeneracy of the volatility), the whole scheme ( X̄ txn )k=1,...,n starting at x (with step
k
T
n
) does have an explicit density p̄ n n
t1 ,...,tn
(θ, x, y1 , . . . , yn ) (see Proposition 10.8). On
the other hand the Euler scheme is of course simulable so that an approximation
of Eq. (2.9) where X Tx (θ) is replaced by X̄ Tx (θ) can be computed by a Monte Carlo
simulation. More details are given in Sect. 10.3.1.
An alternative is to “go back” to the “scenarii” space . Then, some extensions of
the first two approaches are possible when the risky asset prices follow a Brownian
diffusion: if the payoff function and the diffusion coefficients are both smooth, one
may rely on the tangent process (derivative of the process with respect to its starting
value, or more generally with respect to one of its structure parameters, see below).
When the payoff function does not have enough regularity to map possible path
wise differentiation, a more sophisticated method is to call upon Malliavin calcu-
lus or stochastic variational analysis. In (very) short, it provides a differentiation
theory with respect to the generic Brownian paths. In particular, an integration by
parts formula can be established which plays the role on the Wiener space of the
elementary integration by parts used in Sect. 2.2.2. This second topic is beyond the
scope of the present monograph. However, a flavor of Malliavin calculus is pro-
posed in Sect. 10.4 through the Bismut and Clark-Occone formulas. These formulas,
2.2 Greeks (Sensitivity to the Option Parameters): A First Approach 47

which can be viewed as ancestors of Malliavin calculus, provide the δ-hedge for
vanilla options in local volatility models (see Theorem 10.2 and the application that
follows).
 Exercises. 1. Provide simple assumptions to justify the above formal computations
in (2.9), at some point θ0 or for all θ running over a non-empty open interval  of
R (or domain of Rd if θ is vector valued). [Hint: use the remark directly below
Theorem 2.2.]
2. Compute the probability density pT (σ, x, y) of X Tx,σ in a Black–Scholes model
(σ > 0 stands for the volatility parameter).
3. Re-establish all the sensitivity formulas established in the former Sect. 2.2.2
(including the exercises at the end of the section) using this approach.
4. Apply these formulas to the case ϕ(x) := e−r t (x − K )+ and retrieve the classical
expressions for the greeks in a Black–Scholes model: the δ, the γ and the vega.
In this section we focused on the case of the marginal X Tx (θ) at time T of a Brow-
nian diffusion as encountered in local volatility models viewed as a generalization
of the Black–Scholes models investigated in the former section. In fact, this method,
known as the log-likelihood
  method, has a much wider range of application since it
works for any family X (θ) θ∈ of Rd -valued vectors, ( ⊂ Rq ) such that, for every
θ ∈ , the distribution of X (θ) has a probability density p(θ, y) with respect to a
reference measure μ on Rd , usually the Lebesgue measure.

2.2.4 The Tangent Process Method

In fact, when both the payoff function/functional and the coefficients of the SDE are
regular enough, one can differentiate the function/functional of the process directly
with respect to a given parameter. The former Sect. 2.2.2 was a special case of this
method for vanilla payoffs in a Black–Scholes model. We refer to Sect. 10.2.2 for
more detailed developments.
Chapter 3
Variance Reduction

3.1 The Monte Carlo Method Revisited: Static Control


Variate

Let X ∈ L 2R (, A, P) be a random variable, assumed to be easy to simulate at a


reasonable computational cost. We wish to compute

mX = E X ∈ R

as the result of a Monte Carlo simulation.


Confidence interval revisited from the simulation viewpoint
The parameter m is to be computed by a Monte Carlo simulation. Let X k , k ≥ 1,
be a sequence of i.i.d. copies of X . Then the Strong Law of Large Numbers (SLLN)
yields
1 
M
mX = lim X M P-a.s. with X M := Xk.
M→+∞ M k=1

This convergence is ruled by the Central Limit Theorem (CLT)


√   L  
M X M − mX −→ N 0; Var(X ) as M → +∞.

Hence, for every q ∈ (0, +∞), and for large enough M


⎛ ⎡ ⎤⎞
VM V M ⎦⎠
P ⎝m ∈ ⎣ X M − q , XM + q  20 (q) − 1,
M M

© Springer International Publishing AG, part of Springer Nature 2018 49


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_3
50 3 Variance Reduction

where
 x V M  Var(X ) is an estimator of the variance (see (2.1)) and 0 (x) =
ξ2 dξ
e− 2 √ is the c.d.f. of the normal distribution.
−∞ 2π
In numerical probability, we adopt the following reverse point of view based on
the target or prescribed accuracy ε > 0: to make X M enter a confidence interval
[m − ε, m + ε] with a confidence level α := 20 (qα ) − 1, one needs to perform a
Monte Carlo simulation of size

qα2 Var(X )
M ≥ M X (ε, α) = . (3.1)
ε2

In practice, of course, the estimator V M is computed on-line to estimate the variance


Var(X ) as presented in the previous chapter. This estimate need not to be as sharp
as the estimation of m, so it can be processed at the beginning of the simulation on
a smaller sample size.
As a first conclusion, this shows that, a confidence level being fixed, the size of a
Monte Carlo simulation grows linearly with the variance of X for a given accuracy
and quadratically as the inverse of the prescribed accuracy for a given variance.
Variance reduction: (not so) naive approach
Assume now that we know two random variables X, X  ∈ L 2R (, A, P) satisfying

m = E X = E X  ∈ R, Var(X ), Var(X  ), Var(X − X  ) = E (X − X  )2 > 0

(the last condition only says that X and X  are not a.s. equal).
Question: Which random vector (distribution…) is more appropriate?
Several examples of such a situation have already been pointed out in the previous
chapter: usually many formulas are available to compute a greek parameter, even
more if one takes into account the (potential) control variates introduced in the
exercises.
A natural answer is: if both X and X  can be simulated with an equivalent cost
(complexity), then the one with the lowest variance is the best choice, i.e.

X  if Var(X  ) < Var(X ), X otherwise,

provided this fact is known a priori.


ℵ Practitioner’s corner
Usually, the problem appears as follows: there exists a random variable  ∈ L 2R (,
A, P) such that:
(i) E  can be computed at a very low cost by a deterministic method (closed
form, numerical analysis method),
(ii) the random variable X −  can be simulated with the same cost (complexity)
as X ,
3.1 The Monte Carlo Method Revisited: Static Control Variate 51

(iii) the variance Var(X − ) < Var(X ).


Then, the random variable

X = X −  + E 

can be simulated at the same cost as X ,

E X = E X = m and Var(X  ) = Var(X − ) < Var(X ).

Definition 3.1 A random variable  satisfying (i)–(ii)–(iii) is called a control


variate for X .

 Exercise. Show that if the simulation process of X and X −  have complexity


κ and κ respectively, then (iii) becomes

(iii) κ Var(X − ) < κ Var(X ).

The product of the variance of a random variable by its simulation complexity is


called the effort. It will be a central notion when we will introduce and analyze
Multilevel methods in Chap. 9.

Toy-example. In the previous chapter, we showed in Proposition 2.1 that, in a risk-


σ2
neutral Black–Scholes model X tx = x e(r − 2 )t+σWt (x > 0, σ > 0, t ∈ [0, T ]), if the
payoff function is differentiable outside a countable set and locally Lipschitz con-
tinuous with polynomial growth at infinity then the function (x) = E ϕ(X Tx ) is
differentiable on (0, +∞) and
   
 
X Tx x WT
 (x) = E ϕ (X T )
x
= E ϕ(X T ) . (3.2)
x xσT

So we have at hand two formulas for  (x) that can be implemented in a Monte Carlo
simulation. Which one should we choose to compute  (x) (i.e. the δ-hedge up to
a factor of e−r t )? Since they have the same expectations, the two random variables
should be discriminated through (the square of) their L 2 -norm, namely
   2 

X Tx 2 x WT
E ϕ (X T )
x
and E ϕ(X T ) .
x xσT

 Short maturities. It is clear that, if ϕ(x) = 0,


 
W 2 ϕ(x)2
lim E ϕ(X Tx ) T ∼ → +∞ as T → 0,
T →0 xσT T x 2 σ2

whereas, at least if ϕ has polynomial growth,


52 3 Variance Reduction
 
X x 2  2
lim E ϕ (X Tx ) T = ϕ (x) .
T →0 x

Since  (x) → ϕ (x) as T → 0, it follows that, if ϕ has polynomial growth and


ϕ(x) = 0,
 Xx   W 
Var ϕ (X Tx ) T → 0 and Var ϕ(X Tx ) T → +∞ as T → 0.
x xσT

Such is the case for an in-the-money Call option with payoff ϕ(ξ) = (ξ − K )+ if
X 0x = x > K .
 Long maturities. On the other hand, at least if ϕ is bounded,
 
W 2 1
E ϕ(X Tx ) T =O →0 as T → +∞,
xσT T

whereas, if ϕ is bounded away from zero at infinity, easy computations show that
 
X x 2
lim E ϕ (X Tx ) T = +∞.
T →+∞ x

However, these last two conditions on the function ϕ conflict with each other.
In practice, one observes on the greeks of usual payoffs that the pathwise differ-
entiated estimator has a significantly lower variance (even when this variance goes
to 0 like for Puts in long maturities).
Numerical Experiment (Short maturities). We consider the Call payoff ϕ(ξ) = (ξ −
K )+ , with K = 95 and x = 100, still r = 0 and a volatility σ = 0.5 in the Black–
Scholes model. Fig. 3.1 (left) depicts the variance of the pathwise differentiated and
the weighted estimators of the δ-hedge for maturities running from T = 0.001 up to
T = 1.
These estimators can be improved (at least for short maturities) by introducing
control variates as follows (see Exercise 2., Sect. 2.2.2 of the former chapter):
   
  X Tx   WT
 (x) = ϕ (x) + E ϕ (X Tx ) − ϕ (x) =E ϕ(X Tx ) − ϕ(x) . (3.3)
x xσT

However the comparisons carried out with these new estimators tend to confirm the
above heuristics, as illustrated by Fig. 3.1 (right).
A variant (pseudo-control variate)
In option pricing, when the random variable X is a payoff it is usually non-negative.
In that case, any random variable  satisfying (i)–(ii) and

(iii) 0≤≤X
3.1 The Monte Carlo Method Revisited: Static Control Variate 53

6 6

5 5

Variances
4 4
Variances

3 3

2 2

1 1

0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Mat ur it ie s T ∈ (0, 1] Mat ur it ie s T ∈ (0, 1]

 Xx 
Fig. 3.1 Black–Scholes Calls: Left: T → Var ϕ (X Tx ) xT (blue line) and T →
 WT 
Var ϕ(X Tx ) xσT (red line), T ∈ (0, 1], x = 100, K = 95, r = 0, σ = 0.5. Right: T →
  Xx    WT 
Var ϕ (X Tx ) − ϕ (x) xT (blue line) and T → Var ϕ(X Tx ) − ϕ(x) xσT (red line), T ∈ (0, 1]

can be considered as a good candidate to reduce the variance, especially if  is close


to X so that X −  is “small”.
However, note that it does not imply (iii). Here is a trivial counter-example:
let X ≡ 1, then Var(X ) = 0 whereas a uniformly distributed random variable
 on the interval [1 − η, 1], 0 < η < 1, will (almost) satisfy (i)–(ii) but
Var(X − ) > 0. Consequently, this variant is only a heuristic method to reduce
the variance which often works, but with no a priori guarantee.

3.1.1 Jensen’s Inequality and Variance Reduction

This section is in fact an illustration of the notion of pseudo-control variate described


above.
Proposition 3.1 (Jensen’s Inequality) Let X : (, A, P) → R be a random vari-
able and let g : R → R be a convex function. Suppose X and g(X ) are integrable.
Then, for any sub-σ-field B of A,
   
g E (X | B) ≤ E g(X ) | B P-a.s.

In particular, considering B = {∅, } yields the above inequality for regular expec-
tation, i.e.  
g E X ≤ E g(X ).

Proof. The inequality is a straightforward consequence of the linearity of conditional


expectation and the following classical characterization of a convex function
 
g is convex if and only if ∀ x ∈ R, g(x) = sup ax + b, a, b ∈ Q, ay + b ≤ g(y), ∀ y ∈ R . ♦
54 3 Variance Reduction

Jensen’s inequality is an efficient tool for designing control variates when dealing
with path-dependent or multi-asset options as emphasized by the following examples.
 Examples. 1. Basket or index option. We consider a payoff on a basket of d
(positive) risky assets (this basket can be an index). For the sake of simplicity we
suppose it is a Call with strike K , i.e.
 d 

hT = αi X T i,xi
−K
i=1 +

where (X 1,x1 , . . . , X d,xd ) models the price of d traded


 risky assets on a market and the
αi are some positive (αi > 0) weights satisfying 1≤i≤d αi = 1. Then, the convexity
of the exponential implies that

 i,x 
d
αi log(X T i )
0≤e 1≤i≤d ≤ αi X Ti,xi
i=1

so that   i,xi

h T ≥ k T := e 1≤i≤d αi log(X T ) − K ≥ 0.
+

The motivation for this example is that in a (possibly correlated) d-dimensional


Black–Scholes model (see below), 1≤i≤d αi log(X Ti,xi ) still has a normal distribu-
tion so that the European Call option written on the payoff
  i,xi

k T := e 1≤i≤d αi log(X T ) − K
+

has a closed form.


Let us be more specific on the model and the variance reduction procedure.
The correlated d-dimensional Black–Scholes model (under the risk-neutral prob-
ability measure with r > 0 denoting the interest rate) can be defined by the following
system of SDE s which governs the price of d risky assets denoted by i = 1, . . . , d:
⎛ ⎞

q
d X ti,xi = X ti,xi ⎝r dt + σi j dWt ⎠ , t ∈ [0, T ], xi > 0, i = 1, . . . , d,
j

j=1

where W = (W 1 , . . . , W q ) is a standard q-dimensional Brownian motion and σ =


[σi j ]1≤i≤d,1≤ j≤q ∈ M(d, q, R) (matrix with d rows and q columns with real entries).
Its solution is
3.1 The Monte Carlo Method Revisited: Static Control Variate 55

⎛ ⎞
  q
σi.2
= xi exp ⎝ r − σi j Wt ⎠, t ∈ [0, T ], xi > 0, i = 1, . . . , d,
j
X ti,xi t+
2 j=1

where

q
σi.2 = σi2j , i = 1, . . . , d.
j=1

 Exercise. Show that if the matrix σσ ∗ is positive definite (then q ≥ d) and one
may assume without modifying the model that X i,xi only depends on the first i com-
ponents of a d-dimensional standard Brownian motion. [Hint: consider the Cholesky
decomposition in Sect. 1.6.2.]

Now, let us describe the two phases of the variance reduction procedure:
– Phase I:  = e−r t k T as a pseudo-control variate and computation of its expecta-
tion E .
The vanilla Call option has a closed form in a Black–Scholes model and elemen-
tary computations show that
 
  1  
d ∗ ∗
αi log(X T i,xi
/xi ) = N r− αi σi. T ; α σσ α T
2

1≤i≤d
2 1≤i≤d

where α is the column vector with components αi , i = 1, . . . , d.


Consequently, the premium at the origin e−r t E k T admits a closed form (see
Sect. 12.2 in the Miscellany Chapter) given by
 d   
 αi σi.2 −α∗ σσ ∗ α T

xiαi e− 2
1
−r t
e E k T = Call B S 1≤i≤d
, K , r, α∗ σσ ∗ α, T .
i=1

– Phase II: Joint simulation of the pair (h T , k T ).


We need to simulate M independent copies of the pair (h T , k T ) or, to be more
precise of the quantity
 d  
   i,x

−r t −r t αi log(X T i )
e (h T − k T ) = e αi X T i,xi
−K − e 1≤i≤d −K .
+
i=1 +

This task clearly amounts to simulating M independent copies, of the q-dimensional


standard Brownian motion W at time T , namely

WT(m) = (WT1,(m) , . . . , WTq,(m) ), m = 1, . . . , M,


56 3 Variance Reduction

i.e. M independent copies Z (m) = (Z 1(m) , . . . , Z q(m) ) of the N (0; Iq ) distribution in


d √
order to set WT(m) = T Z (m) , m = 1, . . . , M.
The resulting pointwise estimator of the premium is given, with obvious notations,
by
 d 
e−r t   (m)
M
  α −1 ∗ ∗
 √
xi i e 2 1≤i≤d αi σi. −α σσ α T , K , r, α∗ σσ ∗ α, T .
2
h T − k T(m) + Call B S
M
m=1 i=1

 
Remark. The extension to more general payoffs of the form ϕ 1≤i≤d αi X T
i,xi
is
  ϕ is non-decreasing
straightforward provided
i,xi
 and a closed form exists for the vanilla
option with payoff ϕ e 1≤i≤d αi log(X T ) .

 Exercise. Other ways to take advantage of the convexity of the exponential function
can be explored: thus one can start from
 
   X Ti,xi
αi X T
i,xi
= αi xi 
αi ,
1≤i≤d 1≤i≤d 1≤i≤d
xi

αi xi
αi = 
where  , i = 1, . . . , d. Compare on simulations the respective
1≤k≤d αk x k
performances of these different approaches.
2. Asian options and the Kemna–Vorst control variate in a Black–Scholes model
(see [166]). Let   T 
1
hT = ϕ X tx dt
T 0

be a generic Asian payoff where ϕ is non-negative, non-decreasing function defined


on R+ and let
  
σ2
X tx = x exp r− t + σWt , x > 0, t ∈ [0, T ],
2

be a regular Black–Scholes dynamics with volatility σ > 0 and interest rate r . Then,
the (standard) Jensen inequality applied to the probability measure T1 1[0,T ] (t)dt
implies
 T   T   
1 1
X tx dt ≥ x exp (r − σ 2 /2)t + σWt dt
T 0 T 0
  
T σ T
= x exp (r − σ 2 /2) + Wt dt .
2 T 0
3.1 The Monte Carlo Method Revisited: Static Control Variate 57

Now   
T T T
Wt dt = T WT − sdWs = (T − s)dWs
0 0 0

so that    T   
T
1 d 1 T
Wt dt = N 0; 2 s ds = N 0;
2
.
T 0 T 0 3

This suggests to rewrite the right-hand side of the above inequality in a “Black–
Scholes asset” style, namely:
   
1 T
σ2 σ T
X tx dt ≥ x e−( 2 + 12 )T exp (r − (σ 2 /3)/2)T +
r
Wt dt .
T 0 T 0

This naturally leads us to introduce the so-called Kemna–Vorst (pseudo-)control


variate
   
2
−( r2 + σ12 )T
 1 σ2  1 T
k T := ϕ x e
KV
exp r − T +σ Wt dt
2 3 T 0

which is clearly of Black–Scholes type and moreover satisfies

h T ≥ k TK V .

– Phase I: The random variable k TK V is an admissible control variate as soon as


the vanilla option related to the payoff ϕ(X Tx ) has a closed form. Indeed, if a vanilla
option related to the payoff ϕ(X Tx ) has a closed form
ϕ
e−r t E ϕ(X Tx ) = Premium B S (x, r, σ, T ),

then, one has


 
ϕ σ2 σ
e−r t E k TK V = Premium B S x e−( 2 + 12 )T , r, √ , T .
r

– Phase II: One has to simulate independent copies of h T − k TK V , i.e. in practice,


independent copies of the pair (h T , k TK V ). Theoretically speaking, this requires us to
know how to simulate paths of the standard Brownian motion (Wt )t∈[0,T ] exactly and,
T
moreover, to compute with an infinite accuracy integrals of the form T1 0 f (t)dt.
In practice these two tasks are clearly impossible (one cannot even compute a
real-valued function f (t) at every t ∈ [0, T ] with a computer). In fact, one relies on
quadrature formulas to approximate the time integrals in both payoffs which makes
this simulation possible since only finitely many random marginals of the Brownian
motion, say Wt1 , . . . , Wtn , are necessary, which is then quite realistic. Typically, one
uses a mid-point quadrature formula
58 3 Variance Reduction

  
1
n
1 T
2k − 1
f (t)dt  f T
T 0 n k=1 2n

or any other numerical integration method, keeping in mind nevertheless that the
(continuous) functions f of interest are here given for the first and the second payoff
functions by
  
σ2 
f (t) = ϕ x exp r − t + σWt (ω) and f (t) = Wt (ω),
2

respectively. Hence, their regularity is 21 -Hölder (i.e. α-Hölder on [0, T ], for every
α < 21 , like for the payoff h T ). Finally, in practice, it will amount to simulating
independent copies of the n-tuple
 
W 2nT , . . . , W (2k−1)T , . . . , W (2n−1)T
2n 2n

from which one can reconstruct both a mid-point approximation of both integrals
appearing in h T and k T .
In fact, one can improve this first approach by taking advantage of the fact that
W is a Gaussian process as detailed in the exercise below.
Further developments to reduce the time discretization error are proposed in
Sect. 8.2.5 (see [187] where an in-depth study of the Asian option pricing in a
Black–Scholes model is carried out).
 Exercises. 1. (a) Show that if f : [0, T ] → R is continuous then

1   kT 
n
1 T
lim f = f (t) dt.
n n n T 0
k=1

  
Show that t → x exp r − σ2 t + σWt (ω) is α-Hölder for every α ∈ (0, 21 ) (with
2

a random Hölder ratio, of course).


(b) Show that
   n 
T   T3

L Wt dt  W kTn − W (k−1)T = δwk , 1 ≤ k ≤ n = N ak δwk ;
0
n
k=1
12n 2

with ak = 2(n−k)+1
2n
T , k = 1, . . . , n. [Hint: for Gaussian vectors, conditional expec-
tation and affine regression coincide.]
(c) Propose a variance reduction method in which the pseudo-control variate e−r t k T
will be simulated exactly.
2. Check that the preceding can be applied to payoffs of the form
3.1 The Monte Carlo Method Revisited: Static Control Variate 59
  T 
1
ϕ X tx dt − XT
x
T 0

where ϕ is still a non-negative, non-decreasing function defined on R.


3. Best-of-Call option. We consider the Best-of-Call payoff given by
 
h T = max(X T1 , X T2 ) − K + .

(a) Using the convexity inequality (that can still be seen as an application of Jensen’s
inequality) √
ab ≤ max(a, b), a, b > 0,

show that  
k T := X T1 X T2 − K
+

is a natural (pseudo-)control variate for h T .


(b) Show that, in a 2-dimensional Black–Scholes (possibly correlated) model (see
example in Sect. 2.1.3), the premium of the option with payoff k T (known as the
geometric mean option) has a closed form. Show that this closed form can be written
as a Black–Scholes formula with appropriate parameters (and maturity T ). [Hint:
see Sect. 12.2 in the Miscellany Chapter.]
(c) Check on (at least one) simulation(s) that this procedure does reduce the variance
(use the parameters of the model specified in Sect. 2.1.3).
(d) When σ1 and σ2 are not equal, improve the above variance reduction protocol
by considering a parametrized family of (pseudo-)control variate, obtained from the
more general inequality a θ b1−θ ≤ max(a, b) when θ ∈ (0, 1).

3.1.2 Negatively Correlated Variables, Antithetic Method

In this section we assume that X and X  have not only the same expectation mX
but also the same variance, i.e. Var(X ) = Var(X  ), can be simulated with the same
complexity κ = κ X = κ X  . We also assume that E(X − X  )2 > 0 so that X and X 
are not a.s. equal. In such a situation, choosing between X or X  may seem a priori
a question of little interest. However, it is possible to take advantage of this situation
to reduce the variance of a simulation when X and X  are negatively correlated.
Set
X + X
χ= .
2

This corresponds to  = X −X 2
with our formalism. It is reasonable (when no further
information on (X, X  ) is available) to assume that the simulation complexity of χ
is twice that of X and X  , i.e. κχ = 2κ. On the other hand
60 3 Variance Reduction

1  
Var (χ) = Var X + X 
4
1 
= Var(X ) + Var(X  ) + 2Cov(X, X  )
4
Var(X ) + Cov(X, X  )
= .
2

The size M X (ε, α) and M χ (ε, α) of the simulation using X and χ respectively
to enter a given interval [m − ε, m + ε] with the same confidence level α is given,
following (3.1), by
 q 2  q 2
α α
MX = Var(X ) for X and Mχ = Var(χ) for χ.
ε ε
Taking into account the complexity as in the exercise that follows Definition 3.1, this
means in terms of C PU computation time that one should better use χ if and only if

κχ M χ < κ X M X ⇐⇒ 2κ M χ < κ M X ,

i.e.
2Var(χ) < Var(X ).

Given the above inequality, this reads

Cov(X, X  ) < 0.

To take advantage of this remark in practice, one usually relies on the following
result.

Proposition 3.2 (co-monotony) (a) Let Z : (, A, P) → R be a random variable


and let ϕ, ψ : R → R be two monotonic (hence Borel) functions with the same
monotonicity. Assume that ϕ(Z ), ψ(Z ) ∈ L 2R (, A, P). Then
   
Cov ϕ(Z ), ψ(Z ) = E ϕ(Z )ψ(Z ) − E ϕ(Z )E ψ(Z ) ≥ 0. (3.4)

If, mutatis mutandis, ϕ and ψ have opposite monotonicity, then


 
Cov ϕ(Z ), ψ(Z ) ≤ 0.

Furthermore, the inequality holds as an equality if and only if ϕ(Z ) = E ϕ(Z ) P-a.s.
or ψ(Z ) = E ψ(Z ) P-a.s.
(b) Assume there exists a non-increasing (hence Borel) function T : R → R such
d
that Z = T (Z ). Then X = ϕ(Z ) and X  = ϕ(T (Z )) are identically distributed and
satisfy
Cov(X, X  ) ≤ 0.
3.1 The Monte Carlo Method Revisited: Static Control Variate 61

In that case, the random variables X and X  are called antithetic.


Proof. (a) Inequality. Let Z , Z  be two independent random variables defined on
the same probability space (, A, P) with distribution P Z (1 ). Then, using that ϕ
and ψ are monotonic with the same monotonicity, we have
  
ϕ(Z ) − ϕ(Z  ) ψ(Z ) − ψ(Z  ) ≥ 0

so that the expectation of this product is non-negative (and finite since all random
variables are square integrable). Consequently
      
E ϕ(Z )ψ(Z ) + E (ϕ(Z  )ψ(Z  ) − E ϕ(Z )ψ(Z  ) − E ϕ(Z  )ψ(Z ) ≥ 0.

d
Using that Z  = Z and that Z , Z  are independent, we get
 
2 E ϕ(Z )ψ(Z ) ≥ E ϕ(Z )E ψ(Z  ) + E ϕ(Z  ) E ψ(Z ) = 2 E ϕ(Z ) E ψ(Z ),

that is    
Cov ϕ(Z ), ψ(Z ) = E ϕ(Z )ψ(Z ) − E ϕ(Z ) E ψ(Z ) ≥ 0.

If the functions ϕ and ψ have opposite monotonicity, then


  
ϕ(Z ) − ϕ(Z  ) ψ(Z ) − ψ(Z  ) ≤ 0

and one concludes as above up to sign changes.


Equality case. As for the equality case under the co-monotony assumption, we may
assume without loss of generality that ϕ and ψ are non-decreasing. Moreover, we
make the following convention: if a is not an atom of the distribution P Z of Z , then
set ϕ(a) = ϕ(a+ ), ψ(a) = ψ(a+ ), idem for b.
Now, if ϕ(Z ) or ψ(Z ) are P-a.s. constant (hence equal to their expectation) then
equality clearly holds.
Conversely, it follows by  reading the above proof backwards  that if
 
equality
 holds,
 then E  ϕ(Z ) − ϕ(Z ) ψ(Z ) − ψ(Z ) = 0 so that
ϕ(Z ) − ϕ(Z  ) ψ(Z ) − ψ(Z  ) P-a.s. Now, let I be the (closed) convex hull of
the support of the distribution μ = P Z of Z on the real line. Assume e.g. that
I = [a, b] ⊂ R, a, b ∈ R, a < b (other cases can be adapted easily from this one).
By construction a and b are in the support of μ so that for every ε, ε ∈
(0, b − a), both P(a ≤ Z < a + ε) and P(b − ε < Z ≤ b) are (strictly) positive.
If a is an atom of P Z , one may choose εa = 0, idem for b. Hence the event
Cε = {a ≤ Z < a + ε} ∩ {b − ε < Z  ≤ b} has a positive probability since Z and
Z  are independent.
Now assume that ϕ(Z ) is not P-a.s. constant. Then, ϕ cannot be constant on
I and ϕ(a) < ϕ(b) (with the above convention on atoms). Consequently, on Cε ,

1 This is always possible owing to Fubini’s Theorem for product measures by considering the
probability space (2 , A⊗2 , P⊗2 ): extend Z by Z (ω, ω  ) = Z (ω) and define Z  by Z  (ω, ω  ) =
Z (ω  ).
62 3 Variance Reduction

ϕ(Z ) − ϕ(Z  ) > 0 a.s. so that ψ(Z ) − (Z  ) = 0 P-a.s.; which in turn implies that
ψ(a + ε) = ψ(b − ε). Then, letting ε go to 0, one derives that ψ(a) = ψ(b) (still
keeping in mind the convention on atoms). Finally, this shows that ψ(Z ) is P-a.s.
constant.
(b) Set ψ = ϕ ◦ T so that ϕ and ψ have opposite monotonicity. Noting that X and
X  have the same distribution and applying claim (a) completes the proof. ♦

This leads to the well-known “antithetic random variables method ”.


The antithetic random variables method
This terminology is shared by two classical situations in which the above approach
applies:
d
– the symmetric random variable Z : Z = −Z (i.e. T (z) = −z);
d
– the [0, L]-valued random variable Z such that Z = L − Z (i.e. T (z) = L − z).
d
This is satisfied by U = U ([0, 1]) with L = 1.
 Examples. 1. European option pricing in a B S model. Let h T = ϕ(X Tx ) with
 σ2
√ 
ϕ monotonic (like for Calls, Puts, spreads, etc). Then h T = ϕ xe(r − 2 )T +σ T Z ,
 σ2
√ 
W d
Z = √TT = N (0; 1). The function z → ϕ xe(r − 2 )T +σ T z is monotonic as the
composition of two monotonic functions and Z is symmetric.
d
2. Uniform distribution on the unit interval. If ϕ is monotonic on [0, 1] and U =
U ([0, 1]) then  
ϕ(U ) + ϕ(1 − U ) 1
Var ≤ Var(ϕ(U )).
2 2

The above one-dimensional Proposition 3.2 admits a multi-dimensional extension


that reads as follows.

Theorem 3.1 Let d ∈ N∗ and let ϕ, ψ : Rd → R be two functions satisfying the


following joint marginal monotonicity assumption:
for every i ∈ {1, . . . , d} and every (z i+1 , . . . , z d ) ∈ Rd−i , the functions
z i → ϕ(z 1 , . . . , z i , . . . , z d ) and z i → ψ(z 1 , . . . , z i , . . . , z d ) have the same monotonicity

not depending on (z 1 , . . . , z i−1 ) ∈ Ri−1 (but possibly on i and on (z i+1 , . . . , z d )).


Let Z 1 , . . . , Z d be independent real-valued random variables defined on a probability
space (, A, P).
Then, if ϕ(Z 1 , . . . , Z d ), ψ(Z 1 , . . . , Z d ) ∈ L 2 (, A, P),
  
Cov ϕ(Z 1 , . . . , Z d ), ψ Z 1 , . . . , Z d ≥ 0.

Corollary 3.1 If Ti : R → R, i = 1, . . . , d, are non-increasing functions such that


d
Ti (Z i ) = Z i , then if ϕ and ψ are as in the above theorem
3.1 The Monte Carlo Method Revisited: Static Control Variate 63
  
Cov ϕ(Z 1 , . . . , Z d ), ψ T1 (Z 1 ), . . . , Td (Z d ) ≤ 0.

Proof of Theorem 3.1. We proceed by induction on the dimension d using the


notation z d:d  to denote (z d , . . . , z d  ). When d = 1 the result follows from Proposi-
tion 3.2(a). Then, assume it holds true on Rd . By Fubini’s Theorem


E ϕ(Z 1:d+1 )ψ(Z 1:d+1 ) = E ϕ(Z 1:d , z d+1 )ψ(Z 1:d , z d+1 ) P Z d+1 (dz d+1 )
R
≥ E ϕ(Z 1:d , z d+1 ) E ψ(Z 1:d , z d+1 )P Z d+1 (dz d+1 )
R

since, z d+1 being fixed, the functions z 1:d → ϕ(z 1:d , z d+1 ) and z 1:d → ψ(z 1:d , z d+1 )
satisfy the above joint marginal co-monotonicity assumption on Rd .
Now, the functions z d+1 → ϕ(z 1:d , z d+1 ) and z d+1 → ψ(z 1:d , z d+1 ) have the same
monotonicity, not depending on z 1:d so that  : z d+1 → E ϕ(Z 1:d , z d+1 ) and  :
z d+1 → E ψ(Z 1:d , z d+1 ) have the same monotonicity. Hence

E ϕ(Z d+1 )E ψ(Z 1:d , z d+1 )P Z d+1 (dz d+1 ) = E (Z d+1 )(Z d+1 )
R
≥ E (Z d+1 )E (Z d+1 )
= E ϕ(Z 1:d+1 ) E ψ(Z 1:d+1 ),

where we used Fubini’s Theorem twice (in a reverse way) in the last line. This
completes the proof. ♦

The proof of the corollary is obvious.

 Exercise. (a) Show that if there is a permutation σ : {1, . . . , d} → {1, . . . , d} such


that the functions ϕσ and ψσ respectively defined by ϕσ (z 1 , . . . , z d ) = ϕ(z σ(1) , . . . ,
z σ(d) ) and ψσ (z 1 , . . . , z d ) = ψ(z σ(1) , . . . , z σ(d) ) satisfy the above joint marginal
monotonicity assumption and if Z 1 , . . . , Z d are i.i.d. then the conclusion of The-
orem 3.1 remains valid.
(b) Show that if σ(i) = d + 1 − i, the conclusion remains true when Z 1 , . . . , Z d are
simply independent.

Remarks.
 • This result may  be successfully applied to functions of the form
f X̄ Tn , . . . , X̄ k Tn , . . . , X̄ n Tn of the Euler scheme with step Tn of a one-dimensional
Brownian diffusion with a non-decreasing drift and a deterministic strictly positive
diffusion coefficient, provided f is “marginally monotonic”, i.e. monotonic in each
of its variable with the same monotonicity. (We refer to Chap. 7 for an introduc-
tion to the Euler scheme of a diffusion.) The idea is to rewrite the functional as a
“marginally monotonic” function of the n (independent) Brownian increments which
play the role of the random variables Z i . Furthermore, passing to the limit as the step
size goes to zero yields some correlation results for a class of monotonic continuous
64 3 Variance Reduction

functionals defined on the canonical space C([0, T ], R) of the diffusion itself (the
monotonicity should be understood with respect to the naive pointwise partial order:
f ≤ g if f (t) ≤ g(t), t ∈ [0, T ]).
• The co-monotony inequality (3.4) is one of the most powerful tool to establish
lower bounds for expectations. For more insight about these kinds of co-monotony
properties and their consequences for the pricing of derivatives, we refer to [227].
 Exercises. 1. A toy simulation. Let f and g be two functions defined on the real
line by f (u) = √uu2 +1 and g(u) = tanh u, u ∈ R. Set
   
ϕ(z 1 , z 2 ) = f z 1 eaz2 and ψ(z 1 , z 2 ) = g z 1 ebz2 with a, b > 0.

Show that, if Z 1 , Z 2 are two independent random variables, then

E ϕ(Z 1 , Z 2 )ψ(Z 1 , Z 2 ) ≥ E ϕ(Z 1 , Z 2 ) E ψ(Z 1 , Z 2 ).

2. Show that if ϕ and ψ are non-negative, Borel functions defined on R, monotonic


with opposite monotonicity, then
 
E ϕ(Z )ψ(Z ) ≤ E ϕ(Z ) E ψ(Z ) ≤ +∞

so that, if ϕ(Z ), ψ(Z ) ∈ L 1 (P), then ϕ(Z )ψ(Z ) ∈ L 1 (P).


3. Use Propositions 3.2(a) and 2.1(b) to derive directly from its representation as
an expectation that, in the Black–Scholes model, the δ-hedge of a European option
whose payoff function is convex is non-negative.

3.2 Regression-Based Control Variates

3.2.1 Optimal Mean Square Control Variates

We return to the original situation of two square integrable random variables X and
X  , having the same expectation

E X = E X = m

with non-zero variances, i.e.

Var(X ), Var(X  ) > 0.

We assume again that X and X  are not identical in the sense that P(X = X  ) > 0,
which turns out to be equivalent to
3.2 Regression-Based Control Variates 65

Var(X − X  ) > 0.

We saw that if Var(X  )  Var(X ), one will naturally choose X  to implement the
Monte Carlo simulation and we provided several classical examples in that direction.
However, we will see that with a little more effort it is possible to improve this naive
strategy.
This time we simply (and temporarily) set

 := X − X  with E = 0 and Var() > 0.

The idea is simply to parametrize the impact of the control variate  by a factor λ,
i.e. we set for every λ ∈ R,

X λ := X − λ  = (1 − λ)X + λX  .

Then the strictly convex parabolic function  defined by

(λ) := Var(X λ ) = λ2 Var() − 2 λ Cov(X, ) + Var(X )

attains its minimum value at λmin defined by

Cov(X, ) E (X )
λmin := =
Var() E 2

Cov(X , ) E (X  )
= 1+ =1+ .
Var() E 2

Consequently

(Cov(X, ))2 (Cov(X  , ))2


σmin
2
:= Var(X λmin ) = Var(X ) − = Var(X  ) − .
Var() Var()

so that  
σmin
2
≤ min Var(X ), Var(X  )

and σmin
2
= Var(X ) if and only if Cov(X, ) = 0.
Remark. Note that Cov(X, ) = 0 if and only if λmin = 0, i.e. Var(X ) = min (λ).
λ∈R
If we denote by ρ X, the correlation coefficient between X and , one gets

σmin
2
= Var(X )(1 − ρ2X, ) = Var(X  )(1 − ρ2X  , ).

A more symmetric expression for Var(X λmin ) is


66 3 Variance Reduction

Var(X )Var(X  )(1 − ρX,X


2

)
σmin
2
= √ √ 2 √
Var(X ) − Var(X  ) + 2 Var(X )Var(X  )(1 − ρX,X  )
1 + ρX,X 
≤ σX σX 
2

where σX and σX  denote the standard deviations of X and X  , respectively, and ρX,X 
is the correlation coefficient between X and X  .
 Exercise. We go back to the Toy-example from Sect. 3.1 with ϕ(ξ) = (ξ − K )+
(with x > K ) and (x) = E ϕ(X Tx ) (see Eq. (3.2)). In order to reduce the variance
for short maturities of the estimators of the δ-hedge, we note that, for every λ, μ ∈ R,
  Xx    WT 
 (x) = λϕ (x) + E ϕ (X Tx ) − λϕ (x) T = E ϕ(X Tx ) − ϕ(x) .
x xσT
Apply the preceding to this example and implement it with x = 100, K = 95, σ =
0.5 and T ∈ (0, 1] (after reading the next section). Compare the numerical results
with the “naive” variance reduction obtained by (3.3).

3.2.2 Implementation of the Variance Reduction: Batch


versus Adaptive

Let (X k , X k )k≥1 be an i.i.d. sequence of random vectors with the same distribution
as (X, X  ) and let λ ∈ R. Set, for every k ≥ 1,

k = X k − X k , X kλ = X k − λ k .

Now, set for every size M ≥ 1 of the simulation:

1  2 1 
M M
VM :=  , C M := X k k
M k=1 k M k=1

and
CM
λ M := (convention: λ0 = 0). (3.5)
VM

The “batch” approach


 Definition of the batch estimator. The Strong Law of Large Numbers implies that
both

VM −→ Var(X − X  ) and C M −→ Cov(X, X − X  ) P-a.s. as M → +∞


3.2 Regression-Based Control Variates 67

so that
λ M → λmin P-a.s. as M → +∞.

This suggests to introduce the batch estimator of m, defined for every size M ≥ 1
of the simulation by
1  λM
M
λM
XM = X .
M k=1 k

One checks that, for every M ≥ 1,

1  1 
M M
λ
XM M = X k − λM k
M k=1 M k=1
= X M − λ M M (3.6)

with standard notations for empirical means.


 Convergence of the batch estimator. The asymptotic behavior of the batch estimator
is summed up in the proposition below.
Proposition 3.3 The batch estimator P-a.s. converges to m (consistency), i.e.

1  λ M a.s.
M
λ
XM M = X −→ E X = m
M k=1 k

and satisfies a CLT (asymptotic normality) with an optimal asymptotic variance σmin
2
,
i.e. √  λ
 L  
M X M M − m −→ N 0; σmin 2
.

Remark. However, note that the batch estimator is a biased estimator of m since
E λ M M = 0.

Proof. First, one checks from (3.6) that

1  λ M a.s.
M
X −→ m − λmin × 0 = m.
M k=1 k

Now, it follows from the regular CLT that


 
√ 1  λmin  
M
L
M X −m −→ N 0; σmin
2
M k=1 k

since Var(X − λmin ) = σmin


2
. On the other hand
68 3 Variance Reduction
 
√ 1  λM   1 
M M
P
M X − X kλmin = λ M − λmin × √ k −→ 0
M k=1 k M k=1

owing to Slutsky’s Lemma (2 ) since λ M − λmin → 0 a.s. as M → +∞ and

1   
M
L
√ k −→ N 0; E 2
M k=1

by the regular CLT applied to the centered square integrable i.i.d. random variables
k , k ≥ 1. Combining these two convergence results yields the announced CLT. ♦

 Exercise. Let X m and m denote the empirical mean processes of the sequences
(X k )k≥1 and (k )k≥1 , respectively. Show that the quadruplet (X M , M , CM , VM ) can
be computed in a recursive way from the sequence (X k , X k )k≥1 . Derive a recursive
way to compute the batch estimator.

ℵ Practitioner’s corner
One may proceed as follows:
– Recursive implementation: Use the recursion satisfied by the sequence
(X k , k , Ck , Vk )k≥1 to compute λ M and the resulting batch estimator for each size
M.
– True batch implementation: A first phase of the simulation of size M  , M  
M (say M   5% or 10% of the total budget M of the simulation) devoted to a rough
estimate λ M  of λmin , based on the Monte Carlo estimator (3.5).
A second phase of the simulation to compute the estimator of m defined by

1 M
λ 
X M
M − M  k=M  +1 k

whose asymptotic variance – given the first phase of the simulation – is


Var(X λ )|λ=λ M 
. This approach is not recursive at all. On the other hand, the
M − M
above resulting estimator satisfies a CLT with asymptotic variance (λ M  ) =
Var(X λ )|λ=λ M  . In particular, we will most likely observe a significant – although
not optimal – variance reduction. So, from this point of view, you can stop reading
this section at this point.
The adaptive unbiased approach
Another approach is to design an adaptive estimator of m by considering at each step
k the (predictable) estimator λk−1 of λmin . This adaptive estimator is defined and
analyzed below.

2 IfYn → c in probability and Z n → Z ∞ in distribution, then Yn Z n → c Z ∞ in distribution. In


particular, if c = 0 the last convergence holds in probability.
3.2 Regression-Based Control Variates 69

Theorem 3.2 Assume X, X  ∈ L 2+δ (P) for some δ > 0. Let (X k , X k )k≥1 be an i.i.d.
sequence with the same distribution as (X, X  ). We set, for every k ≥ 1,

Xk = Xk − 
 λk−1 k = (1 − 
λk−1 )X k + 
λk−1 X k where 
λk = (−k) ∨ (λk ∧ k)

and λk is defined by (3.5). Then, the adaptive estimator of m defined by

1 
 M
λ

XM = Xk
M k=1

 
λ 
is unbiased E X M = m , convergent, i.e.


λ
 a.s.
X M −→ m as M → +∞,

and asymptotically normal with minimal variance, i.e.


 
√ 
λ L  
M 
XM − m −→ N 0; σmin
2
as M → +∞.

What follows in this section can be omitted at the occasion of a first reading
although the method of proof exposed below is quite standard when dealing with the
efficiency of an estimator by martingale methods the preceding

Proof. Step 1 (a.s. convergence). Let F0 = {∅, } and, for every k ≥ 1,


let Fk := σ(X 1 , X 1 , . . . , X k , X k ), be the filtration of the simulation.
First, we show that (  X k − m)k≥1 is a sequence of square integrable (Fk , P)-
martingale increments. It is clear by construction that  X k is Fk -measurable. More-
over,
 
X k )2 ≤ 2 E X k2 + E (
E ( λk−1 k )2

 
= 2 E X k2 + E 
λ2k−1 E 2k < +∞,

where we used that k and λk−1 are independent and  λk−1 ≤ k − 1. Finally, for every
k ≥ 1,
X k | Fk−1 ) = E (X k | Fk−1 ) − 
E ( λk−1 E (k | Fk−1 ) = m.

This shows that the adaptive estimator is unbiased since E 


X k = m for every k ≥ 1.
In fact, we can also compute the conditional variance increment process:
 
X k − m)2 | Fk−1 = Var(X λ )|λ=λk−1 = (
E ( λk−1 ).

Now, we set for every k ≥ 1,


70 3 Variance Reduction

k

X − m
Nk := .
=1


 from the preceding that the sequence (Nk )k≥1 is a square integrable
 It follows

(Fk )k , P-martingale since ( X k − m)k≥1 , is a sequence of square integrable
(Fk )k , P -martingale increments. Its conditional variance increment process (also
known as “bracket process”) N k , k ≥ 1, is given by


k
E (( 
X  − m)2 | F−1 )
N k =
=1
2

k
(
λ−1 )
= .
=1
2

Now, the above series is a.s. convergent since we know that ( λk ) a.s. con-
verges towards (λmin ) as k → +∞ since  λk a.s. converges toward λmin and  is
continuous. Consequently,

N ∞ = a.s. lim N  M < +∞ a.s.


M→+∞

Hence, it follows from Theorem 12.7 in the Miscellany Chapter that N M → N∞ P-


a.s. as M → +∞ where N∞ is an a.s. finite random variable. In turn, the Kronecker
Lemma (see Lemma 12.1, Sect. 12.7 of the Miscellany Chapter) implies,

1 
M
a.s.
X k − m −→ 0 as M → +∞,
M k=1

i.e.

1   a.s.
M

X M := X k −→ m as M → +∞.
M k=1

Step 2 (CLT, weak rate of convergence). This is a consequence of the Lindeberg


Central Limit Theorem for (square integrable) martingale increments (see Theo-
rem 12.8 in the Miscellany Chapter or Theorem 3.2 and its Corollary 3.1, p. 58
in [142] referred to as Lindeberg’s CLT in what follows). In our case, the array of
martingale increments is defined by


Xk − m
X M,k := √ , 1 ≤ k ≤ M.
M

There are essentially two assumptions to be checked. First the convergence of the
conditional variance increment process toward σmin
2
:
3.2 Regression-Based Control Variates 71


M
  1   
M

E X 2M,k | Fk−1 = E ( X k − m)2 | Fk−1
k=1
M k=1

1  
M
= (λk−1 )
M k=1
−→ σmin
2
:= min (λ).
λ

The second one is the so-called Lindeberg condition (see again Theorem 12.8
or [142], p. 58) which reads in our framework:


M
  P
∀ ε > 0, E X 2M, 1{|X M, |>ε} | F−1 −→ 0.
=1

In turn, owing to the conditional Markov inequality and the definition of X M, , this
condition classically follows from the slightly stronger: there exists a real number
δ > 0 such that  
sup E | 
X  − m|2+δ | F−1 < +∞ P-a.s.
≥1

since


M
  1 
M
 
E X 2M, 1{|X M, |>ε} | F−1 ≤ δ
E |
X  − m|2+δ | F−1 .
=1 εδ M 1+ 2 =1

Now, using that (u + v)2+δ ≤ 21+δ (u 2+δ + v 2+δ ), u, v ≥ 0, and the fact that X , X  ∈
L 2+δ (P), one gets
 
X  − m|2+δ | F−1 ) ≤ 21+δ E |X − m|2+δ + |
E (|  λ−1 |2+δ E ||2+δ .

Finally, the Lindeberg Central Limit Theorem implies


 
√ 1   
M
L
M Xk − m −→ N 0; σmin
2
.
M k=1

This means that the expected variance reduction does occur if one implements
the recursive approach described above. ♦
72 3 Variance Reduction

3.3 Application to Option Pricing: Using Parity Equations


to Produce Control Variates

The variance reduction by regression introduced in the former section still relies on
the fact that κX  κ X −λ or, equivalently, that the additional complexity induced by
the simulation of  given that of X is negligible. This condition may look demanding
but we will see that in the framework of derivative pricing this requirement is always
fulfilled as soon as the payoff of interest satisfies a so-called parity equation, i.e. that
the original payoff can be duplicated by a “synthetic” version.
Furthermore, these parity equations are model free so they can be applied for
various specifications of the dynamics of the underlying asset.
In this section, we denote by (St )t≥0 the risky asset (with S0 = s0 > 0) and set
St0 = er t , the riskless asset. We work under the risk-neutral risk-neutral probability
P (supposed to exist), which means that
 −r t 
e St t∈[0,T ] is a martingale on the scenarii space (, A, P)

(with respect to the augmented filtration of (St )t∈[0,T ] ). Furthermore, to comply with
usual assumptions of AO A theory, we will assume that this risk neutral probability
is unique (complete market) to justify that we may price any derivative under this
probability. However this has no real impact on what follows.

Vanilla Call-Put parity (d = 1)


We consider a Call and a Put with common maturity T and strike K . We denote by
   
Call0 (K , T ) = e−r t E (ST − K )+ and Put 0 (K , T ) = e−r t E (K − ST )+

the premium of this Call and this Put option, respectively. Since

(ST − K )+ − (K − ST )+ = ST − K
 
and e−r t St t∈[0,T ] is a martingale, one derives the classical Call-Put parity equation:

Call0 (K , T ) − Put 0 (K , T ) = s0 − e−r t K

so that Call0 (K , T ) = E (X ) = E (X  ) with

X := e−r t (ST − K )+ and X  := e−r t (K − ST )+ + s0 − e−r t K .

As a result one sets


 = X − X  = e−r t ST − s0 ,

which turns out to be the terminal value of a martingale null at time 0 (this is in fact
the generic situation of application of this parity method).
3.3 Application to Option Pricing: Using Parity Equations to Produce Control Variates 73

Note that the simulation of X involves that of ST so that the additional cost of the
simulation of  is definitely negligible.

Asian Call-Put parity


We consider an Asian Call and an Asian Put with common maturity T , strike K and
averaging period [T0 , T ], 0 ≤ T0 < T .
! T  "
Call0As = e−r t E 1
T −T0 T0 St dt − K
+

and
! T  "
Put 0As = e−r t E K− 1
T −T0 T0 St dt .
+

Still using that 


St = e−r t St is a P-martingale and, this time, the Fubini–Tonelli
Theorem yield
1 − e−r (T −T0 )
Call0As − Put 0As = s0 − e−r t K
r (T − T0 )

so that
Call0As = E (X ) = E (X  )

with
  T 
1
X := e−r t St dt − K
T − T0 T0 +
−r (T −T0 )   
1−e 1 T
X  := s0 − e−r t K + e−r t K− St dt .
r (T − T0 ) T − T0 T0 +

This leads to 
1 T
1 − e−r (T −T0 )
 = e−r t St dt − s0 .
T − T0 T0 r (T − T0 )

Remark. In both cases, the parity equation directly follows from the P-martingale
property of 
St = e−r t St .

3.3.1 Complexity Aspects in the General Case

In practical implementations, one often neglects the cost of the computation of λmin
since only a rough estimate is computed: this leads us to stop its computation after
the first 5% or 10% of the simulation.
74 3 Variance Reduction

– However, one must be aware that the case of the existence of parity equations is
quite specific since the random variable  is involved in the simulation of X , so the
complexity of the simulation process is not increased: thus in the recursive approach
the updating of λ M and of (the empirical mean)  X M is (almost) costless. Similar
observations can be made to some extent on batch approaches. As a consequence, in
that specific setting, the complexity of the adaptive linear regression procedure and
the original one are (almost) the same!
– Warning! This is no longer true in general…and in a general setting the com-
plexity of the simulation of X and X  is double that of X itself. Then the regression
method is efficient if and only if
1  
σmin
2
< min Var(X ), Var(X  )
2
(provided one neglects the cost of the estimation of the coefficient λmin ).
The exercise below shows the connection with antithetic variables which then
appears as a special case of regression methods.
 Exercise (Connection with the antithetic variable method). Let X , X  ∈ L 2 (P)
such that E X = E X  = m and Var(X ) = Var(X  ).
1
(a) Show that λmin = .
2  
λmin X + X X + X 1 
(b) Show that X = and Var = Var(X ) + Cov(X, X  ) .
2 2 2
Characterize the pairs (X, X  ) for which the regression method does reduce the
variance. Make the connection with the antithetic method.

3.3.2 Examples of Numerical Simulations

Vanilla B-S Calls (See Figs. 3.2, 3.3 and 3.4)


The model parameters are specified as follows

T = 1, x0 = 100, r = 5, σ = 20, K = 90, . . . , 120.

The simulation size is set at M = 106 .


3.3 Application to Option Pricing: Using Parity Equations to Produce Control Variates 75

Asian Calls in a Heston model (See Figs. 3.5, 3.6 and 3.7)
The dynamics of the risky asset is this time a stochastic volatility model, namely
the Heston model, defined as follows. Let ϑ, k, a such that ϑ2 /(2ak) ≤ 1 (so that vt
remains a.s. positive, see [183], Proposition 6.2.4, p. 130).
 √ 
d St = St r dt + vt dWt1 , s0 = x0 > 0, t ∈ [0, T ], (risky asset)

dvt = k(a − vt )dt + ϑ vt dWt , v0 > 0
2

with W 1 , W 2 t = ρ t, ρ ∈ [−1, 1], t ∈ [0, T ].

The payoff is an Asian call with strike price K


  T  
1
AsCall H est = e−r t E Ss ds − K .
T 0 +

Usually, no closed forms are available for Asian payoffs, even in the Black–
Scholes model, and this is also the case in the Heston model. Note however that
(quasi-)closed forms do exist for vanilla European options in this model (see [150]),
which is the origin of its success. The simulation has been carried out by replacing
the above diffusion by an Euler scheme (see Chap. 7 for an introduction to the Euler
time discretization scheme). In fact, the dynamics of the stochastic volatility process
does not fulfill the standard Lipschitz continuous assumptions required to make the
Euler scheme converge, at least at its usual rate. In the present case it is even difficult

to define this scheme because of the term vt . Since our purpose here is to illustrate
K ---> Call(s_0,K), CallPar(s_0,K), InterpolCall(s_0,K)

0.015

0.01

0.005

-0.005

-0.01

-0.015

-0.02
90 95 100 105 110 115 120
Strikes K=90,...,120

Fig. 3.2 Black–Scholes Calls: Error = Reference BS-(MC Premium). K = 90, . . . , 120. –o–
o–o– Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated Synthetic Call
76 3 Variance Reduction

0.7

0.6

0.5
K ---> lambda(K)

0.4

0.3

0.2

0.1
90 95 100 105 110 115 120
Strikes K=90,...,120

Fig. 3.3 Black–Scholes Calls: K → 1 − λmin (K ), K = 90, . . . , 120, for the Interpolated
Synthetic Call

18
K ---> StD_Call(s_0,K), StD_CallPar(s_0,K),

16
StD_InterpolCall(s_0,K)

14

12

10

4
90 95 100 105 110 115 120
Strikes K=90,...,120

Fig. 3.4 Black–Scholes Calls. Standard Deviation (MC Premium). K = 90, . . . , 120. –o–o–o–
Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call

the efficiency of parity relations to reduce variance, we adopted a rather “basic”


scheme, namely
3.3 Application to Option Pricing: Using Parity Equations to Produce Control Variates 77

14

K ---> StD-AsCall(s_0,K), StD-AsCallPar(s_0,K),


12

StD-AsInterCall(s_0,K) 10

0
90 95 100 105 110 115 120
Strikes K=90,...,120

Fig. 3.5 Heston Asian Calls. Standard Deviation (MC Premium). K = 90, . . . , 120. –o–o–o–
Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated Synthetic Call

0.9

0.8

0.7
K ---> lambda(K)

0.6

0.5

0.4

0.3

0.2

0.1

0
90 95 100 105 110 115 120
Strikes K=90,...,120

Fig. 3.6 Heston Asian Calls. K → 1 − λmin (K ), K = 90, . . . , 120, for the Interpolated Syn-
thetic Asian Call

 #
rT  T 2 $ 
S̄ kTn = S̄ (k−1)T 1 + + |v̄ (k−1)T | ρZ k + 1 − ρ2 Z k1 ,
n n n n
S̄0 = s0 > 0,
 T 
v̄ kTn = k a − v̄ (k−1)T + ϑ |v̄ (k−1)T | Z k2 , v̄0 = v0 > 0,
n n n
78 3 Variance Reduction

K −−−> AsCall(s ,K), AsCallPar(s0,K), InterAsCall(s ,K)


0.02

0
0.015

0.01

0.005

−0.005
0

−0.01

−0.015

−0.02
90 95 100 105 110 115 120
Strikes K=90,...,120

Fig. 3.7 Heston Asian Calls. M = 106 (Reference: MC with M = 108 ). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call

where Z k = (Z k1 , Z k2 )k≥1 is an i.i.d. sequence of N (0; I2 )-distributed random vectors.


This scheme is consistent but its rate of convergence is not optimal. The simulation
of the Heston model has given rise to an extensive literature. See e.g. [3, 7, 41, 115]
and more recently [4] devoted to general affine diffusion processes.
– Parameters of the model:

s0 = 100, k = 2, a = 0.01, ρ = 0.5, v0 = 10%, ϑ = 20%.

– Parameters of the option portfolio:

T = 1, K = 90, . . . , 120 (31 strikes).

 Exercises. We consider a one dimensional Black–Scholes model with market


parameters
r = 0, σ = 0.3, x0 = 100, T = 1.

1. Consider a vanilla Call with strike K = 80. The random variable  is defined
as above. Estimate the λmin (this should be not too far from 0.825). Then, com-
pute a confidence interval for the Monte Carlo pricing of the Call with and with-
out the linear variance reduction for the following sizes of the simulation: M =
5 000, 10 000, 100 000, 500 000.
2. Proceed as above but with K = 150 (true price 1.49). What do you observe?
Provide an interpretation.
3.3 Application to Option Pricing: Using Parity Equations to Produce Control Variates 79

3.3.3 The Multi-dimensional Case

Let X := (X 1 , . . . , X d ) : (, A, P) −→ Rd ,  := (1 , . . . , q ) : (, A, P) −→


Rq , be square integrable random vectors and assume

E X = m ∈ Rd , E  = 0 ∈ Rq .

Let D(X ) := Cov(X i , X j ) 1≤i, j≤d and let D() denote the covariance (dispersion)
matrices of X and , respectively. Assume

D(X ) and D() > 0

as positive definite symmetric matrices.


The problem is to find a matrix solution  ∈ M(d, q, R) to the optimization problem
%  &
Var(X −  ) = min Var X − L ), L ∈ M(d, q, R

where Var(Y ) is defined by Var(Y ) := E |Y − E Y |2 = E |Y |2 − |E Y |2 for any Rd -


valued random vector.
The solution is given by

 = D()−1 C(X, ),

where
C(X, ) = [Cov(X i ,  j )]1≤i≤d,1≤ j≤q .

 Examples-exercises. Let X t = (X t1 , . . . , X td ), t ∈ [0, T ], be the price process of


d risky traded assets (be careful about the notations that collide at this point: here
X is used to denote the traded assets and the aim is to reduce the variance of the
discounted payoff, usually denoted by the letter h).
1. Options on various baskets:
⎛ ⎞

d
h iT = ⎝ θij X Tj − K ⎠ , i = 1, . . . , d,
j=1
+

where θij are positive real coefficients.


Remark. This approach also produces an optimal asset selection (since it is essen-
tially a PC A), which helps for hedging.
2. Portfolio of forward start options
 
j j
h i, j = X Ti+1 − X Ti , i = 1, . . . , d − 1,
+
80 3 Variance Reduction

where Ti , i = 1, . . . , d is an increasing sequence of maturities.

3.4 Pre-conditioning

The principle of the pre-conditioning method – also known as the Blackwell–Rao


method – is based on the very definition of conditional expectation: let (, A, P) be
probability space and let X : (, A, P) → R be a square integrable random variable.
The practical constraint for implementation is the ability to simulate E (X | B) at a
competitive computational cost. Such is the case in the typical examples hereafter.
For every sub-σ-field B ⊂ A
 
E X = E E (X | B)

and
   
Var E (X | B) = E E (X | B)2 − (E X )2 ≤ E X 2 − (E X )2 = Var(X ) (3.7)

since conditional expectation is a contraction in L 2 (P) as an orthogonal projection.


In fact, one easily checks the following more precise result
  ' '2
Var E (X | B) = Var(X ) − ' X − E (X | B)'2 .

This shows that the above inequality (3.7) is strict, except if X is B-measurable, i.e.
is strict in any case of interest.
The archetypal situation is the following. Assume that
 
X = g(Z 1 , Z 2 ), g ∈ L 2 R2 , Bor (R2 ), P(Z 1 ,Z 2 ) ,

where Z 1 , Z 2 are independent random vectors. Set B := σ(Z 2 ). Then standard results
on conditional expectations show that
 
E X = E G(Z 2 ) where G(z 2 ) = E g(Z 1 , Z 2 ) | Z 2 = z 2 = E g(Z 1 , z 2 )

is a version of the conditional expectation of g(Z 1 , Z 2 ) given σ(Z 2 ). At this stage,


the pre-conditioning method can be implemented as soon as the following conditions
are satisfied:
– a closed form is available for the function G and
– Z 2 can be simulated with the same complexity as X .
σi2
 Examples. 1. Exchange spread options. Let X Ti = xi e(r − 2 )T +σi WT , xi , σi > 0,
i

i = 1, 2, be two “Black–Scholes” assets at time T related to two Brownian motions


WTi , i = 1, 2, with correlation ρ ∈ [−1, 1]. One considers an exchange spread option
with strike K , i.e. related to the payoff
3.4 Pre-conditioning 81

h T = (X T1 − X T2 − K )+ .

Then one can write


√ $ 
(WT1 , WT2 ) = T 1 − ρ2 Z 1 + ρ Z 2 , Z 2 ,

where Z = (Z 1 , Z 2 ) is an N (0; I2 )-distributed random vector. Then, see e.g.


Sect. 12.2 in the Miscellany Chapter),
  √ √
−r t
  −r t
σ12
x1 e(r − 2 )T +σ1 T ( 1−ρ Z 1 +ρz2 )
2
e E hT | Z2 = e E

 
σ22
−x2 e(r − 2 )T +σ2 T z2 − K
+ |z 2 =Z 2
 √ √
ρ2 σ 2 T σ2
− 21 +σ1 ρ T Z 2 (r − 22 )T +σ2 T Z 2
= Call B S x1 e , x2 e
$ 
+K , r, σ1 1 − ρ2 , T .

Finally, one takes advantage of the closed form available for vanilla Call options in
a Black–Scholes model to compute
 
Premium B S (x1 , x2 , K , σ1 , σ2 , r, T ) = E E (e−r t h T | Z 2 )

with a smaller variance than with the original payoff.


2. Barrier options. This example will be detailed in Sect. 8.2.3 devoted to the pricing
(of some classes) of barrier options in a general model using the simulation of a
continuous Euler scheme (using the so-called Brownian bridge method).

3.5 Stratified Sampling

The starting idea of stratification is to localize the Monte Carlo method on the
elements of a measurable partition of the state space E of a random variable
X : (, A, P) → (E, E).
Let (Ai )i∈I be a finite E-measurable partition of the state space E. The Ai ’s are
called strata and (Ai )i∈I a stratification of E. Assume that the weights

pi = P(X ∈ Ai ), i ∈ I,

are known, (strictly) positive and that, still for every i ∈ I ,

d
L(X | X ∈ Ai ) = ϕi (U ),
82 3 Variance Reduction

where U is uniformly distributed over [0, 1]ri (with ri ∈ N ∪ {∞}, the case ri = +∞
corresponding to the acceptance-rejection method) and ϕi : [0, 1]ri → E is an (eas-
ily) computable function. This second condition simply means that the conditional
distribution L(X | X ∈ Ai ) is easy to simulate on a computer. To be more precise we
implicitly assume in what follows that the simulation of X and of the conditional
distributions L(X | X ∈ Ai ), i ∈ I , or, equivalently, the random vectors ϕi (Ui ), have
approximately the same complexity. One must always keep this in mind since it is a
major constraint for practical implementations of stratification methods.
This simulability condition usually has a strong impact on the possible design of
the strata. For convenience, we will assume in what follows that ri = r .
Let F : (E, E) → (R, Bor (R)) such that E F 2 (X ) < +∞. By elementary con-
ditioning, we get
        
E F(X ) = E 1{X ∈Ai } F(X ) = pi E F(X ) | X ∈ Ai = pi E F(ϕi (Ui )) ,
i∈I i∈I i∈I

where the random variables Ui , i ∈ I , are i.i.d. with uniform distribution over [0, 1]r .
This is where the stratification idea is introduced. Let M be the global “budget”
allocated to the simulation of E F(X ). We split this budget into |I | groups by setting

Mi = qi M, i ∈ I,

to be the allocated budget to compute E F(ϕi (U )) in each stratum “Ai ”. This leads
us to define the following (unbiased) estimator

I  1 
Mi
 
) M :=
F(X pi F ϕi (Uik )
i∈I
Mi k=1

where (Uik )1≤k≤Mi ,i∈I are uniformly distributed on [0, 1]r , i.i.d. random variables
(with an abuse of notation since the estimator actually depends on all the Mi ). Then,
elementary computations show that
 I   p2
) M = 1
Var F(X σ ,
i 2
M i∈I qi F,i

where, for every i ∈ I , the local inertia, reads


   
σ 2F,i = Var F(ϕi (U )) = Var F(X ) | X ∈ Ai
 2 
= E F(X ) − E (F(X ) | X ∈ Ai ) | X ∈ Ai
 2 
E F(X ) − E (F(X ) | X ∈ Ai ) 1{X∈Ai }
= .
pi
3.5 Stratified Sampling 83

Optimizing the simulation budget allocation to each stratum amounts to solving


the following minimization problem

 p2   

min i 2
σ F,i where S I := (qi )i∈I ∈ (0, 1)|I |  qi = 1 .
(qi )∈S I q i
i∈I i∈I

 A sub-optimal choice. It is natural and simple to set

qi = pi , i ∈ I.

Such a choice is first motivated by the fact that the weights pi are known and of
course because it does reduce the variance since
 p2 
i 2
σ F,i = pi σ 2F,i
i∈I
q i i∈I
   2 
= E 1{X ∈Ai } F(X ) − E (F(X ) | X ∈ Ai )
i∈I
'   '2
= ' F(X ) − E F(X ) A IX '2 , (3.8)
 
where A IX = σ {X ∈ Ai }, i ∈ I denotes the σ-field spanned by the measurable par-
tition {X ∈ Ai }, i ∈ I , of  and
     
E F(X ) A IX = E F(X ) | X ∈ Ai 1{X ∈Ai } .
i∈I

Consequently

 p2 ' '2  
σ F,i ≤ ' F(X ) − E F(X )'2 = Var F(X )
i 2
(3.9)
i∈I
qi
 
with equality if and only if E F(X ) | A IX = E F(X ) P-a.s. Or, equivalently, equal-
ity holds in this inequality if and only if
 
E F(X ) | X ∈ Ai = E F(X ), i ∈ I.

So this choice always reduces the variance of the estimator since we assumed that the
stratification is not trivial. It corresponds in the opinion poll world to the so-called
quota method.
 Optimal choice. The optimal choice is a solution to the above constrained min-
imization problem. It follows from a simple application of the Schwarz Inequality
(and its equality case) that
84 3 Variance Reduction

⎛ ⎞1 ⎛ ⎞1 ⎛ ⎞1
  pi σ F,i √  pi2 σ 2F,i 2  2
 pi2 σ 2F,i 2 1
pi σ F,i = √ qi ≤ ⎝ ⎠ ⎝ qi ⎠ = ⎝ ⎠ × 12
qi qi qi
i∈I i∈I i∈I i∈I i∈I
⎛ ⎞1
 pi2 σ 2F,i 2

=⎝ ⎠ .
qi
i∈I

Consequently, the optimal choice for the allocation parameters qi ’s, i.e. the solution
to the above constrained minimization problem, is given by

∗ pi σ F,i
q F,i = , i ∈ I, (3.10)
j p j σ F, j

with a resulting minimal variance


 2

pi σ F,i .
i∈I

At this stage the problem is that unlike the weights pi , the local inertia σ 2F,i are
not known, which makes the implementation less straightforward and sometimes
questionable.
Some attempts have been made to circumvent this problem, see e.g. [86] for a
recent reference based on an adaptive procedure for the computation of the local
F-inertia σ 2F,i .
However, using that the L p -norms with respect to a probability measure are non-
decreasing in p, one derives that
!   2  " 21
σ F,i = E  F(X ) − E F(X ) | {X ∈ Ai }   {X ∈ Ai }
   
≥ E  F(X ) − E F(X ) | {X ∈ Ai }  {X ∈ Ai }
  
E  F(X ) − E (F(X )|X ∈ Ai )  {X ∈ Ai }
=
pi

so that, owing to Minkowski’s Inequality,


 2
 '  '2
pi σ F,i ≥ ' F(X ) − E F(X ) | A I '1 .
i∈I

When compared to the resulting variance in (3.9) obtained with the suboptimal
choice qi = pi , this illustrates the magnitude of the gain that can be expected from the
3.5 Stratified Sampling 85
'  '2 '
optimal choice qi = qi∗ : it lies in between ' F(X ) − E F(X ) | A I '1 and ' F(X ) −
 '2
E F(X ) | A I '2 .
d
 Examples. Stratifications for the computation of E F(X ), X = N (0; Id ), d ≥ 1.
(a) Stripes. Let v be a fixed unitary vector (a simple and natural choice for v is v =
e1 = (1, 0, 0, . . . , 0): it is natural to define the strata as hyper-stripes perpendicular
to the main axis Re1 of X ). So, we set, for a given size N of the stratification
(I = {1, . . . , N }),
% &
Ai := x ∈ Rd s.t. (v|x) ∈ [yi−1 , yi ] , i = 1, . . . , N ,

where yi ∈ R is defined by 0 (yi ) = Ni , i = 0, . . . , N (the N -quantiles of the


N (0; 1) distribution). In particular, y0 = −∞ and yN = +∞. Then, if Z denotes
an N (0; 1)-distributed random variable,

1
pi = P(X ∈ Ai ) = P(Z ∈ [yi−1 , yi ]) = 0 (yi ) − 0 (yi−1 ) = ,
N

where 0 denotes the c.d.f. of the N (0; 1)-distribution. Other choices are possible
for the yi , leading to a non-uniform distribution of the pi ’s. The simulation of the
conditional distributions follows from the fact that
   
L X | (v|X ) ∈ [a, b] = L ξ1 v + πv⊥⊥ (ξ2 ) ,

d d
where ξ1 = L(Z | Z ∈ [a, b]) is independent of ξ2 = N (0; Id−1 ),
 
L(Z | Z ∈ [a, b]) = −1
d d
0 (0 (b) − 0 (a))U + 0 (a) , U = U ([0, 1])

and πv⊥⊥ denotes the orthogonal projection on v ⊥ . When v = e1 , this reads simply as

L(X | (v|X ) ∈ [a, b]) = L(Z | Z ∈ [a, b]) ⊗ N (0; Id−1 ).

d
(b) Hyper-rectangles. We still consider X = (X 1 , . . . , X d ) = N (0; Id ), d ≥ 2. Let
(e1 , . . . , ed ) denote the canonical basis of Rd . We define the strata as hyper-
rectangles. Let N1 , . . . , Nd ≥ 1.

d 
  
d
Ai := x ∈ Rd s.t. (e |x) ∈ [yi −1 , yi ] , i ∈ {1, . . . , N },
=1 =1

where the yi ∈ R are defined by 0 (yi ) = Ni  , i = 0, . . . , N . Then, for every multi-
(
index i = (i 1 , . . . , i d ) ∈ d=1 {1, . . . , N },
86 3 Variance Reduction

)
d
L(X | X ∈ Ai ) = L(Z | Z ∈ [yi −1 , yi ]). (3.11)
=1

Optimizing the allocation to each stratum in the simulation for a given function F
in order to reduce the variance is of course interesting and can be highly efficient but
with the drawback of being strongly F-dependent, especially when this allocation
needs an extra procedure like in [86]. An alternative and somewhat dual approach is
to try optimizing the strata themselves uniformly with respect to a class of functions
F (namely Lipschitz continuous functions) prior to the allocation across the strata.
This approach emphasizes the connections between stratification and optimal
quantization and provides bounds on the best possible variance reduction factor that
can be expected from a stratification. Some elements are provided in Chap. 5, see
also [70] for further developments in infinite dimensions.

3.6 Importance Sampling

3.6.1 The Abstract Paradigm of Important Sampling

The basic principle of importance sampling is the following: let X : (, A, P) →


(E, E) be an E-valued random variable. Let μ be a σ-finite reference measure
on (E, E) such that PX  μ, i.e. there exists a probability density f : (E, E) →
(R+ , B(R+ )) such that
PX = f · μ.

In practice, we will have to simulate several r.v.s, whose distributions are all abso-
lutely continuous with respect to this reference measure μ. For a first reading, one
may assume that E = Rd and μ is the Lebesgue measure λd , but what follows can
also be applied to more general measure spaces like the Wiener space (equipped
with the Wiener measure), or countable sets (with the counting measure), etc. Let
h ∈ L 1 (PX ). Then,
 
E h(X ) = h(x)PX (d x) = h(x) f (x)μ(d x).
E E

Now, for any μ-a.s. positive probability density function g defined on (E, E) (with
respect to μ), one has
 
h(x) f (x)
E h(X ) = h(x) f (x)μ(d x) = g(x)μ(d x).
E E g(x)

One can always enlarge (if necessary) the original probability space (, A, P) to
design a random variable Y : (, A, P) → (E, E) having g as a probability density
3.6 Importance Sampling 87

with respect to μ. Then, going back to the probability space yields for every non-
negative or PX -integrable function h : E → R,
 
h(Y ) f (Y )
E h(X ) = E . (3.12)
g(Y )

So, in order to compute E h(X ), one may also implement a Monte Carlo simulation
based on the simulation of independent copies of the random variable Y , i.e.
 
1 
M
h(Y ) f (Y ) f (Yk )
E h(X ) = E = lim h(Yk ) a.s.
g(Y ) M→+∞ M k=1 g(Yk )

ℵ Practitioner’s corner.
 Practical requirements (to undertake the simulation). To proceed, it is necessary
to simulate independent copies of Y and to compute the ratio of density functions
f /g at a reasonable cost. Note that only the ratio is needed which makes useless the
computation of some “structural” constants like (2π)d/2 , e.g. when both f and g are
Gaussian densities with different means (see below). By “reasonable cost” for the
simulation of Y , we mean at the same cost as that of X (in terms of complexity). As
concerns the ratio f /g, this means that its computational cost remains negligible with
respect to that of h or, which may be the case in some slightly different situations,
that the computational cost of h × gf is equivalent to that of h alone.
 Sufficient conditions (to undertake the simulation). Once the above conditions
are fulfilled, the question is: is it profitable to proceed like this? This is the case
if the complexity of the simulation for a given accuracy (in terms of confidence
interval) is lower with the second method. If one assumes as above that simulating
X and Y on the one hand, and computing h(x) and (h f /g)(x) on the other hand
are both comparable in terms of complexity, the question amounts to comparing the
variances or, equivalently, the squared quadratic norm of the estimators since they
have the same expectation E h(X ).
Now
*  + * 2 +   
h(Y ) f (Y ) 2 hf h(x) f (x) 2
E =E (Y ) = g(x) μ(d x)
g(Y ) g E g(x)

f (x)
= h(x)2 f (x) μ(d x)
g(x)
E
 
f
= E h(X )2 (X ) .
g

hf
As a consequence, simulating (Y ) rather than h(X ) will reduce the variance
g
if and only if
88 3 Variance Reduction
 
f  
E h(X )2 (X ) < E h(X )2 . (3.13)
g

Remark. In fact, theoretically, as soon as h is non-negative and E h(X ) = 0, one


may reduce the variance of the new simulation to…0. As a matter of fact,  using the
) f (Y )
Schwarz Inequality one gets, as if trying to “reprove” that Var h(Yg(Y )
≥ 0,
 2  2
 2 h(x) f (x) $
E h(X ) = h(x) f (x)μ(d x) = √ g(x)μ(d x)
E E g(x)
 
(h(x) f (x))2
≤ μ(d x) × g dμ
E g(x) E

(h(x) f (x))2
= μ(d x)
E g(x)

since g is a probability density function. Now the equality case in the Schwarz

√ f (x) are proportional
inequality says that the variance is 0 if and only if g(x) and h(x)
g(x)
μ(d x)-a.s., i.e. h(x) f (x) = c g(x) μ(d x)-a.s. for a (non-negative) real constant c.
Finally, when h has a constant sign and E h(X ) = 0 this leads to

h(x)
g(x) = f (x) μ(d x) P-a.s.
E h(X )

This choice is clearly impossible to make since it would mean that E h(X ) is
known since it is involved in the formula…and would then be of no use. A contrario
this may suggest a direction for designing the (distribution) of Y .

3.6.2 How to Design and Implement Importance Sampling

The intuition that must guide practitioners when designing an importance sampling
method is to replace a random variable X by a random variable Y so that hgf (Y ) is in
some way often “closer” than h(X ) to their common mean. Let us be more specific.
We consider a Call on the risky asset (X t )t∈[0,T ] with strike price K and maturity
T > 0 (with interest rate r ≡ 0 for simplicity). If X 0 = x  K , i.e. the option is
deep out-of-the money at the origin of time so that most of the scenarii X T (ω) will
satisfy X T (ω) ≤ K or equivalently (X T (ω) − K )+ = 0. In such a setting, the event
{(X T − K )+ > 0} – the payoff is positive – is a rare event so that the number of
scenarii that will produce a non-zero value for (X T − K )+ will be small, inducing a
too rough estimate of the quantity of interest E (X T − K )+ . Put in a more quantitative
way, it means that, even if both the expectation and the standard deviation of the payoff
are both small in absolute value, their ratio (standard deviation over expectation) will
be very large.
3.6 Importance Sampling 89

By contrast, if we switch from (X t )t∈[0,T ] to (Yt )t∈[0,T ] so that:


f XT
– we can compute the ratio gY
(y) where f X T and gYT are the probability densities
T
of X T and YT , respectively,
– YT takes most, or at least a significant part, of its values in [K , +∞).
Then  
f XT
E (X T − K )+ = E (YT − K )+ (Y )
gY T T

and we can reasonably hope that we will simulate more significant scenarii for
f
(YT − K )+ g X T (YT ) than for (X T − K )+ . This effect will be measured by the variance
YT
reduction.
This interpretation in terms of “rare events” is in fact the core of importance
sampling, more than the plain “variance reduction” feature. In particular, this is what
a practitioner must have in mind when searching for a “good” probability distribution
g: importance sampling is more a matter of “focusing light where it is needed” than
reducing variance.
When dealing with vanilla options in simple models (typically local volatility),
one usually works on the state space E = R+ and importance sampling amounts to
a change of variable in one-dimensional integrals as emphasized above. However, in
more involved frameworks, one considers the scenarii space as a state space, typically
E =  = C(R+ , Rd ) and uses Girsanov’s Theorem instead of the usual change of
variable with respect to the Lebesgue measure.

3.6.3 Parametric Importance Sampling

In practice, the starting idea is to introduce a parametric family of random variables


(Yθ )θ∈ (often defined on the same probability space (, A, P) as X ) such that
– for every θ ∈ , Yθ has a probability density gθ > 0 μ-a.e. with respect to a
reference measure μ and Yθ is as easy to simulate as X in terms of complexity,
– the ratio gf has a small computational cost where f is the probability density
θ
of the distribution of X with respect to μ.
Furthermore, we can always assume, by adding a value to  if necessary, that for
a value θ0 ∈  of the parameter, Yθ0 = X (at least in distribution).
The problem becomes a parametric optimization problem, typically solving the
minimization problem
, * 2 +  -
f f
min E h(Yθ ) (Yθ ) = E h(X )2 (X ) .
θ∈ gθ gθ
90 3 Variance Reduction

Of course there is no reason why the solution to the above problem should be θ0
(if so, such a parametric model is inappropriate). At this stage one can follow two
strategies:
– Try to solve by numerical means the above minimization problem.
– Use one’s intuition to select a priori a good (though sub-optimal) θ ∈  by
applying the heuristic principle: “focus light where needed”.
 Example (The Cameron–Martin formula and Importance Sampling by mean trans-
lation). This example takes place in a Gaussian framework. We consider (as a starting
motivation) a one dimensional Black–Scholes model defined by

X Tx = xeμT +σWT = xeμT +σ
d
TZ
, Z = N (0; 1),

σ2
with x > 0, σ > 0 and μ = r − 2
. Then, the premium of an option with payoff
h : (0, +∞) → (0, +∞) reads

z2 dz
e−r t E h(X Tx ) = E ϕ(Z ) = ϕ(z)e− 2 √ ,
R 2π
 √ 
where ϕ(z) = e−r t h xeμT +σ T z , z ∈ R.
From now on, we forget about the financial framework and deal with

 z2
e− 2
E ϕ(Z ) = ϕ(z)g0 (z)dz where g0 (z) = √
R 2π

and the random variable Z plays the role of X in the above theoretical part. The idea
is to introduce the parametric family

Yθ = Z + θ, θ ∈  := R.

We consider the Lebesgue measure on the real line λ1 as a reference measure, so that

(y−θ)2
e− 2
gθ (y) = √ , y ∈ R, θ ∈  := R.

Elementary computations show that

g0 θ2
(y) = e−θy+ 2 , y ∈ R, θ ∈  := R.

Hence, we derive the Cameron–Martin formula


3.6 Importance Sampling 91

θ2
 
E ϕ(Z ) = e 2 E ϕ(Yθ )e−θYθ
θ2
  θ2
 
= e 2 E ϕ(Z + θ)e−θ(Z +θ) = e− 2 E ϕ(Z + θ)e−θ Z .

Remark. In fact, a standard change of variable based on the invariance of the


Lebesgue measure by translation yields the same result in a much more straight-
forward way: setting z = u + θ shows that

θ2 u2 du θ2  
E ϕ(Z ) = ϕ(u + θ)e− 2 −θu− 2 √ = e− 2 E e−θ Z ϕ(Z + θ)
R 2π
θ2
 
= e 2 E ϕ(Z + θ)e−θ(Z +θ) .

It is to be noticed again that there is no need to account for the normalization constants
g
to compute the ratio 0 .

The next step is to choose a “good” θ which significantly reduces the variance,
i.e. following Condition (3.13) (using the formulation involving “Yθ = Z + θ”), such
that  
θ2 2
E e 2 ϕ(Z + θ)e−θ(Z +θ) < E ϕ2 (Z ),

i.e.  
e−θ E ϕ2 (Z + θ)e−2θ Z < E ϕ2 (Z )
2

or, equivalently, if one uses the formulation of (3.13) based on the original random
variable (here Z ),  
θ2
E ϕ2 (Z )e 2 −θ Z < E ϕ2 (Z ).

Consequently the variance minimization amounts to the following problem


! θ2    "
min e 2 E ϕ2 (Z )e−θ Z = e−θ E ϕ2 (Z + θ)e−2θ Z .
2

θ∈R

It is clear that the solution of this optimization problem and the resulting choice
of θ highly depends on the function h.
– Optimization approach: When h is smooth enough, an approach based on large
deviation estimates has been proposed by Glasserman et al. (see [115]). We propose
a simple recursive/adaptive approach in Sect. 6.3.1 of Chap. 6 based on Stochastic
Approximation which does not depend upon the regularity of the function h (see
also [12] for a pioneering work in that direction).
– Heuristic suboptimal approach: Let us temporarily
 return to
√our pricing
 problem
involving the specified function ϕ(z) = e−r t x exp (μT + σ T z) − K + , z ∈ R.
When x  K (deep-out-of-the-money option), most simulations of ϕ(Z ) will pro-
92 3 Variance Reduction

duce 0 as a result. A first simple idea – if one does not wish to carry out the above
optimization – can be to “re-center the simulation” of X Tx around K by replacing Z
by Z + θ, where θ satisfies
  √ 
E x exp μT + σ T (Z + θ) = K

which yields, since E X Tx = xer t ,

log(x/K ) + r T
θ := − √ . (3.14)
σ T

Solving the similar, though slightly different equation,


  √ 
E x exp μT + σ T (Z + θ) = er t K

would lead to
log(x/K )
θ := − √ . (3.15)
σ T

A third simple, intuitive idea is to search for θ such that


  √   1
P x exp μT + σ T (Z + θ) < K = ,
2
which yields
log(x/K ) + μ T
θ := − √ . (3.16)
σ T

This choice is also the solution to the equation x eμT +σ T θ = K , etc.
All these choices are suboptimal but reasonable when x  K . However, if we
need to price a whole portfolio including many options with various strikes, maturities
(and underlyings…), the above approach is no longer possible and a data driven
optimization method like the one developed in Chap. 6 becomes mandatory.
Other parametric methods can be introduced, especially in non-Gaussian frame-
works, like for example the so-called “exponential tilting” (or Esscher transform)
for distributions having a Laplace transform on the whole real line (see e.g. [199]).
Thus, when dealing with the NIG distribution (for Normal Inverse Gaussian) this
transform has an impact on the thickness of the tail of the distribution. Of course,
there is no a priori limit to what can be designed on a specific problem. When dealing
with path-dependent options, one usually relies on the Girsanov theorem to modify
likewise the drift of the risky asset dynamics (see [199]). All the preceding can be
adapted to multi-dimensional models.
3.6 Importance Sampling 93

 Exercises. 1. (a) Show that under appropriate integrability assumptions on h to


be specified, the function   θ2
θ → E ϕ2 (Z )e 2 −θ Z

is strictly convex and differentiable on the whole real line with a derivative given by
 θ2

θ → E ϕ2 (Z )(θ − Z )e 2 −θ Z .

(b) Show that if ϕ is an even function, then this parametric importance sampling
procedure by mean translation is useless. Give a necessary and sufficient condition
(involving ϕ and Z ) that makes it always useful.
2. Set r = 0, σ = 0.2, X 0 = x = 70, T = 1. One wishes to price a Call option with
strike price K = 100 (i.e. deep out-of-the-money). The true Black–Scholes price is
0.248 (see Sect. 12.2).
Compare the performances of
(i) a “crude” Monte Carlo simulation,
(ii) the above “intuitively guided” heuristic choices for θ.
Assume now that x = K = 100. What do you think of the heuristic suboptimal
choice?
d
3. Write all the preceding when Z = N (0; Id ).
4. Randomization of an integral. Let h ∈ L 1 (Rd , Bor (Rd ), λd ).
(a) Show that for any Rd -valued random vector Y having an absolutely continuous
distribution PY = g.λd with g > 0, λd a.s. on {h > 0}, one has
  
h
hdλd = E (X ) .
Rd g

Derive a probabilistic method to compute h dλd .
Rd
(b) Propose an importance sampling approach to this problem inspired by the above
examples.

3.6.4 Computing the Value-At-Risk by Monte Carlo


Simulation: First Approach

Let X be a real-valued random variable defined on a probability space, representative


of a loss. For the sake of simplicity we suppose here that X has a continuous distribu-
tion, i.e. that its distribution function defined for every x ∈ R by F(x) := P(X ≤ x)
is continuous. For a given confidence level α ∈ (0, 1) (usually closed to 1), the Value-
at-Risk at level α (denoted by VaRα or VaRα,X following [92]) is any real number
satisfying the equation
94 3 Variance Reduction
 
P X ≤ VaRα,X = α ∈ (0, 1). (3.17)

Equation (3.17) has at least one solution since F is continuous which may not
be unique in general. For convenience one often assumes that the lowest solution
of Equation (3.17) is the VaRα,X . In fact, Value-at-Risk (of X ) is not consistent as
a measure of risk (as emphasized in [92]), but nowadays it is still widely used to
measure financial risk.
One naive way to compute VaRα,X is to estimate the empirical distribution function
of a (large enough) Monte Carlo simulation at some points ξ lying in a grid :=
{ξi , i ∈ I }, namely
1 
M

F(ξ) M := 1{X k ≤ξ} , ξ ∈ ,
M k=1

where (X k )k≥1 is an i.i.d. sequence of X -distributed random variables. Then one


. M = α (using an interpolation step of course).
solves the equation F(ξ)
Such an approach based on the empirical distribution of X needs to simulate
extreme values of X since α is usually close to 1. So implementing a Monte Carlo
simulation directly on the above equation is usually a slightly meaningless exercise.
Importance sampling becomes the natural way to “re-center” the equation.
Assume, for example, that

d
X := ϕ(Z ), Z = N (0; 1).

Then, for every ξ ∈ R,

θ2  
P(X ≤ ξ) = e− 2 E 1{ϕ(Z +θ)≤ξ} e−θ Z

so that the above Eq. (3.17) now reads (θ being fixed)


  θ2
E 1{ϕ(Z +θ)≤VaRα,X } e−θ Z = e 2 α.

It remains to find good variance reducers θ. This choice depends of course on ξ but
in practice it should be fitted to reduce the variance in the neighborhood of VaRα,X .
We will see in Chap. 6 that more efficient methods based on Stochastic Approxi-
mation can be devised. But they also need variance reduction to be implemented.
Furthermore, similar ideas can be used to compute a consistent measure of risk called
the Conditional Value-at-Risk (or Averaged Value-at-Risk).
Chapter 4
The Quasi-Monte Carlo Method

In this chapter we present the so-called Quasi-Monte Carlo (QMC) method, which
can be seen as a deterministic alternative to the standard Monte Carlo method: the
pseudo-random numbers are replaced by deterministic computable sequences of
[0, 1]d -valued vectors which, once substituted mutatis mutandis in place of pseudo-
random numbers in the Monte Carlo method, may significantly speed up its rate
of convergence, making it almost independent of the structural dimension d of the
simulation.

4.1 Motivation and Definitions

Computing an expectation E ϕ(X ) using a Monte Carlo simulation ultimately


amounts to computing either a finite-dimensional integral

f (u 1 , . . . , u d )du 1 · · · du d
[0,1]d

or an infinite-dimensional integral

f (u)λ∞ (du),
[0,1](N∗ )


where [0, 1](N ) denotes the set of finite [0, 1]-valued sequences (or, equivalently,
sequences
 vanishing at a finite range) and λ∞ = λ⊗N is the Lebesgue measure on
N∗ ∗
[0, 1] , Bor ([0, 1]N ) . Integrals of the first type show up when X can be simu-
lated by standard methods like the inverse distribution function, Box–Muller simula-
d
tion method of Gaussian distributions, etc, so that X = g(U ), U = (U 1 , · · · , U d ) =

© Springer International Publishing AG, part of Springer Nature 2018 95


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_4
96 4 The Quasi-Monte Carlo Method

U ([0, 1]d ) whereas the second type is typical of a simulation using an acceptance-
rejection method (like the polar method for Gaussian distributions).
As concerns finite-dimensional integrals, we saw that, if (Un )n≥1 denotes an i.i.d.
sequence of uniformly distributed random vectors on [0, 1]d , then, for every function
f ∈ L 1R ([0, 1]d , λd ),

P(dω)-a.s.

1  
n
(4.1)
f Uk (ω) −→ E f (U1 ) = f (u 1 , . . . , u d )du 1 · · · du d ,
n k=1 [0,1]d

where the subset  f of P-probability 1 on which this convergence holds true depends
on the function f . In particular, the above a.s.-convergence holds for any continuous
function on [0, 1]d . But in fact, taking advantage of the separability of the space
of continuous functions, we will show below that this convergence simultaneously
holds for all continuous functions on [0, 1]d and even on the larger class of Riemann
integrable functions on [0, 1]d .
First we briefly recall the basic definition of weak convergence of probability
measures on metric spaces (see [45] Chap. 1 for a general introduction to weak
convergence of probability measures on metric spaces).

Definition 4.1 (Weak convergence) Let (S, δ) be a metric space and let S :=
Borδ (S) be its Borel σ-field. Let (μn )n≥1 be a sequence of probability measures
on (S, S) and μ a probability measure on the same space. The sequence (μn )n≥1
(S)
weakly converges to μ (denoted by μn =⇒ μ) if, for every function f ∈ Cb (S, R),
 
f dμn −→ f dμ as n → +∞. (4.2)
S S

In view of applications, the first important result on weak convergence of proba-


bility measures of this chapter is the following.

(S)
Proposition 4.1 (See Theorem 5.1 in [45]) If μn =⇒ μ, then the above conver-
gence (4.2) holds for every bounded Borel function f : (S, S) → R such that
   
μ Disc( f ) = 0 where Disc( f ) = x ∈ S, f is discontinuous at x ∈ S.

Functions f such that μ(Disc( f )) = 0 are called μ-a.s. continuous functions.

Theorem 4.1 (Glivenko–Cantelli Theorem) If (Un )n≥1 is an i.i.d. sequence of uni-


formly distributed random variables on the unit hypercube [0, 1]d , then

1
n
([0,1]d )
P(dω)-a.s. δUk (ω) =⇒ λd|[0,1]d = U ([0, 1]d ) as n → +∞,
n k=1
4.1 Motivation and Definitions 97

i.e.

P(dω)-a.s. ∀ f ∈ C([0, 1]d , R),



1
n
n→+∞
f (Uk (ω)) −→ f (x)λd (d x). (4.3)
n k=1 [0,1]d

Proof. The vector space C([0, 1]d , R) endowed with the sup-norm f sup :=
supx∈[0,1]d | f (x)| is separable in the sense that there exists a sequence ( f m )m≥1 of
continuous functions on [0, 1]d which is everywhere dense in C([0, 1]d , R) with
respect to the sup-norm (1 ).
Now, for every m ≥ 1, the convergence (4.1) holds with f = f m for every
 m , where
ω ∈
/ m ∈ A and P(m ) = 1. Set 0 = ∩m∈N∗ m . One has P(c0 ) ≤
m≥1 P(m ) = 0 by σ-sub-additivity of probability measures so that P(0 ) = 1.
c

Then,

∀ ω ∈ 0 , ∀ m ≥ 1,

1
n
n→+∞
f m (Uk (ω)) −→ E f m (U1 ) = f m (u)λd (du).
n k=1 [0,1]d

On the other hand, it is straightforward that, for every f ∈ C([0, 1]d , R), for every
n ≥ 1 and every ω ∈ ,

1 1
n n
f m (Uk (ω)) − f (Uk (ω)) ≤ f − f m sup
n k=1 n k=1

and E ( f m (U1 )) − E ( f (U1 )) ≤ f − f m sup . As a consequence, for every ω ∈ 0


and every m ≥ 1,

1
n
lim f (Uk (ω)) − E f (U1 ) ≤ 2 f − f m sup .
n n k=1

Now, the fact that the sequence ( f m )m≥1 is everywhere dense in C([0, 1]d , R) with
respect to the sup-norm means precisely that

lim f − f m sup = 0.
m

1 When d = 1, an easy way to construct this sequence is to consider the countable family of contin-
uous piecewise affine functions with monotonicity breaks at rational points of the unit interval and
taking rational values at these break points (and at 0 and 1). The density follows from that of the set
Q of rational numbers. When d ≥ 2, one proceeds likewise by considering continuous functions
which are affine on hyper-rectangles with rational vertices which tile the unit hypercube [0, 1]d .
We refer to [45] for more details.
98 4 The Quasi-Monte Carlo Method

Consequently, for every f ∈ C([0, 1]d , R),

1
n
f (Uk (ω)) − E f (U1 ) −→ 0 as n → +∞.
n k=1

This completes the proof since it shows that, for every ω ∈ 0 , the expected conver-
gence holds for every continuous function on [0, 1]d . ♦

Corollary 4.1 Owing to Proposition 4.1, one may replace in (4.3) the set of
continuous functions on [0, 1]d by that of all bounded Borel λd -a.s. continuous
functions on [0, 1]d .

Definition 4.2 A bounded λd -a.s. continuous Borel function f : [0, 1]d → R is


called Riemann integrable.

Remark. In fact, in the above definition one may even replace the Borel measurabil-
ity by
 “Lebesgue”-measurability.
 This means, for a function f : [0, 1]d → R, replac-
ing Bor ([0, 1]d ), Bor (R) -measurability by L([0, 1]d ), Bor (R) -measurability,
where L([0, 1]d ) denotes the completion of the Borel σ-field on [0, 1]d by the λd -
negligible sets (see [52], Chap. 13). Such functions are known as Riemann integrable
functions on [0, 1]d (see again [52], Chap. 13).

The preceding suggests that, as long as one wishes to compute some quanti-
ties E f (U ) for (reasonably) smooth functions f , we only need to have access to
a sequence that satisfies the above convergence property for its empirical distri-
bution. Furthermore, we know from the fundamental theorem of simulation (see
Theorem 1.2) that this situation is generic since all distributions can be simulated
from a uniformly distributed random variable over [0, 1], at least theoretically. This
leads us to formulate the following definition of a uniformly distributed sequence.

Definition 4.3 A [0, 1]d -valued sequence (ξn )n≥1 is uniformly distributed on [0, 1]d
(or simply uniformly distributed in what follows) if

1
n
([0,1]d )
δξk =⇒ U ([0, 1]d ) as n → +∞.
n k=1

We need some characterizations of uniform distribution which can be established


more easily on examples. These are provided by the next proposition. To this end we
need to introduce further definitions and notations.
4.1 Motivation and Definitions 99

Definition 4.4 (a) We define a componentwise partial order on [0, 1]d , simply
denoted by “≤”, by: for every x = (x 1 , . . . , x d ), y = (y 1 , . . . , y d ) ∈ [0, 1]d

x ≤ y if x i ≤ y i , 1 ≤ i ≤ d.

(b) The “box” [[x, y]] is defined for every x = (x 1 , . . . , x d ), y = (y 1 , . . . , y d ) ∈


[0, 1]d by
[[x, y]] := {ξ ∈ [0, 1]d , x ≤ ξ ≤ y}.

Note that [[x, y]] = ∅ if and only if x ≤ y and, if this is the case, [[x, y]] =
i=1
d
[x i , y i ].

Notation. In particular, the unit hypercube [0, 1]d can be denoted by [[0, 1]], where
1 = (1, . . . , 1) ∈ Rd .

Proposition 4.2 (Portmanteau Theorem) (See among other references [175] or


[48] (2 )) Let (ξn )n≥1 be a [0, 1]d -valued sequence. The following assertions are
equivalent.
(i) (ξn )n≥1 is uniformly distributed on [0, 1]d .
(ii) Convergence of distribution functions: for every x ∈ [0, 1]d ,

1
n d
1[[0,x]] (ξk ) −→ λd ([[0, x]]) = x i as n → +∞.
n k=1 i=1

(iii) “Discrepancy at the origin” or “star discrepancy”:

1
n d
Dn∗ (ξ) := sup 1[[0,x]] (ξk ) − x i −→ 0 as n → +∞. (4.4)
x∈[0,1]d n k=1 i=1

(iv) “Extreme discrepancy”:

1
n d
Dn∞ (ξ) := sup 1[[x,y]] (ξk ) − (y i − x i ) −→ 0 as n → +∞.
x,y∈[0,1]d n k=1 i=1
(4.5)

2 The name of this theorem looks mysterious. Intuitively, it can be simply justified by the multiple
properties established as equivalent to the weak convergence of a sequence of probability measures.
However, it is sometimes credited to Jean-Pierre Portmanteau in the paper: Espoir pour l’ensemble
vide, Annales de l’Université de Felletin (1915), 322–325. In fact, one can easily check that no
mathematician called Jean-Pierre Portmanteau ever existed and that there is no university in the
very small French town of Felletin. This reference is just a joke hidden in the bibliography of the
second edition of [45]. The empty set is definitely hopeless…
100 4 The Quasi-Monte Carlo Method

(v) Weyl’s criterion: for every integer p ∈ Nd \ {0}

1  2ı̃π( p|ξk )
n
e −→ 0 as n → +∞ (with ı̃ 2 = −1).
n k=1

(vi) Bounded Riemann integrable function: for every bounded λd -a.s. continuous
Lebesgue-measurable function f : [0, 1]d → R

1
n
f (ξk ) −→ f (x)λd (d x) as n → +∞.
n k=1 [0,1]d

Definition 4.5 The two moduli introduced in items (iii) by (4.4) and (iv) by (4.5)
define the discrepancy at the origin and the extreme discrepancy, respectively.

Remark. By construction these two discrepancies take their values in the unit inter-
val [0, 1].

Sketch of proof. The ingredients of the proof come from theory of weak conver-
gence of probability measures. For more details in the multi-dimensional setting
we refer to [45] (Chap. 1 devoted to the general theory of weak convergence of
probability measures on a Polish space) or [175] (an old but great book devoted to
uniformly distributed sequences). We provide hereafter some elements of proof in
the one dimensional case.
The equivalence (i) ⇐⇒ (ii) is simply the characterization of weak conver-
gence of probability measures by the convergence of their  distributions func-
tions (3 ) since the distribution function Fμn of μn = n1 1≤k≤n δξk is given by

Fμn (x) = n1 1≤k≤n 1{0≤ξk ≤x} .
Owing to Dini’s second Lemma, this convergence of non-decreasing (distribution)
functions is uniform as soon as it holds pointwise since its pointwise limiting function,
FU ([0,1]) (x) = x, is continuous. This remark yields the equivalence (ii) ⇐⇒ (iii).
Although more technical, the d-dimensional extension remains elementary and relies
on a similar principle.
The equivalence (iii) ⇐⇒ (iv) is trivial since Dn∗ (ξ) ≤ D ∞ (ξ) ≤ 2Dn∗ (ξ) in
one dimension. Note that in d dimensions the inequality reads Dn∗ (ξ) ≤ D ∞ (ξ) ≤
2d Dn∗ (ξ).
Item (v) is based on the fact that weak convergence of finite measures on
[0, 1]d is characterized by that of the sequences
 of their Fourier coefficients.
 The
Fourier coefficients of a finite measure μ on [0, 1]d , Bor ([0, 1]d ) are defined by

3 The distribution function Fμ of a probability measure μ on [0, 1] is defined by Fμ (x) = μ([0, x]).
One shows that a sequence of probability measures μn converges toward a probability measure μ if
and only if their distribution functions Fμn and Fμ satisfy that Fμn (x) converges to Fμ (x) at every
continuity point x ∈ R of Fμ (see [45], Chap. 1).
4.1 Motivation and Definitions 101

c p (μ) := [0,1]d e2ı̃π( p|u) μ(du), p ∈ Zd , ı̃ 2 = −1 (see e.g. [156]). One checks that the
 
Fourier coefficients c p (λd|[0,1]d ) p∈Zd are simply c p (λd|[0,1]d ) = 0 if p = 0 and 1 if
p = 0.
Item (vi) follows from (i) and Proposition 4.1 since for every x ∈ [[0, 1]], f c (ξ) :=
1{ξ≤x} is continuous outside {x}, which is clearly Lebesgue negligible. Conversely,
(vi) implies the pointwise convergence of the distribution functions Fμn as defined
above toward FU ([0,1]) . ♦

The discrepancy at the origin Dn∗ (ξ) plays a central rôle in theory of uniformly
distributed sequences: it does not only provide a criterion for uniform distribution, it
also appears as an upper error modulus for numerical integration when the function
f has the appropriate regularity (see Koksma–Hlawka Inequality below).

Definition 4.6 (Discrepancy of an n-tuple) One defines the star discrepancy


Dn∗ (ξ1 , . . . , ξn ) at the origin of a n-tuple (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n by (4.4) of the
above Proposition and by (4.5) for the extreme discrepancy.
The geometric interpretation of the star discrepancy is the following: if x =
(x 1 , . . . , x d ) ∈ [[0, 1]], the hyper-volume of [[0, x]] is equal to the product x 1 · · · x d
and
1
n
card {k ∈ {1, . . . , n}, ξk ∈ [[0, x]]}
1[[0,x]] (ξk ) =
n k=1 n

is simply the frequency with which the first n points ξk ’s of the sequence fall into
[[0, x]]. The star discrepancy measures the maximal resulting error when x runs over
[[0, 1]].
Exercise 1. below provides a first example of a uniformly distributed sequence.

 Exercises. 1. (Rotations of the torus ([0, 1]d , +)). Let (α1 , . . . , αd ) ∈ (R \ Q)d
be (irrational numbers) such that (1, α1 , . . . , αd ) are linearly independent over Q
(4 ). Let x = (x 1 , . . . , x d ) ∈ Rd . For every n ≥ 1, set
 
ξn := {x i + n αi } 1≤i≤d ,

where {x} denotes the fractional part of a real number x. Show that the sequence
(ξn )n≥1 is uniformly distributed on [0, 1]d (and can be recursively generated). [Hint:
use Weyl’s criterion.]
2. More on the one dimensional case. (a) Assume d = 1. Show that, for every n-tuple
(ξ1 , . . . , ξn ) ∈ [0, 1]n

4 Thismeans that if the rational scalars λi ∈ Q, i = 0, . . . , d satisfy λ0 + λ1 α1 + · · · + λd αd = 0


then λ0 = λ1 = · · · = λd = 0. Thus α ∈ R is irrational if and only if (1, α) are linearly independent
on Q.
102 4 The Quasi-Monte Carlo Method

k−1 k
Dn∗ (ξ1 , . . . , ξn ) = max − ξk(n) , − ξk(n)
1≤k≤n n n

where (ξk(n) )1≤k≤n is the/a reordering of the n-tuple (ξ1 , . . . , ξn ) defined by k → ξk(n)
is non-decreasingand {ξ1(n) , . . . , ξn(n) } = {ξ1 , . . . , ξn }. [Hint: Where does the càdlàg
function x → n1 nk=1 1{ξk ≤x} − x attain its infimum and supremum?]
(b) Deduce that

1 2k − 1
Dn∗ (ξ1 , . . . , ξn ) = + max ξk(n) − .
2n 1≤k≤n 2n

(c) Minimal discrepancy at the origin.


 Show that the n-tuple with the lowest
discrepancy (at the origin) is 2k−1
2n 1≤k≤n
(the “mid-point” uniform n-tuple) with
1
discrepancy 2n .

4.2 Application to Numerical Integration: Functions


with Finite Variation

Definition 4.7 (see [48, 237]) A function f : [0, 1]d →R has finite variationin the
measure sense if there exists a signed measure (5 ) ν on [0, 1]d , Bor ([0, 1]d ) such
that ν({0}) = 0 and

∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[0, 1 − x]])

(or equivalently f (x) = f (0) − ν(c [[0, 1 − x]])). The variation V ( f ) is defined by

V ( f ) := |ν|([0, 1]d ),

where |ν| is the variation measure of ν.

 Exercises. 1. Show that f has finite variation in the measure sense if and only
if there exists a signed ν measure with ν({1}) = 0 such that

∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[x, 1]]) = f (0) − ν(c [[x, 1]])

and that its variation is given by |ν|([0, 1]d ). This could of course be taken as the
definition of finite variation equivalently to the above one.

5 A signed measure ν on a space (X, X ) is a mapping from X to R which satisfies the two
 axioms of
a measure, namely ν(∅) = 0 and if An , n ≥ 1, are pairwise disjoint, then ν(∪n An ) = n≥1 ν(An )
(the series is commutatively convergent hence absolutely convergent). Such a measure is finite and
can be decomposed as ν = ν1 − ν2 , where ν1 , ν2 are non-negative finite measures supported by
disjoint sets, i.e. there exists A ∈ X such that ν1 (Ac ) = ν2 (A) = 0 (see [258]).
4.2 Application to Numerical Integration: Functions with Finite Variation 103

2. Show that the function f defined on [0, 1]2 by

f (x 1 , x 2 ) := (x 1 + x 2 ) ∧ 1, (x 1 , x 2 ) ∈ [0, 1]2

has finite variation in the measure sense [Hint: consider the distribution of (U, 1 − U ),
d
U = U ([0, 1])].
For the class of functions with finite variation, the Koksma–Hlawka Inequal-

1
n
ity provides an error bound for f (ξk ) − f (x)d x based on the star
n k=1 [0,1]d
discrepancy.

Proposition 4.3 (Koksma–Hlawka Inequality (1943 when d=1)) Let (ξ1 , . . . , ξn )


be an n-tuple of [0, 1]d -valued vectors and let f : [0, 1]d → R be a function with
finite variation (in the measure sense). Then

1  
n
f (ξk ) − f (x)λd (d x) ≤ V ( f )Dn∗ (ξ1 , . . . , ξn ) .
n k=1 [0,1]d


Proof. Set μn = n1 nk=1 δξk − λd|[0,1]d . It is a signed measure with 0-mass. Then, if
f has finite variation with respect to a signed measure ν,
 
1
n
f (ξk ) − f (x)λd (d x) = f (x)
μn (d x)
n k=1 [0,1]d

= f (1) μn ([0, 1]d ) + ν([[0, 1 − x]]))μn (d x)
[0,1]d
 
= 0+ 1{v≤1−x} ν(dv)  μn (d x)
[0,1]d

= 
μn ([[0, 1 − v]])ν(dv),
[0,1]d

where we used Fubini’s Theorem to interchange the integration order (which is


possible since |ν| ⊗ |μn | is a finite measure). Finally, using the extended triangle
inequality for integrals with respect to signed measures,
104 4 The Quasi-Monte Carlo Method

 
1
n
f (ξk ) − f (x)λd (d x) = 
μn ([[0, 1 − v]])ν(dv)
n k=1 [0,1]d [0,1]d

≤ |
μn ([[0, 1 − v]])| |ν|(dv)
[0,1]d
≤ sup |
μn ([[0, v]])||ν|([0, 1]d )
v∈[0,1]d
= Dn∗ (ξ)V ( f ). ♦

Remarks. • The notion of finite variation in the measure sense has been introduced
in [48, 237]. When d = 1, it coincides with the notion of left continuous functions
with finite variation. The most general multi-dimensional extension to higher dimen-
sions d ≥ 2 is the notion of “finite variation” in the Hardy and Krause sense. We refer
e.g. to [175, 219] for its definition and properties. Essentially, it relies on a geometric
extension of the one-dimensional finite variation where meshes on the unit interval
are replaced by tilings of [0, 1]d with hyper-rectangles with an appropriate notion of
increment of the function on these hyper-rectangles. However, finite variation in the
measure sense is much easier to handle, In particular, to establish the above Koksma–
Hlawka Inequality. Furthermore, V ( f ) ≤ VH &K ( f ). Conversely, one shows that a
function with finite variation in the Hardy and Krause sense is λd -a.s. equal to a func-
tion f having finite variations in the measure sense satisfying V ( f˜) ≤ VH &K ( f ). In
one dimension, finite variation in the Hardy and Krause sense exactly coincides with
the standard definition of finite variation.
• A classical criterion (see [48, 237]) for finite variation in the measure sense is the
∂d f
following: if f : [0, 1]d → R has a cross derivative in the distribution
∂x 1 · · · ∂x d
sense which is an integrable function, i.e.

∂d f
(x 1 , . . . , x d ) d x 1 · · · d x d < +∞,
[0,1]d ∂x 1· · · ∂x d

then f has finite variation in the measure sense. This class includes the functions f
defined by

f (x) = f (1) + ϕ(u 1 , . . . , u d )du 1 . . . du d , ϕ ∈ L 1 ([0, 1]d , λd ). (4.6)
[[0,1−x]]

• One derives from the above Koksma–Hlawka Inequality that

 1 n 
Dn∗ (ξ1 , . . . , ξn ) = sup f (ξk ) − f dλd ,
n k=1 [0,1]d

f : [0, 1]d → R, V ( f ) ≤ 1 . (4.7)
4.2 Application to Numerical Integration: Functions with Finite Variation 105

The inequality ≤ is obvious. The reverse  inequality


 follows from the fact that
the functions f x (u) = 1[[0,1−x]] (u) = δx [[0, 1 − u]] have finite variation as soon

as x = 0. Moreover, n1 nk=1 f 0 (ξk ) − [0,1]d f 0 dλd = 0 so that Dn∗ (ξ1 , . . . , ξn ) =
  
supx∈[0,1]d n1 nk=1 f x (ξk ) − [0,1]d f x dλd .

 Exercises. 1. Show that the function f on [0, 1]3 defined by

f (x 1 , x 2 , x 3 ) := (x 1 + x 2 + x 3 ) ∧ 1

does not have finite variation in the measure sense. [Hint: its third derivative in the
distribution sense is not a measure.] (6 )
2. (a) Show directly that if f satisfies (4.6), then for any n-tuple (ξ1 , . . . , ξn ),

1
n

f (ξk ) − f (x)λd (d x) ≤ ϕ L 1 (λd|[0,1]d ) Dn (ξ1 , . . . , ξn ).
n k=1 [0,1]d

(b) Show that if ϕ is also in L p ([0, 1]d , λd ), for a p ∈ (1, +∞] with a Hölder con-
jugate q, then

1
n
(q)
f (ξk ) − f (x)λd (d x) ≤ ϕ L p (λd|[0,1]d ) Dn (ξ1 , . . . , ξn ),
n k=1 [0,1]d

where
 q  q1
1
n d
Dn(q) (ξ1 , . . . , ξn ) = 1[[0,x]] (ξk ) − xi λd (d x) .
[0,1]d n k=1 i=1

This modulus is called the L q -discrepancy of (ξ1 , . . . , ξn ).


3. Other forms of finite variation and the Koksma–Hlawka inequality. Let f :
[0, 1]d → R be defined by

f (x) = ν([[0, x]]), x ∈ [0, 1]d ,

where ν is a signed measure on [0, 1]d . Show that



1
n
f (ξk ) − f (x)λd (d x) ≤ |ν|([0, 1]d )Dn∞ (ξ1 , . . . , ξn ).
n k=1 [0,1]d

6 In fact, its variation in the Hardy and Krause sense is not finite either.
106 4 The Quasi-Monte Carlo Method

4. L 1 -discrepancy in one dimension. Let d = 1 and q = 1. Show that the L 1 -


discrepancy satisfies

n 
 (n)
ξk+1
k
Dn(1) (ξ1 , . . . , ξn ) = − u du,
k=0 ξk(n) n

where ξ0(n) = 0, ξn+1


(n)
= 1 and ξ1(n) ≤ · · · ≤ ξn(n) is the/a reordering of (ξ1 , . . . , ξn ).

4.3 Sequences with Low Discrepancy: Definition(s)


and Examples

4.3.1 Back Again to the Monte Carlo Method on [0, 1]d

Let (Un )n≥1 be an i.i.d. sequence of random vectors uniformly distributed over [0, 1]d
defined on a probability space (, A, P). We saw that
 
P(dω)-a.s., the sequence Un (ω) n≥1 is uniformly distributed

so that, by the Portmanteau Theorem,


 
P(dω)-a.s., Dn∗ U1 (ω), . . . , Un (ω) → 0 as n → +∞.
 
So, it is natural to evaluate its (random) discrepancy Dn∗ U1 (ω), . . . , Un (ω) as a
measure of its uniform distribution and one may wonder at which rate it goes to zero.
To be more precise, is there a kind of transfer of the Central Limit Theorem (CLT)
and the Law of the Iterated Logarithm (LIL) – which rule the weak and strong rate
of convergence in the Strong Law of Large Numbers (SLLN) respectively
 – to this
discrepancy modulus? The answer is positive since Dn∗ U1 , . . . , Un satisfies both a
CLT and a LIL. Both results are due to Chung (see e.g. [63]).
 
 Chung’s CLT for the star discrepancy. The random sequence Dn∗ U1 , . . . , Un ,
n ≥ 1, satisfies
√   L
n Dn∗ U1 , . . . , Un −→ sup Z x(d) as n → +∞,
x∈[0,1]d

and  
   E supx∈[0,1]d |Z x(d) |
E Dn∗ U1 , . . . , Un ∼ √ as n → +∞,
n

where (Z x(d) )x∈[0,1]d denotes the centered Gaussian multi-index process (or “bridged
hyper-sheet”) with covariance structure given by
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 107

∀ x = (x 1 , . . . , x d ), y = (y 1 , . . . , y d ) ∈ [0, 1]d ,
  d  d  d 
Cov Z x(d) , Z (d)
y = x i ∧ yi − xi yi .
i=1 i=1 i=1

 
Remarks. • A Brownian hyper-sheet is a centered Gaussian field Wx(d) x∈[0,1]d
characterized by its covariance structure

d
∀ x, y ∈ [0, 1]d , E Wx(d) W y(d) = x i ∧ yi .
i=1

Then the bridged hyper-sheet is defined by

 d 
(d)
∀ x = (x 1 , . . . , x d ) ∈ [0, 1]d , Z x(d) = Wx(d) − x i W(1,...,1) .
i=1

• In particular, when d = 1, Z = Z (1) is simply the Brownian bridge over the unit
interval [0, 1] defined by
Z x = Wx − x W1 , x ∈ [0, 1],
where (Wx )x ∈ [0, 1] is a standard Brownian motion on [0,1].
The distribution of its sup-norm is characterized, through its tail (or survival)
distribution function, by

 2π  − (2k−1)π 2
(−1)k−1 e−2k z = 1 −
2 2
∀ z ∈ R+ , P sup |Z x | ≥ z =2 e 8z2
x∈[0,1] k≥1
z k≥1
(4.8)
(see [72] or [45], Chap. 2). This distribution is also known as the Kolmogorov–
Smirnov distribution distribution since it appears as the limit distribution used in the
non-parametric eponymous goodness-of-fit statistical test.
• The above CLT points out a possible somewhat hidden dependence of the Monte
d d
Carlo method upon the dimension d. As a matter of fact, let X ϕ = ϕ(U ), U =
ϕ
U ([0, 1]d ) and let ϕ have finite variation V (ϕ). Then, if X n denotes for every n ≥ 1
ϕ
the empirical mean of n independent copies of X k = ϕ(Uk ), one has, owing to (4.7),
 
√  ϕ

 √  
n  sup X n − E X ϕ  = n E Dn∗ U1 , . . . , Un
ϕ:V (ϕ)=1 
1
 
→ V (ϕ)E sup |Z x(d) | as n → +∞.
x∈[0,1]d
108 4 The Quasi-Monte Carlo Method

This dependence with respect to the dimension appears more clearly when dealing
with Lipschitz continuous functions. For more details we refer to the third paragraph
of Sect. 7.3.
 
 Chung’s LIL for the star discrepancy. The random sequence Dn∗ U1 , . . . , Un ,
n ≥ 1, satisfies the following Law of the Iterated Logarithm:

2n  
lim D ∗ U1 (ω), . . . , Un (ω) = 1 P(dω)-a.s.
n log(log n) n

At this stage, this suggests a temporary definition of a sequence with low discrep-
ancy on [0, 1]d as a [0, 1]d -valued sequence ξ := (ξn )n≥1 such that
 
log(log n)
Dn∗ (ξ) = o as n → +∞,
n

which means that its implementation with a function with finite variation will speed
up the convergence rate of numerical integration by the empirical measure with
respect to the worst rate of the Monte Carlo simulation.

 Exercise. Show, using the standard LIL, the easy part of Chung’s LIL, that is


2n   
lim Dn∗ U1 (ω), . . . , Un (ω) ≥ 2 sup λd ([[0, x]])(1 − λd ([[0, x]]))
n log(log n) x∈[0,1]d

= 1 P(dω)-a.s.

4.3.2 Roth’s Lower Bounds for the Star Discrepancy

Before providing examples of sequences with low discrepancy, let us first describe
some results concerning the known lower bounds for the asymptotics of the (star)
discrepancy of a uniformly distributed sequence.
The first results are due to Roth (see [257]): there exists a universal constant
cd ∈ (0, +∞) such that, for any [0, 1]d -valued n-tuple (ξ1 , . . . , ξn ),
d−1
(log n) 2
Dn∗ (ξ1 , . . . , ξn ) ≥ cd . (4.9)
n

Furthermore, there exists a real constant cd ∈ (0, +∞) such that, for every sequence
ξ = (ξn )n≥1 ,
d
∗ (log n) 2
Dn (ξ) ≥  cd for infinitely many n. (4.10)
n
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 109

This second lower bound can be derived from the first one, using the Hammersley
procedure introduced and analyzed in the next section (see the exercise at the end of
Sect. 4.3.4).
On the other hand, there exists (see Sect. 4.3.3 that follows) sequences for which

(log n)d
∀ n ≥ 1, Dn∗ (ξ) = C(ξ) where C(ξ) < +∞.
n
Based on this, one can derive from the Hammersley procedure (see again Sect. 4.3.4
below) the existence of a real constant Cd ∈ (0, +∞) such that

 n (log n)d−1
∀ n ≥ 1, ∃ (ξ1 , . . . , ξn ) ∈ [0, 1]d , Dn∗ (ξ1 , . . . , ξn ) ≤ Cd .
n
In spite of more than fifty years of investigation, the gap between these asymptotic
lower and upper-bounds have not been significantly reduced: it has still not been
proved whether there exists a sequence for which C(ξ) = 0, i.e. for which the rate
(log n)d
n
would not be optimal.
In fact, it is widely shared in the QMC community that in the above lowerbounds, 
(log n)d
d−1
2
can be replaced by d − 1 in (4.9) and d
2
by d in (4.10) so that the rate O n
is commonly considered as the lowest possible rate of convergence to 0 for the star
discrepancy of a uniformly distributed sequence. When d = 1, Schmidt proved that
this conjecture is true.
This leads to a more convincing definition of a sequence with low discrepancy.

Definition 4.8 A [0, 1]d -valued sequence (ξn )n≥1 is a sequence with low discrepancy
if
(log n)d
Dn∗ (ξ) = O as n → +∞.
n

For more insight about the other measures of uniform distribution (L p -discrepancy
( p)
Dn (ξ), diaphony, etc), we refer e.g. to [46, 219].

4.3.3 Examples of Sequences

 Van der Corput and Halton sequences


Let p1 , . . . , pd be the first d prime numbers. The d-dimensional Halton sequence is
defined, for every n ≥ 1, by:
 
ξn =  p1 (n), . . . ,  pd (n) (4.11)
110 4 The Quasi-Monte Carlo Method

where the so-called “radical inverse functions”  p defined for every integer p ≥ 2
by
 r
ak
 p (n) =
k=0
p k+1

with n = a0 + a1 p + · · · + ar pr , ai ∈ {0, . . . , p − 1}, ar = 0, denotes the p-adic


expansion of n.

Theorem 4.2 (see [165]) Let ξ = (ξn )n≥1 be defined by (4.11). For every n ≥ 1,

d  
1 log( pi n) (log n)d
Dn∗ (ξ) ≤ ( pi − 1) =O as n → +∞. (4.12)
n i=1
log( pi ) n

The proof of this upper-bound, due to H.L. Keng and W. Yu, essentially relies
on the Chinese Remainder Theorem (known as “Théorème chinois” in French and
Sunzi’s Theorem in China, see [165], Sect. 3.5, among others). Since the methods
of proof of such bounds for sequences with low discrepancy are usually highly
technical and rely on combinatorics and Number theory arguments not very familiar
to specialists in Probability theory, we decided to provide a proof for this first upper-
bound which turns out to be more accessible. This proof is postponed to Sect. 12.10.
In fact, the above upper-bound (4.12) remains true if the sequence (ξn )n≥1
is defined with integers p1 , . . . , pd ≥ 2 which are simply pairwise prime, i.e.
gcd( pi , p j ) = 1, i = j, 1 ≤ i, j ≤ d. In particular, if d = 1, the (one-dimensional)
Van der Corput sequence VdC( p) defined by

ξn =  p (n)
 
is uniformly distributed with Dn∗ (ξ) = O logn n for every integer p ≥ 2.
Several improvements of this classical bound have been established: some non-
asymptotic and of numerical interest (see e.g. [222, 278]), some more theoretical.
Among them let us cite the following one established by H. Faure (see [88], see
also [219], p. 29)
 d

1 log n pi + 2
Dn∗ (ξ) ≤ d+ ( pi − 1) + , n ≥ 1,
n i=1
2 log pi 2

which provides a lower constant in front of (lognn) than in (4.12).


d

One easily checks that the first terms of the VdC(2) sequence are as follows

1 1 3 1 5 3 7
ξ1 = , ξ2 = , ξ3 = , ξ4 = , ξ5 = , ξ6 = , ξ7 = , . . .
2 4 4 8 8 8 8
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 111

 Exercise. Let ξ = (ξn )n≥1 denote the p-adic Van der Corput sequence and let
ξ1(n) ≤ · · · ≤ ξn(n) be the reordering of its first n terms.
(a) Show that, for every k ∈ {1, . . . , n},

k
ξk(n) ≤ .
n+1

[Hint: Use an induction on q where n = q p + r , 0 ≤ r ≤ p − 1.]


(b) Derive that, for every n ≥ 1,

ξ1 + · · · + ξn 1
≤ .
n 2

(c) One considers the p-adic Van der Corput sequence (ξ˜n )n≥1 starting at 0, i.e.

ξ˜1 = 0, ξ˜n = ξn−1 , n ≥ 2,

where (ξn )n≥1 is the regular p-adic Van der Corput sequence. Show that ξ˜k(n+1) ≤ k−1
n+1
,
k = 1, . . . , n + 1. Deduce that the L 1 -discrepancy of the sequence ξ˜ satisfies

˜ = 1 − ξ1 + · · · + ξn .
Dn(1) (ξ)
2 n

 Kakutani sequences
This family of sequences was first obtained as a by-product while trying to generate
the Halton sequence as the orbit of an ergodic transform (see [181, 185, 223]). This
extension is based on p-adic addition on [0, 1], p integer, p ≥ 2. It is also known
as the Kakutani adding machine. It is defined on regular p-adic expansions of real
numbers of [0, 1] (7 ) as the addition from the left to the right of the regular p-adic
expansions with carry-over. The p-adic regular expansion of 1 is conventionally set
p
to 1 = 0.( p − 1)( p − 1)( p − 1) . . . and 1 is not considered as a p-adic rational
number in the rest of this section.
Let ⊕ p denote this addition (or Kakutani’s adding machine). Thus, as an example

0.123333... ⊕10 0.412777... = 0.535011...


7 Every real number in [0, 1) admits a p-adic expansion x = k≥1 xpkk , xk ∈ {0, . . . , p − 1}, k ≥ 1.
If x is not a p-adic rational, this expansion is unique. If x is a p-adic rational number, i.e. of the
form x = pNr for some r ∈ N and N ∈ {0, . . . , pr − 1}, then x has two p-adic expansions, one of
  x  −1 
the form x = k=1 xpkk with x = 0 and a second reading x = −1 xk
k=1 p k + p 
p−1
k≥+1 p k . It
is clear that if x is not a p-adic rational number, its p-adic “digits” xk cannot all be equal to p − 1
for k large enough. By definition the regular p-adic expansion of x ∈ [0, 1) is the unique expansion
of x whose digits xk will be infinitely often not equal to p − 1. The case of 1 is specific: its unique

p-adic expansion 1 = k≥1 p−1 pk
will be considered as regular. This regular expansion is denoted
p
by x = 0.x1 x2 . . . xk . . . for every x ∈ [0, 1].
112 4 The Quasi-Monte Carlo Method

In a more formal way if x, y ∈ [0, 1] with respective regular p-adic expansions


 zk
0, x1 x2 · · · xk · · · and 0, y1 y2 · · · yk · · ·, then x ⊕ p y is defined by x ⊕ p y =
k≥1
pk
where the {0, . . . , p − 1}-valued sequence (z k )k≥1 is given by

z k = xk + yk + εk−1 mod p and εk = 1{xk +yk +εk−1 ≥ p} , k ≥ 1,

with ε0 = 0.
– If x or y is a p-adic rational number and x, y = 1, then one easily checks that
z k = yk or z k = yk for every large enough k so that this defines a regular expansions,
i.e. the digits (x ⊕ p y)k of x ⊕ p y are (x ⊕ p y)k = z k , k ≥ 1.
 numbers, then it may happen that z k =
– If both x and y are not p-adic rational
p − 1 for every large enough k so that k≥1 pzkk is not the regular p -adic expansion
of x ⊕ p y. So is the case, for example, in the following pseudo-sum:

0.123333... ⊕10 0.412666... = 0.535999... = 0.536

where (x ⊕ p y)1 = 5, (x ⊕ p y)2 = 3, (x ⊕ p y)3 = 5 and (x ⊕ p y)k = 9, k ≥ 4.


Then, for every y ∈ [0, 1], one defines the associated p -adic rotation with angle
y by
T p,y (x) := x ⊕ p y.

Proposition 4.4 (see [185, 223]) Let p1 , . . . , pd denote the first d prime numbers,
y1 , . . . , yd ∈ (0, 1), where yi is a pi -adic rational number satisfying yi ≥ 1/ pi , i =
1, . . . , d and x1 , . . . , xd ∈ [0, 1]. Then the sequence (ξ)n≥1 defined by
 
ξn := T pn−1
i ,yi
(xi ) 1≤i≤d , n ≥ 1,

has a discrepancy at the origin Dn∗ (ξ) satisfying a similar upper-bound to (4.12) as
the Halton sequence, namely, for every integer n ≥ 1,
 d   
1 log( pi n)
Dn∗ (ξ) ≤ d −1+ ( pi − 1) .
n i=1
log( pi )

p
Remarks. • Note that if yi = xi = 1/ pi = 0.1 , i = 1, . . . , d, the sequence ξ is
simply the regular Halton sequence.
• This upper-bound is obtained by adapting the proof of Theorem 4.2 (see Sect. 12.10)
and we do not pretend it is optimal as a universal bound when the starting values
xi and the angles yi vary. Its main interest is to provide a large family in which
sequences with better performances than the regular Halton sequences are “hidden”,
at least at finite range.
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 113

ℵ Practitioner’s corner.  Iterative generation. One asset of this approach is to


provide an easy recursive form for the computation of ξn since

ξn = ξn−1 ⊕ p (y1 , . . . , yd ),

where, with a slight abuse of notation, ⊕ p denotes the componentwise pseudo-


addition with p = ( p1 , . . . , pd ). Once the pseudo-addition has been coded, this
method is tremendously fast.
Appropriate choices of the starting vector and the “angle” can significantly reduce
the discrepancy, at least in a finite range (see below).
 A “super-Halton” sequence. Heuristic arguments too lengthy to be developed here
suggest that a good choice for the “angles” yi and the starting values xi is

2 pi − 1 − ( pi + 2)2 + 4 pi
yi = 1/ pi , xi = 1/5, i = 3, 4, xi = , i = 3, 4.
3
This specified Kakutani – or “super-Halton” – sequence is much easier to implement
than the Sobol’ sequences and behaves quite well up to medium dimensions d, say
1 ≤ d ≤ 20 (see [48, 237]).
 Faure sequences.
These sequences were introduced in [89]. Let p be the smallest prime integer satis-
fying p ≥ d. The d-dimensional Faure sequence is defined for every n ≥ 1 by
    
ξn =  p (n − 1), C p  p (n − 1) , · · · , C d−1
p  p (n − 1)

where  p still denotes the p-adic radical inverse function


 and, for every p-adic
rational number u with regular p-adic expansion u = k≥0 u k p −(k+1) ∈ [0, 1] (note
that u k = 0 for large enough k)
⎛ ⎞
   
C p (u) = ⎝ j
u j mod p ⎠ p −(k+1) .
k
k≥0 j≥k
! "# $
∈{0, ..., p−1}

The discrepancy at the origin of these sequences satisfies (see [89])


 
d
1 1 p−1
Dn∗ (ξ) ≤ (log n) + O((log n)
d d−1
) . (4.13)
n d! 2 log p

It was later shown in [219] that they share the following Pp,d -property (see
also [48], p. 79).
114 4 The Quasi-Monte Carlo Method

Proposition 4.5 For every m ∈ N and every  ∈ N∗ , for any r1 , . . . , rd ∈ N such that
r1 + · · · + rd = m and every x1 , . . . , xd ∈ N such that xk ≤ prk − 1, k = 1, . . . , d,
there is exactly one term in the sequence ξpm +i , i = 0, . . . , p m − 1, lying in the
d
% 
hyper-cube xk p −rk , (xk + 1) p −rk .
k=1

This property is a special case of (t, d)-sequences in base p as defined in [219]


(see Definitions 4.1 and 4.2, p. 48) corresponding to t = 0.
The prominent feature of Faure’s estimate (4.13) is that the coefficient of the
leading error term (in the (log n)k -scale) satisfies
d
1 p−1
lim = 0,
d d! 2 log p
 
which seems to suggest that the rate is asymptotically better than O (lognn) as d
d

increases.
The above convergence result is an easy consequence of Stirling’s formula and
Bertrand’s conjecture which says that, for every integer d > 1, there exists a prime
number p such that d < p < 2d (8 ). To be more precise, the function x → log
x−1
x
being
increasing on (0, +∞),
d d d
1 p−1 1 2d − 1 e 1
≤  √ →0 as d → +∞.
d! 2 log p d! 2 log 2d log 2d 2πd

A non-asymptotic upper-bound is provided in [237] (due to Y.-J. Xiao in his PhD


thesis [278]):
   
d d
1 1 p−1 log(2n)
∀ n ≥ 1, Dn∗ (ξ) ≤ +d +1 .
n d! 2 log p

Note that this bound has the same coefficient in its leading term as the asymptotic
error bound obtained by Faure. Unfortunately, from a numerical point of view, it
becomes efficient only for very large n: thus if d = p = 5 and n = 1 000,
   
d d
1 1 p−1 log(2n)
Dn∗ (ξ) ≤ +d +1  1.18,
n d! 2 log p

which is of little interest if one keeps in mind that, by construction, the discrepancy
takes its values in [0, 1]. This can be explained by the form of the “constant” term
in the (log n)k -scale in the above upper-bound: one has

8 Bertrand’s
conjecture was stated in 1845 but it is no longer a conjecture since it was proved by P.
Tchebychev in 1850.
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 115

d
(d + 1)d p−1
lim = +∞.
d→+∞ d! 2

A better bound is provided in Y.-J. Xiao’s PhD thesis [278], provided n ≥ p d+2 /2.
But once again it is of little interest for applications when d increases since, for p ≥ d,
p d+2 /2 ≥ d d+2 /2.
 The Sobol’ sequences
The discovery by Ilia Sobol’ of his eponymous sequences has clearly been, not
only a pioneering but also a striking contribution to sequences with low discrepancy
and quasi-Monte Carlo simulation. The publication of his work goes back to 1967
(see [265]). Although it was soon translated into English, it remained ignored to a
large extent for many years, at least by practitioners in the western world. From a
purely theoretical point of view, it is the first family of sequences ever discovered
satisfying the Pp,d -property. In the modern classification of sequences with low dis-
crepancy they appear as subfamilies of the largest family of Niederreiter’s sequences
(see below), discovered later on. However, now that they are widely used especially
for (Quasi-)Monte Carlo simulation in Quantitative Finance, their impact, among
theorists and practitioners, is unrivaled, compared to any other uniformly distributed
sequence.
In terms of implementation, Antonov and Saleev proposed in [10] a new imple-
mentation based on the Gray code, which dramatically speeds up the computation
of these sequences.
In practice, the construction of the Sobol’ sequences relies on “direction numbers”,
which turn out to be crucial for the efficiency of the sequences. Not all admissible
choices are equivalent and many authors proposed efficient initializations of these
numbers after Sobol’ himself (see [266]), who have proposed a solution up to d = 51
in 1976. Implementations are also available in [247]. For recent developments on
this topic, see e.g. [276].
Even if (see below) some sequences proved to share slightly better “academic” per-
formances, no major progress has been made since in the search for good sequences
with low discrepancy. Sobol’ sequences remain unrivaled among practitioners and
are massively used in Quasi-Monte Carlo simulation, now becoming a synonym for
QMC in the quantitative finance world.
The main advances come from post-processing of the sequences like randomiza-
tion and/or scrambling. These points are briefly discussed in Sect. 4.4.2.
 The Niederreiter sequences.
These sequences were designed as generalizations of Faure and Sobol’ sequences
(see [219]).
Let q ≥ d be the smallest primary integer not lower than d (a primary integer is an
integer of the form q = pr with p prime, r ∈ N∗ ). The (0, d)-Niederreiter sequence
is defined for every integer n ≥ 1 by
 
ξn = q,1 (n − 1), q,2 (n − 1), · · · , q,d (n − 1) ,
116 4 The Quasi-Monte Carlo Method

where  
 
q,i (n) := ψ −1
C((i)j,k) (ak ) q− j ,
j k

 : {0, . . . , q − 1} → IFq is a one-to-one correspondence between {0, . . . , q − 1}


(to be specified) and the finite field IFq with cardinality q, satisfying (0) = 0, and
 
C((i)j,k) = k
j−1 (i − 1).

This quite general family of sequences contains both the Faure and the Sobol’
sequences. To be more precise:
• when q is the smallest prime number not less than d, one retrieves the Faure
sequences,
• when q = 2r , with 2r −1 < d ≤ 2r , the sequence coincides with the Sobol’
sequences (in their original form).
The main feature of Niederreiter sequences is to be (t, d)-sequences in base q and
consequently have a discrepancy satisfying an upper-bound with a structure similar
to that of the Faure or Sobol’ sequences (which correspond to t = 0). For a precise
definition of (t, d)-sequences (and (t, m, d)-nets) as well as an in-depth analysis of
their properties in terms of discrepancy, we refer to [219]. Note that (0, d)-sequences
in base p reduce to the Pd, p -property mentioned above. Records in terms of low
discrepancy are usually held within this family, and then beaten by other sequences
from this family.

4.3.4 The Hammersley Procedure

The Hammersley procedure is a canonical method for designing a [0, 1]d -valued n-
tuple from a [0, 1]d−1 -valued one with a discrepancy at the origin ruled by the latter
(d − 1)-dimensional one.

Proposition 4.6 Let d ≥ 2. Let (ζ1 , . . . , ζn ) be a [0, 1]d−1 -valued n-tuple. Then, the
[0, 1]d -valued n-tuple defined by

k
(ξk )1≤k≤n = ζk ,
n 1≤k≤n

satisfies

max1≤k≤n k Dk∗ (ζ1 , . . . , ζk )   1 + max1≤k≤n k Dk∗ (ζ1 , . . . , ζk )


≤ Dn∗ (ξk )1≤k≤n ≤ .
n n
(4.14)
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 117

Proof. It follows from the very definition of the discrepancy at the origin that

  1
n d−1
Dn∗ (ξk )1≤k≤n = sup 1{ζk ∈[[0,x]], nk ≤y} − y xi
(x,y)∈[0,1]d−1 ×[0,1] n k=1 k=1
& '
1
n d−1
= sup sup 1{ζk ∈[[0,x]], nk ≤y} − y xi
x∈[0,1]d−1 y∈[0,1] n k=1 k=1
&
1
k d−1
k
= max sup 1{ζ ∈[[0,x]]} − xi
1≤k≤n x∈[0,1]d−1 n =1  n i=1
'
1
k−1 d−1
k
∨ sup 1{ζ ∈[[0,x]]} − x i
x∈[0,1]d−1 n =1  n i=1

1
n
since one can easily check that the functions of the form y → ak 1{ nk ≤y} − b y
n k=1
(ak , b ≥ 0) attain their supremum either at some y = k
or at its left limit “y− =
k  n
n −
”, k ∈ {1, . . . , n}. Consequently,

(
  1
Dn∗ (ξk )1≤k≤n = max k Dk∗ (ζ1 , . . . , ζk )
n 1≤k≤n
( ))
1 
k−1 d−1
k
∨ (k − 1) sup 1{ζ ∈[[0,x]]} − x i
x∈[0,1]d−1 k − 1 =1 k − 1 i=1
1   
≤ max k Dk∗ (ζ1 , . . . , ζk ) ∨ (k − 1)Dk−1 ∗
(ζ1 , . . . , ζk−1 ) + 1
n 1≤k≤n
1 + max1≤k≤n k Dk∗ (ζ1 , . . . , ζk )
≤ . (4.15)
n

The lower bound is obvious from (4.15). ♦

Corollary 4.2 Let d ≥ 1. There exists a real constant Cd ∈ (0, +∞) such that, for
every n ≥ 1, there exists an n-tuple (ξ1n , . . . , ξnn ) ∈ ([0, 1]d )n satisfying

1 + (log n)d−1
Dn∗ (ξ1n , . . . , ξnn ) ≤ Cd .
n

Proof. If d = 1, a solution is given for a fixed integer n ≥ 1 by setting ξk = nk ,


k = 1, . . . , n (or ξk = 2k−1
2n
, k = 1, . . . , n, etc). If d ≥ 2, it suffices to apply for every
n ≥ 1 the Hammersley procedure to any (d − 1)-dimensional sequence ζ = (ζn )n≥1
with low discrepancy in the sense of Definition 4.8. In this case, the constant Cd
118 4 The Quasi-Monte Carlo Method
 
can be taken equal to 2 (cd−1 (ζ) ∨ 1), where Dk∗ (ζ) ≤ cd−1 (ζ) 1 + (log k)d−1 /k
for every k ≥ 1. ♦

The main drawback of this procedure is that if one starts from a sequence with low
discrepancy (often defined recursively), one loses the “telescopic” feature of such a
sequence. If one wishes, for a given function f defined on [0, 1]d , to increase n in
order to improve the accuracy of the approximation, all the terms of the sum in the
empirical mean have to be re-computed.

 Exercises. 1. (Roth’s lower-bound). Derive the theoretical lower bound (4.10)


for infinite sequences from the one in (4.9).
2. Extension of Hammersley’s procedure.
(a) Let (ξ1 , . . . , ξn ) be a [0, 1]d -valued n-tuple such that 0 ≤ ξ1d < ξ2d < . . .
< ξnd ≤ 1. Prove that

max1≤k≤n k Dk∗ (ξ11:d−1 , . . . , ξk1:d−1 )


Dn∗ (ξ11:d−1 , . . . , ξn1:d−1 ) ≤ Dn∗ (ξ1 , . . . , ξn ) ≤
n
+ Dn∗ (ξ1d , . . . , ξnd ),

where ξk1:d−1 = (ξk1 , . . . , ξnd−1 ). [Hint: Follow the lines of the proof of Proposition 4.6,

using that the supremum of the càdlàg function ϕ : y → n1 nk=1 ak 1{ξkd ≤y} − b y
(ak , b ≥ 0) is max1≤k≤n |ϕ(ξkd )| ∨ |ϕ((ξkd )− )| ∨ |ϕ(1)|.]
(b) Deduce that the upper-bound in (4.14) can be slightly improved by an appropriate
choice of the dth component of the terms of the n-tuple (ξk )1≤k≤n [Hint: what can
be better that ( nk )1≤k≤n in one dimension?]

4.3.5 Pros and Cons of Sequences with Low Discrepancy

The use of sequences with low discrepancy to compute integrals instead of the Monte
Carlo method (based on pseudo-random numbers) is known as the Quasi-Monte
Carlo method (QMC). This terminology extends to so-called good lattice points not
described here (see [219]).
The Pros.
 The main attracting feature of sequences with low discrepancy is the combination
of the Koksma–Hlawka inequality with the rate of decay of the discrepancy. It sug-
gests that the QMC method is almost dimension free. This fact should be tempered
in practice, standard a priori bounds for discrepancy do not allow for the use of this
inequality to provide some “100%-confidence intervals”.
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 119

 When the sequence ξ can be obtained as the orbit ξn = T n−1 (ξ1 ), n ≥ 1, of an


ergodic, or, better, a uniquely ergodic transform T : [0, 1]d → [0, 1]d (9 ) one shows
that the integration rate of a function f : [0, 1]d → R can be O(1/n) if f is a
coboundary for T , i.e. can be written

f − f (u)du = g − g ◦ T,
[0,1]d

where g is a bounded Borel function (see e.g. [223]). As a matter of fact, for such
coboundaries,
 1
1
n
g(ξ1 ) − g(ξn+1 )
f (ξk ) − f (u)du = =O .
n k=1 [0,1]d n n

The main difficulty is to determine practical criteria.


The Kakutani transforms (rotations with respect to ⊕ p ) and the rotations of the
torus are typical examples of ergodic transforms, the former being uniquely ergodic
(keep in mind that the p-adic Van der Corput sequence is an orbit of the Kakutani
transform T p,1/ p ). The Kakutani transforms – or, to be precise, their representation
on [0, 1]d – are unfortunately not continuous. However, their (original) natural
representations on the space of power series with coefficients in {0, . . . , p − 1},
endowed with its natural metric compact topology, definitely are (see [223]). One
can take advantage of this unique ergodicity and characterize their coboundaries.
Easy-to-check criteria based on the rate of decay of the Fourier coefficients c p ( f ),
p = ( p 1 , . . . , p d ) ∈ Zd , of a function f as p := p 1 × · · · × p d goes to infinity
have been established (see [223, 278, 279]), at least for the p-adic Van der Corput

9 Let (X, X , μ) be a probability space. A mapping T : (X, X ) → (X, X ) is ergodic if

(i) μ ◦ T −1 = μ i.e. μ is invariant under T,

(ii) ∀ A ∈ X , T −1 (A) = A =⇒ μ(A) = 0 or 1.

Then Birkhoff’s pointwise ergodic Theorem (see [174]) implies that, for every f ∈ L 1 (μ),

1
n
μ(d x)-a.s. f (T k−1 (x)) −→ f dμ.
n X
k=1

The mapping T is uniquely ergodic if μ is the only measure satisfying T . If X is a topological space,
X = Bor (X ) and T is continuous, then, for any continuous function f : X → R,

1
n
sup f (T k−1 (x)) − f dμ −→ 0 as n → +∞.
x∈X n X
k=1

In particular, it shows that any orbit (T n−1 (x))n≥1 is μ-distributed. When X = [0, 1]d and μ = λd ,
one retrieves the notion of uniformly distributed sequence. This provides a powerful tool for devising
and studying uniformly distributed sequences. This is the case e.g. for Kakutani sequences or
rotations of the torus.
120 4 The Quasi-Monte Carlo Method

sequences in one dimension and other orbits of the Kakutani transforms. Similar
results also exist for the rotations of the torus.
Extensive numerical tests on problems involving some smooth (periodic) func-
tions on [0, 1]d , d ≥ 2, have been carried out, see e.g. [48, 237]. They suggest that
this improvement still holds in higher dimensions, at least partially.
 It remains that, at least for structural dimensions d up to a few tens, Quasi-
Monte Carlo integration empirically usually outperforms regular Monte Carlo sim-
ulation even if the integrated function does not have finite variation. We refer to
Fig. 4.1 further on (see Application to the Box–Muller method), where E |X 1 − X 2 |,
d
X = (X 1 , X 2 ) = N (0; I2 ) is computed by using a simple Halton(2, 3) sequence and
pseudo-random numbers. More generally we refer to [237] for extensive numerical
comparisons between Quasi- and regular Monte Carlo methods.
This concludes the pros.
The cons.
 As concerns the cons, the first is that all the non-asymptotic bounds for the dis-
crepancy at the origin are very poor from a numerical point of view. We again refer
to [237] for some examples which emphasize that these bounds cannot be relied on
to provide (deterministic) error intervals for numerical integration. This is a major
drawback compared to the regular Monte Carlo method, which automatically pro-
vides, almost for free, a confidence interval at any desired level.
 The second significant drawback concerns the family of functions for which we
know that the QMC numerical integration is speeded up thanks to the Koksma–
Hlawka Inequality. This family – mainly the functions with finite variation over
[0, 1]d in some sense – somehow becomes sparser and sparser in the space of Borel
functions as the dimension d increases since the requested condition becomes more
and more stringent (see Exercise 2 immediately before the Koksma–Hlawka formula
in Proposition 4.3 and Exercise 1 which follows).
More importantly, if one is interested in integrating functions sharing a “standard”
regularity like Lipschitz continuity, the following theorem due to Proïnov ([248])
shows that the curse of dimensionality comes back into the game in a striking way,
without any possible escape.

Theorem 4.3 (Proïnov). Assume Rd is equipped with the ∞ -norm defined by


|(x 1 , . . . , x d )|∞ := max |x i |. Let
1≤i≤d

w( f, δ) := sup | f (x) − f (y)|, δ ∈ (0, 1)


x,y∈[0,1]d , |x−y|∞ ≤δ

denote the uniform continuity modulus of f (with respect to the ∞ -norm).


(a) Let (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n . For every continuous function f : [0, 1]d → R,

1  1
n
f (x)d x − f (ξk ) ≤ Cd w f, Dn∗ (ξ1 , . . . , ξn ) d ,
[0,1]d n k=1
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 121

where Cd ∈ (0, +∞) is a universal optimal real constant only depending on d.


In particular, if f is Lipschitz continuous with coefficient [ f ]∞ Lip :=
| f (x)− f (y)|
supx,y∈[0,1]d |x−y| , then


1
n
f (ξk ) ≤ Cd [ f ]∞ ∗ 1
f (x)d x − Lip Dn (ξ1 , . . . , ξn ) .
d

[0,1]d n k=1

(b) If d = 1, Cd = 1 and if d ≥ 2, Cd ∈ [1, 4].


Claim (b) should be understood in the following sense: there exist (families of)
functions f with Lipschitz coefficient [ f ]Lip = 1 for which the above inequality
holds (at least asymptotically) as an equality for a value of the constant Cd equal to 1
or lying in [1, 4] depending on the dimension d.

 Exercise. Show, using the Van der Corput sequences starting at 0 (see the exercise
in the paragraph devoted to Van der Corput and Halton sequences in Sect. 4.3.3) and
the function f (x) = x on [0, 1], that the above Proïnov Inequality cannot be improved
for Lipschitz continuous functions even in one dimension. [Hint: Reformulate some
results of the Exercise in Sect. 4.3.3.]
 A third drawback of using QMC for numerical integration is that all functions
need to be defined on unit hypercubes. One way to partially get rid of that may be to
consider integration on some domains C ⊂ [0, 1]d having a regular boundary in the
Jordan sense (10 ). Then a Koksma–Hlawka-like inequality holds true:

1
n
1{ξk ∈C} f (ξk ) ≤ V ( f ) D̃n∞ (ξ) d
1
f (x)λd (d x) −
C n k=1

where V ( f ) denotes the variations of f (in the Hardy and Krause or measure sense)
and D̃n∞ (ξ) denotes the extreme discrepancy of (ξ1 , . . . , ξn ) (see again [219]). The
simple fact to integrate over such a set annihilates the low discrepancy effect (at least
from a theoretical point of view).

 Exercise. Prove Proïnov’s Theorem when d = 1. [Hint: read the next chapter and
compare the star discrepancy modulus and the L 1 -mean quantization error.]
This suggests that the rate of numerical integration in dimension
 d by a sequence
log n
with low discrepancy of Lipschitz continuous functions is O 1 as n → +∞,
nd
d−1
(log n) d
or O 1 when considering, for a fixed n, an n-tuple designed by the
nd
Hammersley method. This emphasizes that sequences with low discrepancy are not
spared by the curse of dimensionality when implemented on functions with standard
regularity. . .

10 Namely that for every ε > 0, λd ({u ∈ [0, 1]d : dist(u, ∂C) < ε}) ≤ κC ε.
122 4 The Quasi-Monte Carlo Method

4.3.6 ℵ Practitioner’s Corner

 WARNING! The dimensional trap. Although it is not strictly speaking a draw-


back, this last con is undoubtedly the most dangerous trap, at least for beginners:
a given (one-dimensional) sequence (ξn )n≥1 does not “simulate” independence, as
emphasized by the classical exercise below.

 Exercise. (a) Let ξ = (ξn )n≥1 denote the dyadic Van der Corput sequence. Show
that, for every n ≥ 0,

1 ξn
ξ2n+1 = ξ2n + and ξ2n =
2 2
with the convention ξ0 = 0. Deduce that

1
n
5
lim ξ2k ξ2k+1 = .
n n k=1 24

Compare with E (U V ), where U, V are independent with uniform distribution over


[0, 1]. Conclude.

(b) Show, still for the dyadic Van der Corput sequence, that n1 nk=1 δ(ξ2k ,ξ2k+1 ) weakly
converges toward a Borel distribution μ on [0, 1]2 to be specified explicitly. Extend
this result to p-adic Van der Corput sequences.
In fact, this phenomenon is typical of the price to pay for “filling up the gaps” faster
than random numbers do. This is the reason why it is absolutely mandatory to use
d -dimensional sequences with low discrepancy to perform QMC computations
d
related to a random vector X of the form X = (Ud ), Ud = U ([0, 1]d ) (d is some-
times called the structural dimension of the simulation). The d components of these
d-dimensional sequences do simulate independence.
This has important consequences on very standard simulation methods, as illus-
trated below.

Application to the “ QMC Box–Muller method”.   To adapt the Box–Muller method


of simulation of a normal distribution N 0; 1 introduced in Corollary 1.3, we pro-
ceed as follows: let ξ = (ξn1 , ξn2 )n≥1 be a uniformly distributed sequence over [0, 1]2
(in practice chosen with low discrepancy). We set, for every n ≥ 1,

 *   *  
ζn = ζn1 , ζn2 ) := −2 log(ξn1 ) sin 2πξn2 , −2 log(ξn1 ) cos 2πξn2 .

Then, for every bounded continuous function f 1 : R → R,

1   1  
n
d
lim f 1 ζk −→ E f 1 (Z 1 ), Z 1 = N 0; 1
n n k=1
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 123
 
since (ξ 1 , ξ 2 ) −→ f −2 log ξ 1 sin(2πξ 2 ) is continuous on (0, 1]2 and bounded,
hence Riemann integrable on [0, 1]2 . Likewise, for every bounded continuous
function f 2 : R2 → R,

1    
n
d
lim f 2 ζk −→ E f 2 (Z ), Z = N 0; I2 .
n n k=1

These continuity assumptions on f 1 and f 2 can be relaxed, e.g. for f 2 to: the function

f 2 defined on (0, 1]2 by
     
(ξ 1 , ξ 2 ) −→ 
f 2 (ξ 1 , ξ 2 ) := f 2 −2 log ξ 1 sin 2πξ 2 , −2 log ξ 1 cos 2πξ 2

is Riemann integrable. To benefit from the Koksma–Hlawka inequality, we need that


f 2 has finite variation. Establishing that 
 f 2 does have finite variation is clearly quite
demanding, even when it is true, which is clearly not the generic situation.
d  
All these considerations admit easy extensions to functions f (Z ), Z = N 0; Id .
To be more precise, the extension of Box–Muller method to multivariate normal
distributions (see (1.1)) should be performed following the same rule of the structural
dimension (here d, assumed to be even for convenience): such a sequence (ζn )n≥1
can be constructed from a uniformly distributed sequence ξ = (ξn )n≥1 over [0, 1]d
by plugging the components of ξ into (1.1) in place of (U1 , . . . , Ud ), that is
* * 
(ζn2i−1 , ζn2i ) = −2 log(ξn2i−1 ) cos(2πξn2i ), −2 log(ξn2i−1 ) sin(2πξn2i ) ,

i = 1, . . . , d/2.

In particular, we will see further on in Chap. 7 that simulating the Euler scheme
with step mT of a d-dimensional diffusion over [0, T ] with an underlying q-dimensional
 
Brownian motion consumes m independent N 0; Iq -distributed random vectors, i.e.
m × q independent N 0; 1 random variables. To perform a QMC simulation of a
function of this Euler scheme at time T , we consequently need to consider a sequence
with low discrepancy over [0, 1]mq . Existing error bounds on sequences with low dis-
crepancy and the sparsity of functions with finite variation make essentially mean-
ingless any use of Koksma–Hlawka’s inequality to produce error bounds. Proïnov’s
theorem itself is difficult to use, owing to the difficult evaluation of [ f ]Lip . Not to
mention that in the latter case, the curse of dimensionality will lead to extremely
poor theoretical bounds for Lipschitz functions (like for  f 2 in dimension 2).

 Example. We depict below in Fig. 4.1 a “competition” between pseudo-random


d
numbers and a basic Halton(2, 3) sequence to compute E |X 1 −X 2 |, X =(X 1 , X 2 ) =
N (0; I2 ), on a short trial of size 1 500.
124 4 The Quasi-Monte Carlo Method

Fig. 4.1 Mc versus QMC. E |X 1 − X 2 | computed by simulation with 1 500 trials of pseudo-
random numbers (red) and a Halton(2, 3) sequence (blue) plugged into a Box–Muller formula
(Reference value √2π )

 Exercise.
 (a) Implement a QMC-adapted Box–Muller simulation method for
N 0; I2 (based on the sequence with low discrepancy of your choice) and orga-
nize a race MC vs QMC to compute various calls, say Call B S (K , T ) (T = 1,
K ∈ {95, 96, . . . , 104, 105}) in a Black–Scholes model (with r = 2%, σ = 30%,
x = 100, T = 1). To simulate this underlying Black–Scholes risky asset, first use
the closed expression
σ2

X tx = x e(r − 2 )t+σ TZ
.

(b) Anticipating Chap. 7, implement the Euler scheme (7.5) of the Black–Scholes
dynamics
d X tx = X tx (r dt + σdWt ).

Consider steps of the form T


m
with m = 10, 20, 50, 100. What conclusions can be
drawn?
 Statistical correlation at finite range. A final difficulty encountered by practi-
tioners is to “reach” the effective statistical independence between the coordinates
of a d-dimensional sequence with low discrepancy. This independence is only true
asymptotically so that, in high dimensions and for small values of n, the coordinates
of the Halton sequence remain highly correlated for a long time. As a matter of fact,
the ith component of the canonical d-dimensional Halton sequence (i.e. designed
from the first d prime numbers p1 , . . . , pd ) starts as follows

n 1 n − pi
ξni = , n ∈ {1, . . . , pi − 1}, ξni = 2 + , n ∈ { pi , . . . , 2 pi − 1}, . . .
pi pi pi
4.3 Sequences with Low Discrepancy: Definition(s) and Examples 125

so it is clear that the ith and the (i + 1)th components will remain highly correlated
if d = 81 and (i, i + 1) = (d − 1, d) then pd−1 = 503 and pd = 509,. . .
To overcome this correlation observed for (not so) small values of n, the usual
method is to discard the first values of a sequence.

 Exercise. Let ξ 1 = VdC( p1 ) and ξ 2 = VdC( p2 ) where p1 and p2 are two (rea-
sonably) distinct large prime numbers satisfying p1 < p2 < 2 p1 ( p2 exists owing to
Bertrand’s conjecture). Show that the pseudo-estimator of the correlation between
these two sequences at order n = p1 satisfies
  
1 1 2
n
1 1
n
1 2
n  1  4 p1 − p2 + 1
ξ ξ − ξ ξ = 1+
n k=1 k k n k=1 k n k=1 k p1 12 p2
1  1  1  1
> 1+ 1+ > .
12 p1 2 p1 12
 MC versus QMC : a visual point of view. As a conclusion to this section, let us
emphasize graphically in Fig. 4.2 the differences of textures between MC and QMC
“sampling”, i.e. between (pseudo-)randomly generated points (say 60 000) and the
same number of terms of the Halton sequence (with p1 = 2 and p2 = 3).

Fig. 4.2 MC versus QMC Left: 6.104 randomly generated points; Right: 6.104 terms of the
Halton(2, 3)-sequence
126 4 The Quasi-Monte Carlo Method

4.4 Randomized QMC

The principle at the origin of randomized QMC is to introduce some randomness in


the QMC method in order to produce a confidence interval or, returning to a regular
MC viewpoint, to use QMC sequences as a variance reducer in the Monte Carlo
method.
Throughout this section, {x}  denotes the componentwise
 fractional part of x =
(x 1 , . . . , x d ) ∈ Rd , i.e. {x} = {x 1 }, . . . , g{x d } .
Moreover, we denote by (ξn )n≥1 a uniformly distributed sequence over [0, 1]d .

4.4.1 Randomization by Shifting

The starting idea is to note that a shifted uniformly distributed sequence is still
uniformly distributed. The second stage is to randomly shift such a sequence to
combine the properties of the Quasi- and regular Monte Carlo simulation methods.

Proposition 4.7 (a) Let U be a uniformly distributed random variable on [0, 1]d .
Then, for every a ∈ Rd
d
{a + U } = U.

(b) Let a = (a 1 , . . . , a d ) ∈ Rd . The sequence ({a + ξn })n≥1 is uniformly distributed.

Proof. (a) One easily checks that the characteristic functions of U and {a + U }
coincide on Zd since, for every p ∈ Zd ,

E e2ı̃π( p|{a+U }) = E e2ı̃π( p|a+U ) = e2ı̃π( p|a) E e2ı̃π( p|U )


= e2ı̃π( p|a) 1{ p=0} = 1{ p=0} = E e2ı̃π( p|U ) .

Hence (see [155]), U and {a + U } have the same distributions since they are [0, 1]d -
valued (in fact this is the static version of Weyl’s criterion used to prove claim (b)
below).
(b) This follows from Weyl’s criterion: let p ∈ Nd \ {0},

1  2ı̃π( p|{a+ξk }) 1  2ı̃π( p1 {a 1 +ξk1 }+···+ pd {a d +ξkd })


n n
e = e
n k=1 n k=1
1  2ı̃π( p1 (a 1 +ξk1 )+···+ pd (a d +ξkd ))
n
= e
n k=1
1  2ı̃π( p|ξk )
n
= e2ı̃π( p|a) e
n k=1
−→ 0 as n → +∞. ♦
4.4 Randomized QMC 127

Consequently, if U is a uniformly distributed random variable on [0, 1]d and


f : [0, 1]d → R is a Riemann integrable function, the random variable

1
n
χ = χ( f, U ) := f ({U + ξk })
n k=1

satisfies
1
Eχ = × n E f (U ) = E f (U )
n

owing to claim (b). Then, starting from an M-sample U1 , . . . , U M of the uniform


distribution over [0, 1]d , one defines the empirical Monte Carlo estimator of size M
attached to χ by

1  1 
M M n
+
I ( f, ξ)n,M := χ( f, Um ) = f ({Um + ξk }).
M m=1 n M m=1 k=1

This estimator has a complexity approximately equal to κ × n M, where κ is the


unitary complexity induced by the computation of one value of the function f . From
the preceding, it satisfies
+ a.s.
I ( f, ξ)n,M −→ E f (U )

by the Strong Law of Large Numbers and its (weak) rate of convergence is ruled by
a CLT √   L  
M + I ( f, ξ)n,M − E f (U ) −→ N 0; σn2 ( f, ξ)

with  
1
n
σn2 ( f, ξ) = Var f ({U + ξk }) .
n k=1

Hence, the specific rate of convergence of the QMC is irremediably lost. In particular,
this hybrid method should be compared to a regular Monte Carlo simulation of size
n M through their respective variances. It is clear that we will observe a variance
reduction if and only if
σn2 ( f, ξ) Var( f (U ))
< ,
M nM
i.e.  
1
n
Var( f (U ))
Var f ({U + ξk }) ≤ .
n k=1 n
128 4 The Quasi-Monte Carlo Method

The only natural upper-bound for the left-hand side of this inequality is

   2
1
n
σn2 ( f, ξ) = f ({u + ξk }) − f dλd du
[0,1]d n k=1 [0,1]d

 2
1
n
≤ sup f ({u + ξk }) − f dλd .
u∈[0,1]d n k=1 [0,1]d

One can show that f u : v → f ({u + v}) has finite variation on [0, 1]d as soon as f
has (in the same sense) and that supu∈[0,1]d V ( f u ) < +∞ (more precise results can
be established). Consequently

σn2 ( f, ξ) ≤ sup V ( f u )2 Dn∗ (ξ1 , . . . , ξn )2


u∈[0,1]d

so that, if ξ = (ξn )n≥1 is a sequence with low discrepancy (say Faure, Halton, Kaku-
tani or Sobol’, etc),
(log n)2d
σn2 ( f, ξ) ≤ C 2f,ξ , n ≥ 1.
n2
Consequently, in that case, it is clear that randomized QMC provides a very significant
variance reduction (for the same complexity) of a magnitude proportional to (lognn)
2d

(with an impact of magnitude (log


d
n)
√ on the confidence interval). But one must bear
n
in mind once again that such functions with finite variations become dramatically
sparse among Riemann integrable functions as d increases.
C 2f,ξ
In fact, an even better bound of the form σn2 ( f, ξ) ≤ 2 can be obtained for
n
some classes of functions as emphasized in the Pros part of Sect. 4.3.5: when the
sequence (ξn )n≥1 is the orbit of a (uniquely) ergodic transform and f is a cobound-
ary for this transform. But of course this class is even sparser. For sequences obtained
by iterating rotations – of the torus or of the Kakutani adding machine – some cri-
teria can be obtained involving the rate of decay of the Fourier coefficients c p ( f ),
p = ( p 1 , . . . , p d ) ∈ Zd , of f as p := p 1 × · · · × p d goes to infinity since, in that
C 2f,ξ
case, one has σn2 ( f, ξ) ≤ 2 . Hence, the gain in terms of variance becomes pro-
n
portional to n1 for such functions (a global budget/complexity being prescribed for
the simulation).
By contrast, if we consider Lipschitz continuous functions, things go radically
differently: assume that f : [0, 1]d → R is Lipschitz continuous and isotropically
periodic, i.e. for every x ∈ [0, 1]d and every vector ei = (δi j )1≤ j≤d , i = 1, . . . , d of
the canonical basis of Rd (δi j stands for the Kronecker symbol) f (x + ei ) = f (x) as
soon as x + ei ∈ [0, 1]d , then f can be extended as a Lipschitz continuous function on
4.4 Randomized QMC 129

the whole Rd with the same Lipschitz coefficient, say [ f ]Lip . Furthermore, it satisfies
f (x) = f ({x}) for every x ∈ Rd . Then, it follows from Proïnov’s Theorem 4.3 that

 2
1
n
≤ [ f ]2Lip Dn∗ (ξ1 , . . . , ξn ) d
2
sup f ({u + ξk }) − f dλd
u∈[0,1]d n k=1 [0,1]d

2 (log n)2
≤ Cd2 [ f ]2Lip Cξd 2
nd
(where Cd is Proïnov’s constant). This time, still for a prescribed budget, the “gain”
2
factor in terms of variance is proportional to n 1− d (log n)2 , which is no longer a
gain. . . but a loss as soon as d ≥ 2!
For more results and details, we refer to the survey [271] on randomized QMC
and the references therein.
Finally, randomized QMC is a specific (and not so easy to handle) variance reduc-
tion method, not a QMC speeding up method. It suffers from one drawback shared
by all QMC-based simulation methods: the sparsity of the class of functions with
finite variation and the difficulty for identifying them in practice when d > 1.

4.4.2 Scrambled (Randomized) QMC

If the principle of mixing randomness and the Quasi-Monte Carlo method is undoubt-
edly a way to improve rates of convergence of numerical integration over unit
hypercubes, the approach based on randomly “shifted” sequences with low discrep-
ancy (or nets) described in the former section turned out to be not completely satisfac-
tory and it is no longer considered as the most efficient way to proceed by the QMC
community.
A new idea emerged at the very end of the 20th century inspired by the pioneering
work by A. Owen (see [221]): to break the undesired regularity which appears even
in the most popular sequences with low discrepancy (like Sobol’ sequences), he
proposed to scramble them in an i.i.d. random way so that these regularity features
disappear while preserving the quality, in terms of discrepancy, of these resulting
sequences (or nets).
The underlying principle – or constraint – was to preserve their “geometric-
combinatorial” properties. Typically, if a sequence shares the (s, d)-property in a
given base (or the (s, m, d)-property for a net), its scrambled version should share it
too. Several attempts to produce efficient deterministic scrambling procedures have
been made as well, but of course the most radical way to get rid of regularity features
was to consider a kind of i.i.d. scrambling as originally developed in [221]. This has
been successfully applied to the “best” Sobol’ sequences by various authors.
130 4 The Quasi-Monte Carlo Method

We will not go deeper into the detail and technicalities as it would lead us too
far from probabilistic concepts and clearly beyond the mathematical scope of this
textbook.
Nevertheless, one should keep in mind that these types of improvements are
mostly devoted to highly regular functions. In the same spirit, several extensions of
the Koksma–Hlawka Inequality have been established (see [78]) for differentiable
functions with finite variation (in the Hardy and Krause sense) whose partial deriva-
tives also have finite variations up to a given order.

4.5 QMC in Unbounded Dimension: The


Acceptance-Rejection Method

If one looks at the remark “The practitioner’s viewpoint” in Sect. 1.4 devoted to
Von Neumann’s acceptance-rejection method, it is a simple exercise to check that
one can replace pseudo-random numbers by a uniformly distributed sequence in
the procedure (almost) mutatis mutandis, except for some more stringent regularity
assumptions.
We adopt the notations of this section and assume that the Rd -valued random
vectors X and Y have absolutely
 continuous
 distributions with respect to a reference
σ-finite measure μ on Rd , Bor (Rd ) . Assume that PX has a density proportional to
f , that PY = g · μ and that f and g satisfy

f ≤ c g μ-a.e. and g > 0 μ-a.e.

where c is a positive real constant.


Furthermore, we make the assumption that Y can be simulated at a reasonable
cost like in the original rejection-acceptation method, i.e. that

d
Y = (U ), U = U ([0, 1]r )

for some r ∈ N∗ where  : [0, 1]r → R.


Additional “ QMC assumptions”:
 The first additional assumption in this QMC framework is that we ask  to be
a Riemann integrable function (i.e. Borel, bounded and λr -a.s. continuous).
 We also assume that the function

I : (u 1 , u 2 ) → 1{c u 1 g((u 2 ))≤ f ((u 2 ))} is λr +1 − a.s. continuous on[0, 1]r +1


(4.16)
(which also amounts to Riemann integrability since I is bounded).
4.5 QMC in Unbounded Dimension: The Acceptance-Rejection Method 131

 Our aim is to compute E ϕ(X ), where ϕ ∈ L1 (PX ). Since we will use ϕ(Y ) =
ϕ ◦ (Y ) to perform this integration (see below), we also ask ϕ to be such that

ϕ ◦  is Riemann integrable.

This classically holds true if ϕ is continuous (see e.g. [52], Chap. 3).
Let ξ = (ξn1 , ξn2 )n≥1 be a [0, 1] × [0, 1]r -valued sequence, assumed to be with low
discrepancy (or simply uniformly distributed) over [0, 1]r +1 . Hence, (ξn1 )n≥1 and
(ξn2 )n≥1 are In particular, uniformly distributed over [0, 1] and [0, 1]r , respectively.
d d
If (U, V ) = U ([0, 1] × [0, 1]r ), then (U, (V )) = (U, Y ). Consequently, the
product of two Riemann integrable functions being Riemann integrable,
n
k=11{c ξ 1 g((ξk2 ))≤ f ((ξk2 ))} ϕ((ξk2 )) n→+∞ E (1{c U g(Y )≤ f (Y )} ϕ(Y ))
n k −→
k=1 1{c ξk1 g((ξk2 ))≤ f ((ξk2 ))} P(c U g(Y ) ≤ f (Y ))

= ϕ(x) f (x)d x
Rd
= E ϕ(X ), (4.17)

where the last two lines follow from computations carried out in Sect. 1.4.
The main gap to apply the method in a QMC framework is the a.s. continuity
assumption (4.16). The following proposition yields an easy and natural criterion.

f
Proposition 4.8 If the function g
◦  is λr -a.s. continuous on [0, 1]r , then Assump-
tion (4.16) is satisfied.

Proof. First we note that


f 
Disc(I) ⊂ [0, 1] × Disc ◦
g
 f 
∪ (ξ , ξ ) ∈ [0, 1]r +1 s.t. c ξ 1 = ◦ (ξ 2 )
1 2
g
where I denotes the function defined in (4.16). Now, it is clear that
 f   f 
λr +1 [0, 1] × Disc( ◦ ) = λ1 ([0, 1]) × λr Disc( ◦ ) = 1 × 0 = 0
g g

owing to the λr -a.s. continuity of f


g
◦ . Consequently

   f 
λr +1 Disc(I) = λr +1 (ξ 1 , ξ 2 ) ∈ [0, 1]r +1 s.t. c ξ 1 = ◦ (ξ 2 ) .
g
132 4 The Quasi-Monte Carlo Method

In turn, this subset of [0, 1]r +1 is negligible with respect to the Lebesgue measure
λr +1 since, returning to the independent random variables U and Y and keeping in
mind that g(Y ) > 0 P-a.s.,
 f 
λr +1 (ξ 1 , ξ 2 ) ∈ [0, 1]r +1 s.t. c ξ 1 = ◦ (ξ 2 ) = P(c U g(Y ) = f (Y ))
g
f
=P U = (Y ) = 0
cg

where we used (see exercise below) that U and Y are independent by construction
and that U has a diffuse distribution (no atom). ♦

f
Remark. The criterion of the proposition is trivially satisfied when g
and  are
continuous on Rd and [0, 1]r , respectively.

 Exercise. Show that if X and Y are independent and X or Y has no atom then

P(X = Y ) = 0.

As a conclusion note that in this section we provide no rate of convergence for this
acceptance-rejection method by quasi-Monte Carlo. In fact, there is no such error
bound under realistic assumptions on f , g, ϕ and . Only empirical evidence can
justify its use in practice.

4.6 Quasi-stochastic Approximation I

It is natural to try to replace regular pseudo-random numbers by quasi-random num-


bers in other procedures where they are commonly implemented. This is the case for
stochastic Approximation which can be seen as the stochastic counterpart of recur-
sive zero search or optimization procedures like the Newton–Raphson algorithm,
etc. These aspects of QMC will be investigated in Chap. 6 (Sect. 6.5).
Chapter 5
Optimal Quantization Methods I:
Cubatures

Optimal Vector Quantization is a method coming from Signal Processing originally


devised and developed in the 1950’s (see [105]) to optimally discretize a continuous
(stationary) signal in view of its transmission. It was introduced as a quadrature
formula for numerical integration in the early 1990’s (see [224]), and for conditional
expectation approximations in the early 2000’s, in order to price multi-asset American
style options [19–21]. In this brief chapter, we focus on the cubature formulas for
numerical integration with respect to the distribution of a random vector X taking
values in Rd .
In view of applications, we will only deal in this monograph with the canonical
Euclidean quadratic setting (i.e. the L2 optimal vector quantization in Rd equipped
with the canonical Euclidean norm), but a general theory of optimal vector quantiza-
tion can be developed in a general framework (with any norm on Rd and an Lp -norm
or pseudo-norm – 0 < p < +∞ – on the probability space (, A, P)). For a more
comprehensive introduction to optimal vector quantization theory, we refer to [129]
and for an introduction more oriented toward applications in Numerical Probabil-
ity we refer to [228]. For an extension to infinite-dimensional spaces (known as
functional quantization) see, among other references [204, 233].
Optimal quantization is closely connected to unsupervised automatic classifica-
tion since it is the natural theoretical framework for modeling and analyzing cele-
brated classification procedures like k-means. We will not investigate these aspects
of quantization in this chapter, for which we refer to [106], for example.
We recall that the canonical Euclidean norm on the vector space Rd is denoted
by | · |.

© Springer International Publishing AG, part of Springer Nature 2018 133


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_5
134 5 Optimal Quantization Methods I: Cubatures

5.1 Theoretical Background on Vector Quantization

Let X be an Rd -valued random vector defined on a probability space (, A, P).


Unless stated otherwise, we will assume throughout this chapter that X ∈
L2Rd (, A, P). The purpose of vector quantization is to study the best approximation
of X by random vectors taking at most N fixed values x1 , . . . , xN ∈ Rd .

Definition 5.1 (a) Let  = {x1 , . .. , xN } ⊂ Rd be a subset of size N , called an N -


quantizer. A Borel partition Ci () i=1,...,N of Rd is a Voronoi partition of Rd induced
by the N -quantizer  if, for every i ∈ {1, . . . , N },
 
Ci () ⊂ ξ ∈ Rd , |ξ − xi | ≤ min |ξ − xj | .
j=i

The Borel sets Ci () are called Voronoi cells of the partition induced by .

In vector quantization theory, an N -quantizer  is also called a codebook, as a


testimony to its links with Information theory, or a grid in applications to Numerical
Probability. Elements of  are also called codewords, again in reference to Informa-
tion theory.

Notation and terminology. Any such N -quantizer is in correspondence with the


N -tuple x = (x1 , . . . , xN ) ∈ (Rd )N as well as with all N -tuples obtained by a permuta-
tion of the components of x. In many applications, when there is no ambiguity regard-
ing the fact that the points xi are pairwise distinct, we will use the notation x rather than
 to designate such an N -quantizer. Conversely an N -tuple x = (x1 , . . . , xN ) ∈ (Rd )N
is in correspondence with the grid (or codebook) x = {x1 , . . . , xN } made up of its
components (or codewords). However, one must keep in mind that in general x has
size |x | ∈ {1, . . . , N }, which may be lower than N when several components of x
are equal. If this is the case the indexing of x is not one–to-one.

Remarks. • Although, in the above definition, | · | denotes the canonical Euclidean


norm, many results in what follows still hold true for any norm on Rd (see [129]),
except for some differentiability results on the quadratic distortion function (see
Proposition 6.3.1 in Sect. 6.3.5).
• In our regular setting where | · | denotes the canonical Euclidean norm, the closure
and the interior of the Voronoi cells Ci () satisfy, for every i ∈ {1, . . . , N },
 
C i () = ξ ∈ Rd , |ξ − xi | = min |ξ − xj |
1≤j≤N

and
◦  
C i () = ξ ∈ R , |ξ − xi | < min |ξ − xj | .
d
i=j
5.1 Theoretical Background on Vector Quantization 135


Furthermore, C i () and C i () are polyhedral convex sets as the intersection of the
half-spaces containing xi defined by the median hyperplanes Hij of the pairs (xi , xj ),
j = i.
Let  = {x1 , . . . , xN } be an N -quantizer. The nearest
 neighbor
 projection Proj :
Rd → {x1 , . . . , xN } induced by a Voronoi partition Ci (x) i=1,...,N is defined by


N
∀ ξ ∈ Rd , Proj (ξ) := xi 1{ξ∈Ci ()} .
i=1

Then, we define the resulting quantization of X by composing Proj and X , namely


N
 = Proj (X ) =
X xi 1{X ∈Ci ()} . (5.1)
i=1

There are as many quantizations of X as Voronoi partitions induced by  (denoting


 is then an abuse of notation). The pointwise error induced when
all of them by X
replacing X by X is given by
 
 | = dist X , {x1 , . . . , xN } = min |X − xi |
|X − X
1≤i≤N

and does not depend on the selected nearest neighbour projection on . When X has a
strongly continuous distribution, i.e. P(X ∈ H ) = 0 for any hyperplane H of Rd , the
boundaries of the Voronoi cells Ci () are P-negligible so that any two quantizations
induced by  are P-a.s. equal.

Definition 5.2 (a) The mean quadratic quantization error induced by an N -quantizer
 ⊂ Rd is defined as the quadratic norm of the pointwise error, i.e.
 1
2
 1
2

X −X 2
= E min |X − xi |2 = min |ξ − xi |2 PX (d ξ) . (5.2)
1≤i≤N Rd 1≤i≤N

(b) We define the quadratic distortion function at level N , defined as the squared
mean quadratic quantization error on (Rd )N , namely

Q2,N : x = (x1 , . . . , xN ) −→ E x


min |X − xi |2 = X − X 2
. (5.3)
2
1≤i≤N

The following facts are obvious:


 2 ,
– if  = {x1 , . . . , xN } is an N -quantizer, then Q2,N (x1 , . . . , xN ) = X − X 2

– the quadratic distortion function clearly satisfies


136 5 Optimal Quantization Methods I: Cubatures

inf Q2,N (x) = inf  2 ,  ⊂ Rd , || ≤ N
X −X 2
x∈(Rd )N

since any grid  with cardinality at most N can be “represented” by an N -tuple by


repeating some components in an appropriate way.
We briefly recall some classical facts about theoretical and numerical aspects of
Optimal Quantization. For further details, we refer e.g. to [129, 228, 231–233].

Theorem 5.1 (Existence of optimal N -quantizers, [129, 224]) Let X ∈ L2Rd (P)
and let N ∈ N∗ .
(a) The quadratic distortion function Q2,N at level N attains a minimum at an N -
 
tuple x(N ) ∈ (Rd )N and x(N ) = xi(N ) , i = 1, . . . , N is an optimal quantizer at level
N (though its cardinality may be lower than N , see the remark below).
(b) If the support of the distribution PX of X hasat least N elements,
 then x(N ) =
(N )
(x1 , . . . , xN ) has pairwise distinct components, P X ∈ Ci (x ) > 0, i = 1, . . . , N
N N

(and minx∈(Rd )N −1 Q2,N −1 (x) > 0). Furthermore, the sequence N → inf Q2,N (x)
x∈(Rd )N
converges to 0 and is (strictly) decreasing as long as it is positive.
   
Remark. If supp PX is finite, say supp PX = {x1 , . . . , xN0 } ⊂ Rd , N0 ≥ 1 (with
pairwise distinct xi ), then x(N0 ) = (x1 , . . . , xN0 ) is an optimal quantizer at level N0
and minx∈(Rd )N0 Q2,N0 (x) = 0 and for every level N ≥ N0 , minx∈(Rd )N Q2,N (x) = 0.

Proof. (a) We will proceed by induction on the level N . First note that the L2 -mean
quantization error function defined on (Rd )N by
  
 
(x1 , . . . , xN ) −→ Q2,N (x1 , . . . , xN ) =  min |X − xi |
1≤i≤N 2

is
 clearly 1-Lipschitz
 continuous with respect to the ∞ -norm on (Rd )N defined by
(x1 , . . . , x ) ∞ := max |xi |. This is a straightforward consequence of Minkowski’s
N  1≤i≤N
inequality combined with the more elementary inequality
 
 
 min ai − min bi  ≤ max |ai − bi |
1≤i≤N 1≤i≤N 1≤i≤N

(whose proof is left to the reader). As a consequence, it implies the continuity of its
square, the quadratic distortion function Q2,N .
As a preliminary remark, note that by its very definition the sequence N →
inf X − X x 2 is non-increasing.
x∈(Rd )N
 N = 1. The non-negative strictly convex function Q2,1 clearly goes to +∞ as
|x1 | → +∞. Hence, Q2,1 attains a unique minimum at the mean of X , i.e. x1(1) = E X .
So {x1(1) } is an optimal quantization grid at level 1.
5.1 Theoretical Background on Vector Quantization 137

 N ⇒ N + 1. Assume there exists an x(N ) ∈ (Rd )N such that Q2,N (x(N ) ) =


min(Rd )N Q2,N . Set x(N ) = {xi(N ) , i = 1, . . . , N }. Then, either supp(PX ) \ x(N ) = ∅
and any (N + 1)-tuple of (Rd )N +1 which “exhausts” the grid x(N ) makes the function
Q2,N +1 equal to 0, its lowest possible value, or there exists  a ξN+1 ∈ supp(PX ) \ x(N ) .
In this second case, let  ∗ = x(N ) ∪ {ξN +1 } and let Ci ( ∗ ) 1≤i≤N +1 be a Voronoi
partition of Rd where CN +1 ( ∗ ) is the Voronoi cell of ξN +1 . As ξN +1 ∈ / x(N ) , it is

clear that C N +1 ( ∗ ) = ∅ and that |X − ξN +1 | < min1≤i≤N |X − xi(N ) | on the inte-
   ◦ 
rior of this cell. Furthermore, P X ∈ CN +1 ( ∗ ) ≥ P X ∈ C N +1 ( ∗ ) > 0 since

ξN +1 ∈ C N +1 ( ∗ ) and ξN +1 ∈ supp(PX ). Note that, everywhere on (Rd )N , one has
|X − ξN +1 | ∧ min |X − xi(N ) | ≤ min |X − xi(N ) |, so that, combining both inequal-
1≤i≤N 1≤i≤N
ities yields
     
∗ 2 = E X − ξN +1 2 ∧ min X − xi(N ) 2
λN +1 = E X − X
1≤i≤N

<E min |X − xi(N ) |2


1≤i≤N
 
= Q2,N (x(N ) ) = min X − X
x 2 .
x∈(Rd )N
2

In particular, card(x(N ) ) = N , otherwise card( ∗ ) ≤ N − 1 + 1 = N , but this would


contradict that Q2,N is minimum at x(N ) since any N -tuple x∗ “exhausting” the values
of  ∗ would satisfy Q2,N (x∗ ) < Q2,N (x(N ) ). Hence, the set KN +1 = x ∈ (Rd )N +1 :

Q2,N +1 (x) ≤ λN +1 is non-empty since it contains all the (N + 1)-tuples which
“exhaust” the elements of  ∗ . It is closed since Q2,N +1 is continuous. Let us show
that it is also a bounded subset of (Rd )N +1 . Let x[k] = x[k],1 , . . . , x[k],N +1 , k ∈ N∗ ,
be a KN +1 -valued sequence of (N + 1)-tuples. Up to at most N + 1 extractions, one
may assume without loss of generality that there exists a subset I ⊂ {1, . . . , N + 1}
such that for every i ∈ I , x[k],i → x[∞],i ∈ Rd and for every i ∈ / I , |x[k],i | → +∞ as
k → +∞. By a straightforward application of Fatou’s Lemma (where |I | denotes
here the cardinality of the index set I )
 2
 
lim Q2,N +1 (x[k] ) ≥  min |X − x[∞],i | ≥ inf Q2,|I | (y).
k i∈I 2 y∈(Rd )|I |

 
The sequence x[k] k∈N∗ being KN +1 -valued, one has
inf Q2,|I | (y) ≤ λN +1 < inf Q2,N (x).
y∈(Rd )|I | x∈(Rd )N

 
In turn, this implies that |I | = N + 1, i.e. the sequence of (N + 1)-tuples x[k] k≥1 is
bounded. As a consequence, the set KN +1 is compact and the function Q2,N +1 attains
a minimum over KN +1 at an x(N +1) which is obviously its absolute minimum and has
pairwise
 components
 such that, with obvious notation, card(x(N +1) ) = N + 1 and
P X ∈ Ci (x(N +1) ) > 0, i = 1, . . . , N + 1.
138 5 Optimal Quantization Methods I: Cubatures

(b) The strict decrease of N → inf (Rd )N Q2,N as long as it is not 0 is a straightforward
consequence of the proof of the Claim (a). If (zN )N ≥1 is an everywhere dense sequence
in Rd , then
   
0 ≤ inf Q2,N ≤ X − X
(z1 ,...,zN )  =  min |X − zi | ↓ 0 as N → +∞
(Rd )N 2 1≤i≤N 2

by the Lebesgue dominated convergence theorem (min1≤i≤N |X − zi | ≤ |X − z1 | ∈


L2 ensures the domination property). The other claims follow from the proof of
Claim (a). ♦

The preceding leads naturally to the following definition.


Definition 5.3 A grid associated to any N -tuple solution to the above distortion
minimization problem is called an optimal quadratic N -quantizer or an optimal
quadratic quantizer at level N (the term “quadratic” may be dropped in the absence
of ambiguity).
Notation (a slight abuse of). When an N -tuple x = (x1 , . . . , xN ) has pairwise
x instead of X
distinct components, we will often use the notation X x . For simplicity,
we will also call it an N -quantizer. Similarly, we will also denote Ci (x) instead of
Ci (x ) to denote Voronoi cells.

Remarks. • When N = 1, the N -optimal quantizer is always unique and is equal


to E X .
• When N ≥ 2, the set argmin Q2,N is never reduced to a single N -tuple (except if X
is P-a.s. constant), simply because argmin Q2,N is left stable under the action of the
N ! permutations of {1, . . . , N }. The question of the geometric uniqueness – as a grid
(non-ordered set) – is much more involved. When d ≥ 2, uniqueness usually fails
if the distribution of X is invariant under isometries. Thus, the normal distribution
N (0; Id ) is invariant under all orthogonal transforms and so is argmin Q2,N . But
there are also examples (see [129]) for which optimal grids at level N do not even
make up a “connected” set.
• However, in one dimension, it has been proved (see e.g. [168] and Sect. 5.3.1
further on) that, as soon as μ is absolutely continuous with a log-concave density,
there exists exactly one optimal quantization grid at level N . This grid has full size N
and is characterized by its stationarity so that argmin Q2,N is made of the N ! resulting
N -tuples. Such distributions are often called unimodal.
• This existence result admits many extensions, in particular in infinite dimensions
when Rd is replaced by a separable Hilbert space or, more generally, a reflexive
Banach space and X is a Radon random vector (but also for L1 -spaces). In such
infinite-dimensional settings vector quantization is known as functional quantization,
see [226] for a general introduction and [131, 204] for various existence and regularity
results for functional quantizers.

 Exercises. 1. (Unimodality). Show that if X has a unimodal distribution, then |X |


also has a unimodal distribution.
5.1 Theoretical Background on Vector Quantization 139

p
2. Lp -optimal quantizers. Let p ∈ (0, +∞) and X ∈ LRd (P). Show that the Lp -
distortion function at level N ∈ N∗ defined by
 
Qp,N : x = (x1 , . . . , xN ) −→ E min |X − xi |p = X − X
x p
p
(5.4)
1≤i≤N

attains its minimum on (Rd )N . (Note that when p = N = 1 this minimum is attained
at the median of the distribution of X ).
3. Constrained quantization (at 0). (a) Show that, if X ∈ L2Rd (P) and N ∈ N∗ , the
function defined on (Rd )N by
 
(x1 , . . . , xN ) → E min |X − xi |2 ∧ |X |2 = min |ξ − xi |2 ∧ |ξ|2 PX (d ξ)
1≤i≤N Rd 1≤i≤N
(5.5)
attains a minimum at an N -tuple (x1∗ , . . . , xN∗ ).
(b) How would you interpret (x1∗ , . . . , xN∗ ) in terms of quadratic optimal quantization?
At which level?

Proposition 5.1 Assume that the support of PX has at least N elements.


(a) Any L2 -optimal N -quantizer x(N ) ∈ (Rd )N is stationary in the following sense
(see [224, 226]): for every Voronoi quantization Xx(N ) of X ,
 
x(N ) = X
E X |X x(N ) . (5.6)

x(N ) of X (i.e. x(N ) ∈ argminQ2,N ) is the best


(b) An optimal N -quantization X
quadratic approximation of X among all random variables Y : (, A, P) → Rd
having at most N values:
  
X − Xx(N )  = min X − Y 2 , Y : (, A, P) → Rd , measurable, |Y ()| ≤ N .
2

Proof. (a) Let x(N ) be an optimal N -quantizer x(N ) be an optimal quan-


(N )
 and let X
tization of X given by (5.1), where Ci (x ) 1≤i≤N is a Voronoi partition induced
byx(N ) . By the definition
 x(N )  of conditional expectation as an orthogonal projector
 on
L2 σ(X x(N ) ) = ϕ X  , ϕ : x(N ) → R, Borel , we know that X − E X | X x(N )
     
⊥ L2 σ(X x(N ) ) in L2 (, A, P). As X x(N ) − E X | X
x(N ) lies in L2 σ(Xx(N ) ) , it fol-
lows from Pythagoras’ Theorem that
        
x(N ) 2 + E X | X
x(N ) 2 = X − E X | X
X − X x(N ) 2 .
x(N ) − X
2 2 2

 
On the other hand, by the definition of a x(N ) = x1(N ) , . . . , xN(N ) -valued Voronoi
quantizer, one has
       
x(N )  ≥ dist X , x(N )  = X − X
X − E X | X x(N )  .
2 2 2
140 5 Optimal Quantization Methods I: Cubatures
 
x(N ) =
The former equality and this inequality are compatible if and only if E X | X
x(N ) P-a.s.
X
(b) Let Y : (, A, P) → Rd with |Y ()| ≤ N and let  = {Y (ω), ω ∈ }. It is clear
that || ≤ N so that
    (N ) 
X −Y 2 ≥ dist(X , ) = X − X
 2 ≥ inf Q2,N (x)=X − X
x  . ♦
2 2
x∈(Rd )N

Remark. • An important additional property of an optimal quantizer is shown


in [129] (Theorem 4.2, p. 38): the boundaries of any of its Voronoi partition are

PX -negligible, i.e. P X ∈ Ni=1 ∂Ci (x) = 0 (even if PX has atoms).
• Let x ∈ (Rd )N be an N -tuple with pairwise distinct components and all its Voronoi

partitions have a PX -negligible boundary, i.e. P X ∈ Ni=1 ∂Ci (x) = 0. Hence, the
Voronoi quantization X x of X is P-a.s. uniquely defined and, if x is only a local
minimum, or any other kind of critical point, of the quadratic distortion function
Q2,N , then x is a stationary quantizer, still in the sense that
 
x = X
E X |X x .

This is a straightforward consequence of the differentiability result of the quadratic


distortion Q2,N at N -tuples with pairwise distinct components and negligible Voronoi
cells boundaries established further on in Chap. 6 (see Proposition 6.3.1).
An extended definition of a stationary quantizer is as follows (note it is not intrinsic
since it depends upon the choice of its Voronoi partition).

Definition 5.4 An N -tuple x ∈ (Rd )N with pairwise distinct components is a station-


ary quantizer if there exists a Voronoipartition of Rd induced by x whose resulting
x satisfies P X ∈ Ci (x) > 0, i = 1, . . . , N , and the stationarity prop-
quantization X
erty  
E X |Xx = X x . (5.7)

 Exercise (Non-optimal stationary quantizer). Let PX = μ = 14 (δ0 + δ 21 ) + 21 δ 34 be


a distribution on the real line.
(a) Show that ( 41 , 34 ) is a stationary quantizer for μ (in the sense of the above
definition).
(b) Show that ( 81 , 58 ) is an optimal quadratic 2-quantizer for μ.

Figure 5.1 shows a quadratic optimal – or at least close to optimal – N -quantizer


for a bivariate normal distribution N (0, I2 ) with N = 200.
Figures 5.2 and 5.3 (Gaussian shaped line) illustrate on the bi-variate normal distri-
bution the intuitive fact that optimal vector quantization does not produce quantizers
whose Voronoi cells all have the same “weights”. One observes that the closer a cell
5.1 Theoretical Background on Vector Quantization 141

-1

-2

-3
-3 -2 -1 0 1 2 3

Fig. 5.1 Optimal quadratic quantization of size N = 200 of the bi-variate normal distribution
N (0, I2 )

Fig. 5.2 Voronoi Tessellation of an optimal N -quantizer (N = 500) for the N (0; I2 ) distribution.
Color code: the heavier the cell is, the warmer (i.e. the lighter) the cell looks (with J. Printems)

is to the mode of the distribution, the heavier it weighs. In fact, optimizing a quantizer
tends to equalize the local inertia of the cells, i.e.

 
E min1≤j≤N |X − xj(N ) |2
E 1{X ∈Ci (x(N ) )} |X − xi(N ) |2  , i = 1, . . . , N .
N
142 5 Optimal Quantization Methods I: Cubatures

(N )   (N )  (N ) 
Fig. 5.3 xi → P X ∈ Ci (x(N ) ) (flat line) and xi → E 1{X ∈Ci (x(N ) )} |X − xi |2 , X ∼
N (0; 1) (Gaussian line), x(N ) optimal N -quantizer, N = 50 (with J.-C. Fort)

This fact can be easily highlighted numerically in one dimension, e.g. on the normal
distribution, as illustrated in Fig. 5.3 (1 ).

Let us return to the study of the quadratic mean quantization error and, to be more
precise, to its asymptotic behavior as the quantization level N goes to infinity. As seen
in Theorem 5.1, the fact that the minimal mean quadratic quantization error goes to
zero as N goes to infinity is relatively obvious since it follows from the existence of
an everywhere dense sequence in Rd . Determining the rate of this convergence is a
much more involved question which is answered by the following theorem, known as
Zador’s Theorem, that we will essentially admit. In fact, optimal vector quantization
can be defined and developed in any Lp -space and, keeping in mind some future
application (in L1 ), we will state the theorem in such a general framework.

Theorem 5.2 (Zador’s Theorem) Let d ∈ N∗ and let p ∈ (0, +∞).


p+δ
(a) Sharp rate (see [129]). Let X ∈ LRd (P) for some δ > 0. Let PX (d ξ) =
ϕ(ξ)λd (d ξ) + ν(d ξ), where ν ⊥ λd , i.e. is singular with respect to the Lebesgue
measure λd on Rd (2 ). Then, there is a constant 
Jp,d ∈ (0, +∞) such that
  p1 + d1
x =
1 d
lim N min d X −X p
Jp,d ϕ d +p d λd .
N →+∞ x∈(Rd )N Rd

− 2·3
ξ2  
1 The “Gaussian” solid line follows the shape of ξ → ϕ 1 (ξ) = √e √ , i.e. P X ∈ Ci (x(N ) ) 
3 3 2π
(N )
ϕ 1 (xi ), i = 1, . . . , N , in a sense to be made precise. This is another property of optimal quantizers
3
which is beyond the scope of this textbook, see e.g. [132].
2 This means that there is a Borel set A ∈ B or(Rd ) such that λ (A ) = 0 and ν(A ) = ν(Rd ). Such
ν d ν ν
a decomposition always exists and is unique.
5.1 Theoretical Background on Vector Quantization 143

(b) Non- asymptotic upper- bound (see [205]). Let δ > 0. There exists a real
constant Cd ,p,δ ∈ (0, +∞) such that, for every Rd-valued random vector X ,

∀ N ≥ 1, min x
X −X p
≤ Cd ,p,δ σp+δ (X )N − d
1

x∈(Rd )N

where, for r ∈ (0, +∞), σr (X ) = min X − a r


≤ +∞.
a∈Rd

Additional proof of Claim (b) (see also Theorem 1 in [236] and the remark that
follows.) In fact, what is precisely proved in [205] (Lemma 1) is

∀ N ≥ 1, min x
X −X p
≤ Cd ,p,δ X p+δ N
− d1
.
x∈(Rd )N

To derive the above conclusion, just note that the quantization error is invariant under
xa

translation since X −a =Xx − a (where x  a := {x1 − a, . . . , xN − a}), so
xa
x = (X − a) − X
that X − X  −a , which in turn implies
xa
∀ a ∈ Rd , min x
X −X p
= min 
(X − a) − X −a p
x∈(Rd )N x∈(Rd )N
x
= min 
(X − a) − X −a p
x∈(Rd )N
− d1
≤ Cd ,p,δ X − a p+δ N .

Minimizing over a completes the proof. ♦

Remarks. • By truncating any random vector X , one easily checks that one always
has  1 1 p+d
x ≥
1 d
lim N d min X −X p
Jp,d ϕ d +p d λd .
N →+∞ x∈(Rd )N Rd

• The first rigorous proof of claim (a) in a general framework is due to S. Graf and H.
Luschgy in [129]. Claim (b) is an improved version of the so-called Pierce’s Lemma,
also established in [129].
1
• The N d factor is known as the curse of dimensionality: this is the optimal rate to
“fill” a d -dimensional space by 0-dimensional objects.
• The real constant  Jp,d clearly corresponds to the case of the uniform distribution
U ([0, 1]d ) over the unit hypercube [0, 1]d for which the following slightly more
precise statement holds
1
lim N d min x
X −X p
= inf N d min
1
x
X −X p
=
Jp,d .
N x∈(Rd )N N x∈(Rd )N
144 5 Optimal Quantization Methods I: Cubatures

A key part of the proof is a self-similarity argument “à la Hammersley” which


establishes the theorem for the U ([0, 1]d ) distributions.
• Zador’s Theorem holds true for any general – possibly non-Euclidean – norm on Rd
and the value of Jp,d depends on the norm on Rd under consideration. When d = 1,
elementary computations show that  J2,1 = 2√1 3 . When d = 2, with the canonical

Euclidean norm, one shows (see [218] for a proof, see also [129]) that J2,d = 185√3 .
Its exact value is unknown for d ≥ 3 but, still for the canonical Euclidean norm, one
has, using some random quantization arguments (see [129]),
 
 d d
J2,d ∼  as d → +∞.
2πe 17, 08

5.2 Cubature Formulas

The random vector X x takes its values in the finite set (or grid) x = {x1 , . . . , xN } (of
size N ), so for every continuous function F : Rd → R with F(X ) ∈ L2 (P), we have


N
 
x ) =
E F(X F(xi )P X ∈ Ci (x) ,
i=1

which is the quantization-based cubature formula to approximate E F(X ) (see [224,


228]). Indeed, as X x is close to X in L2 (P), it is natural to estimate E F(X ) by E F(X
x )
when F is continuous. Furthermore, when F is Lipschitz continuous, one can provide
an upper bound for the resulting error using the quantization errors X − X x 1 (which

comes out naturally) and X − X 2 , or its square, when the quantizer x is stationary
x

(see the following sections).


Likewise, one can consider a priori the σ(X x )-measurable random variable F(X x )
 
as a good approximation of the conditional expectation E F(X ) | X  .
x
 
This principle also suggests to approximate the conditional expectation E F(X )|Y
 
by E F(X x )|Ŷ y . In that case one needs the “transition” probabilities:
 
P X ∈ Cj (x) | Y ∈ Ci (y) .

Numerical computation of E F(X x ) is possible as soon as F(ξ) can be computed


 
at any ξ ∈ R and the distribution P(X
d  = xi ) x is known. The induced
of X
1≤i≤N
quantization error X − X  2 is used to control the error (see hereafter). These
x

quantities related to the quantizer x are also called companion parameters.


5.2 Cubature Formulas 145

5.2.1 Lipschitz Continuous Functions

Assume that the function F : Rd → R is Lipschitz continuous on Rd with Lipschitz


coefficient [F]Lip . Then
      
E (F(X ) | X x ) = E F(X ) − F(X
x ) − F(X x  ≤ [F]Lip E |X − X
x ) | X x | | X
x

so that, for every real exponent r ≥ 1,


 
E (F(X ) | X x ) ≤ [F]Lip X − X
x ) − F(X x
r r

 function xu → u ,
r
owing to the conditional Jensen inequality applied to the convex
 ) , one
see Proposition 3.1. In particular, using that E F(X ) = E E (F(X ) | X
derives (with r = 1) that
     
x ) ≤ E F(X ) | X
E F(X ) − E F(X  x )
x − F(X
1

x 1 .
≤ [F]Lip X − X

Finally, using the monotonicity of the Lr (P)-norms as a function of r yields


 
x ) ≤ [F]Lip X − X
E F(X ) − E F(X x x 2 .
≤ [F]Lip X − X (5.8)
1

This universal bound is optimal in the following sense: considering the 1-Lipschitz
continuous function F(ξ) := mini=1,...,N |x − ξi | = dist(ξ, x ) which is equal to 0 at
every component xi of x, shows that equality may hold in (5.8) so that
 
x
X −X 1
 x ) .
= sup E F(X ) − E F(X (5.9)
[F]Lip ≤1

In turn, due to the Monge–Kantorovich characterization of the L1 -Wasserstein dis-


tance (3 ) (see [272] for a definition and a characterization), (5.9) also reads
 
x
X −X 1
= W1 PX , PX̂ x .

 
μ and ν are distributions on Rd , Bor(Rd ) with finite pth moment (1 ≤ p < +∞), the Lp -
3 If

Wasserstein distance between μ and ν is defined by


 1 
p
W1 (μ, ν) = inf |x − y|p m(dx, dy) , m Borel distribution on Rd × Rd , m(dx × Rd ) = μ, m(Rd × dy) = ν .
Rd ×Rd

When p = 1, the Monge–Kantorovich characterization of W1 reads as follows:


  
 
W1 (μ, ν) = sup  fd μ − fd ν , f : Rd → R, [f ]Lip ≤ 1 .
Rd Rd

Note that the absolute values can be removed in the above characterization of W1 (μ, ν) since f and
−f are simultaneously Lipschitz continuous with the same Lipschitz coefficient.
146 5 Optimal Quantization Methods I: Cubatures

For an introduction to the Wasserstein distance and its main properties, we refer
to [272].
Moreover, the bounded Lipschitz continuous functions making up a character-
izing family for the weak convergence of probability measures on Rd (see The-
orem 12.6(ii)), one derives that, for any sequence of N -quantizers x(N ) satisfying
X −X x(N ) 1 → 0 as N → +∞ (but a priori not optimal),

  x(N )  (Rd )
 = xi(N ) δ (N ) =⇒
P X PX ,
x i
1≤i≤N

(Rd )
where =⇒ denotes the weak convergence of probability measures on Rd .
In fact, still due to the Monge–Kantorovich characterization of L1 -Wasserstein
distance, we have trivially by (5.9) the more powerful result
 
W1 PX , Px(N ) −→ 0 as n → +∞.
X

Variants of these cubature formulas can be found in [231] or [131] for functions
F having only some local Lipschitz continuous regularity and polynomial growth.

5.2.2 Convex Functions

Let x ∈ (Rd )N bea stationary quantizer, e.g. in the sense of Definition 5.4 (Eq. (5.7)),
so that E X | Xx = X x is a stationary quantization of X . Then, Jensen’s inequality
yields, for every convex function F : Rd → R such that F(X ) ∈ L1 (P),
 
x ) ≤ E F(X ) | X
F(X x .

In particular, this implies


x ) ≤ E F(X ).
E F(X (5.10)

5.2.3 Differentiable Functions With Lipschitz Continuous


Gradients (C 1Lip )

In this chapter we will temporarily denote the canonical inner product on Rd by


x | y rather than (x | y) to avoid confusion with conditional expectation.
Assume now that F is differentiable on Rd , with a Lipschitz continuous gradient
∇F. Let x ∈ (Rd )N be a stationary quantizer in the sense of Definition 5.4 and let
x be the resulting stationary quantization. Note first that F(X ) ∈ L2 (P) since F has
X
(at most) quadratic growth and X ∈ L2Rd (P).
5.2 Cubature Formulas 147

We start from the first-order Taylor expansion with integral remainder of F at a


point u ∈ Rd : for every v ∈ Rd ,

  1    
F(v) = F(u) + ∇F(u) | v − u + ∇F tv + (1 − t)u − ∇F(u) | v − u dt.
0

Using the Schwarz Inequality and the fact that ∇F is Lipschitz continuous, we obtain

   1    
F(v) − F(u) − ∇F(u) | v − u  ≤ 
 ∇F tv + (1 − t)u − ∇F(u) | v − u dt
0
1     
≤ ∇F tv + (1 − t)u − ∇F(u) v − udt
0
1 [∇F]Lip
≤ [∇F]Lip |v − u|2 t dt = |v − u|2 .
0 2

x , the inequality reads


Then, with v = X and u = X

   [∇F]Lip  
F(X ) − F(Xx ) − ∇F(X x  ≤
x ) | X − X  x 2 .
X − X (5.11)
2
  
Taking conditional expectation given X x  ≤
x yields, keeping in mind that E Z | X
  x 
E Z  | X
 ,
    x 
E (F(X ) | X
x ) − F(X
x ) − E ∇F(X x  X
x ) | X − X  
[∇F]Lip  
≤ E |X − X x |2 | X
x .
2
x ) is σ(X
Now, using that the random variable ∇F(X x )-measurable, one has
   
x ) | X − X
E ∇F(X x ) | E (X − X
x = E ∇F(X x | X
x ) = 0

so that
  [∇F]Lip  
E (F(X ) | X  x ) ≤
x ) − F(X  x |2 | X
E |X − X x .
2
Then, for every real exponent r ≥ 1, the conditional Jensen’s Inequality applied to
the function u → ur yields

  [∇F]Lip
E (F(X ) | X  x ) ≤
x ) − F(X x
X −X 2
. (5.12)
r 2 2r

In particular, when r = 1, one derives

  [∇F]Lip
 x ) ≤
E F(X ) − E F(X x 2 .
X −X
2 2
148 5 Optimal Quantization Methods I: Cubatures

These computations open the way to the following proposition.


Proposition 5.2 Let X ∈ L2 (P) and let x (or x ) be a stationary quantizer in the
sense of Definition 5.4.
(a) If F ∈ CLip
1
(Rd , R), i.e. is differentiable with a Lipschitz continuous gradient ∇F,
then
   
E F(X ) − E F(X x ) ≤ 1 [∇F]Lip X − X x 2 . (5.13)
2 2

Moreover,
   1 
sup E F(X ) − E F(X
x ) , F ∈ CLip
1 x 2 . (5.14)
(Rd , R), [∇F]Lip ≤ 1 = X − X
2 2

In fact, this supremum holds as a maximum, attained for the function F(ξ) = 21 |ξ|2
since [∇F]Lip = 1 and

x
X −X 2  x |2 .
= E |X |2 − E |X (5.15)
2

(b) In particular, if F is twice differentiable with a bounded Hessian D2 F, then

   
x ) ≤ 1 |||D2 F||| X − X
E F(X ) − E F(X x 2 ,
2 2

 
where |||D2 F||| = sup sup u∗ D2 F(x)u.
x∈Rd u:|u|=1

Proof. (a) The error bound (5.13) is proved above. Equality in (5.14) holds for the
function F(ξ) = 21 |ξ|2 since [∇F]Lip = 1 and

1 x |2 = 1 E |X |2 − 2 E X | X  x  + E |X x |2
E |X − X
2 2
1  
= E |X |2 − 2 E E (X |X x ) | X
 x + E |X x |2
2
1  x |2 + E |X x |2
= E |X |2 − 2 E |X
2
= E F(X ) − E F(X x )

where we used in the penultimate line that X x = E (X | Xx ).


(b) This error bound is straightforward by applying Taylor’s formula at order 2 to
F between X and X x : this amounts to replacing [∇F]Lip by |||D2 F||| in (5.11). Or,
equivalently, by showing that |||D2 F||| = [∇F]Lip (see also footnote (3) in the proof
of Step 3 of Proposition 7.4 in Chap. 7). ♦

Variants of these cubature formulas can be found in [231] or [131] for functions
F whose gradient only has local Lipschitz continuous regularity and polynomial
growth.
5.2 Cubature Formulas 149

 Exercise. (a) Let x ⊂ Rd be a quantizer and let Px denotethe set of probabil-
ity distributions supported by x . Show that X − x 2 = W2 μX , P , where μX
X
   x
denotes the distribution of X and W2 μX , Px = inf ν∈Px W2 μX , ν).
(b) Deduce that, if x is a stationary quantizer, then

   1  2
sup E F(X ) − E F(X
x ) , F ∈ CLip
1
(Rd , R), [∇F]Lip ≤ 1 = W2 μX , Px .
2

5.2.4 Quantization-Based Cubature Formulas for E (F(X) | Y)

Let X and Y be two Rd -valued random vectors defined on the same probability
space (, A, P) and let F : Rd → R be a Borel function. Assume that F(X ) ∈ L2 (P).
The natural idea to approximate E (F(X ) | Y ) by quantization is to replace mutatis
mutandis the random vectors X and Y by their quantizations X  and  Y (with respect
to quantizers x and y that we drop in the notation for simplicity). The resulting
approximation is then    
E F(X ) | Y  E F(X ) | 
Y .

At this stage, a natural question is to look for a priori estimates for the result-
ing quadratic error given the quadratic mean quantization errors X − X  2 and
Y − Y 2.
To this end, we need further assumptions on F. Let ϕF : Rd → R be a regular
(Borel) version of the conditional expectation, i.e. satisfying
 
E F(X ) | Y = ϕF (Y ).

Usually, no closed form is available for the function ϕF but some regularity property
can be established, or more precisely “transmitted” or “propagated” from F to ϕF .
Thus, we may assume that both F and ϕF are Lipschitz continuous with Lipschitz
continuous coefficients [F]Lip and [ϕF ]Lip , respectively.
The main example of such a situation is the (homogeneous) Markovian frame-
work, where (Xn )n≥0 is a homogeneous Feller Markov chain, with transition
(P(y, dx))y∈Rd and when X = Xk and Y = Xk−1 . Then, with the above notations,
ϕF = PF. If we assume that the Markov kernel P preserves and propagates
Lipschitz continuity in the sense that, if [F]Lip < +∞ then [PF]Lip < +∞, the above
assumption is clearly satisfied.
Remark. The above property is the key to quantization-based numerical schemes
in Numerical probability. We refer to Chap. 11 for an application to the pricing
of American options where the above principle is extensively applied to the Euler
scheme which is a Markov process and propagates Lipschitz continuity.
We prove below a slightly more general proposition by considering the situation
where the function F itself is replaced/approximated by a function G. This means
) | 
that we approximate E (F(X ) | Y ) by E (G(X Y ).
150 5 Optimal Quantization Methods I: Cubatures

Proposition 5.3 Let X , Y : (, A, P) → Rd be two random variables and let F, G :


Rd → R be two Borel functions.
(a) Quadratic case p = 2. Assume F(X ), G(X
 ) ∈ L (P) and that there
2
exists a Lip-
schitz continuous function ϕF such that E F(X ) |Y = ϕF (Y ) ∈ L2 (P). Then
      2  2
 ) |  2  )  
E F(X ) | Y − E G(X Y  ≤ F(X ) − G(X  + [ϕF ]2Lip Y − 
Y  . (5.16)
2 2 2

In particular, if G = F and F is Lipschitz continuous


      2  2
 ) |  2    
E F(X ) | Y − E F(X Y  ≤ [F]2Lip X − X  + [ϕF ]2Lip Y − 
Y  . (5.17)
2 2 2

(b) Lp -case p = 2. Assume now that F(X ), G(X ) ∈ Lp (P), p ∈ [1, +∞). Then
ϕF (Y ) ∈ Lp (P) and
        
 ) |    )  
E F(X ) | Y − E G(X Y  ≤ F(X ) − G(X  + 2[ϕF ]Lip Y − 
Y  . (5.18)
p p p

In particular, if G = F and F is Lipschitz continuous

        
 ) |      
E F(X ) | Y − E F(X Y  ≤ [F]Lip X − X  + 2[ϕF ]Lip Y − 
Y . (5.19)
p p p

Proof. (a) We first note that


       
) | 
E F(X ) | Y − E G(X Y = E F(X ) | Y − E F(X ) |  Y
 
+ E F(X ) − G(X ) | 
Y
  
= E F(X ) | Y ) − E E F(X ) | Y | 
Y
 
+ E F(X ) − G(X ) | 
Y ,

where we used that Y is σ(Y )-measurable.


    
Now E F(X ) | Y ) − E E F(X ) | Y |  Y and E F(X ) − G(X ) | 
Y are clearly
 
orthogonal in L2 (P) by the very definition of the orthogonal projector E · | 
Y on
 
L2 σ(
Y ), P so that
      2
 ) |  2     
E F(X ) | Y − E G(X Y  =E F(X ) | Y ) − E E F(X ) | Y | 
Y 
  
2 2

 2
+ E F(X ) − G(X ) | 
Y  . (5.20)
2

Now using, the definition of conditional expectation given 


Y as the best quadratic
approximation among σ( Y )-measurable random variables, we get
5.2 Cubature Formulas 151
       
   
E F(X ) | Y ) − E E F(X ) | Y | 
Y  = ϕF (Y ) − E ϕF (Y )|Y 
2
  2
 
   
Y ) ≤ [ϕF ]Lip Y − 
≤ ϕF (Y ) − ϕF ( Y .
2 2

On the other hand, using that E ( · |σ(


Y )) is an L2 -contraction and that F itself is
Lipschitz continuous yields
   
 ) |     )
E (F(X ) − G(X Y ) ≤ F(X ) − G(X .
2 2

Finally,
    2  2
 ) |  2  )  
E (F(X ) | Y ) − E G(X Y  ≤ F(X ) − G(X  + [ϕF ]2Lip Y − 
Y . (5.21)
2 2 2

The case when G = F Lipschitz continuous is obvious.


(b) First, when p = 2, the Pythagoras-like equality (5.20) should be replaced by the
standard Minkowski Inequality. Secondly, using that ϕF ( Y ) is σ(
Y )-measurable, one
has
      
    
ϕF (Y ) − E ϕF (Y ) | 
Y  ≤ ϕF (Y ) − ϕF ( Y ) p + ϕF (
Y ) − E ϕF (Y ) | 
Y 
  
p p

     
= ϕF (Y ) − ϕF (Y ) p + E ϕF (Y ) − ϕF (Y ) | Y ) 
 
p

 
≤ 2 ϕF (Y ) − ϕF (
Y )
p

since E ( · | 
Y ) is an Lp -contraction when p ∈ [1, +∞). ♦

 
Remark. Markov kernels P(x, d ξ) x∈Rd which propagate Lipschitz continuity in
the sense that

[P]Lip = sup [Pf ]Lip , f : Rd → R, [f ]Lip ≤ 1 < +∞

are especially well-adapted to propagate quantization errors using the above error
bounds since it implies that, if F is Lipschitz continuous, then ϕF = Pf is Lipschitz
continuous too.

 Exercises. 1. Detail the proof of the above Lp -error bound when p = 2. [Hint:
show that Z − E (ϕ(Y )|Y ) p ≤ Z − ϕ(Y ) p + 2 ϕ(Y ) − ϕ( Y ) p .]
2. Prove that the Euler scheme with step Tn of a diffusion with Lipschitz continuous
drift b(t, x) = b(x) and diffusion coefficient σ(t, x) = σ(x) starting at x0 ∈ Rd at
time 0 (see Chap. 7 for a definition) is an Rd -valued homogeneous Markov chain
with respect to the filtration of the Brownian increments whose transition propagates
Lipschitz continuity.
152 5 Optimal Quantization Methods I: Cubatures

5.3 How to Get Optimal Quantization?

This phase is often considered as the prominent drawback of optimal quantization-


based cubature methods for expectation or conditional expectation computation, at
least when compared to the Monte Carlo method. If computing optimal or opti-
mized quantization grids and their weights is less flexible and more time consuming
than simulating a random vector, one must keep in mind that such grids can be
stored off line forever and made available instantly. This means that optimal quan-
tization is mainly useful when one needs to compute many integrals (or conditional
expectations) with respect to the same probability distribution such as the Gaussian
distributions.

5.3.1 Dimension 1…

Though originally introduced to design weighted cubature formulas for the compu-
tation of integrals with respect to distributions in medium dimensions (from 2 to 10
or 12), the quadrature formulas derived for specific one-dimensional distributions or
random variables turn out to be quite useful. This is especially the case when some
commonly encountered random variables are not easy to simulate in spite of the exis-
tence of closed forms for their density, c.d.f. or cumulative first moment functions.
Such is the case for, among others, the (one-sided) Lévy area, the supremum of the
Brownian bridge (or Kolmogorov–Smirnov distribution) which will be investigated
in exercises further on. The main assets of optimal quantizations grids and their
companion weight vectors is threefold:

• Such optimal quantizers are specifically fitted to the random variable they quantize
without any auxiliary “transfer” function-like for Gauss–Legendre points which
are naturally adapted to uniform, Gauss–Laguerre points adapted to exponential
distributions, Gauss–Hermite points adapted to Gaussian distributions.
• The error bounds in the resulting quadrature formulas for numerical integration
are tailored for functions with “natural” Lipschitz or CLip
1
-regularity.
• Finally, in one dimension as will be seen below, fast deterministic algorithms,
based on fixed point procedures for contracting functions or Newton–Raphson
zero search algorithms, can be implemented, quickly producing optimal quantizers
with high accuracy.

Although of little interest for applications (since other deterministic methods


like PDE approach are available) we propose below for the sake of completeness a
Newton–Raphson method to compute optimal quantizers of scalar unimodal distri-
butions, i.e. absolutely continuous distributions whose density is log-concave. The
starting point is the following theorem due to Kieffer [168], which gives the unique-
ness of the optimal quantizer in that setting.
5.3 How to Get Optimal Quantization? 153

Theorem 5.3 (1D-distributions, see [168]) If d = 1 and PX (d ξ) = ϕ(ξ) d ξ with


log ϕ concave, then, for every level N ≥ 1, there is exactly one stationary N -quantizer
(up to the reordering of its components, i.e. as a grid). This unique stationary quan-
tizer is a global (and local) minimum of the distortion function, i.e.
 
∀ N ≥ 1, argminRN Q2,N = x(N ) .

Definition 5.5 Absolutely continuous distributions on the real line with a log-
concave density are called unimodal distributions. The support (in R) of such a dis-
tribution and of its density is a (closed) interval [a, b], where −∞ ≤ a ≤ b ≤ +∞.

Examples of unimodal distributions: The (non-degenerate) Gaussian distributions,


the gamma distributions γ(α, β) with α ≥ 1, β > 0 (including the exponential distri-
butions), the B(α, β)-distributions with α, β ≥ 1 (including the uniform distribution
over the unit interval), etc.
In this one-dimensional setting, a deterministic optimization approach, based on
the Newton–Raphson zero search algorithm or on the Lloyd fixed point algorithm,
can be developed. Both algorithms are detailed below (the procedures will be written
formally for absolutely continuous distributions on the real line, not only unimodal
ones).

 Specification of the Voronoi cells. Let x = (x1 , . . . , xN ) ∈ SNa,b := ξ ∈ (a, b)N :

ξ1 < ξ2 < · · · < ξN , where [a, b], a, b ∈ R, denotes the convex hull of the support
of PX in R. Then we set
 
Ci (x) = xi− 21 , xi+ 21 , i = 1, . . . , N − 1, CN (x) = [xN − 21 , b]

with
xi+1 + xi
xi+ 21 = , i = 1, . . . , N − 1, x 21 = a, xN + 21 = b
2
and the convention that, if a or b is infinite, they are “removed” from “their” Voronoi
cell. Also keep in mind that if the density ϕ is unimodal, then supp(PX ) = supp(ϕ) =
[a, b] (in R).
We will now compute the gradient and the Hessian of the quadratic distortion
function Q2,N on SNa,b .
 Computation of the gradient ∇Q2,N (x). Let x ∈ SNa,b . Decomposing the quadratic
distortion function Q2,N (x) across the Voronoi cells Ci (x) leads to the following
expression for the quadratic distortion:


N xi+ 1
Q2,N (x) = (xi − ξ)2 ϕ(ξ) d ξ.
2

i=1 xi− 1
2
154 5 Optimal Quantization Methods I: Cubatures

If the density function is continuous, elementary differentiation with respect to every


variable xi then yields that Q2,N is continuously differentiable at x with
 !
∂Q2,N xi+ 1
∇Q2,N (x) := =2 (xi − ξ) ϕ(ξ) d ξ .
2

∂xi 1≤i≤N xi− 1


2 1≤i≤N

If x = x(N ) ∈ SNa,b is the optimal quantizer at level N , hence with pairwise dis-
tinct components, ∇Q2,N (x) = 0, i.e. x is a zero of the gradient of the distortion
function Q2,N .
u
If we introduce the cumulative distribution function (u) = ϕ(v)dv and the
−∞
u
cumulative partial first moment function Ψ (u) = v ϕ(v)dv, the zeros of ∇Q2,N
−∞
are solutions to the non-linear system of equations
 
xi (xi+ 21 ) − (xi− 21 ) = Ψ (xi+ 21 ) − Ψ (xi− 21 ), i = 1, . . . , N . (5.22)

Note that this formula also reads


Ψ (xi+ 21 ) − Ψ (xi− 21 )
xi = , i = 1, . . . , N ,
(xi+ 21 ) − (xi− 21 )

which is actually a rewriting of the stationarity Eq. (5.6) since we know that
(xi+ 21 ) = (xi− 21 ). Hence, Eq. (5.22) and its stationarity version are of course true
for any absolutely continuous distributions with density ϕ (keep in mind that any
minimizer of the distortion function has pairwise distinct components). However,
when ϕ is not unimodal, it may happen that ∇Q2,N has several zeros, i.e. several
stationary quantizers, which do not all lie in argmin Q2,N .
These computations are special cases of a multi-dimensional result (see Sect. 6.3.5
of the next chapter devoted to stochastic approximation and optimization) which
holds for any distribution PX – possibly having singular component – at N -tuples
with pairwise distinct components.

d
 Example. Thus, if X = N (0; 1), the c.d.f.  = 0 is (tabulated and) computable
(see Sect. 12.1.2) with high accuracy at a low computational cost whereas the cumu-
lative partial first moment is simply given by

x2
e− 2
Ψ0 (x) = − √ , x ∈ R.

 Computation of the Hessian ∇ 2 Q2,N (x). If ϕ is moreover continuous (at least in


the neighborhood of each component xi of x), the Hessian ∇ 2 Q2,N (x) can in turn
be computed. Note that, if ϕ is unimodal, ϕ is continuous on (a, b) but possibly
5.3 How to Get Optimal Quantization? 155

discontinuous at the endpoints a, b (consider the uniform distribution). The Hessian


reads at such x = (x1 , . . . , xN ),
 
∂ 2 Q2,N
∇ Q2,N (x) =
2
(x) ,
∂xi ∂xj 1≤i,j≤N

where, for every i, j ∈ {1, . . . , N },

∂ 2 Q2,N      xi+1 −xi   xi − xi−1  


(x) = 2  xi+ 21 −  xi− 21 − ϕ xi+ 21 − ϕ xi− 21 ,
∂xi2 2 2
∂ Q2,N
2
xi+1 − xi  
(x) = − ϕ xi+ 21 ,
∂xi ∂xi+1 2
∂ 2 Q2,N xi − xi−1  
(x) = − ϕ xi− 21 ,
∂xi ∂xi−1 2
∂ 2 Q2,N
(x) = 0 otherwise,
∂xi ∂xj

when these partialderivatives


  make sense, with the convention (only valid in the
above formulas) ϕ x 21 = ϕ xN + 21 = 0.

The Newton–Raphson zero search procedure


The Newton–Raphson can be viewed as an accelerated gradient descent. For many
distributions, such as the strictly log-concave distributions, for example, it allows an
almost instant search for the unique optimal N -quantizer with the requested accuracy.
Assume supp(PX ) = [a, b] ∩ R (or at least that the convex hull of supp(PX ) is
equal to [a, b] ∩ R.
Let x[0] ∈ SNa,b . The zero search Newton–Raphson procedure starting at x[0] is then
defined as follows:
 −1  
x[n+1] = x[n] − ∇ 2 Q2,N (x[n] ) ∇Q2,N (x[n] ) , n ∈ N. (5.23)

 Example of the normal distribution. Thus, for the normal distribution N (0; 1), the
three functions ϕ0 , 0 and Ψ0 are explicit and can be computed at a low computational
cost. Thus, for N = 1, . . . , 1 000, tabulations with a 10−14 accuracy of the optimal
N -quantizers  
x(N ) = x1(N ) , . . . , xN(N )

have been computed and can be downloaded at the website www.quantize.maths-fi.


com (package due to S. Corlay). Their companion weight parameters are computed
as well, with the same accuracy, namely the weights:
   (N )   (N ) 
P X ∈ Ci (x(N ) ) =  xi+ 1 −  xi− 1 , i = 1, . . . N ,
2 2
156 5 Optimal Quantization Methods I: Cubatures

and the resulting (squared) quadratic quantization error (through its square, the
quadratic distortion) given by

   x(N ) 2 
N
 (N ) 2  (N )   (N ) 
x(N ) 2 = E X 2 − E X
X − X  = EX −
2
xi 0 xi+ 1 − 0 xi− 1
2 2 2
i=1

⎛ ⎞
x(N )1
using (5.15). The vector of local inertia ⎝ (ξ − xi(N ) )2 ϕ(ξ)d ξ ⎠
i+ 2
is also
x(N )1
i− 2
i=1,...,N
made available.

The Lloyd method (or Lloyd’s method I)


The Lloyd method (also known as Lloyd’s method I) was first devised in 1957 by
S.P. Lloyd but was only published in 1982 in [201]. To our knowledge, it was the
first method devoted to the numerical computation of quantizers. It was rediscovered
independently by Max in the early 1960s. In another seminal paper [168], Kieffer
first establishes (in one dimension) the uniqueness of stationary – hence optimal – N
quantizer, when X (is square integrable and) has a log-concave (unimodal) density
ϕ. Then, he shows the convergence at an exponential rate of the Lloyd method when,
furthermore, log ϕ is not piecewise affine.
Let us be more precise. Lloyd’s method is essentially a fixed point algorithm
based on the fact that the unique stationary quantizer x = x(N ) ∈ SNa,b satisfies the
stationarity Eq. (5.7) (or (5.22)), which may be re-written as

Ψ (xi+ 21 ) − Ψ (xi− 21 )
xi = Λi (x) := , i = 1, . . . , N . (5.24)
(xi+ 21 ) − (xi− 21 )

If the density ϕ is log-concave and not piecewise affine, the function Λ = (Λi )i=1,...,N
defined by the right-hand side of (5.24) from SNa,b onto SNa,b – called Lloyd’s map – is
contracting (see [168]) and the Lloyd method is defined as the iterative fixed point
procedure based on Λ, namely
 
x[n+1] = Λ x[n] , n ≥ 0, x[0] ∈ SNa,b . (5.25)

Hence, (x[n] )n≥0 converges exponentially fast toward x(N ) . When Λ is not contracting,
no general convergence results are known, even in case of uniqueness of the optimal
N -quantizer.
In terms of implementation, the Lloyd method only involves the c.d.f.  and
the cumulative partial first moment function Ψ of the random variable X for which
closed forms are required. The multi-dimensional version of this algorithm – which
does not require such closed forms – is presented in Sect. 6.3.5 of Chap. 6.
When both functions  and Ψ admit closed forms (as for the normal distribu-
tion, for example), the procedure can be implemented. It is usually slower than the
5.3 How to Get Optimal Quantization? 157

above Newton–Raphson procedure (with which is in fact an accelerated version of


the deterministic “mean zero search procedure” associated with the Competitive
Learning Vector Quantization algorithm presented in Sect. 6.3.5 of the next chapter).
However, when the quantization level N increases the Lloyd method turns out to be
more stable than the Newton–Raphson algorithm.
Like all fixed point procedures relying on a contracting function, the Lloyd method
can be speeded up by storing and taking advantage of the past iterations of the
procedure by an appropriate regression method, known as Anderson’s acceleration
method. For more details, we refer to [274].

Remark. In practice, one can successfully implement these two deterministic recur-
sive procedures even if the distribution of X is not unimodal (see the exercises below).
In particular, the procedures converge when uniqueness of the stationary quantizer
holds, which is true beyond the class of unimodal distributions. Examples of unique-
ness of optimal N -quantizers for non-unimodal (one-dimensional) distributions can
be found in [96] (see Theorem 4): thus, uniqueness holds for the normalized Pareto
distributions αx−(α+1) 1[1,+∞) , α > 0, or power distributions αxα−1 1(0,1] , α ∈ (0, 1),
none of them being unimodal (in fact semi-closed forms are established for such
distributions, from which uniqueness is derived).

ℵ Practitioner’s corner: Splitting Initialization method for scalar distributions.


In view of applications, one is usually interested in building a database of optimal
quantizers of a random variable X for all levels between N = 1 and a level Nmax . In
such a situation, one may take advantage of the optimal quantizer at level N − 1 to
initialize any of the above two procedures at level N . The idea is to mimic the proof
of the existence of an optimal quantizer (see Theorem 5.1).
 
General case. Let N ≥ 2. Having at hand x(N −1) = x1(N −1) , . . . , xN(N−1−1) and x̄ =
E X . One natural and usually efficient way to initialize the procedure to compute an
optimal N -quantizer is to set
 
x[0] = x1(N −1) , . . . , xi(N
0
−1)
, x̄, xi(N −1)
0 +1
, . . . , xN(N−1−1) ,

where xi(N
0
−1)
< x̄ < xi(N −1)
0 +1
.

Symmetric random variables. When X has a symmetric distribution (with a density


or at least assigning no mass to 0), the optimal quantizers of SNa,b are themselves
symmetric, at least under a uniqueness assumption like unimodality. Thus quan-
∗ ∗
tizers at odd level N are of the form (−x1:(N −1)/2 , 0, x1:(N −1)/2 ) if N is odd and
 ∗ ∗

− x1:N /2 , x1:N /2 if N is even.

Then, one checks that, if N is even, x1:N /2 is obtained as the optimal N /2-quantizer

of |X | and, if N is odd, x1:(N −1)/2 is obtained as the optimal (N − 1)/2-quantizer with
constraint at 0 solution to the minimization problem of the constrained distortion
function (5.5) associated to |X | instead of X .
158 5 Optimal Quantization Methods I: Cubatures

 Exercises. 1. Quantization of the non-central χ2 (1) distribution. (a) Let X = Z 2 ,


d d
Z = N (0; 1) so that X = χ2 (1). Let X denote the c.d.f. of X and let ΨX (x) =
E X 1{X ≤x} , x ∈ R+ , denote its cumulated first moment. Show that, for every x ∈ R+ ,

√ 2 x −x
X (x) = 20 ( x) − 1 and ΨX (x) = X (x) − e 2,
π

where as usual 0 denotes the c.d.f. of the standard normal distribution N (0; 1).
Show that its density ϕX is log-convex and given by

e− 2
x
1
ϕX (x) = √ 0 (x) = √ , x ∈ (0, +∞)
x 2πx

so that we have no guarantee that optimal N -quantizers are unique a priori.


(b) Write and implement the Newton–Raphson zero search algorithm and the fixed
point Lloyd method for the standard χ2 (1)-distribution
 at a level N ≥ 1. [Hint: for a
practical implementation choose the starting value x1(0) , . . . , xN(0) carefully, keeping
in mind that the density of X goes to +∞ at 0.]
(c) Establish 0 -based closed formulas for X and ΨX when X = (Z + m)2 , m ∈ R
(non-central χ2 (1)-distribution).
(d ) Derive and implement both the Newton–Raphson zero search algorithm and the
Lloyd method to compute optimal N -quantizers for non-central χ2 (1)-distributions.
Compare.
This exercise is important when trying to quantize the Milstein scheme of a one
dimensional diffusion (see Sect. 7.5.1 further on, in particular, the exercise after
Theorem 7.5).
2. Quantization of the log-normal distribution. (a) Let X = eσZ+m , σ > 0, m ∈ R,
d
Z = N (0; 1). Let X and ΨX denote the c.d.f. of X and its cumulated first moment
ΨX (x) = E X 1{X ≤x} , respectively. Show that, for every x ∈ (0, +∞),
 
log(x) − m σ2 log(x) − m − σ 2
X (x) = 0 and ΨX (x) = em+ 2 0
σ σ

where 0 still denotes the c.d.f. of the standard normal distribution N (0; 1).
(b) Derive and implement both the Newton–Raphson zero search algorithm and
Lloyd’s method I for log-normal distributions. Compare.
3. Quantization by Fourier. Let X be a symmetric random variable on (, A, P)
with a characteristic function χ(u) = E eı̃ uX , u ∈ R, ı̃ 2 = −1. We assume that χ ∈
L1 (R, du).
(a) Show that PX is absolutely continuous with an even continuous density ϕ defined
1 +∞
by ϕ(0) = χ(u) du and, for every x ∈ (0, +∞),
π 0
5.3 How to Get Optimal Quantization? 159

+∞
1 1 u
ϕ(x) = eı̃xu χ(u)du = cos(u) χ du
2π R πx 0 x
1  π
u + kπ
= (−1)k cos(u) χ du.
πx 0 x
k≥0

Furthermore, show that χ is real-valued, even and always (strictly) positive. [Hint:
See an elementary course on Fourier transform and/or characteristic functions, e.g.
[44, 52, 155, 263] among (many) others.]
(b) Show that the c.d.f. X of X is given by X (0) = 1
2
and, for every x ∈ (0, +∞),

+∞
1 1 sin u u
X (x) = + χ du, X (−x) = 1 − X (x). (5.26)
2 π 0 u x

(c) Assume furthermore that X ∈ L1 (P). Prove that its cumulated partial first moment
function ΨX is negative, even and given by ΨX (0) = −C and, for every x ∈ (0, +∞),
 +∞
1 x 1 − cos u u
ΨX (x) = −C + x (x) − − χ du, ΨX (−x) = ΨX (x), (5.27)
2 π 0 u2 x

where C = E X+ . [Hint: Use in both cases Fubini’s theorem and integration(s) by


parts.]
(d ) Show that X (x) can be written on (0, +∞) as an alternating series reading

1 1 π
sin u u + kπ
X (x) = + (−1)k χ du, x ∈ (0, +∞).
2 π 0 u + kπ x
k≥0

Show likewise that ΨX (x) also reads for every x ∈ (0, +∞)


1 x π
1 − (−1)k cos u u + kπ
ΨX (x) = −C + x X (x) − − χ du.
2 π 0 (u + kπ)2 x
k≥0

(N )
(e)(NPropose  database ofoptimal quantizers x =
two methods to compute a (small)
x1 ) , . . . , xN(N ) and their weight vectors p1(N ) , . . . , pN(N ) for a symmetric random
variable X satisfying the above conditions, say for levels running from N = 1 up to
Nmax = 50. Compare their respective efficiency. [Hint: Prior to the implementation
have a look at the (recursive) splitting initialization method for scalar distributions
described in the above Practitioner’s corner, see also Sect. 6.3.5.]
4. Quantized one-sided Lévy’s area. Let W = (W 1 , W 2 ) be a 2-dimensional standard
Wiener process and let
1
X = Ws1 dWs2
0
160 5 Optimal Quantization Methods I: Cubatures

denote the Lévy area associated to (W 1 , W 2 ) at time 1. We admit that the character-
istic function of X reads
1
χ(u) = E eı̃uX = √ , u ∈ R,
cosh u

(see Formula (9.105) in Chap. 9 further on, applied here with μ = 0, and Sect. 12.11
of the Miscellany Chapter for a proof).
(a) Show that
1
C := E X+ = √ E B L2 ([0,1],dt) ,

where B = (Bt )t∈[0,1] denotes a standard Brownian motion.


(b) Establish the elementary identity

1 1
E B L2 ([0,1],dt) = +E B L2 ([0,1],dt) − B 2
L2 ([0,1],dt)
4 2

and justify why E B L2 ([0,1],dt) should be computed by a Monte Carlo simulation


using this identity. [Hint: An appropriate Monte Carlo simulation should yield a
result close to 0.2485, but this approximation is not accurate enough to compute
optimal quantizers (4 .]
(c) Describe in detail a method (or possibly two methods) for computing a small
database of N -quantizers of the Lévy area for levels running from N = 1 to Nmax =
50, including, for every level N ≥ 1, both their weights and their induced quadratic
mean quantization error. [Hint: Use (5.15) to compute X − X  2 when  is a
2
stationary quantizer.]
5. Clark–Cameron oscillator. Using the identity (9.105) in its full generality, extend
the quantization procedure of Exercise 4. to the case where
1
X = (Ws1 + μs) dWs2 ,
0

with μ a fixed real constant.


6. Supremum of the Brownian bridge. Let

X = sup Wt − t W1 |
t∈[0,1]

denote the supremum of the standard Brownian bridge (see Chap. 8 for more details,
see also Sect. 4.3 for the connection with uniformly distributed sequences and

4Amore precise approximation is C = 0.24852267852801818 ± 2.033 10−7 obtained by imple-


menting an ML2R estimator with a target RMSE ε = 3.0 10−7 , see Chap. 9.
5.3 How to Get Optimal Quantization? 161

discrepancy). This distribution, also known as the Kolmogorov–Smirnov (K–S)


distribution since it is the limiting distribution emerging from the non-parametric
eponymous goodness-of-fit statistical test, is characterized by its survival function
given (see (4.8)) by 
¯ X (x) = 2 (−1)k−1 e−2k x .
2 2
(5.28)
k≥1

(a) Show that the cumulative partial first moment function ΨX of X is given for every
x ≥ 0 by
x
ΨX (x) = E X 1{X ≤x} = ¯ X (u) du − x 
 ¯ X (x)
0

and deduce that



√  (−1)k−1 1
∀ x ≥ 0, ΨX (x) := 2π 0 (2kx) − ¯ X (x),
−x
k 2
k≥1

where 0 denotes the c.d.f. of the normal distribution N (0; 1).


(b) Compute a small database of N -quantizers, say N = 1 up to 50, of the K–S
distribution (including the weights and the quantization error at each level), based
on the fixed point Lloyd method. [Hint: Recall the splitting method.]
(c) Show that the K–S distribution is absolutely continuous with a density ϕX given by

(−1)k−1 k 2 e−2k x , x ≥ 0
2 2
ϕX (x) = 8x
k≥1

and implement a second method to compute the same small database. Compare their
respective efficiencies.

5.3.2 The Case of the Normal Distribution N (0; Id )


on Rd , d ≥ 2

As soon as d ≥ 2, most procedures to optimize the quantization error are stochastic


and based on some nearest neighbor search. Let us cite:
– the randomized Lloyd’s method I or randomized Lloyd’s algorithm
and
– the Competitive Learning Vector Quantization algorithm (CLVQ).
162 5 Optimal Quantization Methods I: Cubatures

The first is a randomized version of the d -dimensional version of the fixed point
procedure described above in the one dimensional setting. The second is a recursive
stochastic gradient approximation procedure.
These two stochastic optimization procedures are presented in more detail in
Sect. 6.3.5 of Chap. 6 devoted to Stochastic Approximation.
For N (0; Id ), a large scale optimization has been carried out (with the sup-
port of ACI Fin’Quant) based on a mixed CLVQ -Lloyd’s procedure. To be pre-
cise, grids have been computed for d = 1 up to 10 and N = 1 up to 5 000. Their
companion parameters have also been computed (still by simulation): weight, L1
quantization error, (squared) L2 -distortion, local L1 and L2 -pseudo-inertia of each
Voronoi cell. These enriched grids are available for downloading on the website www.
quantize.maths-fi.com which also contains many papers dealing with quantization
optimization.
Recent implementations of exact or approximate fast nearest neighbor search
procedures has dramatically reduced the computation time in higher dimensions; not
to speak of implementation on GPU . For further details on the theoretical aspects
we refer to [224] for CLVQ (mainly devoted to compactly supported distributions).
As for Lloyd’s method I, a huge literature is available (often under the name of
k-means algorithm). Beyond the seminal – but purely one-dimensional – paper by
J.C. Kieffer [168], let us cite [79, 80, 85, 238]. For more numerical experiments
with the Gaussian distributions, we refer to [231] and for more general aspects in
connection with classification and various other applications, we refer to [106].
For illustrations depicting optimal quadratic N -quantizers of the bi-variate nor-
mal distribution N (0; I2 ), we refer to Fig. 5.1 (for N = 200), Fig. 6.3 (for N = 500,
with its Voronoi diagram) and Fig. 5.2 (for the same quantizer colored with a coded
representation of the weight of the cells).

5.3.3 Other Multivariate Distributions

Algorithms such as CLVQ and the randomized Lloyd’s procedure developed to quan-
tize the multivariate normal N (0; Id ) distributions in an efficient and systematic way
can be successfully implemented for other multivariate distributions. Sect. 6.3.5 in
the next chapter is entirely devoted to these stochastic optimization procedures, with
some emphasis on their practical implementation as well as techniques to speed
them up.
However, to anticipate their efficiency, we propose in Fig. 5.4 a quantization of
size N = 500 of the joint law (W1 , supt∈[0,1] Wt ) of a standard Brownian motion W .
5.4 Numerical Integration (II): Quantization-Based Richardson–Romberg Extrapolation 163

 
Fig. 5.4 Optimal N -quantization (N = 500) of W1 , supt∈[0,1] Wt depicted with its Voronoi tes-
sellation, W standard Brownian motion (with B. Wilbertz)

5.4 Numerical Integration (II): Quantization-Based


Richardson–Romberg Extrapolation

The challenge is to fight against the curse of dimensionality to increase the critical
dimension beyond which the theoretical rate of convergence of the Monte Carlo
method outperforms that of optimal quantization. Combining the above cubature
formula (5.8), (5.13) and the rate of convergence of the (optimal) quantization error,
it seems natural to deduce that the critical dimension to use quantization-based cuba-
ture formulas is d = 4 (when dealing with continuously differentiable functions),
at least when compared to Monte Carlo simulation. Several tests have been carried
out and reported in [229, 231] to refine this a priori theoretical bound. The bench-
mark was made of several options on a geometric index on d independent assets
in a Black–Scholes model: Puts, Puts spread and the same in a smoothed version,
always without any control variate. Of course, having no correlation assets is not a
realistic assumption but it is clearly more challenging as far as numerical integration
is concerned. Once the dimension d and the number of points N have been chosen,
we compared the resulting integration error with a one standard deviation confidence
interval of the corresponding Monte Carlo estimator for the same number of inte-
gration points N . The last standard deviation is computed thanks to a Monte Carlo
simulation carried out using 104 trials.
The results turned out to be more favorable to quantization than predicted by
theoretical bounds, mainly because we carried out our tests with rather small values
of N , whereas the curse of dimensionality is an asymptotic bound. Up to dimension 4,
164 5 Optimal Quantization Methods I: Cubatures

the larger N is, the more quantization outperforms Monte Carlo simulation. When
the dimension d ≥ 5, quantization always outperforms Monte Carlo (in the above
sense) until a critical size Nc (d ) which decreases as d increases.
In this section, we provide a method to push these critical sizes forward, at least
for sufficiently smooth functions. Let F : Rd → R be a twice differentiable function
with Lipschitz continuous Hessian D2 F. Let (X (N ) )N ≥1 be a sequence of optimal
quadratic quantizations. Then

 
(N ) ) + 1 E D2 F(X
E F(X ) = E F(X (N ) ) · (X − X
(N ) )⊗2
2  (5.29)
 |3 .
+ O E |X − X

Under some assumptions which are satisfied by most usual distributions (including
the normal distribution), it is proved in [131] as a special case of a more general
result that
E |X − X|3 = O(N − d3 )

|3 = O(N − 3−ε


or at least (in particular, when d = 2) E |X − X d ), ε > 0. If, further-

more, we conjecture the existence of a real constant cF,X ∈ R such that

  c
(N ) ) · (X − X
E D2 F(X (N ) )⊗2 = F,X2 + o N − d3 , (5.30)
Nd
one can use a Richardson–Romberg extrapolation to compute E F(X ).
Quantization-based Richardson–Romberg extrapolation. We consider two sizes
N1 and N2 (in practice one often sets N1 = N /2 and N2 = N with N even). Then
combining (5.29) with N1 and N2 , we cancel the first order error term and obtain
2 2
⎛ ⎞
(N2 ) ) − N1d E F(X
N2 E F(X
d (N1 ) ) 1
E F(X ) = 2 2 +O⎝ 1
2 2
⎠.
N2 − N1
d d
(N1 ∧ N2 ) (N2d − N1d )
d

Numerical illustration. In order to see the effect of the extrapolation technique


described above, numerical computations have been carried out with regularized ver-
sions of some Put Spread options on geometric indices in dimension d = 4, 6, 8 , 10.
By “regularized”, we mean that the payoff at maturity T has been replaced by its
price function at time T  < T (with T   T ). Numerical integration was performed
using the Gaussian optimal grids of size N = 2k , k = 2, . . . , 12 (available at the
website www.quantize.maths-fi.com).
We consider again one of the test functions implemented in [231] (p. 152). These
test functions were borrowed from classical option pricing in mathematical finance,
namely a Put Spread option (on a geometric index, which is less classical). Moreover;
we will use a “regularized” version of the payoff. One considers d independent traded
assets S 1 , . . . , S d following a d -dimensional Black and Scholes dynamics (under its
5.4 Numerical Integration (II): Quantization-Based Richardson–Romberg Extrapolation 165

risk neutral probability)



 σ2  √
Sti = s0i exp r − i t + σi t Z i,t , i = 1, . . . , d ,
2

where Z i,t = Wti , W = (W 1 , . . . , W d ) is a d -dimensional standard Brownian motion.


Independence is unrealistic but corresponds to the most unfavorable case for numer-
ical experiments. We also assume that s0i = s0 > 0, i = 1, . . . , d , and that the d
assets share the same volatility σi = σ > 0. One considers the geometric index
 1 σ2 1
It = St1 . . . Std d . One shows that e− 2 ( d −1)t It itself has a risk neutral Black–Scholes
dynamics. We want to test the regularized Put Spread option on this geometric index
with strikes K1 < K2 (at time T /2). Let ψ(s0 , K1 , K2 , r, σ, T ) denote the premium
at time 0 of a Put Spread on any of the assets S i . We have

ψ(x, K1 , K2 , r, σ, T ) = π(x, K2 , r, σ, T ) − π(x, K1 , r, σ, T ),


π(x, K, r, σ, T ) = Ke−rt 0 (−d2 ) − x 0 (−d1 ),
log(x/K) + (r + σ2
)T 
d1 = √ 2d
, d2 = d1 − σ T /d .
σ T /d

Using the martingale property of the discounted


 value of the premium
 of a European
option yields that the premium e−rt E (K1 − IT )+ − (K2 − IT )+ of the Put Spread
option on the index I satisfies on the one hand
   σ2 1 √ 
e−rt E (K1 − IT )+ − (K2 − IT )+ = ψ s0 e 2 ( d −1)T , K1 , K2 , r, σ/ d , T

and, one the other hand,


 
e−rt E (K1 − IT )+ − (K2 − IT )+ = E g(Z),

where
σ2
g(Z) = e−rT /2 ψ e 2 ( d1 −1) T2
I T2 , K1 , K2 , r, σ, T /2

d
and Z = (Z 1, 2 , . . . , Z d , 2 ) = N (0; Id ). The numerical specifications of the function
T T

g are as follows:
s0 = 100, K1 = 98, K2 = 102, r = 5%, σ = 20%, T = 2.
The results are displayed in Fig. 5.5 in a log-log-scale for the dimensions d =
4, 6, 8, 10.
First, we recover theoretical rates (namely −2/d ) of convergence for the error
bounds. Indeed, some slopes β(d ) can be derived (using a regression) for the quantiza-
tion errors and we found β(4) = −0.48, β(6) = −0.33, β(8) = −0.25 and β(10) =
−0.23 for d = 10 (see Fig. 5.5). These rates plead for the implementation of the
Richardson–Romberg extrapolation. Also note that, as already reported in [231],
166 5 Optimal Quantization Methods I: Cubatures

(a) 10 d = 4 | European Put Spread (K1,K2) (regularized)


g4 (slope -0.48)
(b) 10
d=6
QTF g4 (slope -0.33)
g4 Romberg (slope ...) QTF g4 Romberg (slope -0.84)
MC standart deviation (slope -0.5) MC

1
1

0.1
0.1
0.01
0.01
0.001

0.001
1e-04

1e-05 0.0001
1 10 100 1000 10000 1 10 100 1000 10000

d=8 d = 10
(c) 0.1
QTF g4 (slope -0.25)
(d) 0.1 QTF g4 (slope -0.23)
QTF g4 Romberg (slope -0.8)
QTF g4 Romberg (slope -1.2)
MC MC

0.01

0.01

0.001

0.0001 0.001
100 1000 10000 100 1000 10000

Fig. 5.5 Errors and standard deviations as functions of the number of points N in a log-log-scale.
The quantization error is displayed by the cross + and the Richardson–Romberg extrapolation error
by the cross ×. The dashed line without crosses denotes the standard deviation of the Monte Carlo
estimator. a d = 4, b d = 6, c d = 8, d d = 10 (with J. Printems)

when d ≥ 5, quantization still outperforms MC simulations (in the above sense) up to


a critical number Nc (d ) of points (Nc (6) ∼ 5000, Nc (7) ∼ 1000, Nc (8) ∼ 500, etc).
As concerns the Richardson–Romberg extrapolation method itself, note first that
it always gives better results than “crude” quantization. As regards now the compar-
ison with Monte Carlo simulation, no critical number of points NRomb (d ) comes out
beyond which MC simulation outperforms Richardson–Romberg extrapolation. This
means that NRomb (d ) is greater than the range of use of quantization-based cubature
formulas in our benchmark, namely 5 000.
Richardson–Romberg extrapolation techniques are sometimes considered a little
unstable, and indeed, it has not always been possible to satisfactorily estimate its rate
of convergence on our benchmark. But when a significant slope (in a log-log scale)
can be estimated from the Richardson–Romberg errors (like for d = 8 and d = 10
in Fig. 5.5c, d), its absolute value is larger than 1/2, and so, these extrapolations
always outperform the MC method, even for large values of N . As a by-product, our
results plead in favour of the conjecture (5.30) and lead us to think that Richardson–
Romberg extrapolation is a powerful tool to accelerate numerical integration by
optimal quantization, even in higher dimensions.
5.5 Hybrid Quantization-Monte Carlo Methods 167

5.5 Hybrid Quantization-Monte Carlo Methods

In this section we explore two aspects of variance reduction by quantization. First


we show how to use optimal quantization as a control variate, then we present a
stratified sampling method relying on a quantization-based stratification. This second
method can be seen as a guided Monte Carlo method or a hybrid Quantization-Monte
Carlo method. This method was originally introduced in [231, 233] to deal with
Lipschitz continuous functionals of Brownian motion. We also refer to [192] for
other quantization-based variance reduction method on the Wiener space. Here, we
only consider a finite-dimensional setting.

5.5.1 Optimal Quantization as a Control Variate

Let X ∈ L2Rd (, A, P) take at least N ≥ 1 values with positive probability. We assume
that we have access to an N -quantizer x := (x1 , . . . , xN ) ∈ (Rd )N with pairwise dis-
tinct components and we denote by Projx one of its Borel nearest neighbor projections.
x = Projx (X ). We assume that we now have the distribution of X
Let X x characterized
by the N -tuple (x1 , . . . , xN ) itself and their weights (probability distribution),
 
x = xi ) = P X ∈ Ci (x) , i = 1, . . . , N ,
pi := P(X

where Ci (x) = Proj−1


x ({xi }), i = 1, . . . , N , denotes the Voronoi tessellation of the N -
quantizer induced by the above nearest neighbor projection. By “know” we mean that
we have access to accurate enough numerical values of the xi and their companion
weights pi .
Let F : Rd → Rd be a Lipschitz continuous function such that F(X ) ∈ L2 (P). In
order to compute E F(X ), one writes for every simulation size M ≥ 1,
 
x ) + E F(X ) − F(X
E F(X ) = E F(X x )
M
   x
= E F(X x ) + 1 F X (m) − F X (m) +RN ,M , (5.31)
& '( ) M
m=1
(a) & '( )
(b)

x
where X (m) , m = 1, . . . , M , are M independent copies of X , 
X (m) = Projx (X (m) )
and RN ,M is a remainder term defined by (5.31). Term (a) can be computed by
quantization and term (b) can be computed by a Monte Carlo simulation. Then, it is
clear that
 
σ F(X ) − F(X x ) x ) 2
F(X ) − F(X X −Xx 2
RN ,M 2 = √ ≤ √ ≤ [F]Lip √
M M M
168 5 Optimal Quantization Methods I: Cubatures

as M → +∞, where σ(Y ) denotes the standard deviation of a random variable Y .


Furthermore,
√ L  
x )) as M → +∞.
M RN ,M −→ N 0; Var(F(X ) − F(X

x )N ≥1 is a
Consequently, if F is simply a Lipschitz continuous function and if (X
(N )

sequence of optimal quadratic quantizations of X , then


 x(N ) 

F(X ) − F X σ2+δ (X )
≤ √ ≤ C2,δ [F]Lip √ 1 ,
2
RN ,M 2
(5.32)
M MN d

where C2,δ denotes the constant coming from the non-asymptotic version of Zador’s
Theorem (see Theorem 5.2(b)).

ℵ Practitioner’s corner
As concerns practical implementation of this quantization-based variance reduction
method, the main gap is the nearest neighbor search needed at each step to compute
x

X (m) from X (m) .
In one dimension, an (optimal) N -quantizer is usually directly obtained as a mono-
tonic N -tuple with non-decreasing components and the complexity of a nearest num-
ber search on the real line based on a dichotomy procedure is approximately log N
log 2
.
Unfortunately, this one-dimensional setting is of little interest for applications.
In d dimensions, there exist nearest neighbor search procedures with a O(log N )-
complexity, once the N -quantizer has been given an appropriate tree structure (which
costs O(N log N )). The most popular tree-based procedure for nearest neighbor
search is undoubtedly the K-d tree (see [99]). During the last ten years, several
attempts to improve it have been carried out, among them, we may mention the Prin-
cipal Axis Tree algorithm (see [207]). These procedures are efficient for quantizers
with a large size N lying in a vector space with medium dimension (say up to 10).
An alternative to speed up the nearest neighbor search procedure is to restrict to
product quantizers whose Voronoi cells are hyper-parallelepipeds. In such a case,
the nearest neighbor search reduces to those on the d marginals with an approximate
resulting complexity of d log N
log 2
.
However, this nearest neighbor search procedure undoubtedly slows down the
global procedure.

5.5.2 Universal Stratified Sampling

The main drawback of the preceding is the repeated use of nearest neighbor search
procedures. Using a quantization-based stratification may be a way to take advan-
tage of quantization to reduce the variance without having to implement such time
consuming procedures. On the other hand, one important drawback of the regular
stratification method as described in Sect. 3.5 is to depend on the function F, at least
when concerned by the optimal choice for the allocation parameters qi . Our aim is to
5.5 Hybrid Quantization-Monte Carlo Methods 169

show that quantization-based stratification has a uniform efficiency among the class of
Lipschitz continuous functions. The first step is the universal stratified sampling for
Lipschitz continuous functions detailed in the simple proposition below, where we use
the notations introduced in Sect. 3.5. Also keep in mind that for a random vector Y ∈
 1/2
L2Rd (, A, P), Y 2 = E |Y |2 where | · | denotes the canonical Euclidean norm.

Proposition 5.4 (Universal stratification) Let X ∈ L2Rd (, A, P) and let (Ai )i∈I be
a stratification of Rd . For every i ∈ I , we define the local inertia of the random vector
X on the stratum Ai by
 
σi2 = E |X − E (X |X ∈ Ai )|2 | X ∈ Ai .

(a) Then, for every Lipschitz continuous function F : (Rd , | · |) → (Rd , | · |),

∀ i ∈ I, sup σF,i = σi , (5.33)


[F]Lip ≤1

where σF,i is non-negative and defined by


   
σF,i
2
= min E |F(X ) − a|2 | X ∈ Ai = E |F(X ) − E X | X ∈ Ai |2 | X ∈ Ai .
a∈Rd

(b) Suboptimal choice: qi = pi .


!
    2
sup pi σF,i
2
= pi σi2 = X − E X |σ({X ∈ Ai }, i ∈ I ) 2 . (5.34)
[F]Lip ≤1 i∈I i∈I

(c) Optimal choice of the qi . (see (3.10) for a closed form of the qi )
!2 !2
 
sup pi σF,i = pi σi
[F]Lip ≤1 i∈I i∈I (5.35)
  2
= X − E X |σ({X ∈ Ai }, i ∈ I ) 1 .

Remark. Any real-valued Lipschitz continuous function can be canonically seen as


an Rd -valued Lipchitz function, but then the above equalities (5.33)–(5.35) only hold
as inequalities.

Proof. (a) Note that


 
σF,i
2
= Var (F(X ) | X ∈ Ai ) = E (F(X ) − E (F(X )|X ∈ Ai ))2 | X ∈ Ai
 
≤ E (F(X ) − F(E (X |X ∈ Ai )))2 |X ∈ Ai
170 5 Optimal Quantization Methods I: Cubatures

owing to the very definition of conditional expectation as a minimizer with respect


to the conditional distribution. Now using that F is Lipschitz continuous, it follows
that
1  
σF,i
2
≤ [F]2Lip E 1{X ∈Ai } |X − E (X |X ∈ Ai )|2 = [F]2Lip σi2 .
pi

The equalities in (b) and (c) straightforwardly follow from (a). Finally, it follows
from Jensen’s Inequality (or the monotonicity of conditional Lp -norms) that


N 
N
   *1
pi σi = pi E |X − E (X |X ∈ Ai )|2 |X ∈ Ai 2
i=1 i=1
 
≥ X − E (X |σ({X ∈ Ai }, i ∈ I ))1 . ♦

5.5.3 A(n Optimal) Quantization-Based Universal


Stratification: A Minimax Approach

Let X ∈ L2Rd (, A, P) take at least N values with positive probability. The starting
idea is to use the Voronoi diagram of an N -quantizer x = (x1 , . . . , xN ) of X such that
P(X ∈ Ci (x)) > 0, i = 1, . . . , N , to design the strata in a stratification procedure.
Firstly, this amounts to setting I = {1, . . . , N } and

Ai = Ci (x), i ∈ I ,

in the preceding. Then, for every i ∈ {1, . . . , N }, there exists a Borel function
ϕ(xi , .) : [0, 1]q → Rd such that

d
x = xi ) = 1Ci (x) PX (d ξ)
ϕ(xi , U ) = L(X | X  ,
P X ∈ Ci (x)

d
where U = U ([0, 1]r ). Note that the dimension r is arbitrary: one may always assume
that r = 1 by the fundamental theorem of simulation, but in order to obtain some
closed forms for ϕ(xi , .), we are led to consider situations where r ≥ 2 (or even
infinite when considering a Von Neumann acceptance-rejection method).
Now let (ξ, U ) be a pair of independent random vectors such that ξ = X
d
x and
d
U = U ([0, 1] ). Then, one checks that
r

d
ϕ(ξ, U ) = X

so that one may assume without loss of generality that


 
x , U ), U =
X = ϕ(X
d
x are independent.
U [0, 1]r , where U and X
5.5 Hybrid Quantization-Monte Carlo Methods 171

In terms of implementation, as mentioned above we need a closed formula for


the function ϕ which induces some stringent constraints on the choice of the
N -quantizers. In particular, there is no reasonable hope to consider true optimal
quadratic quantizers for that purpose. A reasonable compromise is to consider some
optimal product quantization for which the function ϕ can easily be made explicit
(see Sect. 3.5).
Let Ad (N ) denote the family of all Borel partitions of Rd having at most N
elements.

Proposition 5.5 (a) Suboptimal stratification (pi = qi ). One has


!
  
inf sup pi σF,i
2
= min X − X
x 2 .
(Ai )1≤i≤N ∈Ad (N ) [F]Lip ≤1 x∈(Rd )N 2
i∈I

(b) Optimal stratification. One has


!
  
inf sup pi σF,i ≥ min X − X
x  .
(Ai )1≤i≤N ∈Ad (N ) [F]Lip ≤1 x∈(Rd )N 1
i∈I

Proof. Let Ai = Ci (x), i = 1, . . . , N . Then (5.34) and (5.35) rewritten in terms of


quantization read
!
   
sup pi σF,i
2
≤ pi σi2 = X − E (X |X
x )2
2
(5.36)
[F]Lip ≤1 i∈I i∈I

and !2 !2
   
sup pi σF,i ≤ pi σi ≥ X − E (X |X
x )2
1
(5.37)
[F]Lip ≤1 i∈I i∈I

x ).
where we used the obvious fact that σ({X ∈ Ci (x)}, i ∈ I ) = σ(X
(a) It follows from (5.34) that
!
   
inf pi σi2 = inf X − E X |σ({X ∈ Ai }, i ∈ I  .
(Ai )1≤i≤N (Ai )1≤i≤N 2
i∈I

Now, it follows from Proposition 5.1(b) that, if x(N ) denotes an optimal quantizer at
level N for X (which has N pairwise distinct components),
    
x(N )  = min X − Y  , Y : (, A, P) → Rd , |Y ()| ≤ N
X − X
2 2
172 5 Optimal Quantization Methods I: Cubatures

and
x(N ) = E X | X
X x(N ) .

Consequently, (5.36) completes the proof.


(b) It follows from (5.35) that
     
pi σi ≥ X − E (X |σ({X ∈ Ai }, i ∈ I )1 ≥ dist X , A 1 ,
i∈I

   
where
 A := E X |σ({X ∈ Ai }, i ∈I ) (ω), ω ∈  has at most N elements. Now
 
dist(X , A ) = E dist X , A = X − X x(A)  , consequently,
1 1

  
A
pi σi ≥ X − X 1
= min X − X
x  . ♦
x∈(Rd )N 1
i∈I

As a conclusion, we see that the notion of universal stratification (with respect to


Lipschitz continuous functions) and quantization are closely related since the vari-
ance reduction factor that can be obtained by such an approach is essentially ruled
by the optimal quantization rate of the random vector X , that is cX N − d , according
1

to Zador’s Theorem (see Theorem 5.2).

One dimension
In this case the method applies straightforwardly provided both the distribution func-
tion FX (u) := P(X ≤ u) of X (on R̄) and its right continuous (canonical) inverse on
[0, 1], denoted by FX−1 are computable.
We also need the additional assumption that the N -quantizer x = (x1 , . . . , xN )
satisfies the following continuity assumption

P(X = xi ) = 0, i = 1, . . . , N .

Note that this is always the case if X has a density. Then set

xi + xi+1
xi+ 21 = , i = 1, . . . , N − 1, x 21 = −∞, xN + 21 = +∞.
2
Then elementary computations show that, with q = 1,
 
∀ u ∈ [0, 1], ϕN (xi , u) = FX−1 FX (xi− 21 ) + FX (xi+ 21 ) − FX (xi− 21 ) u , (5.38)
i = 1, . . . , N .
5.5 Hybrid Quantization-Monte Carlo Methods 173

Higher dimensions
We consider a random vector X = (X 1 , . . . , X d ) whose marginals X i are indepen-
dent. This may appear as a rather stringent restriction in full generality although it is
often possible to “extract” in a model an innovation with this correlation structure. At
least in a Gaussian framework, such a reduction is always possible after an orthogonal
diagonalization of its covariance matrix. One considers a product quantizer (see e.g.
[225, 226]) defined as follows: for every  ∈ {1, . . . , d }, let x(N ) = (x1(N ) , . . . , xN(N  ) )
be an N -quantizer of the marginal X  and + set N := N1 × · · · × Nd . Then, define for
every multi-index i := (i1 , . . . , id ) ∈ I := d=1 {1, . . . , N },
 1) 
xi = xi(N
1
, . . . , xi(N
d
d)
.

Then, one defines ϕN ,X (xi , u) by setting q = d and


 
ϕN ,X xi , (u1 , . . . , ud ) = ϕN ,X (xi(N

)
, u )
1≤≤d

where ϕN ,X is defined by (5.38).

For various numerical experiments, we refer to [70] in finite dimensions, or


[192, 233] for an implementation on the Wiener space.
Chapter 6
Stochastic Approximation
with Applications to Finance

6.1 Motivation

In Finance, one often faces some optimization problems or zero search problems.
The former often reduce to the latter since, at least in a convex framework, mini-
mizing a function amounts to finding a zero of its gradient. The most commonly
encountered examples are the extraction of implicit parameters (implicit volatility of
an option, implicit correlations for a single best-of-option or in the credit markets),
the calibration, the optimization of an exogenous parameters for variance reduction
(regression, importance sampling, etc). All these situations share a common feature:
the involved functions all have a representation as an expectation, namely they read
h(y) = E H (y, Z) where Z is a q-dimensional random vector. The aim of this chapter
is to provide a toolbox – stochastic approximation – based on simulation to solve
these optimization or zero search problems. It can be viewed as a non-linear extension
of the Monte Carlo method.
Stochastic approximation can also be presented as a probabilistic extension of
Newton–Raphson like zero search recursive procedures of the form

∀ n ≥ 0, yn+1 = yn − γn+1 h(yn ) (0 < γn ≤ γ0 ), (6.1)

where h : Rd → Rd is a continuous vector field satisfying a sub-linear growth


assumption at infinity. Under some appropriate mean-reverting assumptions, one
shows that such a procedure is bounded and eventually converges to a zero y∗ of h.
As an example, if one sets γn = (Jh (yn−1 ))−1 – where Jh (y) denotes the Jacobian of
h at y – the above recursion is just the regular Newton–Raphson procedure for zero
 −1
search of the function h (one can also set γn = γ Jh (yn−1 ) , γ > 0).
In one dimension, mean-reversion may be obtained by an increasing assumption
made on the function h or, more simply, by assuming that h(y)(y − y∗ ) > 0 for every
y = y∗ : if this is the case, yn is decreasing as long as yn > y∗ and increasing whenever
yn < y∗ . In higher dimensions, this assumption becomes (h(y) | y − y∗ ) > 0, y = y∗ ,
and will be extensively called upon later.

© Springer International Publishing AG, part of Springer Nature 2018 175


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_6
176 6 Stochastic Approximation with Applications to Finance

More generally mean-reversion may follow from a monotonicity assumption on


h in one (or higher) dimension and more generally follows from the existence of a
so-called Lyapunov function. To introduce this notion, let us make a (light) connec-
tion with Ordinary Differential Equations (ODEs) and differential dynamical systems:
thus, when γn = γ > 0, Eq. (6.1) is but the Euler scheme with step γ > 0 of the ODE

ODEh ≡ ẏ = −h(y).

A Lyapunov function for ODEh is a function L : Rd → R+ such that any solution


t → y(t) of the equation satisfies t → L(y(t)) is non-increasing as t increases. If L
is differentiable this is equivalent to the condition (∇L|h) ≥ 0 since

d  
L(y(t)) = ∇L(y(t))|ẏ(t) = −(∇L|h)(y(t)).
dt
If such a Lyapunov function does exist (which is not always the case!), the system
is said to be dissipative.
We essentially have two frameworks:
– the function L is identified a priori, it is the object of interest for optimization
purposes, and one designs a function h from L, e.g. by setting h = ∇L (or possibly
h positively proportional to ∇L, i.e. h = ρ∇L, where ρ is – at least – an everywhere
positive function).
– The function naturally entering into the problem is h and one has to search for
a Lyapunov function L (which may not exist). This usually requires a deep under-
standing of the problem from a dynamical point of view.
This duality also occurs in discrete time Stochastic Approximation Theory from
its very beginning in the early 1950s (see [169, 253]).
As concerns the constraints on the Lyapunov function, due to the specificities of
the discrete time setting, we will require some further regularity assumptions on ∇L,
typically ∇L is Lipschitz continuous and |∇L|2 ≤ C(1 + L) (essentially a quadratic
growth property).
 Exercises. 1. Show that if a function h : Rd → Rd is non-decreasing in the fol-
lowing sense
∀ x, y ∈ Rd , (h(y) − h(x)|y − x) ≥ 0

and if h(y∗ ) = 0 then L(y) = |y − y∗ |2 is a Lyapunov function for ODEh .


2. Assume furthermore that {h = 0} = {y∗ } and that h satisfies a sub-linear growth
assumption: |h(y)| ≤ C(1 + |y|), y ∈ Rd . Show that the sequence (yn )n≥0 defined
by (6.1) converges toward y∗ .

Now imagine that no straightforward access to numerical values of h(y) is avail-


able but that h has an integral representation with respect to an Rq -valued random
vector Z, say
6.1 Motivation 177

Borel d
h(y) = E H (y, Z), H : Rd × Rq −→ Rd , Z = μ, (6.2)

satisfying E |H (y, Z)| < +∞ for every y ∈ Rd . Assume that


– H (y, z) is easy to compute for any pair (y, z),
– the distribution μ of Z can be simulated at a reasonable cost.
A first idea is to simply “randomize” the above zero search procedure (6.1) by
using at each iterate a Monte Carlo simulation to approximate h(yn ).
A more sophisticated idea is to try to do both simultaneously by using, on the
one hand, H (yn , Zn+1 ) instead of h(yn ) and, on the other hand, by letting the step
γn go to 0 to asymptotically smoothen the chaotic effect induced by this “local”
randomization. However, one should not allow γn to go to 0 too fast, so that an
averaging
 effect occurs like in the Monte Carlo method. In fact, one should impose
that n γn = +∞, if only to ensure that the initial value Y0 will be “forgotten” by
the procedure.
Based on this heuristic analysis, we can reasonably hope that the Rd -valued recur-
sive procedure
∀n ≥ 0, Yn+1 = Yn − γn+1 H (Yn , Zn+1 ), (6.3)

where

(Zn )n≥1 is an i.i.d. sequence with distribution μ defined on (, A, P)

and Y0 , defined on the same probability space, is independent of the sequence (Zn )n≥1 ,
also converges to a zero y∗ of h, under appropriate assumptions on both H and the gain
sequence γ = (γn )n≥1 . We will call such a recursive procedure a Markovian stochastic
algorithm or, more simply, a stochastic algorithm. There are more general classes of
recursive stochastic algorithms but this framework is sufficient for our purpose.
The preceding can be seen as the “meta-theorem” of Stochastic Approximation
since a large part of this theory is focused on making this algorithm converge toward
its “target” (a zero of h). In this framework, the Lyapunov functions mentioned above
are called upon to ensure the stability of the procedure.
 A toy-example: the Strong Law of Large Numbers. As a first example note that the
sequence of empirical means (Z n )n≥1 of an i.i.d. sequence (Zn ) of integrable random
variables satisfies
1
Z̄n+1 = Z̄n − (Z̄n − Zn+1 ), n ≥ 0, Z̄0 = 0,
n+1

i.e. a stochastic approximation procedure of type (6.3) with H (y, Z) = y − Z and


h(y) := y − z∗ with z∗ = E Z (so that Yn = Z̄n ). Then the procedure converges a.s.
(and in L1 ) to the unique zero z∗ of h.
The (weak) rate of convergence of (Z̄n )n≥1 is ruled by the CLT which may suggest
that the generic rate of convergence of this kind of procedure is of the same type.
Note that the deterministic counterpart of (6.3) with the same gain parameter, yn+1 =
178 6 Stochastic Approximation with Applications to Finance

yn − n+1
1
(yn − z∗ ), converges at a 1n -rate to z∗ (and this is clearly not the optimal
choice for γn in this deterministic framework).
However, if we do not know the value of the mean z∗ = E Z but if we are able to
simulate μ-distributed random vectors, the first recursive stochastic procedure can be
easily implemented whereas the deterministic one cannot. The stochastic procedure
we are speaking about is simply the regular Monte Carlo method!
 A toy-model: extracting implicit volatility in a Black–Scholes model. A second
toy-example is the extraction of implicit volatility in a Black–Scholes model for a
vanilla Call or Put option. In practice it is carried out by a deterministic Newton
procedure (see e.g. [209]) since closed forms are available for both the premium and
the vega of the option. But let us forget about that for a while to illustrate the basic
principle of Stochastic Approximation. Let x, K, T ∈ (0, +∞), let r ∈ R and set for
every σ ∈ R,
σ2
Xtx,σ = xe(r− 2 )t+σWt , t ≥ 0.

We know that σ → PutBS (x, K, σ, r, T ) = e−rt E (K − XTx,σ )+ is an even function,


increasingon(0, +∞),continuouswith lim PutBS (x, K, σ, r, T ) = (e−rt K − x)+ and
σ→0
lim PutBS (x, K, σ, r, T ) = e−rt K (these bounds are model-free and can be directly
σ→+∞
derived by arbitrage arguments). Let Pmarket (x, K, r, T ) ∈ [(e−rt K − x)+ , e−rt K] be a
consistent mark-to-market price for the Put option with maturity T and strike price K.
Then the implied volatility σimpl := σimpl (x, K, r, T ) is defined as the unique positive
solution to the equation

PutBS (x, K, σimpl , r, T ) = Pmarket (x, K, r, T )

or, equivalently,
 
E e−rt (K − XTx,σ )+ − Pmarket (x, K, r, T ) = 0.

This naturally suggests to devise the following stochastic algorithm to solve this
equation numerically:
 √
 
σ2
(r− 2n )T +σn T Zn+1
σn+1 = σn − γn+1 K − xe − Pmarket (x, K, r, T )
+

where (Zn )n≥1 is an i.i.d. sequence of N (0; 1)-distributed random variables and the
step sequence γ = (γn )n≥1 is for example given by γn = nc for some parameter c > 0.
After a necessary “tuning” of the constant c (try c = x+K2
), one observes that

σn −→ σimpl a.s. as n → +∞.


6.1 Motivation 179

 Exercise. Try it!

A priori, one might imagine that σn could converge toward −σimpl (which would
not be a real problem) but it a.s. never happens because this negative solution is
repulsive for the related ODEh and “noisy”. This is an important topic often referred
to in the literature as “how stochastic algorithms never fall into noisy traps” (see [51,
191, 242], for example).
To conclude this introductory section, let us briefly return to the case where h =
∇L and L(y) = E (y, Z) so that ∇L(y) = E H (y, Z) with H (y, z) := ∂(y,z) ∂y
. The
function H is sometimes called a local gradient of L and the procedure (6.3) is known
as a stochastic gradient procedure. When Yn converges to some zero y∗ of h = ∇L
at which the algorithm is “noisy enough” – say e.g. E (H (y∗ , Z) H (y∗ , Z)∗ ) > 0 is a
definite symmetric matrix – then y∗ is necessarily a local minimum of the potential
L: y∗ cannot be a trap. So, if L is strictly convex and lim|y|→+∞ L(y) = +∞, ∇L has
a single zero y∗ which is simply the global minimum of L: the stochastic gradient
turns out to be a minimization procedure.
However, most recursive stochastic algorithms (6.3) are not stochastic gradients
and the Lyapunov function, if it exists, is not naturally associated to the algorithm:
finding a Lyapunov function to “stabilize” the algorithm (by bounding a.s. its paths,
see the Robbins–Siegmund Lemma below) is often a difficult task which requires a
deep understanding of the related ODEh .
As concerns the rate of convergence, one must keep√in mind that it is usually ruled

by a CLT at a 1/ γn -rate which can attain at most the n-rate of the regular CLT. So,
such a “toolbox” is clearly not competitive compared to a deterministic procedure,
when available, but this rate should be compare to that of the Monte Carlo method
(i.e. the SLLN) since their fields of application are similar: stochastic approximation
is the natural extension of the Monte Carlo method to solve inverse or optimization
problems related to functions having a representation as an expectation of simulable
random functions.
Recently, several contributions (see [12, 196, 199]) have drawn the attention
of the quants world to stochastic approximation as a tool for variance reduc-
tion, implicitation of parameters, model calibration, risk management…It is also
used in other fields of finance like algorithmic trading as an on-line optimizing
device for execution of orders (see e.g. [188, 189]). We will briefly discuss several
(toy-)examples of application.

6.2 Typical a.s. Convergence Results

Stochastic Approximation theory provides various theorems which guarantee the a.s.
and/or Lp -convergence of recursive stochastic approximation algorithms as defined
by (6.3). We provide below a general (multi-dimensional) preliminary result known
as the Robbins–Siegmund Lemma from which the main convergence results will be
180 6 Stochastic Approximation with Applications to Finance

easily derived. However, though, strictly speaking, this result does not provide any
direct a.s. convergence result.
In what follows the function H and the sequence (Zn )n≥1 are defined by (6.2) and
h is the vector field from Rd to Rd defined by h(y) = E H (y, Z1 ).
Theorem 6.1 (Robbins–Siegmund Lemma) Let h : Rd → Rd and
let H : R × R → R satisfy (6.2). Suppose that there exists a continuously dif-
d q d

ferentiable function L : Rd → R+ satisfying

∇L is Lipschitz continuous and |∇L|2 ≤ C(1 + L) (6.4)

for some real constant C > 0 such that h satisfies the mean-reverting assumption

(∇L|h) ≥ 0. (6.5)

Furthermore, suppose that H satisfies the following sub-linear growth assumption



∀ y ∈ Rd , H (y, Z)2 ≤ C 1 + L(y) (6.6)

(which implies |h| ≤ C 1 + L).
Let γ = (γn )n≥1 be a sequence of positive real numbers satisfying the (so-called)
decreasing step assumption
 
γn = +∞ and γn2 < +∞. (6.7)
n≥1 n≥1

Finally, assume that Y0 is independent of (Zn )n≥1 and E L(Y0 ) < +∞.
Then, the stochastic algorithm defined by (6.3) satisfies the following five prop-
erties:

(i) Yn − Yn−1 −→ 0 P-a.s. and in L2 (P) as n → +∞ (and |Yn |2 < +∞ a.s.),
n≥1

(ii) the sequence (L(Yn ))n≥0 is L (P)-bounded,


1

a.s.
(iii) L(Yn ) −→ L∞ ∈ L1 (P) as n → +∞,

(iv) γn (∇L|h)(Yn−1 ) < +∞ a.s. as an integrable random variable,
n≥1
γ   
(v) the sequence Mn = nk=1 γk H (Yk−1 , Zk ) − h(Yk−1 ) is an L2 -bounded square
integrable martingale, and hence converges in L2 and a.s.
Remarks and terminology. • The sequence (γn )n≥1 is called a step sequence or a
gain parameter sequence.
• If the function L satisfies Assumptions (6.4), (6.5), (6.6) and moreover
lim|y|→+∞ L(y) = +∞, then L is called a Lyapunov function of the system like in
Ordinary Differential Equation Theory.
6.2 Typical a.s. Convergence Results 181
√ √
• Note that Assumption (6.4) on L implies that ∇ 1 + L is bounded. Hence L has
at most a linear growth so that L itself has at most a quadratic growth. This justifies
the somewhat unexpected terminology “sub-linear growth” for Assumption (6.6).
• In spite of the standard terminology, the step sequence does not need to be decreas-
ing in Assumption (6.7).

• A careful reading of the proof below shows that the assumption n≥1 γn = +∞
is not needed. However we leave it in the statement because it is dramatically useful
for any application of this Lemma since it implies, combined with (iv), that

lim(∇L|h)(Yn−1 ) = 0.
n

These assumptions are known as “Robbins–Siegmund assumptions”.


• When H (y, z) := h(y) (i.e. the procedure is noiseless), the above theorem provides
a convergence result for the original deterministic procedure (6.1).

The key of the proof is the following quite classical convergence theorem for non-
negative super-martingales (see [217]).
Theorem 6.2 Let (Sn )n≥0 be a non-negative super-martingale with respect to a fil-
tration (Fn )n≥0 , on a probability space (, A, P) (i.e. for every n ≥ 0, Sn ∈ L1 (P)
and E (Sn+1 |Fn ) ≤ Sn a.s.), then, Sn converges P-a.s. to an integrable (non-negative)
random variable S∞ .
For general convergence theorems for sub-, super- and true martingales we refer
to any standard course on Probability Theory or, preferably, to [217].

Proof of Theorem 6.1. Set Fn := σ(Y0 , Z1 , . . . , Zn ), n ≥ 1, and for notational con-


venience Yn := Yn − Yn−1 , n ≥ 1. It follows from the fundamental theorem of Cal-
culus that there exists ξn+1 ∈ (Yn , Yn+1 ) (geometric interval) such that

L(Yn+1 ) = L(Yn ) + (∇L(ξn+1 )|Yn+1 )


≤ L(Yn ) + (∇L(Yn )|Yn+1 ) + [∇L]Lip |Yn+1 |2
= L(Yn ) − γn+1 (∇L(Yn )|H (Yn , Zn+1 )) (6.8)
+[∇L]Lip γn+1
2
|H (Yn , Zn+1 )|2
= L(Yn ) − γn+1 (∇L(Yn )|h(Yn )) − γn+1 (∇L(Yn )|Mn+1 ) (6.9)
+[∇L]Lip γn+1
2
|H (Yn , Zn+1 )|2 ,

where
Mn+1 = H (Yn , Zn+1 ) − h(Yn ).

We aim at showing that (Mn )n≥1 is a sequence of (square integrable) (Fn )-


martingale increments satisfying E (|Mn+1 |2 | Fn ) ≤ C(1 + L(Yn )) for an appro-
priate real constant C > 0.
182 6 Stochastic Approximation with Applications to Finance

First, it is clear by induction that Yn is Fn -measurable for every n ≥ 0 and so is


Mn owing to its very definition.
As a second step, note that L(Yn ) ∈ L1 (P) and H (Yn , Zn+1 ) ∈ L2 (P) for every
index n ≥ 0. This follows again by an induction based on (6.9): using that
|(a|b)| ≤ 21 (|a|2 + |b|2 ), a, b ∈ Rd , we first get

  1 
E ∇L(Yn )|H (Yn , Zn+1 ) ≤ E |∇L(Yn )|2 + E |H (Yn , Zn+1 )|2 .
2
Now, Yn being Fn -measurable and Zn+1 being independent of Fn ,
 
E |H (Yn , Zn+1 )|2 | Fn = E |H (y, Z1 )|2 |y=Yn

so that
 
E |H (Yn , Zn+1 )|2 = E E |H (Yn , Zn+1 )|2 | Fn

= E E |H (y, Z1 )|2 |y=Yn
 
≤ C 2 1 + E L(Yn )

owing to (6.6). Combined with (6.4) and plugged into the above inequality, this yields
   
E ∇L(Yn )|H (Yn , Zn+1 ) ≤ C 2 1 + E L(Yn ) .

By the same argument, we get


   
E H (Yn , Zn+1 ) | Fn = E H (y, Z1 ) |y=Yn
= h(Yn ).

Plugging these bounds into (6.9), we derive that L(Yn+1 ) ∈ L1 (P). Consequently
E (Mn+1 | Fn ) = 0. The announced inequality for E (|Mn+1 |2 | Fn ) holds with
C = 2 C 2 owing to (6.6) and the inequality

|Mn+1 |2 ≤ 2(|H (Yn , Zn+1 )|2 + |h(Yn )|2 ).

At this stage, we derive from the fact that ∇L(Yn ) and Mn+1 ∈ L2 (P),
   
E (∇L(Yn )|Mn+1 ) | Fn = ∇L(Yn ) | E (Mn+1 | Fn ) = 0.

Conditioning (6.9) with respect to Fn reads


  
E L(Yn+1 ) | Fn ) + γn+1 (∇L|h)(Yn ) ≤ L(Yn ) + CL γn+1
2
1 + L(Yn )
≤ L(Yn )(1 + CL γn+1
2
) + CL γn+1
2
,

where CL = C 2 [∇L]Lip > 0. Then, adding the positive term


6.2 Typical a.s. Convergence Results 183


n 
γk (∇L|h)(Yk−1 ) + CL γk2
k=1 k≥n+2

on the left-hand side of the above inequality, adding (1 + CL γn+1


2
) times this term on
n+1
the right-hand side and dividing the resulting inequality by k=1 (1 + CL γk2 ) shows
that the Fn -adapted sequence
n−1 
L(Yn ) + k=0 γk+1 (∇L|h)(Yk ) + CL k≥n+1 γk2
Sn = n , n ≥ 0,
k=1 (1 + CL γk )
2

is a (non-negative) super-martingale with S0 = L(Y0 ) ∈ L1 (P). The fact that the


added term is positive follows from the mean-reverting inequality (∇L|h) ≥ 0. Hence
(Sn )n≥0 is P-a.s. convergent toward a non-negative
 integrable random variable S∞ by
Theorem 6.2. Consequently, using that k≥n+1 γk2 → 0 as n → +∞, one gets


n−1
a.s. 
L(Yn ) + γk+1 (∇L|h)(Yk ) −→ S∞ = S∞ (1 + CL γn2 ) ∈ L1 (P). (6.10)
k=0 n≥1

(ii) The super-martingale (Sn )n≥0 being L1 (P)-bounded by E S0 = E L(Y0 ) < +∞,
one derives likewise that (L(Yn ))n≥0 is L1 -bounded since
 n 

L(Yn ) ≤ (1 + CL γk2 ) Sn , n ≥ 0,
k=1


and (1 + CL γk2 ) < +∞ owing to the decreasing step assumption (6.7) made on
k≥1
(γn )n≥1 .

(iv) Now, for the same reason, the series 0≤k≤n−1 γk+1 (∇L|h)(Yk ) (with non-
negative terms) satisfies for every n ≥ 1,
 n−1 
 
n
E γk+1 (∇L|h)(Yk ) ≤ (1 + CL γk2 )E S0
k=0 k=1

so that, by the Beppo Levi monotone convergence Theorem for series with non-
negative terms,  

E γn+1 (∇L|h)(Yn ) < +∞
n≥0

so that, in particular,
184 6 Stochastic Approximation with Applications to Finance

γn+1 (∇L|h)(Yn ) < +∞ P-a.s.
n≥0

and the series converges in L1 to its a.s. limit.


(iii) It follows
 from (6.10) that, P-a.s., L(Yn ) −→ L∞ as n → +∞, which is inte-
grable since L(Yn ) n≥0 is L1 -bounded.
(i) Note that, again by Beppo Levi’s monotone convergence Theorem for series with
non-negative terms,
 
  
E |Yn | =
2
E |Yn |2 ≤ γn2 E |H (Yn−1 , Zn )|2
n≥1 n≥1 n≥1

≤C γn2 (1 + E L(Yn−1 )) < +∞
n≥1


so that E |Yn |2 → 0 and n≥1 |Yn |2 < +∞ a.s. which in turns yields Yn =
Yn − Yn−1 → 0 a.s.
γ 
(v) We have Mn = nk=1 γk Mk so that M γ is clearly an (Fn )-martingale.
Moreover,

n
 
M γ n = γk2 E |Mk |2 |Fk−1
k=1

n
 
≤ γk2 E |H (Yk−1 , Zk )|2 |Fk−1
k=1

n
= γk2 E |H (y, Z1 )|2 |y=Yk−1
,
k=1

where we used in the last line that Zk is independent of Fk−1 . Consequently, owing
to (6.6) and (ii), one has
E M γ ∞ < +∞,
γ
which in turn implies by Theorem 12.7 that Mn converges a.s. and in L2 . ♦

Corollary 6.1 (a) Robbins–Monro algorithm. Assume that the mean function
h of the algorithm is continuous and satisfies
 
∀ y ∈ Rd , y = y∗ , y − y∗ |h(y) > 0. (6.11)

Suppose furthermore that Y0 ∈ L2 (P) and that H satisfies


 
∀ y ∈ Rd , H (y, Z) ≤ C(1 + |y|).
2

Finally, assume that the step sequence (γn )n≥1 satisfies (6.7). Then
a.s.
{h = 0} = {y∗ } and Yn −→ y∗ .
6.2 Typical a.s. Convergence Results 185
 
The convergence also holds in every Lp (P), p ∈ (0, 2) (and |Yn − y∗ | n≥0 is L2 -
bounded).
(b) Stochastic gradient (h = ∇L). Let L : Rd → R+ be a differentiable function
satisfying (6.4), lim L(y) = +∞, and {∇L = 0} = {y∗ }. Assume the mean func-
|y|→+∞
tion of the algorithm
 is given by h = ∇L, that the function H satisfies
E |H (y, Z)|2 ≤ C 1 + L(y) and that L(Y0 ) ∈ L1 (P). Assume that the step sequence
(γn )n≥1 satisfies (6.7). Then
a.s.
L(y∗ ) = min L and Yn −→ y∗ as n → +∞.
Rd

 
Moreover, ∇L(Yn ) converges to 0 in every Lp (P), p ∈ (0, 2) (and L(Yn ) n≥0 is L1 -
 
bounded so that ∇L(Yn ) n≥0 is L2 -bounded).

Proof. (a) Assumption (6.11) implies that the mean-reverting assumption (6.5) is
satisfied by the quadratic Lyapunov function L(y) = |y − y∗ |2 (which clearly satis-
fies (6.4)). The assumption on H is clearly the linear growth Assumption (6.6) for this
function L. Consequently, it follows from the above Robbins–Siegmund Lemma that
  
|Yn − y∗ |2 −→ L∞ ∈ L1 (P) and γn h(Yn−1 )|Yn−1 − y∗ < +∞ P-a.s.
n≥1

and that (|Yn − y∗ |2 )n≥0 is L1 (P)-bounded.


 Let ω ∈  such that |Yn (ω) − y∗ | converges in R+ and the  series
2
 
γn Yn−1 (ω)−y∗ |h(Yn−1 (ω)) <+∞. Since γn Yn−1 (ω)−y∗ |h(Yn−1 (ω)) <+∞,
n≥1 n≥1
it follows that  
lim Yn−1 (ω) − y∗ |h(Yn−1 (ω)) = 0.
n

 
If lim Yn−1 (ω) − y∗ |h(Yn−1 (ω)) > 0, the convergence of the above series
n   
induces a contradiction with the fact that γn = +∞. Let φ(n, ω) n≥1 be a sub-
n≥1
sequence such that
 
Yφ(n,ω) (ω) − y∗ |h(Yφ(n,ω) (ω)) −→ 0 as n → +∞.

Now, (Yn (ω))


 n≥0 being
 bounded, one may assume, up to one further extraction, still
denoted by φ(n, ω) n≥1 , that Yφ(n,ω) (ω) → y∞ = y∞ (ω). It follows by the continuity
 
of h that y∞ − y∗ |h(y∞ ) = 0 which in turn implies that y∞ = y∗ . Now, since we
know that L(Yn (ω)) = |Yn (ω) − y∗ |2 converges,
2 2
lim Yn (ω) − y∗ = lim Yφ(n,ω) (ω) − y∗ = 0.
n n
186 6 Stochastic Approximation with Applications to Finance

2
Finally, for every p ∈ (0, 2), (|Yn − y∗ |p )n≥0 is L p (P)-bounded, hence uniformly
integrable. As a consequence the a.s. convergence holds in L1 , i.e. Yn → y∗ converges
in Lp (P).
 that {h = 0} ⊂ {y∗ }. On the other hand, if y = y∗ = εu,
It is clear from (6.11)
|u| = 1, ε > 0, one has ε u|h(y∗ + εu) > 0. Letting ε → 0 implies that u|h(y∗ ) ≥
0 for every unitary vector
 u since
 h is continuous, which in turn implies (switching
h(y∗ )
from u to −u) that u|h(y∗ ) = 0. Hence h(y∗ ) = 0 (otherwise u = |h(y ∗ )|
yields a
contradiction).
(b) One may apply the Robbins–Siegmund Lemma with L as a Lyapunov function
since (h|∇L) = |∇L|2 ≥ 0. The assumption on H is just the linear growth assump-
tion (6.6). As a consequence
 2
L(Yn ) −→ L∞ ∈ L1 (P) and γn ∇L(Yn−1 ) < +∞ P-a.s.
n≥1

   
and L(Yn ) n≥0 is L1 (P)-bounded. Let ω ∈  such that L Yn (ω) converges in R+ ,
 2
γn ∇L(Yn−1 (ω)) < +∞ and Yn (ω) − Yn−1 (ω) → 0.
n≥1

The same arguments as above show that


2
lim ∇L(Yn (ω)) = 0.
n

 
From the convergence of L(Yn ω) toward L∞ (ω) and lim L(y) = +∞, one
|y|→+∞
 
derives that (Yn (ω))n≥0 is bounded. Then there exists a subsequence φ(n, ω) n≥1
such that
   
Yφ(n,ω) → y, ∇L Yφ(n,ω) (ω) → 0 and L Yφ(n,ω) (ω) → L∞ (ω)
as n → +∞.
Then ∇L(y) = 0 which implies y = y∗ and L∞ (ω) = L(y∗ ). Since L is non-negative,
differentiable and goes to infinity at infinity, it attains its unique global minimum at
y∗ . In particular, {L = L(y∗ )} = {∇L = 0} = {y∗ }. Consequently, the only possible
limiting value for the bounded sequence (Yn (ω))n≥1 is y∗ since L(Yn ) → L(y∗ ), i.e.
Yn (ω) converges toward y∗ .
The Lp (P)-convergence to 0 of |∇L(Yn )|, p ∈ (0, 2), follows by the same uniform
integrability argument as in (a). ♦

 Exercises. 1. Show that Claim (a) remains true if one only assumes that

y −→ (h(y)| y − y∗ ) is lower semi-continuous.

2. Non-homogeneous L2 -strong law of large numbers by stochastic approximation.


Let (Zn )n≥1 be an i.i.d. sequence of square integrable random vectors. Let (γn )n≥1 be
6.2 Typical a.s. Convergence Results 187

a sequence of positive real numbers satisfying the decreasing step Assumption (6.7).
Show that the recursive procedure defined by

Yn+1 = Yn − γn+1 (Yn − Zn+1 )

a.s. converges toward y∗ = E Z1 .

The above settings are in fact some special cases of a more general result, the
so-called “pseudo-gradient setting”, stated below. However its proof, in particular in
a multi-dimensional setting, needs additional arguments, mainly the so-called ODE
method (for Ordinary Differential Equation method) originally introduced by Ljung
(see [200]). The underlying idea is to think of a stochastic algorithm as a perturbed
Euler scheme with a decreasing step of the ODE ẏ = −h(y). For an introduction to
the ODE method, see Sect. 6.4.1; we also refer to classical textbooks on Stochastic
Approximation like [39, 81, 180].

Theorem 6.3 (Pseudo-Stochastic Gradient) Assume that L and h and the step
sequence (γn )n≥1 satisfy all the assumptions of the Robbins–Siegmund Lemma.
Assume furthermore that

lim L(y) = +∞ and (∇L|h) is lower semi-continuous.


|y|→+∞

Then, P(d
 ω)-a.s., there
  exists an  = (ω) ≥ 0 and a connected component
C∞ (ω) of (∇L|h) = 0 ∩ L =  such that
 
dist Yn (ω), C∞ (ω) −→ 0 as n → +∞.

In particular, if for every  ≥ 0, {(∇L|h) = 0} ∩ {L = } is locally finite (1 ),


then, P(d  exists an ∞(ω) such that Yn converges toward a point of
 ω)-a.s., there
the set (∇L|h) = 0 ∩ L = ∞ (ω) .

Proof (One-dimensional case). We consider an ω ∈  for which all the “a.s. conclu-
sions” of the Robbins–Siegmund Lemma are true. Combining Yn (ω) − Yn−1 (ω) → 0
with the boundedness of the sequence (Yn (ω))n≥0 , one can show that the set Y∞ (ω)
of the limiting values of (Yn (ω))n≥0
 is a connected
 compact set (2 ).
On the other hand, Y∞ (ω) ⊂ L = L∞ (ω) since L(Yn (ω)) → L∞ (ω). Further-
more, reasoning as in the proof of claim (b) of the above corollary  shows that
there exists a limiting
  value y ∈ Y
∗  ∞ (ω) such that ∇L(y ∗ )|h(y ∗ ) = 0 so that
y∗ ∈ (∇L|h) = 0 ∩ L = L∞ (ω) .

1 By locally finite, we mean “finite on every compact set”.


2 The method of proof is to first establish that Y∞ (ω) is “bien enchaı̂né” as a set. A subset A ⊂ Rd
is “bien enchaı̂né” if for every a, a ∈ A and every ε > 0, there exists p ∈ N∗ , b0 , b1 , …, bp ∈ A such
that b0 = a, bp = a , |bi − bi−1 | ≤ ε. Any connected set A is “bien enchaı̂né” and the converse is
true if A is compact. What we need here is precisely this converse (see e.g. [13] for details).
188 6 Stochastic Approximation with Applications to Finance

At this stage, we assume that d = 1. Either Y∞ (ω) = {y∗ } and the proof is com-
plete, or Y∞ (ω) is a non-trivial compact interval as a compact connected subset of

R. The function L is constant on this interval,
 consequently its derivative L is zero
on Y∞(ω) so that Y∞ (ω)
 ⊂ (∇L|h) = 0 ∩ L = L(y∗ ) . Hence the conclusion.
When (∇L|h) = 0 ∩ L =  is locally finite, the conclusion is obvious since its
connected components are reduced to single points. ♦

6.3 Applications to Finance

6.3.1 Application to Recursive Variance Reduction


by Importance Sampling

This section was originally motivated by the seminal paper [12]. Finally, we followed
the strategy developed in [199] which provides, in our mind, an easier to implement
procedure. Assume we want to compute the expectation

|z|2 dz
E ϕ(Z) = ϕ(z)e− 2
d
(6.12)
Rd (2π) 2

where ϕ : Rd → R is integrable with respect to the normalized Gaussian measure.


In order to deal with a consistent problem, we assume throughout this section that

P(ϕ(Z) = 0) > 0.

 Examples. (a) A typical example is provided by an option pricing in a d -


dimensional Black–Scholes model where, with the usual notations,
 √
 
σ2
−rt i (r− 2i )T +σi T (Az)i
ϕ(z) = e φ x0 e , x0 = (x01 , . . . , x0d ) ∈ (0, +∞)d ,
1≤i≤d

with A a lower triangular matrix such that the covariance matrix R = AA∗ has diagonal
entries equal to 1 and φ a non-negative, continuous if necessary, payoff function. The
dimension d corresponds to the number of underlying risky assets.
(b) Monte Carlo simulation of functionals of the Euler scheme of a diffusion (or
Milstein scheme) appear as integrals with respect to a multivariate Gaussian vector.
Then the dimension d can be huge since it corresponds to the product of the number
of time steps by the number of independent Brownian motions driving the dynamics
of the SDE.
Variance reduction by mean translation: first approach (see [12]).
A change of variable z = ζ + θ, for a fixed θ ∈ Rd , leads to

|θ|2  
E ϕ(Z) = e− 2 E ϕ(Z + θ)e−(θ|Z) . (6.13)
6.3 Applications to Finance 189

Such a change of variable yields what is known as the Cameron–Martin formula,


which can be seen here either as a somewhat elementary version of the Girsanov
change of probability, or as the first step of an importance sampling procedure.
One natural way to optimize the computation by Monte Carlo simulation of E ϕ(Z)
is to choose, among the above representations depending on the parameter θ ∈ Rd ,
the one with the lowest variance. This means solving, at least roughly, the following
minimization problem
 
min L(θ) := e−|θ| E ϕ2 (Z + θ)e−2(Z|θ)
2
(6.14)
θ∈Rd

 |θ|2   2
since Var e− 2 ϕ(Z + θ)e−(θ|Z) = L(θ) − E ϕ(Z) .
A reverse change of variable shows that

|θ|2  
L(θ) = e 2 E ϕ2 (Z)e−(Z|θ) . (6.15)
 
Hence, if E ϕ2 (Z)|Z|ea|Z| < +∞ for every a ∈ (0, +∞), one can always differen-
tiate the function L on Rd owing to Theorem 2.2(b) with

|θ|2  
∇L(θ) = e 2 E ϕ2 (Z)e−(θ|Z) (θ − Z) , θ ∈ Rd . (6.16)

Rewriting Eq. (6.15) as


 |Z|2

L(θ) = E ϕ2 (Z)e− 2 e 2 |Z−θ|
1 2
(6.17)

clearly shows that L is strictly convex since θ → e 2 |θ−z| is strictly convex


1 2

for every z ∈ Rd and P(ϕ(Z) > 0) > 0. Furthermore, Fatou’s Lemma implies
lim L(θ) = +∞.
|θ|→+∞
Consequently, L has a unique global minimum θ∗ which is also local, whence
satisfies {∇L = 0} = {θ∗ }.
We now prove the classical lemma which shows that if L is strictly convex then
θ → |θ − θ∗ |2 is mean-reverting for ∇L (strictly, in the strengthened sense (6.11) of
the Robbins–Monro framework).

Lemma 6.1 (a) Let L : Rd → R+ be a differentiable convex function. Then

∀ θ, θ ∈ Rd , (∇L(θ) − ∇L(θ )|θ − θ ) ≥ 0.

If, furthermore, L is strictly convex, the above inequality is strict if θ = θ .


(b) If L is twice differentiable and D2 L ≥ αId for some real constant α > 0 (in the
sense that u∗ D2 L(θ)u ≥ α|u|2 , for every θ, u ∈ Rd ), then lim L(θ) = +∞ and
|θ|→+∞
190 6 Stochastic Approximation with Applications to Finance

⎨ (i) (∇L(θ) − ∇L(θ )|θ − θ ) ≥ α|θ − θ |2 ,
∀ θ, θ ∈ Rd ,

(ii) L(θ ) ≥ L(θ) + (∇L(θ) | θ − θ) + 21 α|θ − θ|2 .

Proof. (a) One introduces the differentiable function defined on the unit interval by

g(t) = L(θ + t(θ − θ)) − L(θ), t ∈ [0, 1].

The function g is convex and differentiable. Hence its derivative

g  (t) = (∇L(θ + t(θ − θ))|θ − θ),

is non-decreasing so that g  (1) ≥ g  (0) which yields the announced inequality. If L


is strictly convex, then g  (1) > g  (0) (otherwise g  (t) ≡ 0 which would imply that L
is affine on the geometric interval [θ, θ ]).
(b) The function is twice differentiable under this assumption and
 
g  (t) = (θ − θ)∗ D2 L θ + t(θ − θ) (θ − θ) ≥ α|θ − θ|2 .

The conclusion follows by noting that g  (1) − g  (0) ≥ inf s∈[0,1] g  (s). Moreover,
noting that g(1) ≥ g(0) + g  (0) + 21 inf s∈[0,1] g  (s) yields the inequality (ii). Finally,
setting θ = 0 yields

1
L(θ ) ≥ L(0) + α|θ |2 − |∇L(0)| |θ | → +∞ as |θ | → ∞. ♦
2

This suggests (as noted in [12]) to consider the quadratic function V defined by
V (θ) := |θ − θ∗ |2 as a Lyapunov function instead of L defined in (6.15). Indeed,
L is usually not essentially quadratic: as soon as ϕ(z) ≥ ε0 > 0, it is obvious that
L(θ) ≥ ε20 e|θ| , but this exponential growth is also observed when ϕ is bounded away
2

from zero outside a ball. Hence ∇L cannot be Lipschitz continuous either and, con-
sequently, cannot be used as a Lyapunov function.
However, if one uses the representation of ∇L as an expectation derived from (6.17)
by pathwise differentiation in order to design a stochastic gradient algorithm, namely
|z|2

 1 |z−θ|2 
considers the local gradient H (θ, z) := ϕ2 (z)e− 2 ∂θ e2 , a major difficulty
remains: the convergence results in Corollary 6.1 do not apply, mainly because the lin-
ear growth assumption (6.6) in quadratic mean is not fulfilled by such a choice for H .
In fact, this “naive” procedure explodes at almost every implementation, as pointed
out in [12]. This leads the author to introduce in [12] some variants of the algorithm
involving repeated re-initializations – the so-called projections “à la Chen” – to force
the stabilization of the algorithm and subsequently prevent explosion. The choice
we make in the next section is different (and still other approaches are possible to
circumvent this problem, see e.g. [161]).
6.3 Applications to Finance 191

An “unconstrained” approach based on a third change of variable (see [199])


The starting point is to find a new local gradient to represent ∇L in order to apply
the above standard convergence results. We know, and already used above, that the
Gaussian density is smooth by contrast with the payoff ϕ (at least in quantitative
finance of derivative products): to differentiate L, we already switched the parameter
θ from ϕ to the Gaussian density to take advantage of its smoothness by a change
of variable. At this stage, we face the converse problem: we usually know what the
behavior of ϕ at infinity is whereas we cannot efficiently control the behavior of
e−(θ|Z) inside the expectation as θ goes to infinity. So, it is natural to try to cancel
this exponential term by plugging θ back in the payoff ϕ.
The first step is to make a new change of variable. Starting from (6.16), one gets

|θ|2 dz |z|2
∇L(θ) = e 2 ϕ2 (z)(θ − z)e−(θ|z)− 2
(2π)d /2
R
d

|ζ|2 dζ
= e|θ| ϕ2 (ζ − θ)(2θ − ζ)e− 2
2

d /2
(z := ζ − θ),
R d (2π)
 
= e|θ| E ϕ2 (Z − θ)(2 θ − Z) .
2

Consequently, canceling the positive term e|θ| , we get


2

 
∇L(θ) = 0 ⇐⇒ E ϕ2 (Z − θ)(2θ − Z) = 0.

This suggests to work with the function inside the expectation (though not exactly a
local gradient) up to an explicit appropriate multiplicative factor depending on θ to
satisfy the linear growth assumption for the L2 -norm in θ.
From now on, we assume that there exist two positive real constants a ≥ 0 and
C > 0 such that
0 ≤ ϕ(z) ≤ Ce 2 |z| , z ∈ Rd .
a
(6.18)
 
 Exercise. (a) Show that under this assumption, E ϕ2 (Z)|Z|e|θ||Z| < +∞ for
every θ ∈ Rd , which implies
 that (6.16) holds true.
(b) Show that in fact E ϕ2 (Z)|Z|m e|θ||Z| < +∞ for every θ ∈ Rd and every m ≥ 1,
which in turn implies that L is C ∞ . In particular, show that for every θ ∈ Rd ,

|θ|2   
D2 L(θ) = e 2 E ϕ2 (Z)e−(θ|Z) Id + (θ − Z)(θ − Z)t

(throughout this chapter, we adopt the notation t for transposition). Derive that

|θ|2  
D2 L(θ) ≥ e 2 E ϕ2 (Z)e−(θ|Z) Id > 0

(in the sense of positive definite symmetric matrices) which proves again that L is
strictly convex.
Taking Assumption (6.18) into account, we set
192 6 Stochastic Approximation with Applications to Finance

1
Ha (θ, z) = e−a(|θ| +1) 2
2
ϕ2 (z − θ)(2θ − z). (6.19)

One checks that


1  
E |Ha (θ, Z)|2 ≤ 2 C 4 e−2a(|θ| +1) 2 E e2a|Z|+2a|θ| (4|θ|2 + |Z|2 )
2

    
≤ 2 C 4 4|θ|2 E e2a|Z| + E e2a|Z| |Z|2
≤ C  (1 + |θ|2 )

 |Z| has a Laplace transform defined on the whole real line, which in turn implies
since
E ea|Z| |Z|r < +∞ for every a, r > 0.
On the other hand, it follows that the resulting mean function ha reads
1  
ha (θ) = e−a(|θ| +1) 2
2
E ϕ2 (Z − θ)(2 θ − Z)

or, equivalently,
1
ha (θ) = e−a(|θ| +1) 2 −|θ|2
2
∇L(θ) (6.20)

so that ha is continuous, (θ − θ∗ |ha (θ)) > 0 for every θ = θ∗ and {ha = 0} = {θ∗ }.
Applying Corollary 6.1(a) (the Robbins–Monro Theorem), one derives that for
any step sequence γ = (γn )n≥1 satisfying (6.7), the sequence (θn )n≥0 recursively
defined by
θn+1 = θn − γn+1 Ha (θn , Zn+1 ), n ≥ 0, (6.21)

where (Zn )n≥1 is an i.i.d. sequence with distribution N (0; Id ) defined on a probability
space (, A, P) independent of the Rd -valued random vector θ0 ∈ L2Rd (, A, P),
satisfies
a.s.
θn −→ θ∗ as n → +∞.
1
Remarks. • The reason for introducing (|θ|2 + 1) 2 is that this function is explicit,
behaves like |θ| at infinity and is also everywhere differentiable, which simplifies the
discussion about the rate of convergence detailed further on.
• Note that no regularity assumption is made on the payoff ϕ.
• An alternative approach based on a large deviation principle but which needs some
regularity assumption on the payoff ϕ is developed in [116]. See also [243].
• To prevent a possible “freezing” of the procedure, for example when the step
sequence has been misspecified or when the payoff function is too anisotropic, one
can replace the above procedure (6.21) by the following fully data-driven variant of
the algorithm

∀ n ≥ 0, θn+1 = θn − γn+1 Ha (θn , Zn+1 ), θ0 = θ0 , (6.22)


6.3 Applications to Finance 193

where
ϕ2 (z − θ)
Ha (θ, z) := (2θ − z).
1 + ϕ2 (−θ)

This procedure also converges a.s. under a sub-multiplicativity assumption on the


payoff function ϕ (see [199]).
• A final – and often crucial – trick to boost the convergence when dealing with
rare events, like for importance sampling purpose, is to “drive” a parameter from
a “regular” value to the value that makes the event rare. Typically when trying to
reduce the variance of a deep-out-of-the-money Call option like in the numerical
illustrations below, a strategy can be to implement the above algorithm with a slowly
varying strike Kn from K0 = x0 to the “target” strike K (see below) during the first
half of iterations.

ℵ Practitioner’s corner (On the weak rate of convergence)


This paragraph anticipates on Sects. 6.4.3 and 6.4.4. It can be skipped on a first
reading. Assume that the step sequence has the following parametric form γn =
α
, n ≥ 1, and that Dha (θ∗ ) is positive in the following sense: all the eigenvalues
β+n
of Dha (θ∗ ) have a positive
√ real part. Then, the rate of convergence of θn toward θ∗ is
ruled by a CLT (at rate n) if and only if (see Sect. 6.4.3)

1
α> > 0,
2e(λa,min )

where λa,min is the eigenvalue of Dha (θ∗ ) with the lowest real part. Moreover, one
shows that the theoretical best choice for α is αopt := e(λ1a,min ) . The asymptotic
variance is made explicit, once again in Sect. 6.4.3. Let us focus now on Dha (θ∗ ).
Starting from the expression (6.20)
1
ha (θ) = e−a(|θ| +1) 2 −|θ|2
2
∇L(θ)
1 
−a(|θ|2 +1) 2 − |θ|2
2 
=e × E ϕ2 (Z)(θ − Z)e−(θ|Z) by (6.16)
 
= ga (θ) × E ϕ2 (Z)(θ − Z)e−(θ|Z) .

Then
  |θ|2
Dha (θ) = ga (θ)E ϕ2 (Z)e−(θ|Z) (Id + ZZ t − θZ t ) + e− 2 ∇L(θ) ⊗ ∇ga (θ)

(where u ⊗ v = [ui vj ]1≤i,j ≤ d ). Using that ∇L(θ∗ ) = ha (θ∗ ) = 0, so that


ha (θ∗ )θ∗t = 0, we get
 
Dha (θ∗ ) = ga (θ∗ )E ϕ2 (Z)e−(θ∗ |Z) (Id + (Z − θ∗ )(Z − θ∗ )t ) .
194 6 Stochastic Approximation with Applications to Finance

Hence Dha (θ∗ ) is a definite positive symmetric matrix and its lowest eigenvalue
λa,min satisfies  
λa,min ≥ ga (θ∗ )E ϕ2 (Z)e−(θ∗ |Z) > 0.

These computations show that if the behavior of the payoff ϕ at infinity is mis-
evaluated, this leads to a bad calibration of the algorithm. Indeed, if one considers two
real numbers a, a satisfying (6.18) with 0 < a < a , then one checks with obvious
notations that
1 ga (θ∗ ) 1 
1 1 1
= e(a−a )(|θ∗ | +1) 2
2
= < .
2λa,min ga (θ∗ ) 2λa ,min
 2λa ,min
 2λa ,min

So the condition on α is more stringent with a than with a. Of course, in practice,


the user does not know these values (since she does not know the target θ∗ ), however
she will be led to consider higher values of α than requested, which will lead to the
deterioration of the asymptotic variance (see again Sect. 6.4.3).
These weak rate results seem to be of little help in practice since θ∗ being unknown,
also means that functions at θ∗ are unknown. One way to circumvent this difficulty
is to implement Ruppert and Polyak’s averaging principle, described and analyzed
in Sect. 6.4.4. First the procedure should be implemented with a slowly decreasing
step of the form
α 1
γn = , < b < 1, α > 0, β ≥ 0
β + nb 2

and, as a second step, an averaging phase is added, namely set

θ0 + · · · + θn−1
θ̄n = , n ≥ 1.
n

Then (θ̄n )n≥1 satisfies a Central Limit Theorem at a rate n with an asymptotic
variance corresponding to the optimal asymptotic variance obtained for the original
αopt
algorithm (θn )n≥0 with theoretical optimal step sequences γn = β+n , n ≥ 1.
The choice of the parameter β either for the original algorithm or in its averaged
version does not depend on theoretical motivations. A heuristic rule is to choose it
so that γn does not decrease too fast to avoid being “frozen” far from its target.
Adaptive implementation into the computation of E ϕ(Z)
At this stage, like for the variance reduction by regression, we may follow two
strategies – batch or adaptive – to reduce the variance.
 The batch strategy. This is the simplest and the most elementary strategy.
Phase 1: One first computes an hopefully good approximation of the optimal
variance reducer, which we will denote by θn0 for a large enough n0 that will remain
fixed during the second phase devoted to the computation of E ϕ(Z). It is assumed
that an n0 sample of an i.i.d. sequence (Zm )1≤m≤n0 of N (0; Id )-distributed random
vectors.
6.3 Applications to Finance 195

Phase 2: As a second step, one implements a Monte Carlo simulation based on


|θn0 |2
ϕ̃(z)ϕ(z + θn0 )e(θn0 |z)− 2 , i.e.

1 
M
|θn0 |2
E ϕ(Z) = lim ϕ(Zm+n0 + θn0 )e−(θn0 |Zm+n0 )− 2 ,
M M m=1

where (Zm )m≥n0 +1 is an i.i.d. sequence of N (0; Id )-distributed random vectors inde-
pendent of (Zm )1≤m≤n0 . This procedure satisfies a CLT with (conditional) variance
L(θn0 ) − (E ϕ(Z))2 (given θn0 ).

 The adaptive strategy. This approach, introduced in [12], is similar to the adaptive
variance reduction by regression presented in Sect. 3.2. The aim is to devise a pro-
cedure fully based on the simultaneous computation of the optimal variance reducer
and E ϕ(Z) from the same sequence (Zn )n≥1 of i.i.d. N (0; Id )-distributed random
vectors used in (6.21). To be precise, this leads us to devise the following adaptive
estimator of E ϕ(Z):

1 
M
|θm−1 |2
ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 , M ≥ 1, (6.23)
M m=1

where the sequence (θm )m≥0 is obtained by iterating (6.21).


We will briefly prove that the above estimator is unbiased and convergent.
Let Fm := σ(θ0 , Z1 , . . . , Zm ), m ≥ 0, be the filtration of the (whole) simulation
process. Using that θm−1 is Fm−1 -adapted and Zm is independent of Fm−1 with the
same distribution as Z, one derives classically that
   
|θ |2 |θ|2
−(θm−1 |Zm )− m−1
E ϕ(Zm + θm−1 )e 2 Fm−1 = E ϕ(Z + θ)e−(θ|Z)− 2 .
|θ=θm−1

As a first consequence, the estimator defined by (6.23) is unbiased. Now let us define
the (FM )-martingale

|θm−1 |2

M
ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 − E ϕ(Z)
NM := 1{|θm−1 |≤m} , M ≥ 1.
m=1
m

It is clear that (NM )M ≥1 has square integrable increments so that NM ∈ L2 (P) for
every M ∈ N∗ and
 2 
|θm−1 |2
E ϕ(Zm + θm−1 )e−(θm−1 Zm )− 2 Fm−1 1{|θm−1 |≤m} = L(θm−1 )1{|θm−1 |≤m}
a.s.
−→ L(θ∗ )
196 6 Stochastic Approximation with Applications to Finance

as m → +∞, which in turn implies that


 
1
N ∞ ≤ sup L(θm ) 2
< +∞ a.s.
m
m≥1
m

Consequently (see Theorem 12.7 in the Miscellany Chapter), NM → N∞ P-a.s. where


N∞ is an a.s. finite random variable. Finally, Kronecker’s Lemma (see Lemma 12.1)
implies

M  
1  |θm−1 |2
ϕ(Zm + θm−1 ) e−(θm−1 |Zm )− 2 − E ϕ(Z) 1{|θm−1 |≤m} −→ 0 a.s.
M m=1

as M → +∞. Since θm → θ∗ a.s. as m → +∞, 1{|θm−1 |≤m} = 1 for large enough m


so that

1 
M
|θm−1 |2
ϕ(Zm + θm−1 ) e−(θm−1 |Zm )− 2 −→ E ϕ(Z) a.s. as M → +∞.
M m=1

One can show, using the CLT theorem for triangular arrays of martingale incre-
ments (see [142] and Chap. 12, Theorem 12.8) that
 
√ 1   
M
|θ |2 L
−(θm−1 |Zm )− m−1
M ϕ(Zm + θm−1 ) e 2 − E ϕ(Z) −→ N 0, σ∗2 ,
M m=1

 2
where σ∗2 = L(θ∗ ) − E ϕ(Z) is the minimal variance.
As set, this second approach seems more performing owing to its minimal asymp-
totic variance. For practical use, the verdict is more balanced and the batch approach
turns out to be quite satisfactory.

Numerical illustrations. (a) At-the-money Black–Scholes setting. We consider


a vanilla Call in a B-S model
σ2

XT = x0 e(r− 2 )T +σ
d
TZ
, Z = N (0; 1)

with the following parameters: T = 1, r = 0.10, σ = 0.5, x0 = 100, K = 100. The


Black–Scholes reference price of the Vanilla Call is 23.93.
The recursive optimization of θ was achieved by running the data driven ver-
sion (6.22) with a sample (Zn )n of size 10 000. A first renormalization has been
made prior to the computation: we considered the equivalent problem (as far as vari-
ance reduction is concerned) where the starting value of the asset is 1 and the strike
is the moneyness K/X0 . The procedure was initialized at θ0 = 1. (Using (3.4) would
have led us to set θ0 = −0.2).
6.3 Applications to Finance 197

2 0.25

0.2

1.5 0.15

0.1

1 0.05

0.5 −0.05

−0.1

0 −0.15

−0.2

−0.5 −0.25
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1 2 3 4 5 6 7 8 9 10
5
x 10

Fig. 6.1 B- S Vanilla Call option. T = 1, r = 0.10, σ = 0.5, X0 = 100, K = 100. Left:
convergence toward θ∗ (up to n = 10 000). Right: Monte Carlo simulation of size M = 106 ; dotted
line: θ = 0, solid line: θ = θ10 000  θ∗

We did not try to optimize the choice of the step γn following theoretical results
on the weak rate of convergence, nor did we perform an averaging principle. We
just applied the heuristic rule that if the function H (here Ha ) takes its (usual) values
within a few units, then choosing γn = c 20+n 20
with c  1 (say c ∈ [1/2, 2]) leads to
satisfactory performances of the algorithm.
The resulting value θ10 000 was used in a standard Monte Carlo simulation of size
M = 106 based on (6.13) and compared to a crude Monte Carlo simulation with
θ = 0. The numerical results are as follows:
– θ = 0: 95 % Confidence interval = [23.92, 24.11] (pointwise estimate: 24.02).
– θ = θ10 000  1.51: 95 % Confidence interval = [23.919, 23.967] (pointwise esti-
mate: 23.94).
The gain ratio in terms of standard deviations is 42.69
11.01
= 3.88  4. This is observed
on most simulations we made, however the convergence of θn may be more chaotic
than displayed in Fig. 6.1 (left), where the convergence is almost instantaneous.
The behavior of the two Monte Carlo simulations are depicted in Fig. 6.1 (right).
The√alternative original “parametrized” version of the algorithm (Ha (θ, z)) with a =
2σ T yields quite similar results (when implemented with the same step and the
same starting value).
Further comments: As developed in [226], all of the preceding can be extended
to non-Gaussian random vectors Z provided their distribution have a log-concave
probability density p satisfying, for some positive ρ,

log(p) + ρ| · |2 is convex.

One can also replace the mean translation by other importance sampling procedures
like those based on the Esscher transform. This has applications, e.g. when Z = XT
is the value at time T of a process belonging to the family of subordinated Lévy
processes (Lévy processes of the form Zt = WYt , where Y is a subordinator – an
198 6 Stochastic Approximation with Applications to Finance

increasing Lévy process – independent of the standard Brownian motion W ). For


more insight on such processes, we refer to [40, 261].

6.3.2 Application to Implicit Correlation Search

We consider a 2-dimensional B-S toy model as defined by (2.2), i.e. Xt0 = ert. (riskless
asset) and
σi2
Xti = x0i e(r− 2 )t+σi Wt , x0i > 0, i = 1, 2,
i

for the two risky assets, where W 1 , W 2 t = ρt, ρ ∈ [−1, 1] denotes the correlation
between W 1 and W 2 , that is, the correlation between the yields of the risky assets
X 1 and X 2 .
In this market, we consider a best-of call option defined by its payoff
 
max(XT1 , XT2 ) − K +
.

A market of such best-of calls is a market of the correlation ρ since the respective
volatilities are obtained from the markets of vanilla options on each asset as implicit
volatilities. In this 2-dimensional B-S setting, there is a closed formula for the pre-
mium involving the bi-variate standard normal distribution (see [159]), but what
follows can be applied as soon as the asset dynamics – or their time discretization –
can be simulated at a reasonable computational cost.
We will use a stochastic recursive procedure to solve the inverse problem in ρ

PBoC (x01 , x02 , K, σ1 , σ2 , r, ρ, T ) = Pmarket [Mark-to-market premium], (6.24)

where
  
PBoC (x01 , x02 , K, σ1 , σ2 , r, ρ, T ) := e−rt E max(XT1 , XT2 ) − K +
  √ √ √   
−rt 1 μ1 T +σ1 T Z1 2 μ2 T +σ2 T (ρZ 1 + 1−ρ2 Z 2 )
=e E max x0 e , x0 e −K ,
+

σ2 d
where μi = r − 2i , i = 1, 2, Z = (Z 1 , Z 2 ) = N (0; I2 ).
It is intuitive and easy to check (at least empirically by simulation) that the
function ρ −→ PBoC (x01 , x02 , K, σ1 , σ2 , r, ρ, T ) is continuous and (strictly) decreas-
ing on [−1, 1]. We assume that the market price is at least consistent, i.e. that
Pmarket ∈ [PBoC (1), PBoC (−1)] so that Eq. (6.24) in ρ has exactly one solution, say
ρ∗ . This example is not only a toy model because of its basic B-S dynamics, it is also
due to the fact that, in such a model, more efficient deterministic procedures can be
6.3 Applications to Finance 199

called upon, based on the closed form for the option premium. Our aim is to propose
and illustrate below a general methodology for correlation search.
The most convenient way to prevent edge effects due to the fact that ρ ∈ [−1, 1]
is to use a trigonometric parametrization of the correlation by setting

ρ = cos θ, θ ∈ R.

At this stage, note that


 d
1 − ρ2 Z 2 = | sin θ|Z 2 = (sin θ) Z 2

d
since Z 2 = −Z 2 . Consequently, as soon as ρ = cos θ,
 d
ρ Z1 + 1 − ρ2 Z 2 = (cos θ) Z 1 + (sin θ) Z 2

owing to the independence of Z 1 and Z 2 .


In general, this introduces an over-parametrization, even inside [0, 2π], since
Arccos(ρ∗ ) ∈ [0, π] and 2π − Arccos(ρ∗ ) ∈ [π, 2π] are both solutions to our zero
search problem, but this is not at all a significant problem for practical implemen-
tation: a more careful examination would show that one of these two equilibrium
points is “repulsive” and one is “attractive” for the procedure, see Sects. 6.4.1 and
6.4.5 for a brief discussion: this terminology refers to the status of an equilibrium
for the ODE associated to a stochastic algorithm and the presence (or not) of noise.
A noisy repulsive equilibrium cannot, a.s., be the limit of a stochastic algorithm.
From now on, for convenience, we will just mention the dependence of the pre-
mium function in the variable θ, namely
  √ √    
max x01 eμ1 T +σ1 T Z1 , x02 eμ2 T +σ2 T (cos θ) Z +(sin θ) Z
1 2
θ −→ PBoC (θ) := e−rt E −K .
+

The function PBoC is a 2π-periodic continuous function. Extracting the implicit cor-
relation from the market amounts to solving (with obvious notations) the equation

PBoC (θ) = Pmarket (ρ = cos θ),

where Pmarket is the quoted premium of the option (mark-to-market price). We need
to slightly strengthen the consistency assumption on the market price, which is in
fact necessary with almost any zero search procedure: we assume that Pmarket lies in
the open interval  
Pmarket ∈ PBoC (1), max PBoC (−1)
θ

i.e. that Pmarket is not an extremal value of PBoC . So we are looking for a zero of the
function h defined on R by
200 6 Stochastic Approximation with Applications to Finance

h(θ) = PBoC (θ) − Pmarket .

This function admits a representation as an expectation given by

h(θ) = E H (θ, Z),

where H : R × R2 → R is defined for every θ ∈ R and every z = (z 1 , z 2 ) ∈ R2 by


  √ 1 √ 1  
H (θ, z) = e−rt max x01 eμ1 T +σ1 T z , x02 eμ2 T +σ2 T (z cos θ+z sin θ) − K
2
− Pmarket
+

d
and Z = (Z 1 , Z 2 ) = N (0; I2 ).
Proposition 6.1 Assume the above assumptions made on the Pmarket and the function
PBoC . If, moreover, the equation PBoC (θ) = Pmarket has finitely many solutions on
[0, 2π], then the stochastic zero search recursive procedure defined by

θn+1 = θn − γn+1 H (θn , Zn+1 ), θ0 ∈ R,

where (Zn )n≥1 is an i.i.d, N (0; I2 ) distributed sequence and (γn )n≥1 is a step sequence
satisfying the decreasing step assumption (6.7), a.s. converges toward solution θ∗ to
PBoC (θ) = Pmarket .
Proof. For every z ∈ R2 , θ → H (θ, z) is continuous, 2π-periodic and dominated by
a function g(z) such that g(Z) ∈ L2 (P) (g is obtained by replacing z 1 cos θ + z 2 sin θ
by |z 1 | + |z 2 | in the above formula for H ). One deduces that both the mean function
h and θ → E H 2 (θ, Z) are continuous and 2π-periodic, hence bounded.
The main difficulty in applying the Robbins–Siegmund Lemma is to find the
appropriate Lyapunov function.
 2π
As the quoted value Pmarket is not an extremum of the function P, h± (θ)d θ > 0
0
where h± := max(±h, 0). The two functions h± are 2 π-periodic so that
 t+2π  2π
h± (θ)d θ = h± (θ)d θ > 0 for every t > 0. We consider any (fixed) solu-
t 0
tion θ0 to the equation h(θ) = 0 and two real numbers β ± such that
 2π
+ h+ (θ)d θ
0 < β < 02π < β−
0 h− (θ)d θ

and we set, for every θ ∈ R,

(θ) := h+ (θ) − β + h− (θ)1{θ≥θ0 } − β − h− (θ)1{θ≤θ0 } .

The function  is clearly continuous, 2π-periodic “on the right” on [θ0 , +∞) and “on
the left” on (−∞, θ0 ]. In particular, it is a bounded function. Furthermore, owing to
the definition of β ± ,
6.3 Applications to Finance 201
 θ0  θ0 +2π
(θ)d θ < 0 and (θ)d θ > 0
θ0 −2π θ0

so that  θ
lim (u)du = +∞.
θ→±∞ θ0

As a consequence, there exists a real constant C > 0 such that the function
 θ
L(θ) = (u)du + C
0

is non-negative. Its derivative is given by L =  so that


 
L h = (h+ − h− ) ≥ (h+ )2 + β + (h− )2 ≥ 0 and L h = 0 = {h = 0}.

It remains to prove that L =  is Lipschitz continuous. Calling upon the usual argu-
ments to interchange expectation and differentiation (see Theorem 2.2(b)), one shows
that the function PBoC is differentiable at every θ ∈ R \ 2πZ with
√  

PBoC (θ) = σ2 T E 1{X 2 (θ)>max(X 1 ,K)} XT2 (cos(θ)Z 2 − sin(θ)Z 1 )
T T

(with an obvious definition for XT2 (θ)). Furthermore,


√  √ 

(θ)| ≤ σ2 T E x02 eμ2 T +σ2 T (|Z |+|Z |) (|Z 2 | + |Z 1 |) < +∞
1 2
sup |PBoC
θ∈R\2πZ

so that PBoc is clearly Lipschitz continuous on the interval [0, 2π], hence continuous
on the whole real line by periodicity. Consequently h and h± are Lipschitz continuous,
which implies in turn that  is Lipschitz continuous as well.
Moreover, we know that the equation PBoC (θ) = Pmarket has exactly two solutions
on every interval of length 2π. Hence the set {h = 0} is countable and locally finite,
i.e. has a finite trace on any bounded interval.
One may apply Theorem 6.3 (for which we provide a self-contained proof in
one dimension) to deduce that θn will converge toward a solution θ∗ of the equation
PBoC (θ) = Pmarket . ♦

Exercise. Show that PBoC is continuously differentiable on the whole real line. [Hint:
extend the derivative on R by continuity.]
Extend the preceding to any payoff ϕ(XT1 , XT2 ) where ϕ : R2+ → R+ is a Lipschitz
continuous function. In particular, show without the help of differentiation that the
corresponding function θ → P(θ) is Lipschitz continuous.
202 6 Stochastic Approximation with Applications to Finance

2.8 0
−0.1
2.6
−0.2
2.4
−0.3
2.2 −0.4

2 −0.5
−0.6
1.8
−0.7
1.6
−0.8
1.4 −0.9
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10

Fig. 6.2 B- S Best- of- Call option. T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100, K =
100. Left: convergence of θn toward a θ∗ (up to n = 100 000). Right: convergence of ρn := cos(θn )
toward −0.5

Table 6.1 B- S Best- of- Call option. T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Convergence of ρn := cos(θn ) toward -0.5

n ρn := cos(θn )
1000 −0.5606
10000 −0.5429
25000 −0.5197
50000 −0.5305
75000 −0.4929
100000 −0.4952

Numerical experiment. We set the model parameters to the following values

x01 = x02 = 100, r = 0.10, σ1 = σ2 = 0.30, ρ = −0.50

and the payoff parameters


T = 1, K = 100.

The reference “Black–Scholes” price 30.75 is used as a mark-to-market price


so that the target of the stochastic algorithm is θ∗ = Arccos(−0.5) mod π. The
stochastic approximation procedure parameters are

θ0 = 0, n = 105 .

The choice of θ0 is “blind” on purpose. Finally, we set γn = 0.5


n
. No re-scaling of the
procedure has been made in the example below (see Table 6.1 and Fig. 6.2).
 Exercise (Yet another toy-example: extracting B-S-implied volatility by stochastic
approximation). Devise a similar procedure to compute the implied volatility in a
6.3 Applications to Finance 203

standard Black–Scholes model (starting at x > 0 at t = 0 with interest rate r and


maturity T ).
(a) Show that the B-S premium CBS (σ) is even, increasing on [0, +∞) and contin-
uous as a function of the volatility. Show that limσ→0 CBS (σ) = (x − e−rt K)+ and
limσ→+∞ CBS (σ) = x.
(b) Deduce from (a) that for any mark-to-market price Pmarket ∈ [(x − e−rt K)+ , x],
there is a unique (positive) B-S implicit volatility for this price.
(c) Consider, for every σ ∈ R,
 σ2
√ 
H (σ, z) = χ(σ) xe− 2 T +σ T z − Ke−rt ,
+

σ2
where χ(σ) = (1 + |σ|)e− 2 T . Carefully justify this choice of H and implement the
algorithm with x = K = 100, r = 0.1 and a market price equal to 16.73. Choose
the step parameter of the form γn = xc 1n , n ≥ 1, with c ∈ [0.5, 2] (this is simply a
suggestion).

Warning. The above exercise is definitely a toy exercise! More efficient methods
for extracting standard implied volatility are available (see e.g. [209], which is based
on a Newton–Raphson zero search algorithm; a dichotomy approach is also very
efficient).

 Exercise (Extension to more general asset dynamics). We now consider a pair of


risky assets following two correlated local volatility models,
  
dXti = Xti rdt + σi Xt )dWti , X0i = xi > 0, i = 1, 2,

where the functions σi : R2+ → R+ are bounded Lipschitz continuous functions and
the Brownian motions W 1 and W 2 are correlated with correlation ρ ∈ [−1, 1] so that
W 1 , W 2 t = ρt. (This ensures the existence and uniqueness of strong solutions for
this SDE, see Chap. 7.)
Assume that we know how to simulate (XT1 , XT2 ), either exactly, or at least as
an approximation by an Euler scheme from a d -dimensional normal vector Z =
d
(Z 1 , . . . , Z d ) = N (0; Id ).
Show that the above approach can be extended mutatis mutandis.

6.3.3 The Paradigm of Model Calibration by Simulation

Let  ⊂ Rd be an open convex set of Rd . Let

Y : ( × , Bor() ⊗ A) −→ (Rp , Bor(Rp ))


p
(θ, ω) −→ Yθ (ω) = (Yθ1 (ω), . . . , Yθ (ω))
204 6 Stochastic Approximation with Applications to Finance

be a random vector representative of p payoffs, “re-centered” by their mark-to-market


price (see examples below). In particular, for every i ∈ {1, . . . , p}, E Yθi is represen-
tative of the error between the “theoretical price” obtained with parameter θ and the
quoted price. To make the problem consistent we assume throughout this section that

∀ θ ∈ , Yθ ∈ L1Rp (, A, P).

Let S ∈ S+ (p, R) ∩ GL(p, R) be a (positive definite) matrix. The resulting inner


product is defined by
∀ u, v ∈ Rp , u|vS := u∗ Sv

and the associated Euclidean norm | · |S by |u|S := u|uS .
A natural choice for the matrix S is a simple diagonal matrix S = Diag(w1 , . . . , wp )
with “weights” wi > 0, i = 1, . . . , p.
The paradigm of model calibration is to find the parameter θ∗ that minimizes
the “aggregated error” with respect to the | · |S -norm. This leads to the following
minimization problem

1 2
(C) ≡ argminθ E Yθ = argminθ E Yθ S .
S 2
Here are two simple examples to illustrate this somewhat abstract definition.
 Examples. 1. Black–Scholes model. Let, for any x, σ > 0, r ∈ R,

σ2
Xtx,σ = xe(r− 2 )t+σWt , t ≥ 0,

where W is a standard Brownian motion. Then let (Ki , Ti )i=1,...,p be p “maturity-strike


price” pairs. Set
θ := σ,  := (0, +∞)

and  
Yθ := e−rTi (XTx,σ
i
− Ki )+ − Pmarket (Ti , Ki ) i=1,...,p

where Pmarket (Ti , Ki ) is the mark-to-market price of the option with maturity Ti > 0
and strike price Ki .
2. Merton model (mini-krach). Now, for every x, σ, λ > 0, a ∈ (0, 1), set

σ2
Xtx,σ,λ,a = x e(r− 2 +λa)t+σWt (1 − a)Nt , t ≥ 0,

where W is as above and N = (Nt )t≥0 is a standard Poisson process with jump
intensity λ. Set
θ = (σ, λ, a),  = (0, +∞)2 × (0, 1)

and
6.3 Applications to Finance 205
 
Yθ = e−rTi (XTx,σ,λ,a
i
− Ki ) + − Pmarket (Ti , Ki ) .
i=1...,p

We will also have to make simulability assumptions on Yθ and, if necessary, on its


derivatives with respect to θ (see below). Otherwise our simulation-based approach
would be meaningless.
At this stage, there are essentially two approaches that can be considered in order
to solve this problem by simulation:
• A Robbins–Siegmund zero search approach of ∇L, which needs to have access
to a representation of the gradient – assumed to exist – as an expectation of the
function L.
• A more direct treatment based on the so-called Kiefer–Wolfowitz procedure, which
is a variant of the Robbins–Siegmund approach based on a finite difference method
(with decreasing step) which does not require the existence of a representation of
∇L as an expectation.
The Robbins–Siegmund approach
We make the following assumptions: for every θ0 ∈ ,

⎪ (i) P(d ω)−a.s., θ −→ Yθ (ω) is differentiable


⎨ at θ0 with Jacobian ∂θ0 Yθ (ω),
(CalRZ ) ≡ (ii)

⎪  Uθ0 , neighborhood
∃  of θ0 in , such that

⎩ Yθ −Yθ0
is uniformly integrable.
|θ−θ0 | θ∈Uθ0 \{θ0 }

One checks – using the exercise “Extension to uniform integrability” which follows
Theorem 2.2 – that θ −→ E Yθ is differentiable and that its Jacobian is given by

∂θ E Yθ = E ∂θ Yθ .

Then, the function L is differentiable everywhere on  and its gradient (with respect
to the canonical Euclidean norm) is given by
 
∀ θ ∈ , ∇L(θ) = E (∂θ Yθ )t S E Yθ = E (∂θ Yθ )t S E Yθ .

At this stage we need a representation of ∇L(θ) as an expectation. To this end,


we construct, for every θ ∈ , an independent copy Yθ of Yθ defined as follows: we
consider the product probability space (2 , A⊗2 , P⊗2 ) and set, for every (ω, ω) ∈ 2 ,
Yθ (ω, ω) = Yθ (ω) (the extension of Yθ on 2 still denoted by Yθ ) and Yθ (ω, ω) =
Yθ (ω). It is straightforward by the product measure theorem that the two families
(Yθ )θ∈ and (Yθ )θ∈ are independent with the same distribution. From now on we will
make the usual abuse of notation consisting in assuming that these two independent
copies live on the probability space (, A, P).
Now, one can write
206 6 Stochastic Approximation with Applications to Finance
 
∀ θ ∈ , ∇L(θ) = E (∂θ Yθ )t S E Yθ
 
= E (∂θ Yθ )t S Yθ .

The standard situation, as announced above, is that Yθ is a vector of payoffs written


on d traded risky assets, re-centered by their respective quoted prices. The model
dynamics of the d risky assets depends on the parameter θ, say
 
Yθ = Fi (θ, XTi (θ)) i=1,...,p ,

where the price dynamics (Xt (θ))t≥0 of the d traded assets is driven by a parametrized
diffusion process

dXt (θ) = b(θ, t, Xt (θ)) dt + σ(θ, t, Xt (θ))dWt , X0 (θ) = x0 (θ) ∈ Rd ,

where W is an Rq -valued standard Brownian motion defined on a probability space


(, A, P), b is an Rd -valued parametrized function defined on  × [0, T ] × Rd and
σ is an M(d , q)-valued parametrized function defined on the same product space,
both satisfying appropriate regularity assumptions.
The pathwise differentiability of Yθ in θ needs thatof Xt (θ)
 with respect to θ. This
question is closely related to the θ-tangent process ∂X∂θt (θ) of X (θ). A precise
t≥0
statement is provided in Sect. 10.2.2, which ensures that if b and σ are smooth
enough with respect to the variable θ, then such a θ-tangent process does exist and
is a solution to a linear SDE (involving X (θ) in its coefficients).
Some differentiability properties are also required on the functions Fi in order to
fulfill the above differentiability Assumption (CalRZ )(i). As for model calibrations
on vanilla derivative products performed in Finance, Fi is never everywhere differ-
entiable – typically Fi (y) := e−rTi (y − Ki )+ − Pmarket (Ti , Ki ) – but, if Xt (θ) has an
absolutely continuous distribution (i.e. a probability density) for every time t > 0 and
every θ ∈ , then Fi only needs to be differentiable outside a Lebesgue negligible
subset of R+ . Finally, we can write formally
   t
H θ, W (ω) := ∂θ Yθ (ω) S Yθ (ω),

where W stands for an abstract random innovation taking values in an appropriate


space. We denote by the capital letter W the innovation because when the underlying
dynamics is a Brownian diffusion or its Euler–Maruyama scheme, it refers to a finite-
dimensional functional of (two independent copies of) the Rq -valued standard Brow-
nian motion on an interval [0, T ]: either (two independent copies of) (WT1 , . . . , WTp )
 
or (two independent copies of) the sequence W kT 1≤k≤n of Brownian increments
n
with step Tn over the interval [0, T ]. Thus, these increments naturally appear in the
 n   
simulation of the Euler scheme X̄ kT (θ) 0≤k≤n of the process Xt (θ) t∈[0,T ] when
n
the latter cannot be simulated directly (see Chap. 7 entirely devoted to the Euler
6.3 Applications to Finance 207

scheme of Brownian diffusions). Of course, other situations may occur, especially


when dealing with jump diffusions where W usually becomes the increment process
of the driving Lévy process.
Nevertheless, we make the following reasonable meta-assumptions that
– the process W is simulable,
– the functional H (θ, w) can be easily computed for any input (θ, w).
Then, one may define recursively the following zero search algorithm for ∇L(θ) =
E H (θ, W ), by setting
θn+1 = θn − γn+1 H (θn , W n+1 ),

where (W n )n≥1 is an i.i.d. sequence of copies of W and (γn )n≥1 is a sequence of steps
satisfying the usual decreasing step assumptions
 
γn = +∞ and γn2 < +∞.
n n

In such a general framework, of course, one cannot ensure that the functions L and
H will satisfy the basic assumptions needed to make stochastic gradient algorithms
converge, typically
  
∇L is Lipschitz continuous and ∀ θ ∈ , H (θ, . )2 ≤ C 1 + L(θ)

or one of their numerous variants (see e.g. [39] for a large overview of possible
assumptions). However, in many situations, one can make the problem fit into a con-
verging setting by an appropriate change of variable on θ or by modifying the function
L and introducing an appropriate explicit (strictly) positive “weight function” χ(θ)
that makes the product χ(θ)H (θ, W (ω)) fit with these requirements.
Despite this, the topological structure of the set {∇L = 0} can be nontrivial, in
particular disconnected. Nonetheless, as seen in Theorem 6.3, one can show, under
natural assumptions, that

θn converges to a connected component of {χ|∇L|2 = 0} = {∇L = 0}.

The next step is that if ∇L has several zeros, they cannot all be local minima of L,
especially when there are more than two of them (this is a consequence of the well-
known Mountain-Pass Lemma, see [164]). Some are local maxima or saddle points of
various kinds. These equilibrium points which are not local minima are called traps.
An important fact is that, under some non-degeneracy assumptions on H at such a par-
asitic equilibrium point θ∞ (typically E H (θ∞ , W )H (θ∞ , W )t is positive definite at
least in the direction of an unstable manifold of h at θ∞ ), the algorithm will a.s. never
converge toward such a “trap”. This question has been extensively investigated in the
literature in various settings for many years (see [33, 51, 95, 191, 242]).
A final problem may arise due to the incompatibility between the geometry
of the parameter set  and the above recursive algorithm: to be really defined
208 6 Stochastic Approximation with Applications to Finance

by the above recursion, we need  to be left stable by (almost) all the mappings
θ → θ − γH (θ, w), at least for γ small enough. If not the case, we need to introduce
some constraints on the algorithm by projecting it onto  whenever θn skips outside
. This question was originally investigated in [54] when  is a convex set.
Once all these technical questions have been circumvented, we may state the fol-
lowing meta-theorem, which says that θn a.s. converges toward a local minimum of L.
At this stage it is clear that calibration looks like quite a generic problem for
stochastic optimization and that almost all difficulties arising in the field of Stochastic
Approximation can be encountered when implementing such a (pseudo-)stochastic
gradient to solve it.
The Kieffer–Wolfowitz approach
Practical implementations of the Robbins–Siegmund approach point out a specific
technical difficulty: the random functions θ → Yθ (ω) are not always pathwise dif-
ferentiable (nor in the Lr (P)-sense, which could be enough). More important in
some way, even if one shows that θ → E Y (θ) is differentiable, possibly by calling
upon other techniques (log-likelihood method, Malliavin weights, etc) the resulting
representation for ∂θ Y (θ) may turn out to be difficult to simulate, requiring much
programming care, whereas the random vectors Yθ can be simulated in a standard
way. In such a setting, an alternative is provided by the Kiefer–Wolfowitz algorithm
(K–W ) which combines the recursive stochastic approximation principle with a finite
difference approach to differentiation. The idea is simply to approximate the gradient
∇L by

∂L L(θ + η i ei ) − L(θ − η i ei )
(θ)  , 1 ≤ i ≤ p,
∂θi 2η i

where (ei )1≤i≤p denotes the canonical basis of Rp and η = (η i )1≤i≤p . This finite
difference term has an integral representation given by
 
L(θ + η i ei ) − L(θ − η i ei ) (θ + η i , W ) − (θ − η i , W )
=E
2ηi 2η i

where, with obvious temporary notations,

(θ, W ) := Y (θ, W ), Y (θ, W )S = Y (θ, W )∗ S Y (θ, W )

(Y (θ, W ) is related to the innovation W ). Starting from this representation, we may


derive a recursive updating formula for θn as follows

(θn + ηn+1
i
, W n+1 ) − (θn − ηn+1
i
, W n+1 )
θn+1
i
= θni − γn+1 i
, 1 ≤ i ≤ p.
2ηn+1

We reproduce below a typical convergence result for K–W procedures (see [39])
which is the natural counterpart of the stochastic gradient framework.
6.3 Applications to Finance 209

Theorem 6.4. Assume that the function θ → L(θ) is twice differentiable with a
Lipschitz continuous Hessian. We assume that

θ → (θ, W )2 has (at most) linear growth

and that the two step sequences respectively satisfy

     γn 2
γn = ηni = +∞, γn2 < +∞, ηn → 0, < +∞.
n≥1 n≥1 n≥1 n≥1
ηni

Then, θn a.s. converges to a connected component of a {L = } ∩ {∇L = 0} for some


level  ≥ 0.

A special case of this procedure in a linear framework is proposed in Sect. 10.1.2:


the decreasing step finite difference method for greeks computation. The traps prob-
lem for the K–W algorithm (convergence toward a local minimum of L) has been
more specifically investigated in [191].
Users must keep in mind that this procedure needs some care in the tuning of the
step parameters γn and ηn . This may need some preliminary numerical experiments.
Of course, all the recommendations made for the Robbins–Siegmund procedures
remain valid. For more details on the K–W procedure we refer to [39].

6.3.4 Recursive Computation of the VaR and the CVaR (I)

For a more detailed introduction to Value-at-risk and Conditional Value-at-Risk


(CVaR), see e.g. [254]. For a comprehensive overview on risk measures, see [92].
Theoretical background on Value-at-risk and Conditional Value-at-Risk
Let X : (, A, P) → R be a random variable representative of a loss: X ≥ 0 means
a loss equal to X .

Definition 6.1. The Value at Risk at level α ∈ (0, 1) is the (lowest) α-quantile of the
distribution of X , i.e.
 
VaRα (X ) := inf ξ : P(X ≤ ξ) ≥ α . (6.25)

The Value-at-Risk exists since lim P(X ≤ ξ) = 1 and satisfies


ξ→+∞

   
P X < VaRα (X ) ≤ α ≤ P X ≤ VaRα (X ) .

As soon as the distribution function FX of X is continuous – or, equivalently, the


distribution of X has no atom – the value at risk satisfies
210 6 Stochastic Approximation with Applications to Finance
 
P X ≤ VaRα (X ) = α. (6.26)

Moreover, if FX is also (strictly) increasing, then it is the unique solution of the


above equation (otherwise it is the lowest one). In such a case, we will say that the
Value-at-Risk is unique.
Roughly speaking α represents here an alarm or warning level, typically 0.95 or
0.99 like for confidence intervals. Beyond this level, the loss becomes unacceptable.
Unfortunately this measure of risk is not consistent, for several reasons which are
discussed, e.g. by Föllmer and Schied in [92]. The main point is that it does not favor
the diversification of portfolios to guard against the risk.
When X ∈ L1 (P) with a continuous distribution, a consistent measure of risk is
provided by the Conditional Value-at-Risk (at level α).

Definition 6.2. Let X ∈ L1 (P) with an atomless distribution. The Conditional Value-
at-Risk at level α ∈ (0, 1) is defined by
 
CVaRα (X ) := E X | X ≥ VaRα (X ) . (6.27)

Remark. Note that in the case of non-uniqueness of the α-quantile, the Conditional
Value-at-risk is still well-defined since the above conditional expectation does not
depend upon the choice of this α-quantile solution to (6.26).
 Exercises. 1. Assume that the distribution of X has no atom. Show that
 +∞
1
CVaRα (X ) = VaRα (X ) + P(X > u) du.
1 − α VaRα (X )
 +∞
[Hint: use that for a non-negative r.v. Y , E Y = P(Y ≥ y) dy.]
0
2. Show that the conditional value-at-risk CVaRα (X ) is a consistent measure of risk,
i.e. that it satisfies the following three properties
• ∀ λ > 0, CVaRα (λ X ) = λCVaRα (X ).
• ∀ a ∈ R, CVaRα (X + a) = CVaRα (X ) + a.
• Let X , Y ∈ L1 (P), CVaRα (X + Y ) ≤ CVaRα (X ) + CVaRα (Y ).

The following formulation of the VaRα and CVaRα as solutions to an optimization


problem is due to Rockafellar and Uryasev in [254].

Proposition 6.2. (The Rockafellar–Uryasev representation formula) Let X ∈


L1 (P) with a continuous distribution. The function L : R → R defined by

1
L(ξ) = ξ + E (X − ξ)+
1−α

is convex and lim L(ξ) = +∞. Furthermore, L attains a minimum


|ξ|→+∞
6.3 Applications to Finance 211

CVaRα (X ) = min L(ξ) ≥ E X


ξ∈R

at
VaRα (X ) = inf argminξ∈R L(ξ).

Proof. The function L is clearly convex and 1-Lipschitz continuous since both func-
tions ξ → ξ and ξ → (x − ξ)+ are convex and 1-Lipschitz continuous for every
x ∈ R. As X has no atom, the function L is also differentiable on the whole real line
with a derivative given for every ξ ∈ R by

1 1  
L (ξ) = 1 − P(X > ξ) = P(X ≤ ξ) − α .
1−α 1−α

This follows from the interchange of differentiation and expectation allowed by


Theorem 2.2(a) since ξ → ξ + 1−α 1
(X − ξ)+ is differentiable at a given ξ0 on the
event {X = ξ0 }, i.e P-a.s. since X is atomless and, on the other hand, is Lipschitz
continuous with Lipschitz continuous ratio 1−α X
∈ L1 (P). The second equality is
obvious. Then L attains an absolute minimum at any solution ξα of the equation
P(X > ξα ) = 1 − α, i.e. P(X ≤ ξα ) = α. In particular, L attains a minimum at the
value-at-risk which is the lowest such solution. Furthermore,

E ((X − ξα )+ )
L(ξα ) = ξα +
P(X > ξα )
ξα E1{X >ξα } + E ((X − ξα )1{X >ξα } )
=
P(X > ξα )
E (X 1{X >ξα } )
=
P(X > ξα )
= E (X | {X > ξα }).

The function L satisfies


 
L(ξ) 1
lim = lim 1 + E (X /ξ − 1)+ = 1
ξ→+∞ ξ ξ→+∞ 1−α

and
 
L(−ξ) 1 1 α
lim = lim −1 + E (X /ξ + 1)+ = −1 + = .
ξ→+∞ ξ ξ→+∞ 1−α 1−α 1−α

Hence, lim L(ξ) = +∞.


ξ→±∞
Finally, by Jensen’s Inequality
1  
L(ξ) ≥ ξ + E X − ξ +.
1−α
212 6 Stochastic Approximation with Applications to Finance

One checks that the function on the right-hand side of the above inequality attains
its minimum at its only break of monotonicity, i.e. when ξ = E X . This completes
the proof. ♦

Computing VaRα (X ) and CVaRα (X ) by stochastic approximation


 First step: a stochastic gradient to compute the Value-at-risk. The Rockafellar–
Uryasev representation suggests to implement a stochastic gradient descent since the
function L has a representation as an expectation
 1 
L(ξ) = E ξ + (X − ξ)+ .
1−α

Furthermore, if the distribution PX has no atom, we know that, the function L


being convex and differentiable with derivative L = F−α
1−α
where F = FX denotes the
c.d.f. of X . It satisfies
1  
∀ ξ, ξ  ∈ R, (L (ξ) − L (ξ  ))(ξ − ξ  ) = F(ξ) − F(ξ  ) (ξ − ξ  ) ≥ 0
1−α

and if the Value-at-Risk CVaRα (X ) is the unique solution to F(ξ) = α,


∀ ξ ∈ R, ξ = VaRα (X ),
  1   
L (ξ) ξ − VaRα (X ) = F(ξ) − α ξ − VaRα (X ) > 0.
1−α

Proposition 6.3 Assume that X ∈ L1 (P) with a unique Value-at-Risk VaRα (X ). Let
(Xn )n≥1 be an i.i.d. sequence of random variables with the same distribution as X ,
let ξ0 ∈ L1 (P), independent of (Xn )n≥1 , and let (γn )n≥1 be a positive sequence of
real numbers satisfying the decreasing step assumption (6.7). Then, the stochastic
algorithm (ξn )n≥0 defined by

ξn+1 = ξn − γn+1 H (ξn , Xn+1 ), n ≥ 0, (6.28)

where
1 1  
H (ξ, x) := 1 − 1{x≥ξ} = 1{x<ξ} − α , (6.29)
1−α 1−α

a.s. converges toward the Value-at-Risk, i.e.


a.s.
ξn −→ VaRα (X ).

Furthermore, the sequence (L(ξn ))n≥0 is L1 -bounded so that L(ξn ) → CVaRα (X )


a.s. and in every Lp (P), p ∈ (0, 1].
Proof. First assume that ξ0 ∈ L2 (P). The sequence (ξn )n≥0 defined by (6.28) is the
stochastic gradient related to the Lyapunov function L(ξ) = L(ξ) − E X , but owing
to the convexity of L and the uniqueness of VaRα (X ), it is more convenient to
 2
consider the quadratic Lyapunov function ξ → ξ − VaRα (X ) and rely on Corol-
lary 6.1(a) (Robbins–Monro theorem), once we have observed that the function
6.3 Applications to Finance 213

α
(ξ, x) → H (ξ, x) is bounded by 1−α so that ξ → H (ξ, X )2 is bounded as well.
The conclusion follows directly from the Robbins–Monro theorem.
In the general case – ξ0 ∈ L1 (P) – one introduces the Lyapunov function L(ξ) =
(ξ−ξα )2
√ where we set ξα = VaRα (X ) for convenience. The derivative of L is
1+(ξ−ξα )2

(ξ−ξα )(2+(ξ−ξα )2 )
given by L (ξ) = 3 . One checks on the one hand that L is Lips-
(1+(ξ−ξα )2 ) 2
chitz continuous over the real line (e.g. because L is bounded) and, on the other
hand, {L = 0} ∩ {L = 0} = {L = 0} = {VaRα (X )}. Then Theorem 6.3 (pseudo-
Stochastic Gradient) applies and yields the announced conclusion (note that we need
its one-dimensional version that we established in detail). ♦

 Exercises. 1. Show that if X has a bounded density fX , then a direct application of


the stochastic gradient convergence result (Corollary 6.1(b)) yields the announced
result under the assumption ξ0 ∈ L1 (P). [Hint: show that L is Lipschitz continuous.]
2. Under the additional assumption of Exercise 1, show that the mean function h
fX (x)
satisfies h (x) = L (x) = 1−α . Deduce a way to optimize the step sequence of the
algorithm based on the CLT for stochastic algorithms stated further on in Sect. 6.4.3.

The second exercise is inspired by a simpler “α-quantile” approach. It leads to


a more general a.s. convergence result for our algorithm, stated in the proposition
below.

Proposition 6.4 If X ∈ Lp (P) for a p > 0 and is continuous with a unique value-at-
risk and if ξ0 ∈ Lp (P), then the algorithm (6.28) a.s. converges toward VaRα (X ).

Remark. The uniqueness of the value-at-risk can also be relaxed. The conclusion
becomes that (ξn ) a.s. converges to a random variable taking values in the “VaRα (X )
set”: {ξ ∈ R : P(X ≤ ξ) = α} (see [29] for a statement in that direction).

 Second step: adaptive computation of CVaRα (X ). The main aim of this section is
to compute on-line the CVaRα (X ). The fact that L(ξn ) → CVaRα (X ) is no practical
help since the function L is not explicit. How can we proceed? The idea is to devise
a companion procedure of the above stochastic gradient. Still set temporarily ξα =
VaRα (X ) for convenience. It follows from Proposition 6.3 and Césaro’s averaging
L(ξ0 ) + · · · + L(ξn−1 )
principle that → CVaRα (X ) a.s. and in L1 since one has
n
L(ξn−1 ) → CVaRα (X ) a.s. and in L1 . In particular
 
L(ξ0 ) + · · · + L(ξn−1 )
E −→ CVaRα (X ) as n → +∞.
n

On the other hand, we know that, for every ξ ∈ R,

(x − ξ)+
L(ξ) = E (ξ, X ) where (ξ, x) = ξ + .
1−α
214 6 Stochastic Approximation with Applications to Finance

Using that Xk and (ξ0 , ξ1 , . . . , ξk−1 ) are independent for every k ≥ 1, one has

L(ξk−1 ) = E (ξ, X ) = E (ξk−1 , Xk ) ξk−1 , k ≥ 1,
|ξ=ξk−1

so that
   
(ξ0 , X1 ) + · · · + (ξn−1 , Xn ) L(ξ0 ) + · · · + L(ξn−1 )
E =E
n n
−→ CVaRα (X ) as n → +∞.

This suggests to consider the sequence (Cn )n≥0 defined by

1
n−1
Cn = (ξk , Xk+1 ), n ≥ 1, C0 = 0,
n
k=0

as a candidate to be an estimator of CVaRα (X ). This sequence can clearly be recur-


sively defined since, for every n ≥ 0,

1  
Cn+1 = Cn − Cn − (ξn , Xn+1 ) . (6.30)
n+1

Proposition 6.5 Assume that X ∈ L1+ρ (P) for ρ ∈ (0, 1] and that ξn −→ VaRα (X )
a.s. Then
a.s.
Cn −→ CVaRα (X ) as n → +∞.

Proof. We will prove this claim in detail in the quadratic case ρ = 1. The proof in
the general case relies on the Chow Theorem (see [81] or the second exercise right
after the proof). First, one decomposes

1 1
n−1 n
Cn − L(ξα ) = L(ξk ) − L(ξα ) + Yk
n n
k=0 k=1

1
n−1
with Yk := (ξk−1 , Xk ) − L(ξk−1 ), k ≥ 1. It is clear that L(ξk ) − L(ξα ) → 0
n
k=0
as n → +∞ by Césaro’s principle. As for the second term, we first note that

1  
(ξ, x) − L(ξ) = (x − ξ)+ − E (X − ξ)+
1−α

so that, x → x+ being 1-Lipschitz continuous,

1 1  
|(ξ, x) − L(ξ)| ≤ E |X − x| ≤ E |X | + |x| .
1−α 1−α
6.3 Applications to Finance 215

Consequently, for every k ≥ 1,

2  
E Yk2 ≤ (E X )2 + E X 2 .
(1 − α) 2

We consider the filtration Fn := σ(ξ0 , X1 , . . . , Xn ). One checks that (ξn )n≥1 is (Fn )-
adapted and that, for every k ≥ 1,
 
E (Yk | Fk−1 ) = E (ξk−1 , Xk ) | Fk−1 ) − L(ξk−1 = L(ξk−1 ) − L(ξk−1 ) = 0.

Hence, the sequence defined by N0 = 0 and


n
Yk
Nn := , n ≥ 1,
k
k=1

is a square integrable martingale with a predictable bracket process given by


n
E (Y 2 | Fk−1 )
N n = k
k2
k=1

so that
 1
E N ∞ ≤ sup E Yn2 × < +∞.
n k2
k≥1

Consequently Nn → N∞ a.s. and in L2 as n → +∞ (see Theorem 12.7(b)). Then


the Kronecker Lemma (see Lemma 12.1) implies that

1
n
Yk −→ 0 a.s. as n → +∞
n
k=1

which finally implies that

Cn −→ CVaRα (X ) a.s. as n → +∞. ♦

Remark. For practical implementation, one may prefer to first estimate the VaRα (X )
and, once it is done, use a regular Monte Carlo procedure to evaluate the CVaRα (X ).
 Exercises. 1. Show that an alternative method to compute CVaRα (X ) is to design
the following recursive procedure
 
Cn+1 = Cn − γn+1 Cn − (ξn , Xn+1 ) , n ≥ 0, C0 = 0, (6.31)

where (γn )n≥1 is the step sequence implemented in the algorithm (6.28) to compute
VaRα (X ).
216 6 Stochastic Approximation with Applications to Finance

2 (Proof of Proposition 6.5). Show that the conclusion of Proposition 6.5 remains
valid if X ∈ L1+ρ (P). [Hint: rely on the Chow Theorem (3 ).]

ℵ Practitioner’s corner
 Warning!…Toward an operating procedure! As it is presented, the pre-
ceding is essentially a toy exercise for the following reason: in practice α  1,
the convergence of the above algorithm turns out to be slow and chaotic since
P(X > VaRα (X )) = 1 − α is close to 0. For a practical implementation on real
life portfolios the above algorithm must be combined with an importance sampling
transformation to “recenter” the simulation where things do happen. A realistic and
efficient procedure is developed and analyzed in [29].
 A second practical improvement to the procedure is to make the level α vary
slowly from, say, α0 = 21 to the target level α during the grist part of the simulation.
 An alternative approach to this recursive algorithm is to invert the empirical mea-
sure of the innovations (Xn )n≥1 (see [83]). This method is close to the one described
above once rewritten in a recursive way (it corresponds to the step sequence γn = 1n ).

6.3.5 Stochastic Optimization Methods for Optimal


Quantization

Let X : (, A, P) → Rd be a random vector taking at least N ∈ N∗ values. Hence


any optimal N -quantizer has N pairwise distinct components. We want to produce
an optimal quadratic quantization of X at level N , i.e. to produce an N -quantizer
which minimizes the quadratic quantization error as introduced in Chap. 5.
The starting point of numerical methods is that any optimal (or even locally opti-
mal) quantizer x = (x1 , . . . , xN ) satisfies the stationarity equation as briefly recalled
below.
Competitive Learning Vector Quantization
The Competitive Learning Vector Quantization algorithm – or CLVQ – is simply
the stochastic gradient descent derived from the quadratic distortion function (see
below). Unfortunately, this function, viewed as a potential to be minimized, does
not fulfill any of the required assumptions to be a Lyapunov function in the sense
of Theorem 6.1 (the Robbins–Siegmund Lemma) and Corollary 6.1(b) (Stochastic
algorithm). So we cannot rely on these general results to ensure the a.s. convergence

3 Let (Mn )n≥0 be an (Fn , P)-martingale null at 0 and let ρ ∈ (0, 1]; then
⎧ ⎫
⎨   ⎬
a.s.
Mn −→ M∞ on E |Mn | 1+ρ
| Fn−1 < +∞ .
⎩ ⎭
n≥1
6.3 Applications to Finance 217

of such a procedure. First, the potential does not go to infinity when the norm of
the N -quantizer goes to infinity; secondly, the procedure is not well-defined when
some components of the quantizers merge. Very partial results are known about the
asymptotic behavior of this procedure (see e.g. [39, 224]) except in one dimension in
the unimodal setting (log-concave density for the distribution to be quantized) where
much faster deterministic Newton–Raphson like procedures can be implemented if
the cumulative distribution and the first moment functions both have closed forms.
But its use remains limited due to its purely one-dimensional feature.
However, these theoretical gaps (possible asymptotic “merge” of components or
escape of components at infinity) are not observed in practical simulations. This fully
justifies presenting it in detail.
Let us recall that the quadratic distortion function (see Definition 5.1.2) is defined as
the squared quadratic mean-quantization error, i.e.

#x 2 = E q2,N (x, X ),


∀ x = (x1 , . . . , xN ) ∈ (Rd )N , Q2,N (x) = X − X 2

where x = {x1 , . . . , xN } and the local distortion function q2,N (x, ξ) is defined by
 2
∀ x ∈ (Rd )N , ∀ ξ ∈ Rd , q2,N (x, ξ) = min |ξ − xi |2 = dist ξ, {x1 , . . . , xN } .
1≤i≤N

Proposition 6.6 The distortion function Q2,N is continuously differentiable at N -


tuples x ∈ (Rd )N satisfying
 $ 
x has pairwise distinct components and P ∂Ci (x) = 0
  1≤i≤N
∂Q2,N
with a gradient ∇Q2,N = ∂xi
given by
1≤i≤N

  
∂Q2,N ∂q2,N ∂q2,N
(x) := E (x, X ) = (x, ξ)PX (d ξ),
∂xi ∂xi R d ∂xi

the local gradient being given by

∂q2,N
(x, ξ):=2(xi − ξ)1{Projx (ξ)=xi } , 1 ≤ i ≤ N ,
∂xi

where Projx denotes a (Borel) projection following the nearest neighbor rule on the
grid {x1 , . . . , xN }.

Proof. First note that, as the N -tuple x has pairwise distinct components, all the

interiors C i (x), i = 1, . . . , N , of the Voronoi cells induced by x are non-empty. Let
$N
◦ $
ξ∈ C i (x) = R \
d
∂Ci (x). One has mini=j |ξ − xi | − |ξ − xj | > 0 and
i=1 1≤i≤N
218 6 Stochastic Approximation with Applications to Finance


N
q2,N (x, ξ) = |xj − ξ|2 1{ξ∈ ◦ .
C j (x)}
j=1

Now, if x ∈ (Rd )N satisfies max1≤i≤N |xi − xi | < mini=j |ξ − xi | − |ξ − xj | ∈


(0, +∞), then 1{ξ∈ ◦ (x)} = 1{ξ∈ ◦ (x )} for every i = 1, . . . , N . Consequently,
Cj Cj

∂   ∂|xi − ξ|2
∀ i ∈ {1, . . . , N }, q2,N (x, ξ) = 1{ξ∈ ◦ (x)} = 2 1{ξ∈ ◦ (x)} (xi − ξ).
∂xi Ci ∂xi Ci

% 
Hence, it follows from the assumption P 1≤i≤N ∂Ci (x) = 0 that q2,N ( . , ξ) is
PX (d ξ)-a.s. differentiable with a gradient ∇x q2,N (x, ξ) given by the above formula.
On the other hand, for every x, x ∈ (Rd )N , the function Q2,N is locally Lipschitz
continuous since

Q2,N (x ) − Q2,N (x)


 
≤ min |xi − ξ| − min |xi − ξ| min |xi − ξ|
Rd 1≤i≤N 1≤i≤N 1≤i≤N

+ min |xi − ξ| PX (d ξ)
1≤i≤N
 
   
≤ max |xi − xi | max |xi | + |ξ| + min |xi | + |ξ| PX (d ξ)
1≤i≤N Rd 1≤i≤N 1≤i≤N
 
≤ CX max |xi − xi |∞ 1 + max |xi | + max |xi | .
1≤i≤N 1≤i≤N 1≤i≤N

As a consequence of the local interchange Lebesgue differentiation Theorem 2.2(a),


Q2,N is differentiable at x with the announced gradient. In turn, the continuity
of the gradient ∇Q2,N follows likewise from the Lebesgue continuity theorem
(see e.g. [52]) ♦

Remarks. In fact, when p > 1, the Lp -distortion function Qp,N with respect to a
Euclidean norm is also differentiable at N -tuples having pairwise distinct components
with gradient
 
xi − ξ
∇Qp,N (x) = p |xi − ξ|p−1 μ(d ξ)
Ci (x) |xi − ξ| 1≤i≤N
  
xi − X
= p E 1{X ∈Ci (x)} |xi − X |p−1 .
|xi − X | 1≤i≤N

An extension to the case p ∈ (0, 1] exists under appropriate continuity and integrabil-
ity assumptions
 on the distribution μ so that μ({a}) = 0 for every a and the function
a → Rd |ξ − a|p−1 μ(d ξ) remains bounded on compact sets of Rd . A more general
differentiation result exists for strictly convex smooth norms (see Lemma 2.5, p. 28
in [129]).
6.3 Applications to Finance 219

Remark. Note that if x is anoptimal N -quantizer


 (and X is supported by at least N
%
values), then the condition P 1≤i≤N ∂Ci (x) = 0 is always satisfied even if the dis-
tribution of X is not absolutely continuous and possibly assigns mass to hyperplanes.
See Theorem 4.2 in [129] for a proof.
As emphasized in the introduction of this chapter, the gradient ∇Q2,N having an
integral representation, it is formally possible to minimize Q2,N using a stochastic
gradient descent.
Unfortunately, it is easy to check that lim Q2,N (x) < +∞, though
|x|→+∞
lim Q2,N (x) = +∞. Consequently, it is hopeless to apply the standard a.s.
mini |xi |→+∞
convergence result for the stochastic gradient procedure from Corollary 6.1(b). But
of course, we can still write it down formally and implement it.
 Ingredients:
– A sequence ξ 1 , . . . , ξ n , . . . of (simulated) independent copies of X ,
– A (0, 1)-valued step sequence (γn )n≥1 . One usually chooses the step in the
c
parametric families γn = ↓ 0 (decreasing step satisfying (6.7)). Other choices
b+n
ϑ , 2 < ϑ < 1, in view of Ruppert and Polyak’s
c 1
are possible: slowly decreasing b+n
averaging procedure, a small constant step γn = γ∞  0 ( γn ↓ γ∞ > 0) to better
explore the state space.

 The Stochastic Gradient Descent procedure.


 
Let x[n] = x1[n] , . . . , xN[n] denote the running N -quantizer at iteration n (keep
in mind that the level remains N , fixed throughout the procedure). The procedure
formally reads:
 
x[n+1] = x[n] − γn+1 ∇x q2,N x[n] , ξ n+1 , x[0] ∈ (Rd )N ,
 
where x[0] = x1 [0], . . . , xN [0] is a starting value with pairwise distinct components
in Rd .
The updating of the current quantizer at time n + 1 is performed as follows: let
x[n] = (x1[n] , . . . , xN[n] ),

• Competition phase (winner selection) : i[n+1] ∈ argmini xi[n] − ξ n+1


(nearest neighbor search),
& [n+1]  
xi[n+1] := Dilatation ξ n+1 , 1 − γn+1 )(xi[n]
[n+1] ,
• Learning phase :
xi[n+1] := xi[n] , i = i[n+1] ,

where Dilatation(a, ρ) is the dilatation with center a ∈ Rd and coefficient ρ ∈ (0, 1)


defined by
Dilatation(a, ρ)(u) = a + ρ(a − u).
220 6 Stochastic Approximation with Applications to Finance

In case of conflict on the winner index one applies a prescribed rule (like selection
at random of the true winner, uniformly among the winner candidates).
Warning! Note that choosing a (0, 1)-valued step sequence (γn )n≥1 is crucial to
ensure that the learning phase is a dilatation with coefficient ρ = 1 − γn+1 at iteration
n + 1.
One can easily check by induction that if x[n] has pairwise distinct components
[n]
xi , this is preserved by the learning phase. So, that the above procedure is well-
defined. The name of the procedure – Competitive Learning Vector Quantization
algorithm – is of course inspired by these two phases.

 Heuristics: x[n] −→ x(N ) ∈ argminx∈(Rd )N Q2,N (x) as n → +∞, or, at least toward
a local minima of Q2,N . (This implies that x(N ) has pairwise distinct components.)
 On- line computation of the “companion parameters”: this phase is very
important in view of numerical applications.
 x(N ) 
• Weights pi(N ) = P X
# = xi(N ) , i = 1, . . . , N .

– Initialize: pi[0] := 0, i = 1, . . . , N ,

– Update: pi[n+1] := (1 − γn+1 )pi[n] + γn+1 1{i=i[n+1] } , n ≥ 0. (6.32)

One has:
 
pi[n] −→ pi(N ) a.s. on the event x[n] → x(N ) as n → +∞.

#x 2 : set
• Distortion Q2,N (x(N ) ) = X − X
(N )

– Initialize: Q[0]
2,N := 0

:= (1 − γn+1 )Q2,N + γn+1 xi[n+1]


[n+1] [n] 2
– Update: Q2,N [n+1] − ξ
n+1
, n ≥ 0. (6.33)

One has
[n]    
Q2,N −→ Q2,N x(N ) a.s. on the event x[n] → x(N ) as n → +∞.

Note that, since the ingredients involved in the above computations are those used
in the competition phase (nearest neighbor search), there is (almost) no extra CPU
time cost induced by this companion procedure. By contrast the nearest neighbor
search is costly.
In some way the CLVQ algorithm can be seen as a Non-linear Monte Carlo
Simulation devised to design an optimal skeleton of the distribution of X .
For partial theoretical results on the convergence of the CLVQ algorithm, we
refer to [224] when X has a compactly supported distribution. To the best of our
6.3 Applications to Finance 221

knowledge, no satisfactory theoretical convergence results are available when the


distribution of X has an unbounded support. As for the convergence of the online
adaptive companion procedures, the convergence proof relies on classical martingale
arguments, we refer again to [224], but also to [20] (see also the exercise below).

 Exercise. (a) Replace the step sequence (γn ) in (6.32) and (6.33) by γ̃n = 1n ,
without modifying anything in the CLVQ procedure itself. Show that, if x[n] → x(N )
a.s., the resulting new procedures both converge toward their target. [Hint: Follow
the lines of the convergence of the adaptive estimator of the CVaR in Sect. 6.3.4.]
(b) Extend this result to prove that the a.s. convergence holds on the event
{x[n] → x(N ) }.
The CLVQ algorithm is recommended to obtain accurate results for small or
medium levels N (less that 20) and medium dimensions d (less than 10).
The Randomized Lloyd I procedure
The randomized Lloyd I procedure described below is recommended when N is large
and d is medium. We start again from the fact that if a function is differentiable at
one of its local or global minima, then its gradient is zero at this point. Any global
minimizer x(N ) of the quadratic distortion function Q2,N has pairwise distinct compo-
nents and P-negligible Voronoi cell boundaries as mentioned above. Consequently,
owing to Proposition 6.6, the gradient of the quadratic distortion at such x = x(N )
must be zero, i.e.

∂Q2,N  
(x, ξ) = 2 E (xi − X )1{Projx (X )=xi } = 0, 1 ≤ i ≤ N .
∂xi
 
Moreover we know from Theorem 5.1.1(b) that P X ∈ Ci (x) > 0 for every i ∈
{1, . . . , N } so that the above equation reads equivalently
 
#x = xi } , i = 1, . . . , N .
xi = E X | {X (6.34)

This identity can also


 bexrewritten
 more synthetically as a fixed point equality of the
mapping X# → E X |X # since
 
X #x
#x = E X | X (6.35)

  N
 
#x =
since E X | X #x = xi } 1{X#x =x } . Or, equivalently, but in a more
E X | {X i
i=1
tractable form, of the mapping
 
#x ()
 −→ E X | X (6.36)

defined on the set of subsets of Rd (grids) with (at most) N elements.

 Regular Lloyd’s I procedure (definition). The Lloyd I procedure is simply


222 6 Stochastic Approximation with Applications to Finance

the recursive procedure associated to the fixed point identity (6.34) (or (6.36)). In
its generic form it reads, keeping the notation x[n] for the running N -quantizer at
iteration n ≥ 0:
⎧    x[n] 
#x[n] = xi[n] } if P X
⎨ E X | {X # = xi[n] > 0,
xi[n+1] =  x[n]  i = 1, . . . , N , (6.37)
⎩ # = xi[n] = 0,
xi[n+1] = xi[n] if P X

starting from an N -tuple x[0] ∈ (Rd )N with pairwise distinct components in Rd .


We leave it as an exercise to show that this procedure is entirely determined by
the distribution of the random vector X .

 Exercise. Prove that this recursive procedure only involves the distribution μ = PX
of the random vector X .
The Lloyd I algorithm can be viewed as a two step procedure acting on random
vectors as follows
⎧  
⎨ (i) Grid updating : x[n+1] = X [n+1] () with X [n+1] = E X | X #x[n] ,

⎩ #x [n+1]
(ii) Voronoi cells (weights) updating : X ← X [n+1] .

The first step updates the grid, the second step re-assigns to each element of the
grid its Voronoi cell, which can be also interpreted as a weight updating.

Proposition 6.6 The Lloyd I algorithm makes the mean quadratic quantization error
decrease, i.e.  
n −→ X − X #x[n]  is non-increasing.
2

Proof. It follows from the above decomposition of the procedure and the very defi-
nitions of nearest neighbor projection and conditional expectation as an orthogonal
projector in L2 (P) that, for every n ∈ N,
    
X − X
#x[n+1]  = dist X , x[n+1] 
 [n+1]    
2 2

≤ X − X x 2 = X − E X | X #x[n] 
 
2


≤ X −X # x[n] 
. ♦
2

Though attractive, this proposition is far from a convergence result for the Lloyd
procedure since components of the running quantizer x[n] can a priori escape to
infinity. In spite of a huge literature on the Lloyd I algorithm, also known as k-means
in Statistics and Data Science, this question of its a.s. convergence toward a sta-
tionary, hopefully optimal, N -quantizer has so far only received partial answers. Let
us cite [168], where the procedure is introduced and investigated, probably for the
first time, in a one dimensional setting for unimodal distributions. More recently,
6.3 Applications to Finance 223

as far as strongly continuous distribution are concerned (4 ), let us cite [79, 80]
or [85], where a.s. convergence is established if X has a compactly supported den-
sity and [238],where the convergence is proved for unbounded strongly continuous
distributions under an appropriate initialization of the procedure at level N depend-
ing on the lowest quantization error at level N − 1 (see the splitting method in the
Practitioner’s corner below).

 The Randomized Lloyd I procedure. It relies on the computation of


#x = xi }), 1 ≤ i ≤ N , by a Monte Carlo simulation: if ξ1 ,…, ξM , …are inde-
E (X | {X
pendent copies of X ,

  ξm 1{#ξ x[n] =x[n] }
E X |X #x[n] = xi[n] 
1≤m≤M m i
,
|{1 ≤ m ≤ M , #
ξmx[n] = xi[n] }|

keeping in mind that the convergence holds when M → +∞. To be more precise,
this amounts to setting at every iteration n ≥ 0 and for every i = 1, . . . , N ,
⎧
⎪ 1≤m≤M ξm 1{# x[n] =x[n] }


ξm i
if {1 ≤ m ≤ M , #
ξmx = xi[n] } = ∅,
[n]

|{1≤m≤M , # ξ x[n] =x[n] }|


xi[n+1] = m i
,


⎩ x[n] otherwise,
i

starting from x(0) ∈ (Rd )N chosen to have pairwise distinct components in Rd .


This randomized Lloyd procedure simply amounts to replacing the distribution
PX of X by the (random) empirical measure

1 
M
μ(ω, d ξ)M = δξ (ω) (d ξ).
M m=1 m

In particular, if we use the same sample (ξm (ω))1≤m≤M at each iteration of the pro-
cedure, we still have the property that the procedure decreases a quantization error
modulus (at level N ) related to the distribution μ.
This suggests that the random i.i.d. sample (ξm )m≥1 can also be replaced by deter-
ministic copies obtained through a QMC procedure based on a representation of X
d
of the form X = ψ(U ), U = U ([0, 1]r ).

ℵ Practitioner’s corner
 Splitting Initialization method II. When computing quantizers of larger and larger
sizes for the same distribution, a significant improvement of the method is to initialize
the randomized Lloyd’s procedure or the CLVQ at level N + 1 by adding one com-
ponent to the N -quantizer resulting from the previous execution of the procedure, at

4A Borel distribution μ on Rd is strongly continuous if μ(H ) = 0 for any hyperplane H ⊂ Rd .


224 6 Stochastic Approximation with Applications to Finance

level N . To be more precise, one should initialize the procedure with the N + 1-tuple
(x(N ) , ξ) ∈ (Rd )N +1 where x(N ) denotes the limiting value of the procedure at level
N (assumed to exist, which is the case in practice). Such a protocol is known as the
splitting method.
 Fast nearest neighbor search procedure(s) in Rd .
This is the key step in all stochastic procedures which intend to compute optimal
(or at least “good”) quantizers in higher dimensions. To speed it up, especially when
d increases, is one of the major challenges of computer science.
– The Partial Distance Search paradigm (see [62]).
The nearest neighbor search in a Euclidean vector space can be reduced to the
simpler problem to checking whether a vector u = (u1 , . . . , ud ) ∈ Rd is closer to 0
with respect to the canonical Euclidean distance than a given former “minimal record
distance” δrec > 0. The elementary “trick” is the following

(u1 )2 ≥ δrec
2
=⇒ |u| ≥ δrec
..
.
(u1 )2 + · · · + (u )2 ≥ δrec
2
=⇒ |u| ≥ δrec
..
.

This is the simplest and easiest idea to implement but it seems that it is also the only
one that still works as d increases.
– The K-d tree (Friedmann, Bentley, Finkel, 1977, see [99]): the principle is to
store the N points of Rd in a tree of depth O(log(N )) based on their coordinates on
the canonical basis of Rd .
– Further improvements are due to McNames (see [207]): the idea is to perform a
pre-processing of the dataset of N points using a Principal Component Axis (PCA)
analysis and then implement the K-d -tree method in the new orthogonal basis induced
by the PCA.
Numerical optimization of quantizers for the normal distributions N (0; Id ) on
Rd , d ≥ 1
The procedures that minimize the quantization error are usually stochastic (except in
one dimension). The most famous ones are undoubtedly the so-called Competitive
Leaning Vector Quantization algorithm (see [231] or [229]) and the Lloyd I procedure
(see [106, 226, 231]) which have just been described and briefly analyzed above.
More algorithmic details are also available on the website

www.quantize.maths−fi.com

For normal distributions a large scale optimization has been carried out based on
a mixed CLVQ -Lloyd procedure. To be precise, grids have been computed for d = 1
6.3 Applications to Finance 225

up to 10 and N = 1 up to 5 000. Furthermore, several companion parameters have


also been computed (still by simulation): weight, L1 -quantization error, (squared)
L2 -quantization error (also known as distortion), local L1 and L2 -pseudo-inertia of
each Voronoi cell. All these grids can be downloaded on the above website.
Thus Fig. 6.3 depicts an optimal quadratic N -quantization of the bi-variate normal
distribution N (0; I2 ) with N = 500.

6.4 Further Results on Stochastic Approximation

This section is devoted to more advanced results on Stochastic Approximation and


can be skipped on a first reading. Its first part deals with the connection between the
asymptotic behavior of a stochastic algorithm with mean function h and that of the
ordinary differential equation ODEh ≡ ẏ = −h(y) already introduced in the intro-
duction. The second part is devoted to the main results about the rate of convergence
of stochastic algorithms in both their original and averaged forms.

6.4.1 The Ordinary Differential Equation (ODE) Method

Toward the ODE


The starting idea – which goes back to Ljung in [200] – of the so-called ODE method
is to consider a stochastic algorithm with mean function h as the perturbed discrete
time Euler scheme with decreasing step of ODEh . In this section, we will again

Fig. 6.3 An optimal quantization of the bi-variate normal distribution with size N = 500 (with
J. Printems)
226 6 Stochastic Approximation with Applications to Finance

mainly deal with Markovian stochastic algorithms associated to an i.i.d. sequence


(Zn )n≥1 of Rq -valued innovations of the form (6.3), namely

Yn+1 = Yn − γn+1 H (Yn , Zn+1 )

where Y0 is independent of the innovation sequence (all defined on the same


probability space (, A, P)), H : Rd × Rq → Rd is a Borel function such that
H (y, Z1 ) ∈ L2 (P) for every y ∈ Rd and (γn )n≥1 is sequence of step parameters.
We saw that this algorithm can be represented in a canonical way as follows:

Yn+1 = Yn − γn+1 h(Yn ) − γn+1 Mn+1 , (6.38)

where h(y) = E H (y, Z1 ) is the mean function of the algorithm. We define the fil-
tration Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. The sequence (Yn )n≥0 is (Fn )-adapted and
Mn = E (H (Yn−1 , Zn ) | Fn−1 ) − h(Yn−1 ), n ≥ 1, is a sequence of (Fn )-martingale
increments.
In the preceding we established that, under the assumptions of the Robbins–
Siegmund Lemma (based on the existence of a “Lyapunov function”, see Theorem 6.1
(ii) and (v)), the sequence (Yn )n≥0 is a.s. bounded and that the martingale

Mnγ = γk Mk is a.s. convergent in Rd .
k≥1

At this stage, to derive the a.s. convergence of the algorithm itself in various
settings (Robbins–Monro, stochastic gradient, one-dimensional stochastic gradi-
ent), we used direct pathwise arguments based on elementary topology. The main
improvement provided by the ODE method is to provide more powerful tools from
functional analysis derived from further investigations on the asymptotics of the
“tail”-sequences (Yk (ω))k≥n , n ≥ 0, assuming a priori that the whole sequence
(Yn (ω))n≥0 is bounded. The idea is to represent these tail sequences as stepwise
constant càdlàg functions of the cumulative function of the steps. We will also need
γ
an additional assumption on the paths of the martingale (Mn )n≥1 , however signifi-
cantly less stringent than the above a.s. convergence property. To keep on working in
this direction, it is more convenient to temporarily abandon our stochastic framework
and focus on a discrete time deterministic dynamics.
ODE method I
Let us be more specific: first we consider a recursively defined sequence
 
yn+1 = yn − γn+1 h(yn ) + πn+1 , y0 ∈ Rd , (6.39)

where (πn )n≥1 is a sequence of Rd -valued vectors and h : Rd → Rd is a continuous


Borel function.
We set 0 = 0 and, for every integer n ≥ 1,
6.4 Further Results on Stochastic Approximation 227


n
n = γk .
k=1

Then we define the stepwise constant càdlàg function (yt(0) )t∈R+ by

yt(0) = yn if t ∈ [n , n+1 )

and the sequence of time shifted functions defined by

yt(n) = y(0)n +t , t ∈ R+ .

Finally, we set for every u ∈ R+ ,


   
N (t) = max k : k ≤ t = min k : k+1 > t

so that N (t) = n if and only if t ∈ [n , n+1 ) (in particular, N (n ) = n).
Expanding the recursive Equation (6.39), we get


n 
n
yn = y0 − γk h(yk−1 ) − γk πk
k=1 k=1

which can be rewritten, for every t ∈ [n , n+1 ), as


n 
 k 
  n
yt(0) = y0(0) − h ys(0) ds − γk πk
k−1 '()*
k=1 k=1
=yk
 n 
n
= y0(0) − h(ys(0) )ds − γk πk .
0 k=1

As a consequence, for every t ∈ R+ ,

 N (t) N (t)

yt(0) = y0(0) − h(ys(0) )ds − γk πk
0 k=1
 N (t)
t   
= y0(0) − h(ys(0) )ds + h(yN (t) ) t − N (t) − γk πk . (6.40)
0 k=1

Then, using the very definition of the shifted function y(n) and taking advantage
of the fact that N (n ) = n , we derive, by subtracting (6.40) at times n + t and n
successively, that for every t ∈ R+ ,
228 6 Stochastic Approximation with Applications to Finance

yt(n) = y(0)n + y(0)n +t − y(0)n


 n +t
(0)   N (
 n +t)

= yn − h(ys(0) )ds + h(yN (t) ) t − N (t) − γk πk


n k=n+1
 t
= y0(n) − h(ys(n) )ds + R(n)
t
0

  N (
 n +t)

with R(n)
t = h(yN (t+n ) ) t + n − N (t+n ) − γk πk ,
k=n+1

keeping in mind that y0(n) = y(0)n (= yn ). The term Rn (t) is intended to behave as a
remainder term as n goes to infinity. The next proposition establishes a first connection
between the asymptotic behavior of the sequence of vectors (yn )n≥0 and that of the
sequence of functions (y(n) )n≥0 .

Proposition 6.7 (ODE method I) Assume that


• H1 ≡ Both sequences (yn )n≥0 and (h(yn ))n≥0 are bounded,

• H2 ≡ ∀ n ≥ 0, γn > 0, lim γn = 0 and n≥1 γn = +∞,
n
N (
 n +t)

• H3 ≡ ∀ T ∈ (0, +∞), lim sup γk πk = 0.


n t∈[0,T ]
k=n+1

Then:
 
(a) The set Y ∞ := limiting points of (yn )n≥0 is a compact connected set.
(b) The sequence (y(n) )n≥0 is sequentially relatively compact (5 ) with respect to
the topology of uniform convergence on compacts sets on the space B(R+ , Rd ) of
bounded functions from R+ to Rd (6 ) and all its limiting points lie in Cb (R+ , Y ∞ ).

Proof Let M = supn∈N |h(yn )|.


(a) Let T0 = supn∈N γn < +∞. Then it follows from H2 that

N (
 n +t)

|γn+1 πn+1 | ≤ sup γk πk = 0.


t∈[0,T0 ]
k=n+1

It follows from (6.39) and H3 that

|yn+1 − yn | ≤ γn+1 M + |γn+1 πn+1 | → 0 as n → +∞.

5 Ina metric space (E, d ), a sequence (xn )n≥0 is sequentially relatively compact if from any subse-
quence one can extract a convergent subsequence with respect to the distance d .
6 This topology is defined by the metric d (f , g) =
 min(supt∈[0,k] |f (t)−g(t)|,1)
k≥1 2k
.
6.4 Further Results on Stochastic Approximation 229


set Y is compact
As a consequence, the  and bien enchaı̂né (7 ), hence connected.
.
(b) The sequence h(ys(n) )ds is uniformly Lipschitz continuous with
0 n≥0
Lipschitz continuous coefficient M since, for every s, t ∈ R+ , s ≤ t,
 t  s  t
h(yu(n) )du − h(yu(n) )du ≤ |h(yu(n) )|du ≤ M (t − s)
0 0 s

and (y0(n) )n≥0 = (yn )n≥0 is bounded, hence it follows from the Arzela–Ascoli Theorem
 . 
that the sequence of continuous y0(n) − h(ys(n) )ds is relatively compact in
0 n≥0
C(R+ , Rd ) endowed with the topology, denoted by UK , of the uniform convergence
on compact intervals. On the other hand, for every T ∈ (0, +∞),
 t
sup yt(n) − y0(n) + h(ys(n) )ds = sup |R(n)
t |
t∈[0,T ] 0 t∈[0,T ]
N (
 n +t)
≤ M sup γk + sup γk πk → 0 as n → +∞
k≥n+1 t∈[0,T ] k=n+1

owing to H2 and H3 . Equivalently, this reads


  . 
y(n) − y0(n) −
UK
h(ys(n) )ds −→ 0 as n → +∞, (6.41)
0

 
which implies that the sequence y(n) n≥0 is UK -relatively compact. Moreover, both
 .   
sequences y0(n) − 0 h(ys(n) )ds and y(n) n≥0 have the same UK -limiting values.
n≥0
In particular, these limiting functions are continuous with values in Y ∞ . ♦

ODE method II: flow invariance


To state the first significant theorem on the so-called ODE method theorem, we
introduce the reverse differential equation

ODEh∗ ≡ ẏ = h(y).

Theorem 6.5 (ODE II) Assume Hi , i = 1, 2, 3, hold and that the mean function h is
continuous. Let y0 ∈ Rd and Y ∞ be the set of limiting values of the sequence (yn )n≥0
recursively defined by (6.39).
(a) Any limiting function of the sequence (y(n) )n≥0 is a Y ∞ -valued solution of
ODEh ≡ ẏ = −h(y).

7A set A in a metric space (E, d ) is “bien enchaı̂né” if for every a, a ∈ A, and every ε > 0, there
exists an integer n ≥ 1 and a0 , . . . , an such that a0 = a, an = a and d (ai , ai+1 ) ≤ ε for every
i = 0, . . . , n − 1. Any connected set C in E is bien enchaı̂né. The converse is true if C is compact,
see e.g. [13] for more details.
230 6 Stochastic Approximation with Applications to Finance

(b) Assume that ODEh has a flow t (ξ) (8 ). Assume the existence of a flow for
ODEh∗ . Then, the set Y ∞ is a compact, connected set, flow-invariant for both ODEh
and ODEh∗ .

Proof: (a) Given the above Proposition 6.7, the conclusion follows if we prove that
any limiting function y(∞) = UK -limn y(ϕ(n)) (ϕ(n)
 (ϕ(n))  → +∞)
 (∞)is
 a solution to ODEh .
(ϕ(n)) (∞)
For every t ∈ R+ , yt → yt , hence h yt → h yt since the function h
is continuous. Then, by the Lebesgue dominated convergence theorem, one derives
that for every t ∈ R+ ,
 
t   t  
h ys(ϕ(n)) ds −→ h ys(∞) ds.
0 0

(ϕ(n))
One also has y0 → y0(∞) so that finally, letting ϕ(n) → +∞ in (6.41), we obtain
 t  
yt(∞) = y0(∞) − h ys(∞) ds.
0

One concludes by time differentiation since h ◦ y(∞) is continuous.


(b) Any y∞ ∈ Y ∞ is the limit of a subsequence yϕ(n) with ϕ(n) → +∞. Up to a
new extraction, still denoted by ϕ(n) for convenience, we may assume that y(ϕ(n)) →
y(∞) ∈ C(R+ , Rd ) as n → +∞, uniformly on compact sets of R+ . The function y(∞)
is a Y ∞ -valued solution to ODEh by (a) and y(∞) = (y0(∞) ) owing to the uniqueness
assumption. This implies the invariance of Y ∞ under the flow of ODEh .
For every p ∈ N, we consider for large enough n, say n ≥ np , the sequence
(yN (ϕ(n) −p) )n≥np . It is clear by mimicking the proof of Proposition 6.7 that sequences
of functions (y(N (ϕ(n) −p)) )n≥np are UK -relatively compact. By a diagonal extraction
procedure (still denoted by ϕ(n)), we may assume that, for every p ∈ N,

UK
y(N (ϕ(n) −p)) −→ y(∞),p as n → +∞.
(N (ϕ(n) −p−1)) (N (ϕ(n) −p))
Since yt+1 = yt for every t ∈ R+ and every n ≥ np+1 , one has

(∞),p+1 (∞),p
∀ p ∈ N, ∀ t ∈ R+ , yt+1 = yt .

Furthermore, it follows from (a) that the functions y(∞),p are Y ∞ -valued solutions
to ODEh . One defines the function y(∞) by
(∞),p
yt(∞) = yp−t , t ∈ [p − 1, p],

8 For every ξ ∈ Y ∞ , ODEh admits a unique solution (t (ξ))t∈R+ starting at 0 (ξ) = ξ.
6.4 Further Results on Stochastic Approximation 231

(∞),p
which satisfies a posteriori, for every p ∈ N, yt(∞) = yp−t , t ∈ [0, p]. This implies
that y(∞) is a Y ∞ -valued solution to ODEh∗ starting from y∞ on ∪p≥0 [0, p] = R+ .
Uniqueness implies that y(∞) = ∗t (y∞ ), which completes the proof. ♦

Remark. If uniqueness fails either for ODEh or for ODEh∗ , one still has that Y ∞ is
left invariant by ODEh and ODEh∗ in the weaker sense that, for every y∞ ∈ Y ∞ , there
exist Y ∞ -valued solutions of ODEh and ODEh∗ .

This property is the first step toward the deep connection between the asymptotic
behavior of a recursive stochastic algorithm and its associated mean field ODEh . Item
(b) can be seen as a first criterion to direct possible candidates to a set of limiting
values of the algorithm. Thus, any zero y∗ of h, or equivalently equilibrium points
of ODEh , satisfies the requested invariance condition since t (y∗ ) = y∗ for every
t ∈ R+ . No other single point can satisfy this invariance property. More generally,
we have the following result.
Corollary 6.2 If there are finitely many compact connected sets Xi , i ∈ I (I finite),
two-sided invariant for ODEh and ODEh∗ , then the sequence (yn )n≥0 converges toward
one of these sets, i.e. there is an integer i0 ∈ I such that dist(yn , Xi0 ) → 0 as n →
+∞.
As an elementary example let us consider the ODE

ẏ = (1 − |y|)y + ςy⊥ , y0 ∈ R2 ,

where y = (y1 , y2 ), y⊥ = (−y2 , y1 ) and ς is a positive real constant. Then the unit
circle C(0; 1) is clearly a connected compact set invariant under ODE and ODE ∗ .
The singleton {0} also satisfies this invariant property. In fact, C(0; 1) is an attractor
of ODE and {0} is a repeller in the sense that the flow  of this ODE uniformly
converges toward C(0; 1) on every compact set K ⊂ R2 \ {0}.
We know that any recursive procedure with mean function h of the form
h(y1 , y2 ) = (1 − |y|)y + ςy⊥ satisfying Hi , i = 1, 2, 3, will converge either toward
C(0, 1) or 0. But at this stage, we cannot eliminate the repulsive equilibrium {0}
(what happens if y0 = 0 and the perturbation term sequence πn ≡ 0 ).
Sharper characterizations of the possible set of limiting points of the sequence
(yn )n≥0 have been established in close connection with the theory of perturbed
dynamical systems. To be slightly more precise it has been shown that the set Y ∞ of
limiting points of the sequence (yn )n≥0 is internally chain recurrent or, equivalently,
contains no strict attractor for the dynamics of the ODE, i.e. as a subset A ⊂ Y ∞ ,
A = Y ∞ , such that φt (ξ) converges to A uniformly in ξ ∈ X ∞ . Such results are beyond
the scope of this monograph and we refer to [33] (see also [94] when uniqueness
fails and the ODE has no flow) for an introduction to internal chain recurrence.
Unfortunately, such refined results are still not able to discriminate between these
two candidates as a limiting set, though, as soon as πn behaves like a non-degenerate
noise when yn is close to 0, it seems more likely that the algorithm converges toward
the unit circle, like the flow of the ODEh does (except when starting from 0). At this
232 6 Stochastic Approximation with Applications to Finance

point probability comes back into the game: this intuition can be confirmed under
the additional non-degeneracy assumption of the noise at 0 for the algorithm (the
notion of “noisy trap”). Thus, if the procedure (6.39) is a generic path of a Markovian
algorithm of the form (6.3) satisfying at 0

E H (0, Z1 )H (0, Z1 )t is symmetric positive definite,

this generic path cannot converge toward 0 and will subsequently converge to C(0; 1).
Practical aspects of assumption H3 .
To make the connection with the original form of stochastic algorithms, we come
back to hypothesis H3 in the following proposition. In particular, it emphasizes that
this condition is less stringent than a standard convergence assumption on the series.

Proposition 6.8 Assumption H3 is satisfied in two situations (or their combination):


(a) Remainder term: If πn = rn with rn → 0 and γn → 0 as n → +∞, then

N (
 n +t)

sup γk πk ≤ sup |πk |(N (n +T ) − n ) → 0 as n → +∞.


t∈[0,T ] n+1 k≥n+1

(b) Martingale perturbation term: If πn =


Mn , n ≥ 1, is a sequence of martingale
increments and if the martingale M γ = nk=1 γk Mk a.s. converges in Rd . Then, it
satisfies a Cauchy condition so that

N (
 n +t) 

sup γk πk ≤ sup γk πk → 0 a.s. as n → +∞.
t∈[0,T ] ≥n+1 k=n+1
k=n+1

(c) Mixed perturbation: In practice, one often meets the combination of these
rrsituations:
πn = Mn + rn ,
γ
where rn is a remainder term which goes to 0 a.s. and Mn is an a.s. convergent
martingale.
γ
The a.s.
 convergence  of the martingale (Mn )n≥0 follows from the fact that
supn≥1 E (Mn )2 | Gn−1 < +∞ a.s., where Gn = σ(Mk , k = 1, . . . , n), n ≥ 1
γ
and G0 = {∅, }. The a.s. convergence of (Mn )n≥0 can even be relaxed for the
martingale term. Thus we have the following classical results where H3 is satisfied
γ
while the martingale Mn may diverge.

Proposition 6.9 (a) Métivier–Priouret criterion (see [213]). Let (Mn )n≥1 be a
sequence of martingale
 increments and let (γn )n≥1 be a sequence of non-negative
steps satisfying n γn = +∞. Then, H3 is a.s. satisfied with πn = Mn as soon as
6.4 Further Results on Stochastic Approximation 233

there exists a pair of Hölder conjugate exponents (p, q) ∈ (1, +∞)2 (i.e. 1
p
+ 1
q
= 1)
such that  1+ q
sup E |Mn |p < +∞ and γn 2 < +∞.
n n

This allows for the use of steps of the form γn ∼ c1 n−a , a > 2
2+q
= 2(p−1)
3(p−1)+1
.
(b) Exponential criterion (see e.g. [33]). Assume that there exists a real number c > 0
such that,
λ2
∀ λ ∈ R, E eλMn ≤ ec 2 .
 −c
Then, for every sequence (γn )n≥1 such that e γn < +∞, Assumption H3 is satis-
n≥1
fied with πn = Mn . This allows for the use of steps of the form γn ∼ c1 n−a , a > 0,
and γn = c1 (log n)−1+a , a > 0.
 Examples. Typical examples where the sub-Gaussian assumption is satisfied are
the following:
λ2 2
• |Mn | ≤ K ∈ R+ since, owing to Hoeffding’s Inequality, E eλMn ≤ e 8 K
(see e.g. [193], see also the exercise that follows the proof of Theorem 7.4).
d λ2
• Mn = N (0; σn2 ) with σn ≤ K, so that E eλMn ≤ e 2 K2
.
The first case is very important since in many situations the perturbation term is a
martingale term and is structurally bounded.
Application to an extended Lyapunov approach and pseudo-gradient
By relying on claim (a) in Proposition 6.7, one can also derive directly a.s. conver-
gence results for an algorithm.
Proposition 6.10 (G-Lemma, see [94]) Assume H1 , i = 1, 2, 3. Let G : Rd → R+
be a function satisfying
 
(G) ≡ lim yn = y∞ and lim G(yn ) = 0 =⇒ G(y∞ ) = 0.
n n

Assume that the sequence (yn )n≥0 satisfies



γn+1 G(yn ) < +∞. (6.42)
n≥0

Then, there exists a connected component X ∗ of the set {G = 0} such that dist
(yn , X ∗ ) = 0.
Remark. Any non-negative lower semi-continuous (l.s.c.) function G : Rd → R+
satisfies (G) (9 ).

9 A function f : Rd → R is lower semi-continuous if, for every x ∈ Rd and every sequence xn → x,


f (x) ≤ lim f (xn ).
n
234 6 Stochastic Approximation with Applications to Finance

Proof. First, it follows from Proposition (6.7) that the sequence (yn(n) )n≥0 is UK -
relatively compact with limiting functions lying in C(R+ , Y ∞ ) where Y ∞ still
denotes the compact connected set of limiting values of (yn )n≥0 .
Set, for every y ∈ Rd ,
 
G(y) = lim G(x) = inf lim G(xn ), xn → y
x→y n

so that 0 ≤ G ≤ G. The function G is the l.s.c. envelope of the function G, i.e. the
highest l.s.c. function not greater than G. In particular, under Assumption (G)

{G = 0} = {G = 0} is closed.

First note that Assumption (6.42) reads


 +∞  
G ys(0) ds < +∞.
0

Let y∞ ∈ Y ∞ . Up to at most two extractions of subsequences, one may assume


that y(ϕ(n)) → y(∞) for the UK topology, where y0(∞) = y∞ . It follows from Fatou’s
Lemma that
 +∞  +∞
 (∞)   
0≤ G ys ds = G lim ys(ϕ(n)) ds
0 0 n
 +∞
 
≤ lim G ys(ϕ(n)) ds since G is l.s.c.
0 n
 +∞
 
≤ lim G ys(ϕ(n)) ds owing to Fatou’s Lemma
n 0
 +∞
≤ lim G(ys(ϕ(n)) )ds
n 0
 +∞
= lim G(ys(0) )ds = 0.
n ϕ(n)

 +∞    
Consequently, G ys(∞) ds = 0, which implies that G ys(∞) = 0 ds-a.s.
0  
Now as the function s → ys(∞) is continuous it follows that G y0(∞) = 0 since
 ∞  
G is l.s.c. This in turn implies G y0 = 0, i.e. G y∞ = 0. As a consequence
Y ∞ ⊂ {G = 0}, which yields the result since on the other hand it is a connected
set. ♦

Now we are in position to prove the convergence of stochastic pseudo-gradient


procedures in the multi-dimensional case.
6.4 Further Results on Stochastic Approximation 235

Proof of Theorem 6.3 (Pseudo-gradient). Each path of the stochastic algorithm


fits into the ODE formalism
 (6.39) by setting yn = Yn (ω) and the perturbation
πn = H Yn−1 (ω), Z
 n  (ω) . Under the assumptions of the Robbins–Siegmund
  Lemma,
we know that Yn (ω) n≥0 is P(d ω)-a.s. bounded and L Yn (ω) → L∞ (ω). Com-
bined with (claim (v)) of Theorem 6.1, this implies that P(d ω)-a.s., Assump-
tion Hi , i = 1, 2, 3 are satisfied by (yn )n≥0 . Consequently, Proposition 6.7 implies
that the set Y ∞ (ω) of limiting values of (yn )n≥0 is connected and compact. The
former Proposition
  6.10 applied to the l.s.c. function G = (∇L |h) implies that
Y ∞ (ω) ⊂ (∇L |h = 0} as it is also contained in {L = (ω)} with (ω) = L∞ (ω).
The conclusion follows. ♦

6.4.2 L2 -Rate of Convergence and Application to Convex


Optimization

Proposition 6.11 Let (Yn )n≥1 be a stochastic algorithm defined by (6.3) where the
function H satisfies the quadratic linear growth assumption (6.6), namely

∀ y ∈ Rd , H (y, Z)2 ≤ C(1 + |y|).

Assume Y0 ∈ L2 (P) is independent of the i.i.d. sequence (Zn )n≥1 . Assume there exists
a y∗ ∈ Rd and an α > 0 such that the strong mean-reverting assumption

∀ y ∈ Rd , (y − y∗ |h(y)) > α|y − y∗ |2 (6.43)

holds. Finally, assume that the step sequence (γn )n≥1 satisfies the usual decreasing
step assumption (6.7) and the additional assumption
  
1 γn  
(Gα ) ≡ lim an = 1 − 2 α γn+1 − 1 = −κ∗ < 0. (6.44)
n γn+1 γn+1

Then
a.s. √
Yn −→ y∗ and Yn − y∗ 2 = O( γn ).

γ1
ℵ Practitioner’s corner. • If γn = ϑ , 21 < ϑ < 1, then (Gα ) is satisfied for any
n
α > 0.
γ1 1 − 2 α γ1 1
• If γn = , Condition (Gα ) reads = −κ∗ < 0 or equivalently γ1 > .
n γ1 2α
Proof of Proposition 6.11 The fact that Yn → y∗ a.s. is a straightforward conse-
quence of Corollary 6.1 (Robbins–Monro framework).
  We also know that
|Yn − y∗ | n≥0 is L2 -bounded, so that, in particular Yn n≥0 is L2 -bounded. As con-
cerns the quadratic rate of convergence, we re-start from the classical proof of
Robbins–Siegmund’s Lemma. Let Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. Then
236 6 Stochastic Approximation with Applications to Finance
 
|Yn+1 − y∗ |2 = |Yn − y∗ |2 − 2γn+1 H (Yn , Zn+1 )|Yn − y∗ + γn+1
2
|H (Yn , Zn+1 )|2 .

Since Yn − is Fn -measurable, hence independent of Zn+1 , we know that Likewise we


get
 
E |H (Yn , Zn+1 )|2 = E E |H (Yn , Zn+1 )|2 | Fn ≤ 2 C 2 (1 + E |Yn |2 ) < +∞.

Now, as Yn − y∗ is Fn -measurable, one has


    
E (H (Yn , Zn+1 )|Yn − y∗ ) = E E (H (Yn , Zn+1 )|Fn )|Yn − y∗ = E h(Yn )|Yn − y∗ .

This implies
 
E |Yn+1 − y∗ |2 = E |Yn − y∗ |2 − 2γn+1 E h(Yn )|Yn − y∗ + γn+1 2
E |H (Yn , Zn+1 )|2
   
≤ E |Yn − y∗ |2 − 2γn+1 E h(Yn )|Yn − y∗ + 2 γn+1 2
C 2 1 + E |Yn |2
 
≤ E |Yn − y∗ |2 − 2 α γn+1 E |Yn − y∗ |2 + γn+1
2
C  1 + E |Yn − y∗ |2

owing successively to the linear quadratic growth and the strong mean-reverting
assumptions. Finally,
 
E |Yn+1 − y∗ |2 = E |Yn − y∗ |2 1 − 2 α γn+1 + C  γn+1
2
+ C  γn+1
2
.

If we set for every n ≥ 1,


E |Yn − y∗ |2
un = ,
γn

the above inequality can be rewritten using the expression for an ,


γn  
un+1 ≤ un 1 − 2 α γn+1 + C  γn+1
2
+ C  γn+1
γn+1
 
= un 1 + γn+1 (an + C  γn ) + C  γn+1 .

κ∗
Let n0 be an integer such that, for every n ≥ n0 , an ≤ − 34 κ∗ and C  γn ≤ 4
. For these

integers n, 1 − κ2 γn+1 > 0 and
 κ∗ 
un+1 ≤ un 1 − γn+1 + C  γn+1 .
2
Then, one derives by induction that
 
2C 
∀ n ≥ n0 , 0 ≤ un ≤ max un0 , ∗ ,
κ
6.4 Further Results on Stochastic Approximation 237

which completes the proof. ♦

 Exercise. Prove a similar result (under appropriate assumptions) for an algorithm


of the form  
Yn+1 = Yn − γn+1 h(Yn ) + Mn+1 ,

where h : Rd → Rd is Borel continuous function and (Mn )n≥1 a sequence of Fn -


martingale increments satisfying
 
|h(y)| ≤ C(1 + |y − y∗ |) and E |Mn+1 |2 | Fn < C(1 + |Yn |2 ).

Application to α-convex optimization


Let L : Rd → R+ be a twice differentiable convex function with D2 L ≥ αId , where
α > 0 (in the sense that D2 L(x) − αId is a positive symmetric matrix for every
x ∈ Rd ). Such a function is sometimes called α-strictly convex. Then, it follows from
Lemma 6.1(b) that lim L(y) = +∞ and that, for every x, y ∈ Rd ,
|y|→+∞

 
∇L(x) − ∇L(y)|x − y ≥ α|x − y|2 .

Hence L, being non-negative, attains its minimum at a point y∗ ∈ Rd . In fact, the


above inequality straightforwardly implies that y∗ is unique since {∇L = 0} is clearly
reduced to {y∗ }.
Moreover, if we assume that ∇L is Lipschitz continuous, then for every y, u ∈ Rd ,
 
∗ ∇L(y + tu) − ∇L(y)|u
0 ≤ u D L(y)u = lim
2
≤ [∇L]Lip |u|2 < +∞.
t→0 t

In that framework, the following proposition


 shows
 that the “value function” L(Yn ) of
stochastic gradient converges in L1 at a O 1/n -rate. This is an easy consequence of
the above Proposition 6.11 and a well-known method in Statistics called -method.
Note that, if L is α-strictly convex with a Lipschitz continuous gradient, then
L(y) $ |y|2

Proposition 6.12 Let L : Rd → R+ be a twice differentiable strictly α-convex func-


tion with a Lipschitz continuous gradient hence satisfying αId ≤ D2 L ≤ [∇L]Lip Id
(α > 0) in the sense of symmetric matrices.
Let (Yn )n≥0 be a stochastic gradient descent associated to L, i.e. a stochastic
algorithm defined by (6.3) with a mean function h = ∇L. Assume that the L2 -linear
growth assumption (6.6) on the state function H , the independence assumptions on
Y0 and the innovation sequence (Zn )n≥1 from Proposition 6.11 all hold true. Finally,
assume that the decreasing step assumption (6.7) and (Gα ) are both satisfied by the
sequence (γn )n≥1 . Then
   
E L(Yn ) − L(y∗ ) = L(Yn ) − L(y∗ )1 = O γn .
238 6 Stochastic Approximation with Applications to Finance

Proof. It is clear that such a stochastic algorithm satisfies the assumptions of the
above Proposition 6.11, especially the strong mean-reverting assumption (6.43),
owing to the preliminaries on L that precede the proposition. By the fundamen-
tal theorem of Calculus, for every n ≥ 1, there exists a n ∈ (Yn−1 , Yn ) (geometric
interval in Rd ) such that

  1
L(Yn ) − L(y∗ ) = ∇L(y∗ )|Yn − y∗ + (Yn − y∗ )∗ D2 L( n )(Yn − y∗ )
2
1 2
≤ [∇L]1 Yn − y∗
2
where we used in the second line that ∇L(y∗ ) = 0 and the above inequality.
One concludes by taking the expectation in the above inequality and applying
Proposition 6.11. ♦

 Exercise. Prove a similar result for a pseudo-stochastic gradient under appropriate


assumptions on the mean function h.

6.4.3 Weak Rate of Convergence: CLT

We showed in Proposition 6.11 that under natural assumptions, including a strong


mean-reverting assumption, a stochastic algorithm converges in L2 to its target at

a γn -rate. Then, we saw that it suggests to use some steps of the form γn = n+b c

provided c is not small (see Practitioner’s corner that follows the proof of the proposi-
tion). Such a result corresponds to the Law of Large Numbers in quadratic mean and
one may reasonably guess that a Central Limit Theorem can be established. In fact,
the mean-reverting (to coercivity) assumption can be localized at the target y∗ , lead-
√ n −y∗
ing a CLT at a γn -rate, namely that Y√ γn
converges in distribution to some normal
distribution involving the dispersion matrix H (y∗ ) = E H (y∗ , Z)H (y∗ , Z)t .
The CLT for Stochastic Approximation algorithms has given rise to an extensive
literature starting from the pioneering work by Kushner [179] in the late 1970’s (see
also [49] for a result with Markovian innovations). We give here a result established
by Pelletier in [241] whose main originality is its “locality” in the following sense:
a CLT is shown to hold for the stable weak convergence, locally on the convergence
set(s) of the algorithm to its equilibrium points. In particular, it solves the case of
multi-target algorithms, which is often the case, e.g. for stochastic gradient descents
associated to non-convex potentials. It could also be of significant help to elucidate
the rate of convergence of algorithms with constraints or repeated projections like
those introduced by Chen. Such is the case for Arouna’s original adaptive variance
reduction procedure for which a CLT has been established in [197] by a direct
approach (see also [196] for a rigorous proof of the convergence).
6.4 Further Results on Stochastic Approximation 239

Theorem 6.6 (Pelletier [241]) We consider the stochastic procedure (Yn )n≥0 defined
by (6.3). Let y∗ ∈ {h = 0} be an equilibrium point. We make the following
assumptions.
(i) y∗ is an attractor for ODEh : y∗ is a “locally uniform attractor” for the
ODEh ≡ ẏ = −h(y) in the following sense:
h is differentiable at y∗ and all the eigenvalues of Dh(y∗ ) have positive real parts.

(ii) Regularity and growth control of H : the function H satisfies the following regu-
larity and growth control properties

y → E H (y, Z)H (y, Z)t is continuous at y∗

and
y → E |H (y, Z)|2+β, is locally bounded on Rd

for some β > 0.


(iii) Non-degenerate asymptotic variance: Assume that the covariance matrix of
H (y∗ , Z)

H (y∗ ) := E H (y∗ , Z)H (y∗ , Z)t is positive definite in S(d , R). (6.45)

(iv) Specification of the step sequence: Assume that the step γn is of the form

c 1
γn = , < ϑ ≤ 1, b ≥ 0, c > 0, (6.46)
nϑ +b 2

with
1
c> if ϑ = 1, (6.47)
2 e(λmin )

where λmin denotes the eigenvalue of Dh(y∗ ) with the lowest real part.
 
Then, the a.s. convergence is ruled on the event A∗ = Yn → y∗ by the following
stable Central Limit Theorem
√   Lstably
nϑ Yn − y∗ −→ N (0; c ) on A∗ , (6.48)
 +∞    
I I
where := e−s Dh(y∗ )t − 2cd∗
H (y∗ )e
−s Dh(y∗ )− 2cd∗
ds and c∗ = c if ϑ = 1 and
0
c∗ = +∞ otherwise.
Note that the optimal rate is obtained when ϑ = 1, provided c satisfies (6.47).

The stable convergence in distribution – denoted by “Lstably ” – mentioned in (6.48)


means that there exists an extension ( , A , P ) of (, A, P) and Z : ( , A , P ) →
Rd with N (0; Id ) distribution such that, for every bounded continuous function f
240 6 Stochastic Approximation with Applications to Finance

and every A ∈ A,
 √   √ √ 
E 1A∗ ∩A f n(Yn − y∗ ) −→ E 1A∗ ∩A f ( c Z) as n → +∞.

In fact, when the algorithm a.s. converges toward a unique target y∗ , it has been
shown in [203] (see Sect. 11.4) that one may assume that Z and A are independent
so that, in particular, for every A ∈ A,
 √    √ √ 
E f n(Yn − y∗ ) | A −→ E f ( c Z) as n → +∞.

The proof is detailed for scalar algorithms but its extension to higher-dimensional
procedures is standard, though slightly more technical. In the possibly multi-target
setting, like in the above statement, it is most likely that such an improvement still
holds true (though no proof is known to us). As A∗ ∈ A is independent of Z, this
would read, with the notation of the original theorem: if P(A∗ ) > 0, for every A in
A such that P(A ∩ A∗ ) > 0,
 √    √ √ 
E f n(Yn − y∗ ) | A ∩ A∗ −→ E f ( c Z) as n → +∞.

Remarks. √ • It is clear that the best rate of convergence is obtained when ϑ = 1,


namely n. With the restriction for practical implementation that the choice of
the coefficient c is subject to a constraint involving an unknown lower bound (see
Practitioner’s corner below and Sect. 6.4.4 devoted to Ruppert and Polyak’s averaging
principle).
• When ϑ = 1 and 0 < c ≤ 2 e(λ 1
min )
other rates are obtained. Thus, if c = 2 e(λ
1
min )

+ order of the Jordan blocks of λmin is 1, the above weak convergence


and the maximal
rate switch to logn n with a variance which can again be made explicit.
When 0 < c < 2 e(λ 1
min )
and λmin is real, still with Jordan blocks with order 1, under
an additional assumption on the “differentiability rate” of h at y∗ , one can show that
there exists a non-zero Rd -valued random vector such that
 
Yn = y∗ + n−cλmin + o(1) a.s. as n → +∞.

  λmin is complex, possibly with Jordan order blocks greater


Similar results hold when
than 1 or when γn = o 1n but still satisfies the usual decreasing step Assumption (6.7)
(like γn = log(n+1)(n+b)
c
).

ℵ Practitioner’s corner
 Optimal choice of the step sequence. As mentioned in the theorem itself, the
optimal weak rate is obtained for ϑ = 1, provided the step sequence is of the form
γn = n+b
c
, with c > 2 e(λ
1
min )
. The explicit expression for – which depends on c as
6.4 Further Results on Stochastic Approximation 241

well – suggests that there exists an optimal choice of this parameter c minimizing
the asymptotic variance of the algorithm.
 case to get rid of some technicalities. If d =
We will deal with the one-dimensional
1, then λmin = h (y∗ ), H (y∗ ) = Var H (y∗ , Z) and a straightforward computation
shows that
  c2
c = Var H (y∗ , Z) 
.
2c h (y∗ ) − 1

This simple function of c attains its minimum on (1/(2h (y∗ )), +∞) at

1
copt =
h (y∗ )

with a resulting asymptotic variance

Var(H (y∗ , Z))


.
h (y∗ )2

One shows that this is the lowest possible variance in such a procedure. Consequently,
the best choice for the step sequence (γn )n≥1 is

1 1
γn := or, more generally, γn := ,
h (y∗ )n h (y∗ )(n + b)

where b can be tuned to “control” the step at the beginning of the simulation when
n is small.
At this stage, one encounters the same difficulties as with deterministic proce-
dures since y∗ being unknown, h (y∗ ) is even “more” unknown. One can imagine to
estimating this quantity as a companion procedure of the algorithm, but this turns out
to be not very efficient. A more efficient approach, although not completely satisfac-
tory in practice, is to implement the algorithm in its averaged version (see Sect. 6.4.4
below).
 Exploration vs convergence rate. However, one must keep in mind that this tuning
of the step sequence is intended to optimize the rate of convergence of the algorithm
during its final convergence phase. In real applications, this class of recursive pro-
cedures spends most of its time “exploring” the state space before getting trapped
in some attracting basin (which can be the basin of a local minimum in the case of
multiple critical points). The CLT rate occurs once the algorithm is trapped.
 Simulated annealing. An alternative to these procedures is to implement a simu-
lated annealing procedure which “super-excites” the algorithm using an exogenous
simulated noise in order to improve the efficiency of the exploring phase (see [103,
104, 241]). Thus, when the mean function h is a gradient (h = ∇L), it finally con-
verges – but only in probability – to the true minimum of the potential/Lyapunov
function L. However, the final convergence rate is worse owing to the additional
242 6 Stochastic Approximation with Applications to Finance

exciting noise which slows down the procedure in its convergence phase. Practi-
tioners often use the above Robbins–Monro or stochastic gradient procedure with a
sequence of steps (γn )n≥1 which decreases to a positive limit γ.
Proving a CLT
We now prove this CLT in the 1D-framework when the algorithm a.s. converges
toward a unique “target” y∗ . Our method of proof is the so-called SDE method, which
heavily relies on functional weak convergence arguments. We will have to admit few
important results about weak convergence of processes, for which we will provide
precise references. An alternative proof is possible based on the CLT for triangular
arrays of martingale increments (see [142], see also Theorem 12.8 for a statement in
the Miscellany Chapter). Such an alternative proof – in a one-dimensional setting –
can be found in [203].
We propose below a proof of Theorem 6.6, dealing only with the regular weak
convergence in the case of an a.s. convergence toward a unique equilibrium point y∗ .
The extension to a multi-target algorithm is not much more involved and we refer
to the original paper [241]. Before getting onto the proof, we need to recall the dis-
crete time Burkhölder–Davis–Gundy (B.D.G.) inequality (and the Marcinkiewicz–
Zygmund inequality) for discrete time martingales. We refer to [263] (p. 499) for a
proof and various developments.

Burkhölder–Davis–Gundy Inequality (discrete time) and Marcinkiewicz–


Zygmund inequality. Let p ∈ [1, +∞). There exists universal real constants cp and
Cp > 0 such that, for every sequence (Mn )n≥1 of (Fn )n≥1 -martingale increments,
for every n ≥ 1,
,    , 
-     - 
- n  k
 - n

cp  . (Mk )2  ≤   max M

  ≤ C  . (M k  .
) 2  (6.49)
  k=1,...,n  p
 k=1  =1 p
 k=1 
p p

If p > 1, one also has, for every n ≥ 1


,   n  , 
- n    - n 
-    - 
 . 
(Mk )  ≤  
Mk  ≤ Cp  . (Mk )  .
cp  2 2 (6.50)
 k=1     k=1 
k=1 p
p p

The left inequality in (6.50) remains true for p = 1 if the random variables (Mn )n≥1
are independent. Then, the inequality takes the name of Marcinkiewicz–Zygmund
inequality.

Proof of Theorem 6.4. We first introduce the augmented filtration of the innova-
tions defined by Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. It is clear by induction that (Yn )n≥0 is
d
(Fn )n≥0 -adapted (in what follows we will sometimes use a random variable Z = Z1 ).
We first rewrite the recursion satisfied by the algorithm in its canonical form
6.4 Further Results on Stochastic Approximation 243
 
Yn = Yn−1 − γn h(Yn−1 ) + Mn , n ≥ 1,

where
Mn = H (Yn−1 , Zn ) − h(Yn−1 ), n ≥ 1,

is a sequence of Fn -martingale increments. The so-called SDE method is based on


the same principle as the ODE method but with the quantity of interest

Yn − y∗
ϒn := √ , n ≥ 1.
γn

This normalization is strongly suggested by the above L2 -convergence rate theorem.


The underlying idea is to write a recursion on ϒ
 n which
 appears as an Euler scheme
with decreasing step γn of an SDE having N 0; σ as a stationary/steady regime.
Step 1 (Toward the SDE). As announced, we assume that
a.s.
Yn −→ y∗ ∈ {h = 0}.

We may assume (up to a change of variable resulting from by the translation y ←


y − y∗ ) that
y∗ = 0.

The differentiability of h at y∗ = 0 reads

h(Yn ) = Yn h (0) + Yn η(Yn ) with lim η(y) = η(y∗ ) = 0.


y→0

Moreover the function η is locally bounded on the real line owing to the growth
assumption made on H (y, Z)2 which implies that h is locally bounded too owing
to Jensen’s inequality. For every n ≥ 1, we have
/  
γn √
ϒn+1 = ϒn − γn+1 h(Yn ) + Mn+1
γn+1
/   √
γn √
= ϒn − γn+1 Yn h (0) + η(Yn ) − γn+1 Mn+1
γn+1
/  
γn √ √
= ϒn − ϒn + ϒn − γn+1 γn ϒn h (0) + η(Yn )
γn+1

− γn+1 Mn+1
/ / 
γn    1  γn
= ϒn − γn+1 ϒn h (0) + η(Yn ) − −1
γn+1 γn+1 γn+1

− γn+1 Mn+1 .

Assume that the sequence (γn )n≥1 is such that there exists a c ∈ (0, +∞] satisfying
244 6 Stochastic Approximation with Applications to Finance
 / 
1  γn 1
lim an = −1 = .
n γn+1 γn+1 2c

Note that this implies limn γγn+1


n
= 1. One easily checks that this condition is satisfied
by our two families of step sequences of interest since
1
– if γn = b+n
c
, c > 0, b ≥ 0, then lim an = > 0,
n 2c

– if γn = nϑ , c > 0, 2 < ϑ < 1, then lim an = 0, i.e. c = +∞.
c 1
n
Consequently, for every n ≥ 1,
 1  √
ϒn+1 = ϒn − γn+1 ϒn h (0) − + αn1 + αn2 η(Yn ) − γn+1 Mn+1 ,
2c

where (αni )n≥1 i = 1, 2 are two deterministic sequences such that αn1 → 0 and αn2 →
1 as n → +∞.
Step 2 (Localization(s)). Since Yn → 0 a.s., one can write the scenarii space  as
follows
$ 0 1
∀ ε > 0, = ε,N a.s. where ε,N := sup |Yn | ≤ ε .
N ≥1 n≥N

Let ε > 0 and N ≥ 1 be temporarily free parameters. We define the function


h = hε by
∀ y ∈ R, h(y) = h(y)1{|y|≤ε} + Ky1{|y|>ε}

(K = K(ε) is also a parameter to be specified further on) and


⎧ ε,N
⎨ YN = YN 1{|YN |≤ε} ,

 

⎩ Y ε,N = Y ε,N − γn+1 hε (Y ε,N ) + 1{|Y |≤ε} Mn+1 , n ≥ N .
n+1 n n n

It is straightforward to show by induction that, for every ω ∈ ε,N ,

∀ n ≥ N , Ynε,N (ω) = Yn (ω).

To alleviate notation, we will drop the exponent ε,N in what follows and write Yn
instead of Ynε,N .
The mean function and Fn -martingale increments associated to this new algorithm
are
h and Mn+1 = 1|Y |≤ε Mn+1 , n ≥ N ,
n

which satisfy
6.4 Further Results on Stochastic Approximation 245
 
sup E |Mn+1 |2+β |Fn ≤ 21+β sup E |H (θ, X )|2+β
n≥N |θ|≤ε
≤ A(ε) < +∞ a.s. (6.51)

In what follows, we will study the normalized error defined by

Yn
ϒn := √ , n ≥ N .
γn

Step 3 (Specification of ε and K = K(ε)). We start again from the differentiability


of h at 0 (and h (0) > 0),
 
h(y) = y h (0) + η(y) with lim η(y) = η(0) = 0.
y→0

– If γn = n+b
c
with c > 1
2h (0)
(and b ≥ 0), we may choose ρ = ρ(h) > 0 small
enough so that
1
c> .
2h (0)(1 − ρ)

– If γn = (n+b)
c
ϑ , 2 < ϑ < 1 and, more generally, as soon as lim n an = 0, any
1

choice of ρ ∈ (0, 1) is possible. Now let ε(ρ) > 0 be such that |y| ≤ ε(ρ) implies
|η(y)| ≤ ρh (0). It follows that
 
θh(y) = y2 h (0) + η(y) ≥ y2 (1 − ρ)h (0) if |y| ≤ η(ρ).

Now we specify for the rest of the proof

ε = ε(ρ) and K = (1 − ρ)h (0) > 0.

As a consequence, the function h satisfies

∀ y ∈ R, yh(y) ≥ Ky2 .

Consequently, since (γn )n≥1 satisfies (Gα ) with α = K and κ∗ = 2K − 1c > 0, one
derives following the lines of the proof of Proposition 6.11 (established for Markovian
algorithms) that, √ 
Yn 2 = O γn .

Step 4 (The SDE method). First we apply Step 1 to our framework and we write
 1  √
ϒn+1 = ϒn − γn+1 ϒn h (0) − + αn1 + αn2 η(Yn ) − γn+1 Mn+1 , (6.52)
2c
246 6 Stochastic Approximation with Applications to Finance

where αn1 → 0 and αn2 → 1 are two deterministic sequences and η is a bounded
function.
At this stage, we want to re-write the above recursive equation in continuous time
exactly like we did for the ODE method. To this end, we first set

n = γ1 + · · · + γn , n ≥ N ,

and  
∀ t ∈ R+ , N (t) = min k : k+1 ≥ t , t = N (t) .

To alleviate the notation we also set a = h (0) − 1


2c
> 0. We first define the càdlàg
function ϒ (0) on [N , +∞) by setting

(0)
ϒ(t) = ϒn , t ∈ [n , n+1 ), n ≥ N ,

(0) (0)
so that, in particular, ϒ(t) = ϒ(t) . Following the strategy adopted for the ODE
method, one expands the recursion (6.52) in order to obtain
 n 
(0)   n
(0) √
ϒ(n)
= ϒN − ϒ(t) a + αN1 (t) + αN2 (t) η(YN (t) ) dt) − γk Mk .
N k=N

As a consequence, one has for every time t ≥ N ,

 N (t)

(0)
t   √
ϒ(t) = ϒN − ϒs(0) a + αN1 (s) + αN2 (s) η(YN (s) ) ds − γk Mk . (6.53)
N k=N

Still like for the ODE, we are interested in the functional asymptotics at infinity
(0)
of ϒ(t) , this time in a weak sense. To this end, we introduce for every n ≥ N and
every t ≥ 0, the sequence of time shifted functions, defined this time on R+ :
(n) (0)
ϒ(t) = ϒ(n +t)
, t ≥ 0, n ≥ N .

It follows from (6.53) that these processes satisfy for every t ≥ 0,

 n +t N (
 n +t)
(n) (0)   √
ϒ(t) = ϒn − ϒ(s) a + αN1 (s) + αN2 (s) η(YN (s) ) ds + γk Mk .
n
' () * 'k=n+1 () *
=: A(n) (n)
(t) =: M(t)
(6.54)
Step 5 (Functional tightness). At this stage, we need two fundamental results about
functional weak convergence. The first is a criterion which implies the functional
tightness of the distributions of a sequence of càdlàg processes X (n) (viewed as
6.4 Further Results on Stochastic Approximation 247

probability measures on the space D(R+ , R) of càdlàg functions from R+ to R). The
second is an extension of Donsker’s Theorem for sequences of martingales.
We recall the definition of the uniform continuity modulus defined for every
function f : R+ → R and δ, T > 0 by

w(f , δ, T ) = sup |f (t) − f (s)|.


s,t∈[0,T ], |s−t|≤δ

The terminology comes from the seminal property of this modulus: f is (uniformly)
continuous over [0, T ] if and only if lim w(f , δ, T ) = 0.
δ→0
 
Theorem 6.7 (C-tightness criterion, see [45], Theorem 15.5, p. 127) Let Xtn t≥0 ,
n ≥ 1, be a sequence of càdlàg processes null at t = 0.
(a) If, for every T > 0 and every ε > 0,
 
lim lim P w(X (n) , δ, T ) ≥ ε = 0, (6.55)
δ→0 n

then the sequence (X n )n≥1 is C-tight in the following sense: from any subsequence
  
(X n )n≥1 one may extract a subsequence (X n )n≥1 such that X n converges in distribu-
tion toward a process X ∞ with respect to the weak topology on the space D(R+ , R)
induced by the topology of uniform convergence on compact sets (10 ) such that
P(X ∞ ∈ C(R+ , R)) = 1.
(b) A criterion (see [45], proof of Theorem 8.3 p. 56): If, for every T > 0 and every
ε > 0,
1  
lim lim sup P sup |Xt(n) − Xs(n) | ≥ ε = 0, (6.56)
δ→0 n s∈[0,T ] δ s≤t≤s+δ

then the above condition (6.55) in (a) is satisfied.

The second theorem below provides a tightness criterion for a sequence of mar-
tingales based on the sequence of their bracket processes.

Theorem 6.8 (Weak functional limit of a sequence of martingales, see [154]) Let
(M(t)
n
)t≥0 , n ≥ 1, be a C-tight sequence of càdlàg (local) martingales, null at 0, with
(existing) predictable bracket process M n . If
a.s.
∀ t ≥ 0, M n (t) −→ σ 2 t as n → +∞, σ > 0,

then
LD(R+ ,R)
M n −→ σW

10 Although this topology is not standard on this space, it is simply defined sequentially by
(U )
X n =⇒ X ∞ if for every bounded functional F : D(R+ , R) → R, continuous for the  · sup -norm,
E F(X n ) → E F(X ∞ ).
248 6 Stochastic Approximation with Applications to Finance

where W denotes a standard Brownian motion (11 ).


Now we can apply these results to the processes A(n) (n)
(t) and M(t) .
First we aim at showing that the sequence of continuous processes (A(n) )n≥1 is
C-tight. Since sup |η(YN (t) )| ≤ ηsup < +∞, there exists a real constant C =
n≥N ,t≥0
Cηsup ,α1 sup ,α2 sup > 0 such that the sequence (A(n) )n≥N of time integrals satisfies
for every s ≥ 0 and every δ > 0,
 s+δ+n
sup A(n) (n)
(t) − A(s) ≤ C
(0)
ϒ(u) du.
s≤t≤s+δ s+n

Hence, owing to the Schwarz Inequality,



  s+δ+n
A(n) (n) (0)
2 2
sup (t) − A(s) ≤ C 2 s + δ + n − s + n ϒ(u) du
s≤t≤s+δ s+n

so that
2 3
 2
E sup |A(0)
(t) − A(0)
(s) |
2
≤ C 2 sup E |ϒn |2 × s + δ + n − s + n
s≤t≤s+δ n≥N
 2
≤ C 2 sup ϒn 22 × δ + γN (s+δ+n )+1 .
n≥N

Hence, for every n ≥ N and every s ∈ [0, T ],

1   C 2 sup ϒn 2 (δ + γN (T +δ+ )+1 )2


P sup |A(n) (n) n
(t) − A(s) | ≥ ε ≤ .
2 n

δ s≤t≤s+δ δε

Noting that limn→+∞ γN (T +δ+n )+1 = 0, one derives that Criterion (6.56) is satisfied.
Hence, the sequence (A(n) )n≥N is C-tight by applying the above Theorem 6.7.
Now, we deal with the martingales M (n) , n ≥ N . Let us consider the filtration
(0)
Ft = Fn , t ∈ [n , n+1 ). We define M (0) by

N (t)

(0) (0) √
M(t) = 0 if t ∈ [0, N ] and M(t) = γk Mk+1 if t ∈ [N , +∞).
k=N

 (0) 
It is clear that M(t) t≥0
is an (Ft(0) )t≥0 -martingale. Moreover, we know from (6.51)
that sup E |Mn |2+β ≤ A(ε) < +∞.
n

 
11 This means that for every bounded functional F : D R+ , R → R, measurable with respect to the

σ-field spanned by finite projection α → α(t), t ∈ R+ , and continuous at every α ∈ C R+ , R), one
has E F(M n ) → E F(σ W ) as n → +∞. In fact, this remains  true for
 measurable functionals F
which are Pσ W (d α)-a.s. continuous on C (R+ , R), such that F(M n ) n≥1 is uniformly integrable.
6.4 Further Results on Stochastic Approximation 249

It follows from the Burkhölder–Davis–Gundy Inequality (6.49) that, for every


s ∈ [N , +∞),

2 3 ⎛ ⎞1+ β2
(s+δ)
N
(0) (0)
≤ Cβ E ⎝ γk (Mk )2 ⎠
2+β
E sup M(t) − M(s)
s≤t≤s+δ k=N (s)+1
⎛ ⎞1+ β2
⎛ ⎞1+ β2 
N (s+δ)
(s+δ)
N ⎜ γk (Mk ) ⎟ 2
⎜ k=N (s)+1 ⎟
≤ Cβ ⎝ γk⎠ E⎜ ⎟
⎝ 
N (s+δ) ⎠
k=N (s)+1 γ+ + + + k
k=N (s)+1
⎛ ⎞
⎛ ⎞1+ β2 
N (s+δ)
(s+δ)
N ⎜ γ k | M k | 2+β

⎜ k=N (s)+1 ⎟
≤ Cβ ⎝ γk ⎠ E⎜ ⎟
⎝ 
N (s+δ) ⎠
k=N (s)+1 γk
k=N (s)+1
⎛ ⎞ β2
(s+δ)
N (s+δ)
N
≤ Cβ ⎝ γk ⎠ γk E |Mk |2+β ,
k=N (s)+1 k=N (s)+1

where Cβ is a positive real constant. One finally derives that, for every s ∈ [N , +∞),

2 3 ⎛ ⎞1+ β2
(s+δ)
N
(0) (0)
≤ Cδ A(ε) ⎝ γk ⎠
2+δ
E sup M(t) − M(s)
s≤t≤s+δ k=N (s)+1
 1+ β2
≤ Cδ A(ε) δ + sup γk .
k≥N (s)+1

(n)
Noting that M(t) = M(0)
n +t
− M(0)
n
, t ≥ 0, n ≥ N , we derive

 1+ β2
(n) (n)
≤ Cδ δ +
2+β
∀ n ≥ N , ∀ s ≥ 0, E sup M(t) − M(s) sup γk .
s≤t≤s+δ k≥N (n )+1

Then, by Markov’s inequality, we have for every ε > 0 and T > 0,

1   β
δ2
(n) (n)
lim sup P sup M(t) − M(s) ≥ ε ≤ Cδ 2+β .
n δ s∈[0,T ] s≤t≤s+δ ε

The C-tightness of the sequence (M (n) )n≥N follows again from Theorem 6.7(b).
Furthermore, for every n ≥ N ,
250 6 Stochastic Approximation with Applications to Finance

N (
 n +t)
 
M (n) t = γk E (Mk )2 | Fk−1
k=n+1
N (
 n +t)  
= γk E H (y, Z)2 |y=Yk−1
− h(Yk−1 )2
k=n+1
  N (
 n +t)

∼ E [H (0, Z) ] − h(0) 2 2
γk as n → +∞
k=n+1

since y → E H (y, Z)2 and h are both continuous at y∗ = 0 and Yk → y∗ as k →


+∞. Using that h(0) = 0, it follows that

M (n) t −→ E [H (0, Z)2 ] × t as n → +∞.

Setting σ 2 = E [H (0, Z)2 ], Theorem 6.8 then implies

LC(R+ ,R)
M (n) −→ σ W (∞)

where W (∞) is a standard Brownian motion.


 (n) 
Step 6 (Synthesis and conclusion). The sequence of processes ϒ(t) t≥0
, n ≥ N,
satisfies, for every n ≥ N ,
(n)
∀ t ≥ 0, ϒ(t) = ϒn − A(n) (n)
(t) − M(t) .

 
Consequently, the sequence ϒ (n) n≥N is C-tight since C-tightness is stable under
   
addition, (ϒn )n≥N is tight (by L2 -boundedness) and A(n) n≥N , M (n) n≥N are both
C-tight.
The sequence of random variables (ϒn )n≥N istight since it is L2 -bounded. Con-
sequently, the sequence of processes ϒ (n) , M (n) n≥N is C-tight as well.
 
Now let us elucidate the limit of A(n)(t) n≥N

 n +t  T
sup A(n)
t −a (0)
ϒ(s) ds ≤ an (n)
ϒ(s) ds
t∈[0,T ] n 0

2 3
where an = sup |αk1 | + α.2 sup sup η(Yk ) . As η(Yn ) → 0 a.s., we derive that
k≥n k≥n
T (n)
an → 0 a.s., whereas 0 ϒ(s) ds is L1 -bounded since
 T  T
(n)
E ϒ(s) ds = E ϒs(n) ds ≤ sup ϒn 2 T < +∞
0 0 n≥N
6.4 Further Results on Stochastic Approximation 251

(we used that  · |1 ≤  · 2 ). Hence we obtain by Slutsky’s Theorem


 n +t
sup A(n)
t −a
(0)
ϒ(s) ds → 0 in probability. On the other hand,
t∈[0,T ] n

 n +t
2  n +T
(0)   (0)
sup ϒ(s) ds ≤ sup n + t − n + t |ϒ(s) |ds
t∈[0,T ] n +t t∈[0,T ] n
 T
≤ sup γk |ϒs(n) |ds,
k≥n 0

 n +t
2
(0)
which implies likewise that sup ϒ(s) ds → 0 in probability as n → +∞.
t∈[0,T ] n +t
Consequently, for every T > 0,
 t
sup A(n)
t −a
(n)
ϒ(s) ds → 0 in probability as n → +∞. (6.57)
t∈[0,T ] 0

 (∞) (∞)   
Let ϒ(t) , M(t) t≥0
be a weak functional limiting value of ϒ (n) , M (n) n≥N i.e.
a weak limit along a subsequence (ϕ(n)). It follows from (6.54) that
 .   . 
(n) (n) (n) (n) (n) (n)
ϒ − ϒ(0) +a ϒ(s) ds =M − A −a ϒ(s) ds .
0 0

  . 
Using that (x, y) → x → x − x(0) + a x(s)ds, y is clearly continuous for the
0
.sup -norm and (6.57), it follows that
 t
(∞) (∞) (∞)
ϒ(t) − ϒ(0) +a ϒ(s) ds = σWt(∞) , t ≥ 0.
0

This means that ϒ (∞) is solution to the Ornstein–Uhlenbeck SDE


(∞) (∞)
d ϒ(t) = −aϒ(t) dt + σ dWt(∞) (6.58)

(∞)
starting from a random variable ϒ(0) ∈ L2 such that
 (∞) 
ϒ  ≤ sup ϒn  .
(0) 2 2
n≥N

This follows from the weak Fatou’s Lemma for convergence in distribution (see
(∞) L
Theorem 12.6(v)) since ϒϕ(n) → ϒ(0) . Let ν0 be a weak limiting value of (ϒn )n≥N
(R)
i.e. such that ϒψ(n) ⇒ ν0 .
252 6 Stochastic Approximation with Applications to Finance

For every t > 0, one considers the sequence of integers ψt (n) uniquely defined by

ψt (n) := ψ(n) − t.

Up to a new extraction, we may assume that we simultaneously have the conver-


gence of
L (∞,0)
ϒ (ψ(n)) −→ ϒ (∞,0) starting from ϒ(0)
d
= ν0

and
L (∞,−t)
ϒ (ψt (n)) −→ ϒ (∞,−t) starting from ϒ(0)
d
= ν−t .

One checks by strong uniqueness of solutions of the above Ornstein–Uhlenbeck


SDE (6.58) that
(∞,−t) (∞,0)
ϒ(t) = ϒ(0) .

Now, let (Pt )t≥0 denote the semi-group of the Ornstein–Uhlenbeck process defined
on bounded Borel functions f : R → R by Pt f (x) = E f (Xtx ). From the preceding,
for every t ≥ 0,
ν0 = ν−t Pt .

Moreover, (ν−t )t≥0 is tight since it is L2 -bounded. Let ν−∞ be a weak limiting value
of ν−t as t → +∞.
μ
Let ϒ(t) denote a solution to (6.58) starting from a μ-distributed random variable
independent of W . It is straightforward that its paths satisfy the confluence property

μ μ μ μ
|ϒt − ϒt | ≤ |ϒ0 − ϒ0 |e−at .

For every Lipschitz continuous function f with compact support,


ν ν
ν−∞ Pt (f ) − ν−t P t (f ) = E f (ϒ(t)−∞ ) − E f (ϒ(t)−t )
ν ν
≤ [f ]Lip E ϒ(t)−∞ − ϒ(t)−t
ν ν
≤ [f ]Lip e−at E ϒ(0)−∞ − ϒ(0)−t
ν ν
≤ [f ]Lip e−at ϒ(0)−∞ − ϒ(0)−t 2
≤ 2 [f ]Lip e−at sup ϒn 2
n≥N
−→ 0 as t → +∞,

where we used in the penultimate line that sup ϒn 2 < +∞


n≥N
Consequently  
σ2
ν0 = lim ν−∞ Pt = N 0; .
t→+∞ 2a
6.4 Further Results on Stochastic Approximation 253
 
We have proved that the distribution N 0; σ2a is the only possible limiting value,
2

hence  
L σ2
ϒn −→ N 0; as n → +∞.
2a

Now we return to ϒn (prior to the localization). We just proved that for ε = ε(ρ)
and for every N ≥ 1,
 
L σ2
ϒnε,N −→ N 0; as n → +∞. (6.59)
2a
%
On the other hand, we already saw that Yn → 0 a.s. implies that  = N ≥1 ε,N a.s.
 ε,N   ε,N 
where ε,N = Y = Yn , n ≥ N = ϒ = ϒn , n ≥ N . Moreover, the events
ε,N are non-decreasing as N increases so that

lim P(ε,N ) = 1.
N →+∞

Owing to the localization principle, for every Borel bounded function f ,


 
∀ n ≥ N, E f (ϒn ) − f (ϒnε,N ) ≤ 2f ∞ P cε,N .

Combined with (6.59), if f is continuous and bounded, we get, for every N ≥ 1,


 σ   
lim E f (ϒn ) − E f √ ζ ≤ 2f ∞ P cε,N ,
n 2a

d
where ζ = N (0; 1). The result follows by letting N go to infinity and observing that
for every bounded continuous function f
 σ 
lim E f (ϒn ) = E f √ ζ
n 2a
 
L σ2
i.e. ϒn −→ N 0; as n → +∞. ♦
2a

6.4.4 The Averaging Principle for Stochastic Approximation

Practical implementations of recursive stochastic algorithms show that the conver-


gence, although ruled by a CLT, is chaotic, even in the final convergence phase,
except if the step is optimized to produce the lowest asymptotic variance. Of course,
this optimal choice is not realistic in practice since it requires an a priori knowledge
of what we are trying to compute.
254 6 Stochastic Approximation with Applications to Finance

The original motivation to introduce the averaging principle was to “smoothen”


the behavior of a converging stochastic algorithm by considering the arithmetic mean
of the past values up to the n-th iteration rather than the computed value at the n-th
iteration. In fact, we will see that, if this averaging procedure is combined with the
use of a “slowly decreasing” step parameter γn , one attains for free the best possible
rate of convergence!
To be precise: let (γn )n≥1 be a step sequence satisfying
c
γn = , ϑ ∈ (1/2, 1), c > 0, b ≥ 0.
nϑ + b

Then, we implement the standard recursive stochastic algorithm (6.3) and set

Y0 + · · · + Yn−1
Ȳn := , n ≥ 1.
n
Note that, of course, this empirical mean itself satisfies a recursive formula:

1  
∀ n ≥ 0, Ȳn+1 = Ȳn − Ȳn − Yn , Ȳ0 = 0.
n+1

By Césaro’s averaging principle it is clear that under the assumptions which ensure
that Yn → y∗ , one has
a.s.
Ȳn −→ y∗ as n → +∞

as well. This is even true on the event {Yn → y∗ } (e.g. in the case of multiple tar-
gets). What is more unexpected is that, under natural assumptions (see Theorem 6.9
hereafter), the (weak) rate of this convergence is ruled by a CLT

√ Lstably
n (Ȳn − y∗ ) −→ N (0; ∗) on the event {Yn → y∗ },

where ∗ is the minimal possible asymptotic variance-covariance matrix. Thus, if


(y∗ ,Z))
d = 1, ∗ = Var(Hh (y∗ )2
corresponding to the optimal choice of the constant c in the
step sequence γn = n+b in the CLT satisfied by the algorithm itself.
c

As we did for the CLT of the algorithm itself, we again state – and prove – this
CLT for the averaged procedure for Markovian algorithms, though it can be done
for more general recursive procedures of the form
 
Yn+1 = Yn − γn+1 h(Yn ) + Mn+1 ,

where (Mn )n≥1 is an L2+η -bounded sequence of martingale increments. However,


the adaptation to such a more general setting of what follows is an easy exercise.

Theorem 6.9 (Ruppert and Polyak, see [245, 259], see also [81, 241, 246]) Let
H : Rd × Rq → Rd be a Borel function. Let (Zn )n≥1 be a sequence of i.i.d. Rq -
6.4 Further Results on Stochastic Approximation 255

valued random vectors defined on a probability space (, A, P), independent of


Y0 ∈ L2Rd (, A, P). Then, we define the recursive procedure by

Yn+1 = Yn − γn+1 H (Yn , Zn+1 ), n ≥ 0.

Assume that, for every y ∈ Rd , H (y, Z) ∈ L1 (P) so that the mean vector field h(y) =
E H (y, Z) is well-defined (this is implied by (iii) below).
We make the following assumptions:
(i) The function h is zero at y∗ and is “fast” differentiable at y∗ , in the sense that

∀ y ∈ Rd , h(y) = Jh (y∗ )(y − y∗ ) + O(|y − y∗ |2 ),

where all eigenvalues of the Jacobian matrix Jh (y∗ ) of h at y∗ have a (strictly) positive
real part. (Hence Jh (y∗ ) is invertible).
(ii) The algorithm Yn converges toward y∗ with positive probability.
(iii) There exists an η > 2 such that

∀ K > 0, sup E |H (y, Z)|2+η < +∞. (6.60)


|y|≤K

 
(iv) The mapping
 y → E H (y,Z)H (y, Z)t is continuous at y∗ .
Set ∗ = E H (y∗ , Z)H (y∗ , Z)t .
Then, if the step sequence is slowly decreasing of the form γn = nϑc+b , n ≥ 1, with
1/2 < ϑ < 1 and c > 0, b ≥ 0, the empirical mean sequence defined by

Y0 + · · · + Yn−1
Ȳn =
n
satisfies the CLT with the optimal asymptotic variance, on the event {Yn → y∗ },
namely

√ Lstably  
n (Ȳn − y∗ ) −→ N 0; Jh (y∗ )−1 ∗ Jh (y∗ )
−1
on {Yn → y∗ }.

Proof (partial). We will prove this theorem in the case of a scalar algorithm, that is
we assume d = 1. Beyond this dimensional limitation, only adopted for notational
convenience, we will consider a more restrictive setting than the one proposed in the
above statement of the theorem. We refer for example to [81] for the general case.
In addition (or instead) of the above assumption we assume that:
– the function H satisfies the linear growth assumption:

H (y, Z)|2 ≤ C(1 + |θ|),


256 6 Stochastic Approximation with Applications to Finance

– the mean function h satisfies the coercivity assumption (6.43) from Proposi-
tion 6.11 for some α > 0 and has a Lipschitz continuous derivative.
At this point, note that the step sequences (γn )n≥1 under consideration are non-
increasing and all satisfy the Condition (Gα ) of Proposition 6.11 (i.e. (6.44)) with α

from (6.43). In particular, this implies that Yn → y∗ a.s. and Yn − y∗ 2 = O( γn ).
Without loss of generality we may assume that y∗ = 0, by replacing Yn by Yn − y∗ ,
H (y, z) by H (y∗ + y, z), etc. We start from the canonical decomposition

∀ n ≥ 0, Yn+1 = Yn − γn+1 h(Yn ) − γn+1 Mn+1 ,

where Mn+1 = H (Yn , Zn+1 ) − h(Yn ), n ≥ 0, is a sequence of Fn -martingale incre-


ments with Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. As h(0) = 0 and h is Lipschitz, one
has for every y ∈ R, h(y) − h (0)y = y2 κ(y) with |κ(y)| ≤ [h ]Lip . Consequently, for
every k ≥ 0,
Yk − Yk+1
h (0)Yk = − Mk+1 − Yk2 κ(Yk )
γk+1

which in turn implies, by summing from k = 0 up to n − 1,

1  Yk − Yk−1 1  2
n n−1
√ 1
h (0) n Ȳn = − √ − √ Mn − √ Y κ(Yk ).
n k=1 γk n n k=0 k

We will successively inspect the three sums on the right-hand side of the equation.
First, by an Abel transform, we get

  
Y0 
n n
Yk − Yk−1 Yn 1 1
= − + Yk−1 − .
γk γn γ1 γk γk−1
k=1 k=2

1
Hence, using that the sequence γn n≥1
is non-decreasing, we derive

  
|Yn | |Y0 | 
n n
Yk − Yk−1 1 1
≤ + + |Yk−1 | − .
γk γn γ1 γk γk−1
k=1 k=2

  √
Taking expectation and using that E |Yk | ≤ Yk 2 ≤ C γk , k ≥ 1, for some real
constant C > 0, we get,

  
E |Yn | E |Y0 | 
n n
Yk − Yk−1 1 1
E ≤ + + E |Yk−1 | −
γk γn γ1 γk γk−1
k=1 k=2
n  
C E |Y0 | √ 1 1
≤ √ + +C γk−1 −
γn γ1 γk γk−1
k=2
6.4 Further Results on Stochastic Approximation 257

 1 n  
C E |Y0 | γk−1
= √ + +C √ −1 .
γn γ1 γk−1 γk
k=2

Now, for every k ≥ 2,


   
1 γk−1 − 21 ϑ 1 kϑ + b
√ − 1 = c (k + b) 2 −1
γk−1 γk (k − 1)ϑ + b
ϑ
∼ b− 2 ϑk 2 −1 as k → +∞
1

so that  

n
1 γk−1  ϑ
√ − 1 = O n2 .
γk−1 γk
k=2

As a consequence, it follows from the obvious facts that lim n γn = +∞ and


n
ϑ−1
lim n 2 = 0, that
n
1 
n
Yk − Yk−1
lim √ E = 0,
n n k=1 γk

1  Yk − Yk−1 L1
n
i.e. √ −→ 0 as n → +∞.
n k=1 γk
The second term is the martingale Mn obtained by summing the increments Mk .
This is this martingale which will rule the global weak convergence rate. To analyze it,
we rely on Lindeberg’s Theorem 12.8 from the Miscellany Chapter, applied with an =
 2
n. First, if we set H (y) = E H (y, Z) − h(y) , it is straightforward that condition (i)
of Theorem 12.8 involving the martingale increments is satisfied since

1
n
M n a.s.
= H
(Yk−1 ) −→ H
(y∗ ),
n n
k=1

owing to Césaro’s Lemma and the fact that Yn → y∗ a.s. It remains to check the
condition (ii) of Theorem 12.8, known as Lindeberg’s condition. Let ε > 0. Then,
  √
E (Mk )2 1{|Mk |≥ε√n} ≤ ς(Yk−1 , ε n),
 2 
where ς(y, a) = E H (y, Z) − h(y) 1{|H (y,Z)−h(y)|≥a} , a > 0. One shows, owing
to Assumption (6.60),
 that h is bounded on compact sets which, in turn, implies that
(H (y, Z) − h(y))2 |y|≤K makes up a uniformly integrable family. Hence ς(y, a) → 0
as a → 0 for every y ∈ Rd , uniformly on compact sets of Rd . Since Yn → y∗ a.s.,
(Yn )n≥0 is a.s. bounded, hence
258 6 Stochastic Approximation with Applications to Finance

1  √   √ 
n
ς Yk−1 , ε n ≤ max ς Yk , ε n → 0 a.s. as n → +∞.
n 0≤k≤n−1
k=1

Hence Lindeberg’s condition is satisfied. As a consequence,


 
1 Mn L (y∗ )
√ −→ N 0; H
.
h (0) n h (0)2

The third term can be handled as follows, at least in our strengthened framework.
UnderAssumption (6.43) of Proposition 6.11, we know that E Yn2 ≤ Cγn for some real
constant C > 0 since the class of step sequences we consider satisfies the condition
(Gα ) (see Practitioner’s corner after Proposition 6.11). Consequently,

+∞
1   C [h ]Lip 
n−1 n−1
√ E (Yk2 |κ(Yk )|) ≤ [h ]Lip E Yk2 = √ γk
n k=0 k=0
n k=1
 
Cc [h ]Lip 
n−1
−ϑ
≤ √ E Y0 +
2
k
n k=1
 
Cc [h ]Lip n1−ϑ Cc [h ]Lip 1 −ϑ
≤ √ E Y02 + ∼ n2 → 0
n 1−ϑ 1−ϑ

+∞
1  2 L1
as n → +∞. This implies that √ Yk |κ(Yk )| −→ 0. Slutsky’s Lemma com-
n k=1
pletes the proof. ♦

Remark. As far as the step sequence is concerned, we only used in the above proof
that the step sequence (γn )n≥1 is non-increasing and satisfies the following three
conditions  γk √
(Gα ), √ < +∞ and lim n γn = +∞.
k n
n

Indeed, we have seen in the former section that, in one dimension, the asymptotic

variance h (0)H
2 obtained in the Ruppert-Polyak theorem is the lowest possible asymp-

totic variance in the CLT when specifying the step parameter in an optimal way
copt
(γn = n+b ). In fact, this discussion and its conclusions can be easily extended to
higher dimensions (if one considers some matrix-valued step sequences) as empha-
sized, for example, in [81].
So, the Ruppert and Polyak averaging principle performs as fast as the “fastest”
regular stochastic algorithm with no need to optimize the step sequence: the optimal
asymptotic variance is realized for free!

ℵ Practitioner’s corner.  How to choose ϑ? If we carefully inspect the two


reminder non-martingale terms in the above proof, we see that they converge in
6.4 Further Results on Stochastic Approximation 259

L1 to zero at explicit rates:

1  Yk − Yk−1  ϑ−1  1  2  1 
n n−1
√ = OL1 n 2 and √ Y |κ(Yk )| = OL1 n 2 −ϑ .
n k=1 γk n k=0 k

Hence the balance between these two terms is obtained by equalizing the two
exponents 21 − ϑ and ϑ−1
2
i.e.

1 ϑ−1 2
−ϑ= ⇐⇒ ϑopt = .
2 2 3
See also [101].
 When to start averaging? In practice, one should not start the averaging at the
true beginning of the procedure but rather wait for its stabilization, ideally once the
“exploration/search” phase is finished. On the other hand, the compromise consisting
in using a moving window (typically of length n after 2n iterations) does not yield
the optimal asymptotic variance, as pointed out in [196].

 Exercises. 1. Test the above averaging principle on the former exercises and
“numerical illustrations” by considering γn = γ1 n− 3 , n ≥ 1, as suggested in the
2

first item of the above Practitioner’s corner. Compare with a direct approach with
c 
a step of the form γn = n+b  , with c > 0 “large enough but not too large…”, and

b ≥ 0.
2. Show that, under the (stronger) assumptions that we considered in the proof of the
former theorem, Proposition 6.12 holds true with the averaged algorithm (Ȳn )n≥1 ,
namely that
1
E L(Ȳn ) − L(y∗ ) = O .
n

6.4.5 Traps (A Few Words About)

In the presence of multiple equilibrium points, i.e. of multiple zeros of the mean
function h, some of them turn out to be parasitic. This is the case for saddle points
or local maxima of the potential function L in the framework of stochastic gradi-
ent descent. More generally any zero of h whose Jacobian Jh (y∗ ) has at least one
eigenvalue with non-positive real part is parasitic.
There is a wide literature on this problem which says, roughly speaking, that
a noisy enough parasitic equilibrium point is a.s. not a possible limit point for a
stochastic approximation procedure. Although natural and expected, such a conclu-
sion is far from being straightforward to establish, as testified by the various works
on the topic (see [191], see also [33, 81, 95, 242], etc). If the equilibrium is not noisy
260 6 Stochastic Approximation with Applications to Finance

many situations may occur as illustrated by the two-armed bandit algorithm, whose
zeros are all noiseless (see [184]).
To some extent local minima are parasitic too but this is another story and stan-
dard stochastic approximation does not provide satisfactory answers to this “second
order problem” for which specific procedures like simulated annealing should be
implemented, with the drawback of degrading the (nature and) rate of convergence.
Going deeper in this direction is beyond the scope of this monograph so we refer
to the literature mentioned above and the references therein for more insight on this
aspect of Stochastic Approximation.

6.4.6 (Back to) VaRα and CVaRα Computation (II): Weak


Rate

We can apply both above CLTs to the VaRα and CVaRα (X ) algorithms (6.28)
and (6.30). Since

1   1
h(ξ) = F(ξ) − α and E H (ξ, X )2 = F(ξ)(1 − F(ξ)),
1−α (1 − α)2

where F is the c.d.f. of X , one easily derives from Theorems 6.6 and 6.9 the following
results.

Theorem 6.10 Assume that PX = f (x)dx, where f is a probability density function


continuous at ξα∗ = VaRα (X ).
(a) If γn = nϑc+b , 21 < ϑ < 1, c > 0, b ≥ 0, then

a  L  cα(1 − α) 
n 2 ξn − ξα∗ −→ N 0; .
2f (ξα∗ )

(b) If γn = c
n+b
, b ≥ 0, and c > 1−α
2f (ξα∗ )
then

√   L  c2 α 
n ξn − ξα∗ −→ N 0; ,
2cf (ξα∗ ) − (1 − α)

so that the minimal asymptotic variance is attained with cα∗ = 1−α


f (ξα∗ )
with an asymptotic
α(1−α)
variance equal to f (ξα∗ )2
.
(c) Ruppert and Polyak’s averaging principle: If γn = nϑc+b , 1
2
< ϑ < 1, c > 0,
b ≥ 0, then
√   L  α(1 − α) 
n ξn − ξα∗ −→ N 0; .
f (ξα∗ )2

The algorithm for the CVaRα (X ) satisfies the same kind of CLT.
6.4 Further Results on Stochastic Approximation 261

This result is not satisfactory because the asymptotic variance remains huge since
f (ξα∗ ) is usually very close to 0 when α is close to 1. Thus if X has a normal distribution
N (0; 1), then it is clear that ξα∗ → +∞ as α → 1. Consequently,

f (ξα∗ )
1 − α = P(X ≥ ξα∗ ) ∼ as α → 1
ξα∗

so that
α(1 − α) 1
∼ ∗ ∗ → +∞ as α → 1.
f (ξα∗ )2 ξα f (ξα )

This simply illustrates the “rare event” effect which implies that, when α is close
to 1, the event {Xn+1 > ξn } is rare especially when ξn gets close to its limit ξα∗ =
VaRα (X ).
The way out is to add an importance sampling procedure to somewhat “re-center”
the distribution around its VaRα (X ). To proceed, we will take advantage of our recur-
sive variance reduction by importance sampling described and analyzed in Sect. 6.3.1.
This is the object of the next section.

6.4.7 VaRα and CVaRα Computation (III)

As emphasized in the previous section, the asymptotic variance of our “naive” algo-
rithms for VaRα and CVaRα computation are not satisfactory, in particular when α
is close to 1. To improve them, the idea is to mix the recursive data-driven variance
reduction procedure introduced in Sect. 6.3.1 with the above algorithms.
First we make the (not so) restrictive assumption that the r.v. X , representative of
d
a loss, can be represented as a function of a Gaussian normal vector Z = N (0; Id ),
namely
X = ϕ(Z), ϕ : Rd → R, Borel function.

Hence, for a level α ∈ (0, 1], in a (temporarily) static framework (i.e. fixed ξ ∈ R),
the function of interest for variance reduction is defined by

1  
ϕα,ξ (z) = 1{ϕ(z)≤ξ} − α , z ∈ Rd .
1−α

So, still following Sect. 6.3.1 and taking advantage of the fact that ϕα,ξ is bounded,
we design the following data driven procedure for the adaptive variance reducer
(using the notations from this section),

θn+1 = θn − γn+1 ϕα,ξ (Zn+1 − θn )2 (2θn − Zn+1 ), n ≥ 0,

so that E ϕα,ξ (Z) can be computed adaptively by


262 6 Stochastic Approximation with Applications to Finance

|θ|2
  1  − |θk−1 |2
n
E ϕα,ξ (Z) = e− 2 E ϕα,ξ (Z)e−(θ|Z) = lim e 2 ϕα,ξ (Zk + θk−1 )e−(θk−1 |Zk ) .
n n
k=1

Considering now a dynamical version of these procedures in order to adapt ξ


recursively leads us to design the following procedure:

γn+1 − |θn |2 −(θn |Zk )  


ξn+1 = ξn − e 2 e 1{ϕ(Zn+1 +θn )≤ξn } − α (6.61)
1−α
 2 
θn+1 = θn − γn+1 1{ϕ(Zn+1 )≤ξn } − α (2θn − Zn+1 ) , n ≥ 1, (6.62)

with an appropriate initialization (see the remark below). This procedure is a.s. con-
 
vergent toward its target, denoted by (θα , ξα ), and the averaged component ξ n n≥0
of (ξn )n≥0 satisfies a CLT (see [29]).

Theorem 6.11 (Adaptive VaR computation with importance sampling) (a) If


the step sequence satisfies the decreasing step assumption (6.7), then

n→+∞
(ξn , θn ) −→ (ξα , θα ) a.s. with ξα = VaRα (X )

and  2 
θα = argminθ∈R Vα,ξ (θ) = e−|θ| E 1{ϕ(Z+θ)≤ξ} − α e−2(θ|Z) .
2

 
Note that Vα,ξ (0) = F(ξ) 1 − F(ξ) .
(b) If the step sequence satisfies γn = c
,1
nϑ +b 2
< ϑ < 1, b ≥ 0, c > 0, then
 
√   L Vα,ξα (θα )
n ξ n − ξα −→ N 0; as n → +∞.
f (ξα )2

Remark. In practice it may be useful, as noted in [29], to make the level α slowly
vary with n in (6.61) and (6.62), e.g. from 0.5 up to the requested level, usually close
to 1. Otherwise, the procedure may freeze. The initialization of the procedure should
be set accordingly.

6.5 From Quasi-Monte Carlo to Quasi-Stochastic


Approximation

Plugging quasi-random numbers into a recursive stochastic approximation procedure


instead of pseudo-random numbers is a rather natural idea given the performances
of QMC methods for numerical integration. To the best of our knowledge, it goes
back to the early 1990s in [186]. As expected, various numerical tests showed that
6.5 From Quasi-Monte Carlo to Quasi-Stochastic Approximation 263

it may significantly accelerate the convergence of the procedure like in Monte Carlo
simulations.
In [186], this question is mostly investigated from a theoretical viewpoint. The
main results are based on an extension of uniformly distributed sequences on unit
hypercubes called averaging systems. The two main results are based, on the one
hand, on a contraction assumption and on the other hand on a monotonicity assump-
tion, which both require some stringent conditions on the function H . In the first set-
ting, some a priori error bounds emphasize that quasi-stochastic approximation does
accelerate the convergence rate of the procedure. Both results are one-dimensional,
though the contracting setting could be easily extended to multi-dimensional proce-
dures. Unfortunately, it turns out to be of little interest for practical applications.
In this section, we want to propose a more natural multi-dimensional framework
in view of applications. First, we give the counterpart of the Robbins–Siegmund
Lemma established in Sect. 6.2. It relies on a pathwise Lyapunov function, which
remains a rather restrictive assumption. It emphasizes what kind of assumption is
needed to establish theoretical results when using deterministic uniformly distributed
sequences. A more detailed version, including several examples of applications, is
available in [190].

Theorem 6.12 (Robbins–Siegmund Lemma, QMC framework)


Borel
(a) Let h : Rd → Rd and let H : Rd × [0, 1]q −→ Rd be such that
  d
∀ y ∈ Rd , h(y) = E H (y, U ) , where U = U ([0, 1]q ).

Suppose that
{h = 0} = {y∗ }

and that there exists a differentiable function L : Rd → R+ with a Lipschitz contin-


uous gradient ∇L, satisfying

|∇L| ≤ CL 1 + L,

such that H fulfills the following pathwise mean-reverting assumption: the function
H defined for every y ∈ Rd by

H (y) := inf q (∇L(y)|H (y, u) − H (y∗ , u)) is l.s.c. and positive on Rd \ {y∗ }.
u∈[0,1]
(6.63)
Furthermore, assume that

∀ y ∈ Rd , ∀ u ∈ [0, 1]q , |H (y, u)| ≤ CH 1 + L(y) (6.64)

(which implies that h is bounded) and that the function

u → H (y∗ , u) has finite variation (in the measure sense).


264 6 Stochastic Approximation with Applications to Finance

Let ξ := (ξn )n≥1 be a uniformly distributed sequence over [0, 1]q with low discrep-
ancy, hence satisfying
   
n := max kDk∗ (ξ) = O (log n)q .
1≤k≤n

Let γ = (γn )n≥1 be a non-increasing sequence of gain parameters satisfying


   
γn = +∞, γn (log n)q → 0 and max γn − γn+1 , γn2 (log n)q < +∞.
n≥1 n≥1
(6.65)
Then, the recursive procedure defined by

∀ n ≥ 0, yn+1 = yn − γn+1 H (yn , ξn+1 ), y0 ∈ Rd ,

satisfies:
yn −→ y∗ as n → +∞.

(b) If (y, u) → H (y, u) is continuous, then Assumption (6.63) reads

∀ y ∈ Rd \ {y∗ }, ∀ u ∈ [0, 1]q , (∇L(y)|H (y, u) − H (y∗ , u)) > 0. (6.66)

Proof. (a) Step 1 (Regular step). The beginning of the proof is rather similar to the
“regular” stochastic case except that we will use as a Lyapunov function

= 1 + L.

First note that ∇ = √∇L is bounded (by the constant CL ) so that  is CL -Lipschitz
2 1+L
continuous. Furthermore, for every x, y ∈ Rd ,

|∇L(y) − ∇L(y )|
∇(y) − ∇(y ) ≤ 
1 + L(y)
1 1
+ |∇L(y )|  − (6.67)
1 + L(y) 1 + L(y )
|y − y |
≤ [∇L]Lip 
1 + L(y)
CL   
+ 1 + L(y) − 1 + L(y )
1 + L(y)
|y − y | C2
≤ [∇L]Lip  + L |y − y |
1 + L(y) 1 + L(y)
|y − y |
≤ C  (6.68)
1 + L(y)
6.5 From Quasi-Monte Carlo to Quasi-Stochastic Approximation 265

with C = [∇L]Lip + CL2 .


It follows, by using successively the fundamental theorem of Calculus applied to
 between yn and yn+1 and Hölder’s Inequality, that there exists ζn+1 ∈ (yn , yn+1 )
(geometric interval) such that
 
(yn+1 ) = (yn ) − γn+1 ∇(yn ) | H (yn , ξn+1 )
 
+ γn+1 ∇(yn ) − ∇(ζn+1 )|H (yn , ξn+1 )
 
≤ (yn ) − γn+1 ∇(yn ) | H (yn , ξn+1 )
+ γn+1 ∇(yn ) − ∇(ζn+1 ) H (yn , ξn+1 ) .

Now, the above inequality (6.68) applied with y = yn and y = ζn+1 yields, knowing
that |ζn+1 − yn | ≤ |yn+1 − yn |,

  C
(yn+1 ) ≤ (yn ) − γn+1 ∇(yn ) | H (yn , ξn+1 ) + γn+1
2
 |H (yn , ξn+1 )|2
1 + L(yn )
 
≤ (yn ) − γn+1 ∇(yn ) | H (yn , ξn+1 ) − H (y∗ , ξn+1 )
− γn+1 (∇(yn ) | H (y∗ , ξn+1 )) + γn+1
2
C (yn ).

Then, using (6.64), we get


 
(yn+1 ) ≤ (yn ) 1 + C γn+1
2
 
− γn+1 H (yn ) − γn+1 ∇(yn ) | H (y∗ , ξn+1 ) . (6.69)

Set, for every n ≥ 0,


n
(yn ) + γk H (yk−1 )
sn := n k=1

k=1 (1 + C γk )
2


with the usual convention ∅ = 0. It follows from (6.63) that the sequence (sn )n≥0
is non-negative since all the terms involved in its numerator are non-negative.
Now (6.69) reads
 
∀ n ≥ 0, 0 ≤ sn+1 ≤ sn − γ̃n+1 ∇(yn ) | H (y∗ , ξn+1 ) (6.70)

γn
where γn = n , n ≥ 1.
k=1 (1 + C γk )
2

Step 2 (QMC step). Set for every n ≥ 1,


n
  
n
mn := γk ∇(yk−1 ) | H (y∗ , ξk ) and Sn∗ = H (y∗ , ξk ).
k=1 k=1
266 6 Stochastic Approximation with Applications to Finance

First note that (6.65) combined with the Koksma–Hlawka Inequality (see Propo-
sition (4.3)) imply  
|Sn∗ | ≤ Cξ V H (y∗ , . ) (log n)q , (6.71)
 
where V H (y∗ , . ) denotes the variation in the measure sense of H (y∗ , . ). An Abel
transform yields (with the convention S0∗ = 0)

 ∗
 
n−1
 
mn = γn ∇(yn−1 ) | Sn − γk+1 ∇(yk ) − γk ∇(yk−1 ) | Sk∗
k=1

  
n−1

= γn ∇(yn−1 ) | Sn∗ − γk ∇(yk ) − ∇(yk−1 ) | Sk∗
' () * k=1
(a) ' () *
(b)


n−1
 
− γk+1 ∇(yk ) | Sk∗ .
k=1
' () *
(c)

We aim at showing that mn converges in R toward a finite limit by inspecting the


above three terms.
One gets, using that γn ≤ γn ,
 
|(a)| ≤ γn ∇sup O (log n)q = O(γn (log n)q ) → 0 as n → +∞.

Owing to (6.68), the partial sum (b) satisfies


n−1
  
n−1
|H (yk−1 , ξk )| ∗

γk ∇(yk ) − ∇(yk−1 ) | Sk ≤ C γk γk  |Sk |
k=1 k=1
1 + L(yk−1 )
 
n−1
≤ C CH V H (y∗ , . ) γk2 (log k)q ,
k=1

where we used Inequality (6.71)


 in the  second inequality. 
Consequently the series k≥1 γk ∇L(yk ) − ∇L(yk−1 ) | Sk∗ is (absolutely) con-
vergent owing to Assumption (6.65).
Finally, one deals with term (c). First notice that

|γn+1 − γn | ≤ γn+1 − γn + C γn+1


2
γn ≤ C max(γn2 , γn+1 − γn )

for some real constant C . One checks that the series (c) is also (absolutely) conver-
gent owing to the boundedness of ∇L, Assumption (6.65) and the upper-bound (6.71)
for Sn∗ .
6.5 From Quasi-Monte Carlo to Quasi-Stochastic Approximation 267

Then mn converges toward a finite limit m∞ . This induces that the sequence
(sn + mn )n is bounded below since (sn )n is non-negative. Now, we know from (6.70)
that (sn + mn ) is also non-increasing, hence convergent in R, which in turn implies
that the sequence (sn )n≥0 itself is convergent toward a finite limit. The same argu-
ments as in the regular stochastic case yield

L(yn ) −→ L∞ as n → +∞ and γn H (yn−1 ) < +∞.
n≥1

One concludes, still like in the stochastic case, that (yn ) is bounded and eventually
converges toward the unique zero of H , i.e. y∗ .
(b) is obvious. ♦

ℵ Practitioner’s corner • The step assumption (6.65) includes all the step sequences
of the form γn = ncα , α ∈ (0, 1]. Note that as soon as q ≥ 2, the condition γn (log n)q →
0 is redundant (it follows from the convergence of the series on the right owing to
an Abel transform).
• One can replace the (slightly unrealistic) assumption on H (y∗ , . ) by a more nat-
ural Lipschitz continuous assumption, provided one strengthens the step assump-
tion (6.65) into  1
γn = +∞, γn (log n)n1− q → 0
n≥1

and    1
max γn − γn+1 , γn2 (log n)n1− q < +∞.
n≥1

This is a straightforward consequence of Proïnov’s Theorem (Theorem 4.3), which


implies that
1
|Sn∗ | ≤ C(log n) n1− q .

Note that the above new assumptions are satisfied by the step sequences γn = c

,
1 − q1 < ρ ≤ 1.
• It is clear that the mean-reverting assumption on H is much more stringent in the
QMC setting.
• It remains that theoretical spectrum of application of the above theorem is dra-
matically more narrow than the original one. However, from a practical viewpoint,
one observes on simulations a very satisfactory behavior of such quasi-stochastic
procedures, including the improvement of the rate of convergence with respect to the
regular MC implementation.

 Exercise. We assume now that the recursive procedure satisfied by the sequence
(yn )n≥0 is given by
268 6 Stochastic Approximation with Applications to Finance
 
∀ n ≥ 0, yn+1 = yn − γn+1 H (yn , ξn+1 ) + rn+1 , y0 ∈ Rd ,

where the sequence (rn )n≥1 is a perturbation term. Show that, if n≥1 γn rn is a
convergent series, then the conclusion of the above theorem remains true.

Numerical experiment: We reproduced here (without even trying to check any


kind of assumption) the implicit correlation search recursive procedure tested in
Sect. 6.3.2, implemented this time with a sequence of some quasi-random normal
numbers, namely
+ + 
(ζn1 , ζn2 ) = −2 log(ξn1 ) sin(2πξn2 ), −2 log(ξn1 ) cos(2πξn2 ) , n ≥ 1,

where ξn = (ξn1 , ξn2 ), n ≥ 1, is simply a regular 2-dimensional Halton sequence (see


Table 6.2 and Fig. 6.4).

Table 6.2 B- S Best- of- Call option. T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Convergence of ρn = cos(θn ) toward a ρ∗ = toward − 0.5 (up to n = 100 000)

n ρn := cos(θn )
1000 −0.4964
10000 −0.4995
25000 −0.4995
50000 −0.4994
75000 −0.4996
100000 −0.4998

−0.25 −0.25
−0.3 −0.3
−0.35 −0.35
−0.4 −0.4
−0.45 −0.45
−0.5 −0.5
−0.55 −0.55
−0.6 −0.6
−0.65 −0.65
−0.7 −0.7
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10

Fig. 6.4 B- S Best- of- Call option.T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100, K =
100. Convergence of ρn = cos(θn ) toward a ρ∗ = toward − 0.5 (up to n = 100 000). Left: MC
implementation. Right: QMC implementation
6.6 Concluding Remarks 269

6.6 Concluding Remarks

From a probabilistic viewpoint, many other moment or regularity assumptions on the


Lyapunov function entail some a.s. convergence results. From the dynamical point
of view, stochastic approximation is rather closely connected to the dynamics of the
autonomous Ordinary Differential Equation (ODE ) of its mean function, namely
ẏ = −h(y). However, the analysis of a stochastic algorithm cannot be completely
“reduced” to that of its mean ODE as emphasized by several authors (see e.g. [33, 94]).
There is a huge literature on stochastic approximation, motivated by several fields:
optimization, robotics, statistics, artificial neural networks and machine learning,
self-organization an unsupervised learning, etc. For further insight on stochastic
approximation, the main textbooks are probably [39, 81, 180] for prominently prob-
abilistic aspects. One may read [33] for a dynamical system oriented point of view.
For an occupation measure approach, one may also see [95].
The case of non-i.i.d. Markovian innovation with or without feedback is not treated
in this chapter. This is an important topic for applications for which we mainly refer
to [39], see also, among other more recent references on this framework, [93] which
deals with a discontinuous mean dynamics.
Chapter 7
Discretization Scheme(s) of a Brownian
Diffusion

The aim of this chapter is to investigate several discretization schemes of the (adapted)
solution (X t )t∈[0,T ] to a d-dimensional Brownian Stochastic Differential Equation
(SDE ) formally reading

(S D E) ≡ d X t = b(t, X t )dt + σ(t, X t )dWt , (7.1)

where b : [0, T ] × Rd → Rd , σ : [0, T ] × Rd → M(d, q, R) are continuous func-


tions (see the remark below), W = (Wt )t∈[0,T ] denotes a q-dimensional standard
Brownian motion defined on a probability space (, A, P) and X 0 : (, A, P) → Rd
is a random vector, independent of W . We assume that b and σ are Lipschitz contin-
uous in x uniformly with respect to t ∈ [0, T ], i.e.
   
∀ t ∈ [0, T ], ∀ x, y ∈ Rd , b(t, x) − b(t, y) + σ(t, x) − σ(t, y) ≤ K |x − y|.
(7.2)

Note that | · | and  .  may denote any norm on Rd and on M(d, q, R), respectively,
in the above condition. However, without explicit mention, we will consider the
canonical Euclidean and Fröbenius norms in what follows.
We consider the so-called augmented filtration of the SDE generated by X 0 and
σ(Ws , 0 ≤ s ≤ t), i.e.
 
Ft := σ X 0 , NP , Ws , 0 ≤ s ≤ t , t ∈ [0, T ], (7.3)

where NP denotes the class of P-negligible sets of A (i.e. all negligible sets if
 σ-algebra A is P-complete).
the  When X 0 = x0 ∈ Rd is deterministic, Ft = FtW =
σ NP , Ws , 0 ≤ s ≤ t is simply the augmented filtration of the Brownian motion
W . One shows using Kolmogorov’s 0-1 law (see [251] or [162]) that (FtW ) is right
continuous, i.e. FtW = ∩s>t FsW for every t ∈ [0, T ). The same holds for (Ft ). Such

© Springer International Publishing AG, part of Springer Nature 2018 271


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_7
272 7 Discretization Scheme(s) of a Brownian Diffusion

a combination of completeness and right continuity of a filtration is described as


“usual conditions”.
The following theorem shows both the existence and uniqueness of such an (Ft )-
adapted solution starting from X 0 . Such stochastic processes are usually called Brow-
nian diffusion processes or, more simply Brownian diffusions. We refer to [162],
Theorem 2.9, p. 289, for a proof (among many other references).
Theorem 7.1 (Strong solution of (SDE )) (see e.g. [162], Theorem 2.9, p. 289)
Under the above assumptions on b, σ, X 0 and W , the SDE (7.1) has a unique
(Ft )-adapted solution X = (X t )t∈[0,T ] starting from X 0 at time 0, defined on the
probability space (, A, P), in the following (integral) sense:
 t  t
P-a.s. ∀ t ∈ [0, T ], X t = X 0 + b(s, X s )ds + σ(s, X s )dWs .
0 0

This solution has P-a.s. continuous paths.


Notation. When X 0 = x ∈ Rd , one denotes the solution of (SDE) on [0, T ] by X x
or (X tx )t∈[0,T ] .
Remarks. • A solution as described in the above theorem is known as a strong
solution in the sense that it is defined on the probability space on which W lives.
• The global continuity assumption on b and σ can be relaxed to Borel measurability
if we add the linear growth assumption

∀ t ∈ [0, T ], ∀ x ∈ Rd , |b(t, x)| + σ(t, x) ≤ K (1 + |x|).

In fact, if b and σ are continuous this condition follows from (7.2) applied with (t, x)
and (t, 0), given the fact that t → b(t, 0) is bounded on [0, T ] by continuity.
Moreover, still under this linear growth assumption, the Lipschitz assumption (7.2)
can be relaxed to a local Lipschitz condition on b and σ, namely, for every N ∈ N∗ ,

∃K N ∈ (0, +∞),∀ t ∈ [0, T ], ∀ x, y ∈ B(0, N ),


   
b(t, x) − b(t, y) + σ(t, x) − σ(t, y) ≤ K |x − y|.
N

• By adding the 0-th component t to X , i.e. by setting Yt := (t, X t ), one may some-
times assume that the (S D E) is autonomous, i.e. that the coefficients b and σ only
depend on the space variable. This is often enough for applications, though it may
induce some too stringent assumptions on the time variable in many theoretical
results. Furthermore, when some ellipticity assumptions are required, this way of
considering the equation no longer works since the equation dt = 1dt + 0 dWt is
completely degenerate (in terms of noise).
The above theorem admits an easy extension which allows us to define, like
for ODE s, the flow of an SDE. Under the assumptions of Theorem 7.1, for every
t ∈ [0, T ] and every x ∈ Rd , there exists a unique (Ft+sW
)s∈[t,T ] -adapted solution
(X s )s∈[t,T ] to the above SDE (7.1) in the sense that
t,x
7 Discretization Scheme(s) of a Brownian Diffusion 273
 s  s
P-a.s. ∀ s ∈ [t, T ], X tt,x = x + b(u, X ut,x )du + σ(u, X ut,x )dWu . (7.4)
t t

In fact, (X st,x )s∈[t,T ] is adapted to the augmented filtration of the Brownian motion
W (t) = Wt+s − Wt , s ∈ [t, T ].

7.1 Euler–Maruyama Schemes

Except for some very specific equations, it is impossible to devise an exact simulation
of the process X , even at a fixed time T . By exact simulation, we mean writing X T =
d
χ(U ), U = U ([0, 1]r ), where r ∈ N∗ ∪ {+∞}, χ is an explicit function, defined on

[0, 1]r if r < ∞ and on [0, 1](N ) when r = +∞. In fact, such exact simulation
has been shown to be possible when d = 1 and σ ≡ 1, see [43], by an appropriate
acceptance-rejection method. For a brief discussion on recent developments in this
direction, we refer to the introduction
 of Chap.
 9. Consequently, to approximate
E f (X T ), or more generally E F (X t )t∈[0,T ] by a Monte Carlo method, one needs
to approximate X by a process that can be simulated (at least at a fixed number of
instants).
To this end, we will introduce three types of Euler schemes with step Tn (n ∈
 
N∗ ) associated to the SDE (7.1): the discrete time Euler scheme X̄ = X̄ kTn 0≤k≤n
with step Tn , its càdlàg stepwise constant extension known as the stepwise constant
(Brownian) Euler scheme and the continuous or genuine (Brownian) Euler scheme.

7.1.1 The Discrete Time and Stepwise Constant Euler Schemes

The idea, like for ODEs in the deterministic framework, is to freeze the solution of
the SDE between the regularly spaced discretization instants kT
n
.
Discrete time Euler scheme
T
The discrete time Euler scheme with step n
is defined by

T T n
X̄ tk+1
n = X̄ tkn + b(tk , X̄ tkn ) + σ(tk , X̄ tkn )
n n
Z , X̄ 0 = X 0 ,
n n k+1 (7.5)
k = 0, . . . , n − 1,

where tkn = kTn


, k = 0, . . . , n and (Z kn )1≤k≤n denotes a sequence of i.i.d. N (0; Iq )-
distributed random vectors given by

n  
Z kn := Wtkn − Wtk−1
n , k = 1, . . . , n.
T

We will often drop the superscript n in Z kn and write Z k .


274 7 Discretization Scheme(s) of a Brownian Diffusion

 Example (Black–Scholes).  The discrete time Euler scheme of a standard Black–


Scholes dynamics d X t = X t r dt + σdWt , X 0 = x0 > 0, reads with these notations

 
T T n
X̄ tk+1
n = X̄ tkn 1 + r + σ Z , k = 0, . . . , n − 1, X̄ 0 = X 0 . (7.6)
n n k+1

The stepwise constant Euler scheme


To alleviate he notation, from now on we write

t := tkn if t ∈ [tkn , tk+1


n
).

The stepwise constant Euler scheme, denoted by ( X t )t∈[0,T ] for convenience, is


defined by
X t = X̄ t , t ∈ [0, T ]. (7.7)

7.1.2 The Genuine (Continuous) Euler Scheme

At this stage it is natural to extend the definition (7.5) of the Euler scheme at every
instant t ∈ [0, T ] by interpolating the drift and the diffusion term in their own scale,
that is, with respect to time for the drift and with respect to the Brownian motion for
the diffusion coefficient, namely

∀ k ∈ {0, . . . , n − 1}, ∀ t ∈ [tkn , tk+1


n
),
X̄ t = X̄ t + (t − t)b(t, X̄ t ) + σ(t, X̄ t )(Wt − Wt ), (7.8)
X̄ 0 = X 0 .

It is clear that lim X̄ t = X̄ tk+1


n since W has continuous paths. Consequently,
n
t→tk+1 ,t<tk+1
n

thus defined, ( X̄ t )t∈[0,T ] is an Ft -adapted process with continuous paths.


The following proposition is the key property of the genuine (or continuous) Euler
scheme.
Proposition 7.1 Assume that b and σ are continuous functions on [0, T ] × Rd . The
genuine Euler scheme is a (continuous) Itô process (1 ) satisfying the pseudo-SDE
with frozen coefficients satisfying
 t  t
X̄ t = X 0 + b(s, X̄ s )ds + σ(s, X̄ s )dWs , t ∈ [0, T ]. (7.9)
0 0

Proof. It is clear from (7.8), the recursive definition (7.5) at the discretization dates
tkn and the continuity of b and σ that X̄ t → X̄ tk+1
n as t → tk+1
n
. Consequently, for every

1 in the sense of Sect. 12.8 of the Miscellany Chapter.


7.1 Euler–Maruyama Schemes 275

t ∈ [tkn , tk+1
n
],
 t  t
X̄ t = X̄ tkn + b(s, X̄ s )ds + σ(s, X̄ s )dWs
tkn tkn

so that the conclusion follows by just concatenating the above identities between
t0n = 0, t1n , …, tkn = t and t. ♦

 Example (Black–Scholes). The continuous time Euler scheme of a standard


Black–Scholes dynamics reads, in these notation,

∀ t ∈ [0, T ], X̄ t = X̄ t 1 + r (t − t) + σ(Wt − Wt ) , X̄ 0 = X 0 . (7.10)

Notation: In the main statements, we will write X̄ n instead of X̄ to recall the


dependence of the Euler scheme in its step Tn . Idem for X , etc.

Then, the main (classical) result is that under the assumptions on the coefficients
b and σ mentioned above, supt∈[0,T ] |X t − X̄ t | goes to zero in every L p (P), 0 < p <
+∞, as n → +∞. Let us be more specific on this topic by providing error rates
under slightly more stringent assumptions.
How to use this genuine scheme for practical simulation does not seem not obvi-
ous, at least not as obvious as the stepwise constant Euler scheme. However, it turns
out to be an important method for improving the convergence rate of MC simula-
tions, e.g. for option pricing. Using this scheme in simulations is possible for specific
functionals F. It relies on the so-called diffusion bridge method and will be detailed
further on in Chap. 8.

7.2 The Strong Error Rate and Polynomial Moments (I)

7.2.1 Main Results and Comments

We consider the SDE (7.1) and its Euler–Maruyama scheme(s) as defined by (7.5),
(7.8). The first version of Theorem 7.2 below (including the second remark that fol-
lows) is mainly due to O. Faure in his PhD thesis (see [90]). An important preliminary
step is to establish the existence of finite L p moments of the sup-norm of solutions
when b and σ have linear growth, whenever X 0 itself lies in L p .
Polynomial moments control
It is often useful to have at hand the following uniform bounds for the solution(s) of
(S D E) and its Euler schemes, which first appears as a step of the proof of the rate
but has many other applications. Thus it is an important step to prove the existence of
276 7 Discretization Scheme(s) of a Brownian Diffusion

global strong solutions to (S D E) (7.1) when b and σ are only locally Lipschitz con-
tinuous but both satisfy a linear growth assumption in the space variable x uniformly
in t ∈ [0, T ].

Proposition 7.2 (Polynomial moments) Assume that the coefficients b and σ of


the SDE (7.1) are Borel functions simply satisfying the following linear growth
assumption:

∀ t ∈ [0, T ], ∀ x ∈ Rd , |b(t, x)| + σ(t, x) ≤ C(1 + |x|) (7.11)

for some real constant C = C T > 0 and a “horizon” T > 0. Then, for every p ∈
(0, +∞), there exists a universal positive real constant κ p,d > 0 such that every
strong solution (X t )t∈[0,T ] of (7.1) (if any) satisfies
 
 
 sup |X t | ≤ κ p,d eκ p,d C(C+1)T (1 + X 0  p ) (7.12)
t∈[0,T ] p

and, for every n ≥ 1, the Euler scheme with step T


n
satisfies
 
 
 sup | X̄ tn | ≤ κ p,d eκ p C(C+1)T (1 + X 0  p ). (7.13)
t∈[0,T ] p

Remarks. • The universal constant κ p,d is a numerical function of the BDG real
B DG
constant C p∨2,d .
• Less synthetic but more precise bounds can be obtained. For example, if p ≥ 2,
Inequality (7.55) established in the proof of the Proposition 7.6 reads, at time t = T ,
   √
 
 sup |X t | ≤ 2 e(2+C(Cd, p ) )C T X 0  p + C(T + Cd,
B DG 2
B DG
p T)
t∈[0,T ] p

which holds as an equality if C = 0 and X 0 = 0, unlike the above more synthetic


general upper-bound (7.12).

One interesting consequence of this proposition is that, if b and σ are defined on


R+ × Rd and satisfy (7.11) with the same real constant C for every T > 0 and if
there exists a global solution (X t )t≥0 of (S D E), then the above exponential control
in T of its sup norm over [0, T ] established (7.12) holds true for every T > 0.

7.2.2 Uniform Convergence Rate in L p (P)

First we introduce the following condition (HTβ ) which strengthens Assumption (7.2)
by adding a time regularity assumption of the Hölder type, namely there exists a
β ∈ (0, 1] such that
7.2 The Strong Error Rate and Polynomial Moments (I) 277

⎨ ∃ Cb,σ,T > 0 such that ∀s, t ∈ [0, T ], ∀ x, y ∈ Rd ,
β
(HT ) ≡ |b(t, x) − b(s, x)| + σ(t, x) − σ(s, x) ≤ Cb,σ,T (1 + |x|)|t − s|β

|b(t, x) − b(t, y)| + σ(t, x) − σ(t, y) ≤ Cb,σ,T |y − x|.
(7.14)

Theorem 7.2 (Strong Rate for the Euler scheme) (a) Genuine Euler scheme.
Suppose the coefficients b and σ of the SDE (7.1) satisfy the above regularity condition
(HTβ ) for a real constant Cb,σ,T > 0 and an exponent β ∈ (0, 1]. Then the genuine
Euler scheme ( X̄ tn )t∈[0,T ] converges toward (X t )t∈[0,T ] in every L p (P), p > 0, such
 1 
that X 0 ∈ L p , at a O n −( 2 ∧β) -rate. To be precise, there exists a universal constant
κ p > 0 only depending on p such that, for every n ≥ T ,

   1
  
n 
  T β∧ 2

 sup X t − X̄ t  ≤ K ( p, b, σ, T ) 1 + X 0  p (7.15)
t∈[0,T ] p n

where
K ( p, b, σ, T ) = κ p Cb,σ,T eκ p (1+Cb,σ,T )
2
T

and
Cb,σ,T = Cb,σ,T + sup |b(t, 0)| + sup σ(t, 0) < +∞. (7.16)
t∈[0,T ] t∈[0,T ]

(a ) Discrete time Euler scheme. In particular, (7.15) is satisfied when the


supremum is restricted to discretization instants tkn , namely

   1
  
n 
  T β∧ 2

 sup X tk − X̄ tk  ≤ K ( p, b, σ, T ) 1 + X 0  p . (7.17)
0≤k≤n p n

(a ) If b and σ are defined on the whole R+ × Rd and satisfy (HTβ ) with the same
real constant Cb,σ not depending on T and if b( . , 0) and σ( . , 0) are bounded on
R+ , then Cb,σ,T does not depend on T .
This will be the case in the autonomous case, i.e. if b(t, x) = b(x) and σ(t, x) =
σ(x), t ∈ R+ , x ∈ Rd , with b and σ Lipschitz continuous on Rd .
(b) Stepwise constant Euler scheme.  As soon as b and σ satisfy the
linear growth assumption (7.11) with a real constant L b,σ,T > 0, then, for every
p ∈ (0, +∞) and every n ≥ T ,

  
  n    T (1 + log n)
 
 sup  X̄ t − X̄ t  ≤ κ̃ p e
n κ̃ p L b,σ,T T
1 + X 0  p
t∈[0,T ]  n
 
p

1 + log n
=O
n
278 7 Discretization Scheme(s) of a Brownian Diffusion

where κ̃ p > 0 is a positive real constant only depending on p (and increasing in p).
 In particular, if b and σ satisfy the assumption (HTβ ) like in item (a), then the
stepwise constant Euler scheme ( X tn )t∈[0,T ] converges toward (X t )t∈[0,T ] in every
L p (P), p > 0, such that X 0 ∈ L p and for every n ≥ T ,
  β∧ 21 
     T (1 + log n)
  T
 sup  X t − X t  ≤ K ( p, b, σ, T ) 1 + X 0  p
n
+
t∈[0,T ] p n n
   
1 β 1 + log n
=O +
n n

where
K ( p, b, σ, T ) = κ̃ p (1 + Cb,σ,T )eκ̃ p (1+Cb,σ,T ) T ,
2

κ̃ p > 0 is a positive real constant only depending on p (increasing in p) and Cb,σ,T


is given by (7.16).

Warning! The complete and detailed proof of this theorem in its full generality,
i.e. including the tracking of the constants, is postponed to Sect. 7.8. It makes use
of stochastic calculus. A first version of the proof in the one-dimensional quadratic
case is proposed in Sect. 7.2.3. However, owing to its importance for applications,
the optimality of the upper-bound for the stepwise constant Euler scheme will be
discussed right after the remarks below.
Remarks. • When n ≤ T , the above explicit bounds still hold true with the same
constants provided one replaces
 β∧ 21     1 
T 1 T β T 2
2 by +
n 2 n n

and   
T  1 T  T
1 + log n by 1 + log n + .
n 2 n n

• As a consequence, note that the time regularity exponent β rules the convergence
rate of the scheme as soon as β < 1/2. In fact, the method of proof itself will
emphasize this fact: the idea is to use a Gronwall Lemma to upper-bound the error
X − X̄ in L p (P) by the L p (P)-norm of the increments X s − X s or, equivalently,
X̄ s − X̄ s as we will see later.
• If b(t, x) and σ(t, x) are globally Lipschitz continuous on R+ × Rd with Lipschitz
continuous coefficient Cb,σ , one may consider the time t as a (d + 1)-th spatial
component of X and apply item (a ) of the above theorem directly.
7.2 The Strong Error Rate and Polynomial Moments (I) 279

The following corollary is a straightforward consequence of claims (a) of the


above theorem (to be precise of (7.17)): it yields a (first) convergence rate for the
pricing of “vanilla” European options (i.e. payoffs of the form ϕ(X T )).
Corollary 7.1 Let ϕ : Rd → R be an α-Hölder function for an exponent α ∈ (0, 1],
i.e. a function such that [ϕ]α := supx= y |ϕ(x)−ϕ(y)|
|x−y|α
< +∞. Then, there exists a real
constant Cb,σ,T ∈ (0, ∞) such that, for every n ≥ 1,
  α2
  T
E ϕ(X ) − E ϕ( X̄ n ) ≤ [ϕ]α E |X − X̄ n |α ≤ Cb,σ,T [ϕ]α .
T T
T T
n

We will see further on that this weak error (2 ) rate can be dramatically improved
when b, σ and ϕ share higher regularity properties.
On the universality of the rate for the stepwise constant Euler scheme ( p ≥ 2)
Note that the rate in claim (b) of theorem is universal since, as we will see now,
it holds as a sharp rate for the Brownian motion itself (here we deal with the case
d = 1). Indeed, since W is its own genuine Euler scheme, W̄tn − Wtn = Wt − Wt for
every t ∈ [0, T ]. Now,
 
   
   
 sup |Wt − Wt | =  max sup |Wt − Wtk−1
n |
t∈[0,T ] p  k=1,...,n t∈[tk−1 ,tk )
n n 
p
   
T  

=  max sup |Bt − Bk−1 |
n k=1,...,n t∈[k−1,k) 
p

n
where Bt := T
W Tn t is a standard Brownian motion owing to the scaling property.
Hence
    
  T  
   max ζk 
 sup |Wt − Wt | =
t∈[0,T ]  
n k=1,...,n  p
p

where the random variables ζk := supt∈[k−1,k) |Wt − Wk−1 | are i.i.d.


 Lower bound. Note that, for every k ≥ 1,

ζk ≥ Z k := |Wk − Wk−1 |

since the Brownian motion (Wt ) has continuous paths. The sequence (Z k )k≥1 is i.i.d.
as well, with the same distribution as |W1 |. Hence, the random variables Z k2 are still
i.i.d. with a χ2 (1)-distribution so that (see Exercises 1. and 2. below)

 
2 Theword “weak” refers here to the fact that the error E ϕ(X T ) − E ϕ( X̄ Tn ) is related to the
convergence of the distributions P X̄ n toward P X T as n → ∞—e.g. weakly or in variation—whereas
T
strong L p (P)-convergence involves the joint distributions P( X̄ n ,X ) .
T T
280 7 Discretization Scheme(s) of a Brownian Diffusion
    
 
∀ p ≥ 2,  max |Z k | =  max Z k2  p ≥ c p log n.
k=1,...,n p k=1,...,n 2

Finally, one has

  
  T
∀ p ≥ 2,  sup |Wt − Wt | ≥ c p log n.
t∈[0,T ] p n

 Upper bound. To establish the upper-bound, we proceed as follows. First, note


that 
ζ1 = max sup Wt , sup (−Wt ) .
t∈[0,1) t∈[0,1)

We also know that


d
sup Wt = |W1 |
t∈[0,1)

(see e.g. [251], Reflection principle, p. 105). Hence using that, for every a, b ≥ 0,
e(a∨b) ≤ ea + eb , that supt∈[0,1) Wt ≥ 0 and that −W is also a standard Brownian
2 2 2

motion, we derive that

E eθζ1 ≤ E eθ(supt∈[0,1) Wt ) + E eθ(supt∈[0,1)(−Wt ))


2 2 2

= 2 E eθ(supt∈[0,1) Wt )
2

  
u2 du 2
= 2 E eθW1 = 2 exp −
2
√ =√ < +∞
R
1
2( √1−2θ )2 2π 1 − 2θ

as long as θ ∈ (0, 21 ). Consequently, it follows from Lemma 7.1 below applied with
the sequence (ζn2 )n≥1 , that
    

∀ p ∈ (0, +∞),  max ζk  p =  max ζk2  p ≤ C W, p 1 + log n,
k=1,...,n k=1,...,n 2

i.e. for every p ∈ (0, +∞),

  
  T 
 sup |Wt − Wt | ≤ C W, p 1 + log n . (7.18)
t∈[0,T ] p n

Lemma 7.1 Let Y1 , . . . Yn be non-negative random variables with the same distri-
bution satisfying E (eλY1 ) < +∞ for some λ > 0. Then,

  1 
∀ p ∈ (0, +∞),  max(Y1 , . . . Yn ) p ≤ log n + C p,Y1 ,λ .
λ
7.2 The Strong Error Rate and Polynomial Moments (I) 281

Proof: We may assume without loss of generality that p ≥ 1 since the  ·  p -norm
is non-decreasing. First, assume λ = 1. Let p ≥ 1. One sets
 p
ϕ p (x) = log(e p−1 + x) − ( p − 1) p , x > 0.

The function ϕ p is continuous, increasing, concave and one-to-one from R+ onto


R+ (the term e p−1 is introduced to ensure the concavity). It follows that ϕ−1 p (y) =
(( p−1) p +y)1/ p p−1+y 1/ p
e −e p−1
≤e for every y ≥ 0, owing to the elementary inequality
1 1 1
(u + v) p ≤ u p + v p , u, v ≥ 0. Hence,
  
p −1 p
E max Yk = E max ϕ p ◦ ϕ p (Yk )
k=1,...,n k=1,...,n
   
= E ϕ p max ϕ−1 −1
p
p (Y1 ), . . . , ϕ p (Yn )
p

since ϕ p is non-decreasing. Then Jensen’s Inequality implies


 
p −1 p
E max Yk ≤ ϕ p E max ϕ p (Yk )
k=1,...,n k=1,...,n
 n 
  p 
E ϕ−1 −1
p
≤ ϕp p (Yk ) = ϕ p n E ϕ p (Y1 )
k=0
     p
≤ ϕ p n E e p−1+Y1 ≤ p − 1 + log 1 + n E eY1 .

Hence
   
  1
 max Yk  ≤ log 1 + n E eY1 + p − 1 = log n + log E eY1 + + p−1
k=1,...,n p n

≤ log n + C p,1,Y1 ,

where C p,λ,Y1 = log E eλY1 + e p−1 .
Let us return to the general case, i.e. E eλY1 < +∞ for a λ > 0. Then
  1 
   
 max(Y1 , . . . , Yn ) =  max(λY1 , . . . , λYn )
p λ p
1 
≤ log n + C p,λ,Z 1 .
λ ♦

 Exercises. 1. (Completion of the proof of the above lower bound I ). Let Z be


a non-negative random variable with distribution function F(z) = P(Z ≤ z) and a
continuous probability density function f . Assume that the survival function F̄(z) :=
P(Z > z) satisfies: there exists a c ∈ (0, +∞) such that
282 7 Discretization Scheme(s) of a Brownian Diffusion

∀ z ≥ a, F̄(z) ≥ c f (z). (7.19)

Show that, if (Z n )n≥1 is i.i.d. with distribution P Z (dz),

   n
1 − F k (a)
 
∀ p ≥ 1,  max(Z 1 , . . . , Z n ) ≥ c
p
k=1
k
  
≥ c log(n + 1) + log 1 − F(a) .

[Hint: one may assume p = 1. Establish the classical representation formula


 +∞
EU = P(U ≥ u)du
0

for any non-negative random variable U and use some basic facts about Stieljès
integral like d F(z) = f (z)dz, etc.]
2. (Completion of the proof of the above lower bound II ). Show that the χ2 (1)
− u2
distribution defined by its density on the real line f (u) := √e 2πu 1{u>0} satisfies the
above inequality (7.19). [Hint: use an integration by parts and usual comparison
theorems for integrals to show that
 +∞
2e− 2 e− 2 2e− 2
z u z

F̄(z) = √ − √ du ∼ √ as z → +∞ .]
2πz z u 2πu 2πz

3. (Euler scheme of the martingale Black–Scholes model). Let (Z n )n≥1 be an i.i.d.


sequence of N (0; 1)-distributed random variables defined on a probability space
(, A, P).
(a) Compute for every real number a > 0, the quantity E(1 + a Z 1 )ea Z 1 as a function
of a.
(b) Compute for every integer n ≥ 1,

 
 a(Z 1 +···+Z n )−n a22 
n

e − (1 + a Z k ) .
2
k=1

(c) Let σ > 0. Show the existence of two positive real constants ci , i = 1, 2, such
that

 n 
Zk  σ2 e 2 
 (Z 1 +···+Z n )/√n− 21 
σ2
 σ 2 (c1 σ 2 + c2 )
e − 1 + √  = √ 1− + O(1/n 2 ) .
k=1
n 2 2n n

(d) Deduce the exact convergence rate of the Euler scheme with step T
n
of the
martingale Black–Scholes model
7.2 The Strong Error Rate and Polynomial Moments (I) 283

d X t = σ X t dWt , X 0 = x0 > 0

as n → +∞. Conclude to the optimality of the rate established in Theorem 7.16(a )


as a universal convergence rate of the Euler scheme (in a Lipschitz framework).
A.s. convergence rate(s)
The last important result of this section is devoted to the a.s. convergence of the
Euler schemes toward the diffusion process with a first (elementary) approach to its
rate of convergence.

Theorem 7.3 If b and σ satisfy (HTβ ) for a β ∈ (0, 1] and if X 0 is a.s. finite, the
continuous Euler scheme X̄ n = ( X̄ tn )t∈[0,T ] a.s. converges
 toward
 the diffusion X for
the sup-norm over [0, T ]. Furthermore, for every α ∈ 0, β ∧ 21 ,

a.s.
n α sup |X t − X̄ tn | −→ 0.
t∈[0,T ]

The proof follows from the L p -convergence theorem by an approach “à la Borel–


Cantelli”. The details are deferred to Sect. 7.8.6.

7.2.3 Proofs in the Quadratic Lipschitz Case for Autonomous


Diffusions

We provide below a proof of both Proposition 7.2 and Theorem 7.2 in a simplified one
dimensional, autonomous and quadratic ( p = 2) setting. This means that b(t, x) =
b(x) and σ(t, x) = σ(x) are defined as Lipschitz continuous functions on the real
line. Then (S D E) admits a unique strong solution starting from X 0 on every interval
[0, T ], which means that there exists a unique strong solution (X t )t≥0 starting from
X 0.
Furthermore, we will not care about the structure of the real constants that come
out, in particular no control in T is provided in this concise version of the proof. The
complete and detailed proofs are postponed to Sect. 7.8.

Lemma 7.2 (Gronwall’s Lemma) Let f : R+ → R+ be a Borel non-negative locally


bounded function and let ψ : R+ → R+ be a non-decreasing function satisfying
 t
(G) ≡ ∀ t ≥ 0, f (t) ≤ α f (s) ds + ψ(t)
0

for a real constant α > 0. Then

∀ t ≥ 0, sup f (s) ≤ eαt ψ(t).


0≤s≤t
284 7 Discretization Scheme(s) of a Brownian Diffusion

Proof. It is clear that the non-decreasing (finite) function ϕ(t) := sup0≤s≤t f (s)
t
satisfies (G) instead of f . Now the function e−αt 0 ϕ(s) ds has a right derivative at
every t ≥ 0 and
  t    t 
e−αt ϕ(s) ds = e−αt ϕ(t+) − α ϕ(s) ds
0 r 0
≤ e−αt ψ(t+ )

where ϕ(t+ ) and ψ(t+ ) denote the right limits of ϕ and ψ at t. Then, it follows from
the fundamental theorem of Calculus that the function
 t  t
−αt
t −→ e ϕ(s)ds − e−αs ψ(s+ ) ds is non-increasing and null at 0,
0 0

hence, non-positive so that


 t  t
∀ t ≥ 0, ϕ(s) ≤ eαt e−αs ψ(s+) ds.
0 0

Plugging this into the above inequality implies


 t  t
αt −αs αt
ϕ(t) ≤ αe e ψ(s+)ds + ψ(t) = αe e−αs ψ(s) ds + ψ(t)
0 0
 −αt 
αt 1 − e
≤ αe + 1 ψ(t) = eαt ψ(t),
α

where we used successively that a monotonic function is ds-a.s. continuous and that
ψ is non-decreasing. ♦

Now we recall the quadratic Doob Inequality which is needed in the proof (instead
of the more sophisticated Burkhölder–Davis–Gundy required in the non-quadratic
case).
Doob’s Inequality (L 2 case) (see e.g. [183]).
(a) Let M = (Mt )t≥0 be a continuous martingale with M0 = 0. Then, for every
T > 0,  
E sup Mt2 ≤ 4 E MT2 = 4 E MT .
t∈[0,T ]

(b) If M is simply a continuous local martingale with M0 = 0, then, for every T > 0,
 
E sup Mt2 ≤ 4 E MT .
t∈[0,T ]
7.2 The Strong Error Rate and Polynomial Moments (I) 285

Proof of Proposition 7.2 (A first partial). We may assume without loss of generality

that E X 02 < +∞ (otherwise the inequality is trivially fulfilled). Let τL := min t :
|X t − X 0 | ≥ L , L ∈ N \ {0} (with the usual convention min ∅ = +∞). It is an F-
stopping time as the hitting time of a closed set by a process with continuous paths
(see Sect. 7.8.2). Furthermore, for every t ∈ [0, T ],
τ
|X t L | ≤ L + |X 0 |.

In particular, this implies that


τ
E sup |X t L |2 ≤ 2(L 2 + E X 02 ) < +∞.
t∈[0,T ]

Then,
 t∧τL  t∧τL
τ
XtL = X0 + b(X s )ds + σ(X s )dWs
0 0
 t∧τL  t∧τL
τ τ
= X0 + b(X sL )ds + σ(X sL )dWs
0 0

owing to the local feature of both regular and stochastic integrals. The stochastic
integral  t∧τ
L τ
Mt(L) := σ(X sL )dWs
0

is a continuous local martingale null at zero with bracket process defined by


 t∧τL
τ
M (L) t = σ 2 (X sL )ds.
0

Now, using that t ∧ τL ≤ t, we derive that



τ
t
τ  
|X t L | ≤ |X 0 | + |b(X sL )|ds + sup  Ms(L) ,
0 s∈[0,t]

which in turn immediately implies that


 t
τ τ
sup |X sL | ≤ |X 0 | + |b(X sL )|ds + sup |Ms(L) |.
s∈[0,t] 0 s∈[0,t]

The elementary inequality (a + b + c)2 ≤ 3(a 2 + b2 + c2 ), a, b, c ≥ 0, com-


bined with the Schwarz Inequality successively yields
286 7 Discretization Scheme(s) of a Brownian Diffusion
  2 
 τ 2
t
τ
sup X sL ≤3 X 02 + |b(X sL )|ds + sup |Ms(L) |2
s∈[0,t] 0 s∈[0,t]
  t 
τ
≤3 X 02 +t |b(X sL )|2 ds + sup |Ms(L) |2 .
0 s∈[0,t]

We know that the functions b and σ satisfy a linear growth assumption

|b(x)| + |σ(x)| ≤ Cb,σ (1 + |x|), x ∈ R,

as Lipschitz continuous functions. Then, taking expectation and using Doob’s


Inequality for the local martingale M (L) yields for an appropriate real constant
Cb,σ,T > 0 (that may vary from line to line)
     
τ
t  τ 2
t∧τL
τ
E sup (X sL )2 ≤3 E X 02 + T Cb,σ 1 + E |X sL | ds + E σ 2
(X sL )ds
s∈[0,t] 0 0
  t  t 
 τ  τ
≤ Cb,σ,T E X 02 + 1 + E |X sL | ds + E (1 + |X sL |)2 ds
0 0
  t 
τL 2
= Cb,σ,T E X 0 +
2
(1 + E |X s | )ds
0

where we used again (in the first inequality) that τL ∧ t ≤ t. Finally, this can be
rewritten as
    t 
τL 2 τL 2
E sup (X s ) ≤ Cb,σ,T 1 + E X 0 + 2
E (|X s | )ds
s∈[0,t] 0

for a new real constant


 Cb,σ,T . Then, the
 Gronwall Lemma 7.2 applied to the bounded
τ
function f L (t) := E sups∈[0,t] (X sL )2 at time t = T implies
 
τ  
E sup (X sL )2 ≤ Cb,σ,T 1 + E X 02 eCb,σ,T T .
s∈[0,T ]

This holds or every L ≥ 1. Now τL ↑ +∞ a.s. as L ↑ +∞ since sup |X s | < +∞


0≤s≤t
for every t ≥ 0 a.s. Consequently,
τ
lim sup |X sL | = sup |X s |.
L→+∞ s∈[0,T ] s∈[0,T ]

Then Fatou’s Lemma implies


 
   
E sup X s2 ≤ Cb,σ,T 1 + E X 02 eCb,σ,T T = Cb,σ,T 1 + E X 02 .
s∈[0,T ]
7.2 The Strong Error Rate and Polynomial Moments (I) 287

As for the Euler scheme, the proof follows closely the above lines, once we replace
the stopping time τL by
 
τ̄L = min t : | X̄ t − X 0 | ≥ L .

It suffices to note that, for every s ∈ [0, T ], sup | X̄ u | ≤ sup | X̄ u |. Then one shows
u∈[0,s] u∈[0,s]
that  
 
sup E sup ( X̄ sn )2 ≤ Cb,σ,T 1 + E X 02 eCb,σ,T T . ♦
n≥1 s∈[0,T ]

Proof of Theorem 7.2 (simplified setting). (a) (Convergence rate of the continuous
Euler scheme). Combining the equations satisfied by X and its (continuous) Euler
scheme yields
 
t   t  
X t − X̄ t = b(X s ) − b( X̄ s ) ds + σ(X s ) − σ( X̄ s ) dWs .
0 0

Consequently, using that b and σ are Lipschitz continuous, the Schwartz and Doob
Inequalities lead to
   2
t
E sup |X s − X̄ s | 2
≤ 2E [b]Lip |X s − X̄ s |ds
s∈[0,t] 0
  2 
s  
+2 E sup σ(X u ) − σ( X̄ u ) d Wu
s∈[0,t] 0
 2 
t t  2
≤ 2 [b]2Lip E |X s − X̄ s |ds + 8E σ(X u ) − σ( X̄ u ) du
0 0
 t  t
≤ 2t[b]2Lip E |X s − X̄ s |2 ds + 8 [σ]2Lip E |X u − X̄ u |2 du
0 0
 t
= Cb,σ,T E | X s − X̄ s |2 ds
0
 t  t
≤ Cb,σ,T E |X s − X̄ s |2 ds + Cb,σ,T E | X̄ s − X̄ s |2 ds
0 0
   
t t
≤ Cb,σ,T E sup |X u − X̄ u |2 ds + Cb,σ,T E | X̄ s − X̄ s |2 ds.
0 u∈[0,s] 0

where Cb,σ,T varies from line


 to line. 
The function f (t) := E sups∈[0,t] |X s − X̄ s |2 is locally bounded owing to Step 1.
Consequently, it follows from Gronwall’s Lemma 7.2 applied at t = T that
   T
E sup |X s − X̄ s | 2
≤e Cb,σ,T T
Cb,σ,T E | X̄ s − X̄ s |2 ds.
s∈[0,T ] 0
288 7 Discretization Scheme(s) of a Brownian Diffusion

Now
X̄ s − X̄ s = b( X̄ s )(s − s) + σ( X̄ s )(Ws − Ws ) (7.20)

so that, using Step 1 (for the Euler scheme) and the fact that Ws − Ws and X̄ s are
independent
 
T 2
E | X̄ s − X̄ s | ≤ Cb,σ
2
E b ( X̄ s ) + E σ ( X̄ s ) E (Ws − Ws )
2 2 2
n
  
T 2 T
≤ Cb,σ 1 + E sup | X̄ t | 2
+
t∈[0,T ] n n
 T
≤ Cb,σ 1 + E X 02 .
n

(b) Stepwise constant Euler scheme. We assume here—for pure convenience—that


X 0 ∈ L 4 . One derives from (7.20) and the linear growth assumption satisfied by b
and σ (since they are Lipschitz continuous) that

  T
sup | X̄ t − X̄ t | ≤ Cb,σ 1 + sup | X̄ t | + sup |Wt − Wt |
t∈[0,T ] t∈[0,T ] n t∈[0,T ]

so that
  
   
   T 
 sup | X̄ t − X̄ t | ≤ Cb,σ  1 + sup | X̄ t | + sup |Wt − Wt |  .
t∈[0,T ] 2  t∈[0,T ] n t∈[0,T ] 
2

Now, note that if U and V are two real-valued random variables, the Schwarz Inequal-
ity implies
 
U V 2 = U 2 V 2 1 ≤ U 2 2 V 2 2 = U 4 V 4 .

Combining this with the Minkowski inequality in both resulting terms yields
   
 
 sup | X̄ t − X̄ t | ≤ Cb,σ 1 +  sup | X̄ t |4 T /n +  sup |Wt − Wt |4 .
t∈[0,T ] 2 t∈[0,T ] t∈[0,T ]

Now, as already mentioned in the first remark that follows Theorem 7.2,

  
  T
 sup |W t − W |
t  ≤ C W
1 + log n
t∈[0,T ] 4 n
   
which completes the proof… if we admit that  sup | X̄ t |4 ≤ Cb,σ,T 1 + X 0 4 . ♦
t∈[0,T ]
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 289

Remarks. • The proof in the general L p framework follows exactly the same
lines, except that one replaces Doob’s Inequality for a continuous (local) martin-
gale (Mt )t≥0 by the so-called Burkhölder–Davis–Gundy Inequality (see e.g. [251])
which holds for every exponent p > 0 (only in the continuous setting)
      1
∀ t ≥ 0, c p  Mt  p ≤  sup |Ms | p ≤ C p  Mt  p = C p Mt  2p ,
2
s∈[0,t]

where c p , C p are positive real constants only depending on p. This general setting is
developed in full detail in Sect. 7.8 (in the one-dimensional case to alleviate notation).
• In some so-called mean-reverting situations one may even get boundedness over
t ∈ (0, +∞).

7.3 Non-asymptotic Deviation Inequalities for the Euler


Scheme

The aim of this section is to establish non-asymptotic deviation inequalities for the
Euler scheme in order to provide confidence intervals for the Monte Carlo method. We
 1
recall for convenience several important notations. Let A = Tr(A A∗ ) 2 denote
the Fröbenius norm of A ∈ M(d, q, R) (i.e. the canonical Euclidean norm on Rdq )
and let |||A||| = sup|x|=1 |Ax| be the operator norm of A (with respect to the canonical
Euclidean norms on Rd and Rq ). Note that |||A||| ≤ A.
We still consider the Brownian diffusion process solution to (7.1) on a prob-
ability space (, A, P) with the same Lipschitz regularity assumptions made on
the drift b and the diffusion coefficient σ. The q-dimensional driving Brownian
motion is still denoted by W and the (augmented) natural filtration (Ft )t∈[0,T ]
is still defined by (7.3). Furthermore, as the functions b : [0, T ] × Rd → Rd and
σ : [0, T ] × Rd → M(d, q, R) satisfy a Lipschitz continuous assumption in x uni-
formly in t ∈ [0, T ], we define

|b(t, x) − b(t, y)|


[b]Lip = sup < +∞ (7.21a)
t∈[0,T ], x= y |x − y|

and
σ(t, x) − σ(t, y)
[σ]Lip = sup < +∞. (7.21b)
t∈[0,T ], x= y |x − y|

The definition of the Brownian Euler scheme with step Tn , starting at X 0 , is unchanged
but to alleviate notation in this section, we will temporarily write X̄ kn instead of X̄ tnn .
k
So we have: X̄ 0n = X 0 and
290 7 Discretization Scheme(s) of a Brownian Diffusion

T T
n
=
X̄ k+1 X̄ kn + b(tk , X̄ k ) + σ(tk , X̄ k )
n n n n
Z k+1 , k = 0, . . . , n − 1,
n n
  
where tkn = kTn
, k = 0, . . . , n and Z k = Tn Wtkn − Wtk−1 n , k = 1, . . . , n is an i.i.d.
sequence of N (0; Iq ) random vectors. When X 0 = x ∈ Rd , we may denote occa-
sionally by ( X̄ kn,x )0≤k≤n the Euler scheme starting at x.
 The Euler scheme defines
 an Rd -valued Markov chain with transitions Pk (x, dy) =
P X̄ k+1
n
∈ dy | X̄ kn = x , k = 0, . . . , n − 1, reading on bounded or non-negative
Borel functions f : Rd → R,
 
Pk ( f )(x) = E f ( X̄ k+1 ) | X̄ k = x = E f Ek (x, Z ) , k = 0, . . . , n − 1,

where

T T
Ek (x, z) = x + b(tk , x) + σ(tk , x)
n n
z, x ∈ Rd , z ∈ Rq , k = 0, . . . , n − 1,
n n
d
denotes the Euler scheme operator and Z = N (0; Iq ). Then, set for every k,  ∈
{0, . . . , n − 1}, k < ,

Pk, = Pk ◦ · · · ◦ P−1 and P, ( f ) = f

so that we have, still for bounded or non-negative Borel functions f ,



Pk, ( f )(x) = E f ( X̄  ) | X̄ k = x .

We will need the following property satisfied by the Euler transitions.

Proposition 7.3 Under the above Lipschitz assumption (7.21a) and (7.21b) on b
and σ, the transitions Pk , k = 0, . . . , n − 1, satisfy

[Pk f ]Lip ≤ [Pk ]Lip [ f ]Lip , k = 0, . . . , n − 1,

with  1
T 2 T 2
[Pk ]Lip ≤ Id +
b(tkn , .) + [σ(tkn , .)]2Lip
n Lip n
   21
T T
≤ 1+ Cb,σ + κb
n n

where
Cb,σ = 2[b]Lip + [σ]2Lip and κb = [b]2Lip . (7.22)
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 291

Proof. It is a straightforward consequence of the fact that


  
 T T 
|Pk f (x) − Pk f (y)| ≤ [ f ]Lip  x − y + (b(tkn , x) − b(tkn , y)) + (σ(tkn , x) − σ(tkn , y))Z 
n n 1
  
 T n n T n n 
≤ [ f ]Lip  x − y + (b(tk , x) − b(tk , y)) + (σ(tk , x) − σ(tk , y))Z  .
n n 2

Now, one completes the proof by noting that

  2
 T T 
 x − y + (b(tk , x) − b(tk , y)) +
n n
(σ(tkn , x) − σ(tkn , y))Z 
n n 2
 2
 T n   T  2
=  x − y + b(tk , x) − b(tkn , y)  + σ(tkn , x) − σ(tkn , y)) . ♦
n n

The key property is the following classical exponential inequality for the Gaussian
measure (the proof below is due to Ledoux in [195]).

Proposition 7.4 Let g : Rq → R be a Lipschitz continuous function (with respect


to the canonical Euclidean measure on Rq ) and let Z be an N (0; Iq ) distributed
random vector. Then
  
λ2
∀ λ ∈ R, E eλ g(Z )−E g(Z ) ≤ e 2 [g]Lip .
2
(7.23)

Proof. Step 1 (Preliminaries). We consider a standard Ornstein–Uhlenbeck process


starting at x ∈ Rq , being a solution to the stochastic differential equation

1
d tx = − tx dt + dWt , 0x = x,
2
where W is a standard q-dimensional Brownian motion. One easily checks that this
equation has a (unique) explicit solution on the whole real line given by
 t
− 2t − 2t s
tx =xe +e e 2 dWs , t ∈ R+ .
0

This shows that (tx )t≥0 is a Gaussian process such that

E tx = x e− 2 .
t

Using the Wiener isometry, we derive that the covariance matrix tx of tx is given
by
292 7 Discretization Scheme(s) of a Brownian Diffusion
  t  t 
tx = e−t E e 2 dWs
s s
e 2 dWsk
0 0 1≤k,≤q
 t
= e−t es ds Iq
0
= (1 − e−t )Iq .

(The time covariance structure of the process can be computed likewise but is of
no use in this proof). As a consequence, for every Borel function g : Rq → R with
polynomial growth,
   √
Q t g(x) := E g tx = E g x e−t/2 + 1 − e−t Z
d
where Z = N (0; Iq ). (7.24)

Hence, owing to the Lebesgue dominated convergence theorem,

lim Q t g(x) = E g(Z ).


t→+∞

Moreover, if g is differentiable with a gradient ∇g having polynomial growth,

∇x Q t g(x) = e− 2 E ∇g(tx ).
t
(7.25)

If g is twice continuously differentiable with a Hessian D 2 g having polynomial


growth, it follows from Itô’s formula (see Sect. 12.8), and Fubini’s Theorem that, for
every t ≥ 0,
 
t   t
Q t g(x) = g(x) + E (Lg)(sx ) ds = g(x) + Q s (Lg)(x)ds, (7.26)
0 0

where L is the Ornstein–Uhlenbeck operator (infinitesimal generator of the above


Ornstein–Uhlenbeck stochastic differential equation) which maps g to the function

1 
Lg : ξ −→ Lg(ξ) = g(ξ) − (ξ|∇g(ξ)) ,
2
⎤ ⎡
gxi (ξ)
! ⎢ . ⎥
where g(ξ) = 1≤i≤q gx 2 (ξ) denotes the Laplacian of g and ∇g(ξ) = ⎣ .. ⎦.
i
gxq (ξ)
Now, if h and g are both twice differentiable with existing partial derivatives having
polynomial growth, one has
q 
   |z|2 dz
E (∇g(Z )|∇h(Z ) = gxk (z)h xk (z)e− 2
q .
k=1 Rq (2π) 2
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 293

After noting that

∂  |z|2 |z|2  
∀ z = (z 1 , . . . , z q ) ∈ Rd , h xk (z)e− 2 = e− 2 h x 2 (z) − z k h xk (z) ,
∂z k k

k = 1, . . . , q,

integrations by parts in each of the above q integrals yield the following identity
   
E (∇g(Z )|∇h(Z ) = −2 E g(Z )Lh(Z ) . (7.27)

One also derives from (7.26) and the continuity of s → E (Lg)(sx ) that


Q t g(x) = Q t Lg(x).
∂t

In particular, ∂t∂ Q t g]|t=0 (x) = Q 0 Lg(x) = Lg(x). On the other hand, as Q t g(x)
has partial derivatives with polynomial growth
 
Q s (Q t g)(x) − Q t g(x) ∂
lim = Qs (Q t g)(x) = L Q t g(x)
s→0 s ∂s |s=0

so that, finally, we get the classical identity

L Q t g(x) = Q t Lg(x). (7.28)

Step 2 (The smooth case). Let g : Rq → R be a continuously twice differentiable


function with bounded existing partial derivatives such that E g(Z ) = 0. Let λ ∈ R
be fixed. We define the function Hλ : R+ → R+ by

Hλ (t) = E eλQ t g(Z ) .

Let us first check that Hλ is well-defined by the above equality. Note that the func-
tion g is Lipschitz continuous since its gradient is bounded and [g]Lip ≤ ∇gsup =
supξ∈Rq |∇g(ξ)|. It follows from (7.24) that, for every z ∈ Rq ,

|Q t g(z)| ≤ [g]Lip e−t/2 |z| + |Q t g(0)|



≤ [g]Lip |z| + E |g( 1 − e−t Z )|
≤ [g]Lip |z| + |g(0)| + [g]Lip E |Z |

≤ [g]Lip |z| + |g(0)| + [g]Lip q.

This ensures the existence of Hλ since E ea|Z | < +∞ for every a ≥ 0. One shows
likewise that, for every z ∈ Rq and every t ∈ R+ ,
294 7 Discretization Scheme(s) of a Brownian Diffusion

|Q t g(z)| = |Q t g(z) − E g(Z )|


 √ 
≤ [g]Lip e−t/2 |z| + (1 − 1 − e−t )E |Z | → 0 as t → +∞.

One concludes by the Lebesgue dominated convergence theorem that

lim Hλ (t) = e0 = 1.
t→+∞

Furthermore, one shows, still using the same arguments, that Hλ is differentiable
over R+ with a derivative given for every t ∈ R+ by
 
Hλ (t) = λ E eλQ t g(Z ) Q t Lg(Z )
 
= λ E eλQ t g(Z ) L Q t g(Z ) by (7.28)
λ  
= − E ∇z (eλQ t g(z) )|z=Z |∇ Q t g(Z ) by (7.27)
2
λ2 − t − t  λQ t g(Z )
= − e 2e 2E e |Q t ∇g(Z )|2 by (7.25).
2
Consequently, as lim Hλ (t) = 1,
t→+∞

 +∞
Hλ (t) = 1 − Hλ (s)ds
t
 +∞ 
λ 2
≤ 1 + ∇g2sup e−s E eλQ s g(Z ) ds
2 t
 +∞
−s λ2
= 1+K e Hλ (s)ds with K = ∇g2sup .
t 2

One derives by induction, using that Hλ is non-increasing since its derivative is


negative, that, for every integer m ∈ N∗ ,


m
e−kt e−(m+1)t
Hλ (t) ≤ Kk + Hλ (0)K m+1 .
k=0
k! (m + 1)!

Letting m → +∞ finally yields

−t λ2
∇g2sup
Hλ (t) ≤ e K e ≤ e K = e 2 .

One completes this step of the proof by applying the above inequality to the function
g − E g(Z ).
Step 3 (The Lipschitz continuous case). This step relies on an approximation tech-
nique which is closely related to sensitivity computation for options attached to
non-regular payoffs (but in a situation where the Brownian motion plays the role of a
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 295

pseudo-asset). Let g : (Rq , | · |) → (R, | · |) be a Lipschitz continuous function with


d  
Lipschitz coefficient [g]Lip and ζ = N 0; Iq . One considers for every ε > 0,

√ |u−z|2 du
gε (z) = E g(z + ε ζ) = f (u)e− 2ε
q .
Rq (2πε) 2

It is clear that gε uniformly converges toward g on Rq since |gε (z) − g(z)| ≤ ε E |ζ|
for every z ∈ Rq . One shows likewise that gε is Lipschitz continuous and [gε ]Lip ≤
[g]Lip . Moreover, the function gε is differentiable with gradient given by

1 |u−z|2 du 1  √ 
∇gε (z) = − f (u)e− 2ε (u − z) = √ E g(z + εζ)ζ .
ε ε
q
Rq (2πε) 2

It is always true that if a function h : Rq → R is differentiable and Lipschitz contin-


uous then ∇hsup = [h]Lip (3 ). Consequently,

∇ f ε sup = [ f ε ]Lip ≤ [ f ]Lip .

The Hessian of f ε is also bounded (by a constant depending on ε, but this has no
consequence on the inequality of interest). One concludes by Fatou’s Lemma

λ2 λ2
[ f ]2Lip
E eλ f (Z ) ≤ lim E eλ fε (Z ) ≤ lim e 2 ∇ f ε 2sup
≤e 2 . ♦
ε→0 ε→0

We are now in a position to state the main result of this section and its application
to the design of confidence intervals (see also [100], where this result was proved by
a slightly different method).

Theorem 7.4 Assume |||σ|||sup := sup(t,x)∈[0,T ]×Rd |||σ(t, x)||| < +∞. Then, there
exists a positive decreasing sequence K (b, σ, T, n) ∈ (0, +∞), n ≥ 1, of real num-
bers such that, for every n ≥ 1 and every Lipschitz continuous function f : Rd → R,
  
λ f ( X̄ Tn )−E f ( X̄ Tn ) λ2
≤ e 2 |||σ|||sup [ f ]Lip K (b,σ,T,n) .
2 2
∀ λ ∈ R, E e (7.29)

e(Cb,σ +κb n )T
T

The choice K (b, σ, T, n) = is admissible so that lim ↓ K (b, σ, T, n) =


Cb,σ n
eCb,σ T
, where the real constant Cb,σ is defined in (7.22) of Proposition 7.3.
Cb,σ

3 Infact, for every z, u ∈ Rq , |u| = 1, one has |(∇h(z)|u)| = limε→0 ε−1 |h(z + εu) − h(z)| ≤
[h]Lip , whereas |∇h(z)| = sup|u|=1 |(∇h(z)|u)|. Hence, ∇hsup = supz∈Rq |∇h(z)| ≤ [h]Lip . The
reverse inequality is obvious since for every z, z ∈ Rq , |h(z) − h(z )| = |(∇h(ζ)|z − z)| ≤
∇hsup |z − z|.
296 7 Discretization Scheme(s) of a Brownian Diffusion

Application to the design of non-asymptotic confidence intervals


Let us briefly recall that such exponential inequalities yield deviation inequalities in
the strong law of large numbers. Let ( X̄ n, )≥1 be independent copies of the discrete
time Euler scheme X̄ n = ( X̄ tnn )k=0,...,n with step Tn starting at X̄ 0n = X 0 . Then, for
k
every ε > 0 and every λ > 0, the Markov inequality and independence imply, for
every integer n ≥ 1,
   !  
1   n, 
M
λ M f ( X̄ Tn, )−E f ( X̄ Tn )
P f X̄ T − E f ( X̄ T ) > ε = P e =1
n
> eλεM
M =1
!M 
λ f ( X̄ n, )−E f ( X̄ n )
≤ e−λεM E e =1 T T

  M
λ f ( X̄ Tn )−E f ( X̄ Tn )
= e−λεM E e
λ2
≤ e−λεM+ 2 M|||σ|||2sup [ f ]2Lip K (b,σ,T,n)
.
λ2
The function λ → −λε + 2
|||σ|||2sup [ f ]2Lip K (b, σ, T, n) attains its minimum at

ε
λmin =
|||σ|||2sup [ f ]2Lip K (b, σ, T, n)

so that, finally,
 
1   n, 
M ε2 M
− 2 2
P f X̄ T − E f ( X̄ T ) > ε ≤ e 2|||σ|||sup [ f ]Lip K (b,σ,T,n) .
n
M =1

Applying the above inequality to − f yields by an obvious symmetry argument the


two-sided deviation inequality
 
1  M
 n,   − ε2 M
  2|||σ|||2 2
∀ n ≥ 1, P  f X̄ T − E f ( X̄ T ) > ε ≤ 2e
n sup [ f ]Lip K (b,σ,T,n)
. (7.30)
M =1

One easily derives from (7.30) confidence intervals provided (upper-bounds of) |||σ|||,
[ f ]2Lip and K (b, σ, T, n) are known.
The main feature of the above inequality (7.30), beyond the fact that it holds for
possibly unbounded Lipschitz continuous functions f , is that the right-hand upper-
bound does not depend on the time step Tn of the Euler scheme. Consequently we can
design confidence intervals for Monte Carlo simulations based on the Euler schemes
uniformly in the time discretization step Tn .
Doing so, we can design non-asymptotic confidence intervals when computing
E f (X T ) by a Monte Carlo simulation. We know that the bias is due to the discretiza-
tion scheme and only depends on the step Tn : under appropriate assumptions on b, σ
and f (see Sect. 7.6 for the first order expansion of the weak error), one has,
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 297

c1
E f ( X̄ Tn ) = E f (X T ) + + o(1/n).

Remark. Under the assumptions we make on b and σ, the Euler scheme converges
a.s. and in every L p space (provided X 0 lies in L p ) for the sup norm over [0, T ]
toward the diffusion process X , so that we deduce a similar result for independent
copies X  of the diffusion itself, namely
 
1  M
   − ε2 M
  2|||σ|||2 2
P  f X T − E f (X T ) > ε ≤ 2e sup [ f ]Lip K (b,σ,T )

M =1

eCb,σ T
with K (b, σ, T ) = Cb,σ
and Cb,σ = 2[b]Lip + [σ]2Lip .
Proof of Theorem 7.4. It follows from Proposition 7.4 that, for every Lipschitz
continuous function f : Rd → R, the Euler scheme operator satisfies
  
λ2
√ 2
λ f (Ek (x,Z ))−Pk f (x) T
|||σ(tkn ,x)||| [ f ]2Lip
∀ λ ∈ R, E e ≤e 2 n

 
since the function z → f E(x, z) is Lipschitz continuous from Rq to R with respect
to
 the canonical Euclidean norm with a Lipschitz coefficient upper-bounded by
T
n
[ f ]Lip |||σ(tkn , x)|||, where |||σ(tkn , x)||| denotes the operator norm of σ(tkn , x). Con-
sequently, as |||σ|||sup := sup(t,x)∈[0,T ]×Rd |||σ(t, x)|||,
 
λ f ( X̄ tnn )  2 T
λPk f ( X̄ tnn )+ λ2 |||σ|||2sup [ f ]2Lip
∀ λ ∈ R, E e k+1  Ft n ≤e k
n
.
k

Applying this inequality to Pk+1,n f and taking expectation on both sides then
yields
   
λPk+1,n f ( X̄ tnn ) λPk,n f ( X̄ tnn ) λ2 T
|||σ|||2sup [Pk+1,n f ]2Lip
∀ λ ∈ R, E e k+1 ≤E e k e2 n

since Pk,n = Pk ◦ Pk+1,n . By a straightforward backward induction


 from k = n − 1
down to k = 0, combined with the fact that P0,n f (X 0 ) = E f ( X̄ Tn )|X 0 , we obtain
    2 !n−1
λ
∀ λ ∈ R, E eλ f ( X̄ T ) ≤ E eλE f ( X̄ T )|X 0 e 2 |||σ|||sup n k=0 [Pk+1,n f ]2Lip
n n 2 T
.

First, note that by Jensen’s Inequality applied to the convex function eλ . ,


     
E eλ E f ( X̄ T ) |X 0 ≤ E E eλ f ( X̄ T ) |X 0 = E eλ f ( X̄ T ) .
n n n

On the other hand, it is clear from their very definition that


298 7 Discretization Scheme(s) of a Brownian Diffusion


n−1
[Pk,n ]Lip ≤ [P ]Lip
=k

(with the consistent convention that an empty product is equal to 1). Hence, owing
to Proposition 7.3,
 n−k
(n) T 2 (n) T
[Pk,n ]Lip ≤ 1 + Cb,σ with Cb,σ,T = Cb,σ + κb
n n
so that

T  T 
n n−1
[Pk,n f ]2Lip = [Pn−k,n f ]2Lip
n k=1 n k=0
(n) T n
(1 + Cb,σ ) −1
= n
(n)
[ f ]2Lip
Cb,σ
1 (Cb,σ +κb T )T
≤ e n [ f ]2Lip = K (b, σ, T, n)[ f ]2Lip .
Cb,σ ♦

 Exercise (Hoeffding Inequality and applications). Let Y : (, A, P) → R be a


bounded centered random variable satisfying |Y | ≤ A for some A ∈ (0, +∞).
(a) Show that, for every λ > 0,

1 Y λA 1  Y −λA
eλY ≤ 1+ e + 1− e .
2 A 2 A
(b) Deduce that, for every λ > 0,

λ2 A 2
E eλY ≤ cosh(λ A) ≤ e 2 .

(c) Let (X k )k≥1 be an i.i.d. sequence of random vectors and f : Rd → R a bounded


Borel function. Show that
  
1  M  2
− ε M2
 
∀ε > 0, ∀ M ≥ 1, P  f (X k ) − E f (X ) > ε ≤ 2 e 2 f sup .
M 
k=1

The Monte Carlo method does depend on the dimension d


As a second step, it may seem natural to try to establish deviation inequalities for the
supremum of the Monte Carlo error over all Lipschitz continuous functions whose
Lipschitz continuous coefficients are bounded by 1, that is, for the L 1 -Wasserstein
distance, owing to the Monge–Kantorovich representation (see Sect. 7.2.2). For such
results concerning the Euler scheme, we refer to [87].
7.3 Non-asymptotic Deviation Inequalities for the Euler Scheme 299

At this point we want to emphasize that introducing this supremum inside the
probability radically modifies the behavior of the Monte Carlo error and highlights
a strong dependence on the structural dimension of the simulation. To establish this
behavior, we will rely heavily on an argument from optimal vector quantization
theory developed in Chap. 5.
Let Lip1 (Rd , R) be the set of Lipschitz continuous functions from Rd to R with
a Lipschitz continuous coefficient [ f ]Lip ≤ 1.
Let X  : (, A, P) → Rd ,  ≥ 1, be independent copies of an integrable d-
dimensional random vector X with distribution denoted by PX , independent of X
!M
for convenience. For every ω ∈  and every M ≥ 1, let μMX (ω) = M1 =1 δ X  (ω)
 
denote the empirical measure associated to X  (ω) ≥1 . Then
  
 1 M
  
 
W1 μMX (ω), PX ) = sup  f X  (ω) − f (ξ)PX (dξ)
f ∈Lip1 (R ,R)
d  M R d 
=1

(the absolute values can be removed without damage since f and − f simultaneously
belong to Lip1 (Rd , R)). Now let us introduce the function defined for every ξ ∈ Rd
by
f ω,M (ξ) = min |X  (ω) − ξ| ≥ 0.
=1,...,M

It is clear from its very definition that f ω,M ∈ Lip1 (Rd , R) owing to the elementary
inequality  
 
 min ai − min bi  ≤ max |ai − bi |.
1≤i≤M 1≤i≤M 1≤i≤M

Then, for every ω ∈ ,


 
  
 X 1  M


W1 μM (ω), PX ) ≥  f ω,M (X  (ω)) − f ω,M (ξ)PX (dξ)
 M =1 ( )* + Rd 
=0
 
= f ω,M (ξ)PX (dξ) = min |X (ω ) − X  (ω)|dP(ω )
Rd  =1,...,M

≥ inf E min |X − xi |
(x1 ,...,x M )∈(Rd ) M 1≤i≤M
 
= inf X − ,
X x 1 .
x∈(Rd ) M

The lower bound in the last line is just the optimal L 1 -quantization error of the
(distribution of the) random vector X at level M. It follows from Zador’s Theorem
(see the remark that follows Zador’s Theorem 5.1.2 in Chap. 5 or [129]) that
 
lim M − d
1
inf X − ,
X x 1 ≥ J1,d ϕX  d+1
d ,
M x∈(Rd ) M
300 7 Discretization Scheme(s) of a Brownian Diffusion

where ϕX denotes the density of the nonsingular part of the distribution PX of X with
respect to the Lebesgue measure on Rd , if it exists. The constant J1,d ∈ (0, +∞) is a
universal constant and the pseudo-norm ϕX  d+1 d is finite as soon as X ∈ L 1+ =
∪η>0 L 1+η . Furthermore, it is clear that as soon as the support of PX is infinite,
e1,M (X ) > 0. Combining these two inequalities we deduce that, for non-purely sin-
gular distributions (i.e. such that ϕ X ≡ 0),
 
1 M 
− d1  
lim M sup  f (X  ) − E f (X ) > 0.
M f ∈Lip1 (Rd ,R)  M =1


This illustrates that the strong law of large numbers/Monte Carlo method is not as
“dimension free” as is commonly admitted.
For recent results about the (non-asymptotic) behavior of E W1 (μMX , PX ), we refer
to [97]. It again emphasizes that the L 1 -Wasserstein distance is not dimension free:
in fact, the generic behavior of E W1 (μMX , PX ) is M − d at least when d ≥ 3.
1

7.4 Pricing Path-Dependent Options (I) (Lookback,


Asian, etc)
 
Let us recall the notation D([0, T ], Rd ) := ξ : [0, T ] → Rd , càdlàg . As a direct
consequence of Theorem 7.2, if F : D([0, T ], R) → R is a Lipschitz continuous
functional with respect to the sup norm, i.e. satisfies
 
|F(ξ) − F(ξ )| ≤ [F]Lip sup ξ(t) − ξ (t),
t∈[0,T ]

then 
    1 + log n
E F (X t )t∈[0,T ] ) − E F ( X̄ n )t∈[0,T ]  ≤ [F]Lip Cb,σ,T
t
n

and     
E F (X t )t∈[0,T ] − E F ( X̄ n )t∈[0,T ]  ≤ Cn − 21 .
t

Typical example in option pricing. Assume that a one-dimensional diffusion pro-


cess X = (X t )t∈[0,T ] models the dynamics of a single risky asset (we do not take into
account here the consequences on the drift and diffusion coefficient term induced by
the preservation of non-negativity nor the martingale property under a risk-neutral
probability for the discounted asset…).
– The (partial) Lookback payoffs:
 
h T := X T − λ min X t
t∈[0,T ] +
7.5 The Milstein Scheme (Looking for Better Strong Rates…) 301

where λ = 1 in the regular Lookback case and λ > 1 in the so-called “partial Look-
back” case.
– Vanilla payoffs on supremum (like Calls and Puts) of the form
 
hT = ϕ sup X t or ϕ inf X t
t∈[0,T ] t∈[0,T ]

where ϕ is Lipschitz continuous on R+ .


– Asian payoffs of the form
  T 
1
hT = ϕ ψ(X s )ds , 0 ≤ T0 < T,
T − T0 T0

where ϕ and ψ are Lipschitz continuous on R+ . In fact, such Asianpayoffs are even
T
continuous with respect to the pathwise L 2 -norm, i.e.  f  L 2T := 0 f (s)ds. In
2

fact, the positivity of the payoffs plays no role here.

See Chap. 8 for improvements based on a weak error approach.

7.5 The Milstein Scheme (Looking for Better Strong


Rates…)

Throughout this section we will consider an autonomous diffusion just for nota-
tional convenience (b(t, . ) = b and σ(t, . ) = σ). The extension to general non-
autonomous SDE s of the form (7.1) is straightforward (in particular it adds no
further terms to the discretization scheme). The Milstein scheme has been originally
designed (see [214]) to produce a O(1/n)-error (in L p ), like…the Euler scheme
for deterministic ODE s. However, in the framework of SDE s, such a scheme is of
higher (second) order compared to the Euler scheme. In one dimension, its defini-
tion is simple and it can easily be implemented, provided the diffusion coefficient σ
is differentiable (see below). In higher dimensions, several problems arise, of both
a theoretical and practical (simulability) nature, making its use more questionable,
especially when its performances are compared to the weak error rate satisfied by
the Euler scheme (see Sect. 7.6).

7.5.1 The One Dimensional Setting

Let X x = (X tx )t∈[0,T ] denote the (autonomous) diffusion process starting at x ∈ R at


time which is the 0 solution to the following SDE written in its integrated form
 t  t
X tx =x+ b(X sx )ds + σ(X sx )dWs , (7.31)
0 0
302 7 Discretization Scheme(s) of a Brownian Diffusion

where we assume that the functions b and σ are twice differentiable with bounded
existing derivatives (hence Lipschitz continuous).
The starting idea is to expand the solution X tx for small t in order to “select” the
terms which go to zero at most as fast as t to 0 when t → 0 (with respect to the
L 2 (P)-norm). Let us inspect the two integral terms successively. First,
 t
b(X sx )ds = b(x)t + o(t) as t → 0
0

since X tx → x as t → 0 and b is continuous. Furthermore, as b is Lipschitz contin-


uous and E sups∈[0,t] |X sx − x|2 → 0 as t → 0 (see e.g. Proposition 7.7 further on),
this expansion holds for the L 2 (P)-norm as well (i.e. o(t) = o L 2 (t)).
As concerns the stochastic integral term, we keep in mind that E (Wt+t − Wt )2 =
t so that one may consider heuristically that, in a scheme,
√ a Brownian increment
Wt := Wt+t − Wt between t and t + t behaves like t. Then, by Itô’s Lemma
(see Sect. 12.8), one has for every s ∈ [0, t],
 s  s
1
σ(X sx ) = σ(x) + σ (X ux )b(X ux ) + σ (X ux )σ 2 (X ux ) du + σ (X ux )σ(X ux )d Wu
0 2 0

so that
 t  t s
σ(X s )dWs = σ(x)Wt +
x
σ(X ux )σ (X ux )dWu dWs + O L 2 (t 3/2 ) (7.32)
0 0 0
 t
= σ(x)Wt + σσ (x) Ws dWs + o L 2 (t) + O L 2 (t 3/2 ) (7.33)
0
1
= σ(x)Wt + σσ (x)(Wt2 − t) + o L 2 (t) (7.34)
2
t
since, by Itô’s formula, 0 Ws dWs = 21 (Wt2 − t).
The O L 2 (t 3/2 ) in (7.32) comes from the fact that u → σ (X ux )b(X ux ) + 21 σ (X ux )
σ (X ux ) is L 2 (P)-bounded in t in the neighborhood of 0 (note that b and σ have at
2

most linear growth and use Proposition 7.2). Consequently, using Itô’s fundamental
isometry, the Fubini-Tonnelli Theorem and Proposition 7.2,
 t s 2
1
E σ (X ux )b(X ux ) + σ (X ux )σ 2 (X ux ) du dWs
0 0 2
 t s 2
1
=E σ (X u )b(X u )+ σ (X u )σ (X u ) du ds
x x x 2 x
0 0 2
 t  s 2
  C 
≤ C 1 + x4 du ds = 1 + x 4 t 3.
0 0 3

The o L 2 (t) in Eq. (7.33) also follows from the combination of Itô’s fundamental
isometry (twice) and the Fubini-Tonnelli Theorem, which yields
7.5 The Milstein Scheme (Looking for Better Strong Rates…) 303

 t  s  2  t s
E σσ (X ux ) − σσ (x) dWu dWs = ε(u) du ds,
0 0 0 0

 2
where ε(u) = E σσ (X ux ) − σσ (x) → 0 as u → 0 by the Lebesgue dominated
convergence Theorem. Finally note that, by scaling and using that E W12 = 1 and
E W14 = 3,

Wt2 − t2 = tW12 − 12 = 2 t

so that the second term in the right-hand side of (7.34) is exactly of order one. Then,
X tx expands as follows
1  
X tx = x + b(x)t + σ(x)Wt + σσ (x) Wt2 − t + o L 2 (t).
2
Using the Markov property of the diffusion, one can reproduce the above reasoning
on each time step [tkn , tk+1
n
), given the value of the scheme at time tkn , this suggests to
define the discrete time Milstein scheme ( X̄ tmil,n
n )k=0,...,n with step Tn as follows:
k

X̄ 0mil,n = X 0,

T  mil,n   mil,n  T n
X̄ tmil,n
n = X̄ tmil,n
+ b X̄ t n
n + σ X̄ t n Z
k+1 k n k k n k+1
1  T  n 2
+ σσ X̄ tmil,n
n (Z k+1 ) − 1 , (7.35)
2 k n
where 
n
Z kn = (Wtkn − Wtk−1
n ), k = 1, . . . , n.
T

Or, equivalently, by grouping the drift terms together,

X̄ 0mil,n = X 0 ,
  
 mil,n  1  mil,n  T  mil,n  T n
X̄ tmil,n
n = X̄ tmil,n +
b X̄ t n
n − σσ X̄ t n +σ X̄ t n Z
k+1 k k 2 k n k n k+1
1  T n 2
+ σσ X̄ tmil,n
n (Z ) . (7.36)
2 k n k+1
Like for the Euler scheme, the stepwise constant Milstein scheme is defined as

X tmil,n = X̄ tmil,n , t ∈ [0, T ]. (7.37)

In what follows, when no ambiguity arises, we will often drop the superscript n
in the notation of the Milstein scheme(s).
304 7 Discretization Scheme(s) of a Brownian Diffusion

By interpolating the above scheme between the discretization times, i.e. freezing
the coefficients of the scheme, we define the continuous or genuine Milstein scheme
with step Tn defined, with our standard notations t, by

X̄ 0mil = X 0 ,
 1 
X̄ tmil = X̄ tmil + b( X̄ tmil )− σσ ( X̄ tmil ) (t − t)+σ( X̄ tmil )(Wt − Wt )
2
1
+ σσ ( X̄ tmil )(Wt − Wt )2 (7.38)
2
for every t ∈ [0, T ].
 Example (Black–Scholes model). The discrete time Milstein scheme of a Black–
Scholes model starting at x0 > 0, with interest rate r and volatility σ > 0 over [0, T ]
reads
  
 σ 2
T T σ 2
T
X̄ tmil
n = X̄ tmil
n 1+ r − +σ Zn + (Z n )2 . (7.39)
k+1 k 2 n n k+1 2 n k+1

The following theorem gives the rate of strong pathwise convergence of the Mil-
stein scheme under slightly less stringent hypothesis than the usual CLip
1
-assumptions
on b and σ.

Theorem 7.5 (Strong L p -rate for the Milstein scheme) (See e.g. [170]) Assume
that b and σ are C 1 on R with bounded, αb and ασ -Hölder continuous first derivatives
respectively, αb , ασ ∈ (0, 1]. Let α = min(αb , ασ ).
(a) Discrete time and genuine Milstein scheme. For every p ∈ (0, +∞), there exists
a real constant Cb,σ,T, p > 0 such that, for every X 0 ∈ L p (P), independent of the
Brownian motion W , one has

      1+α
  mil,n    mil,n    T 2
 max  X tkn − X̄ t n  ≤  sup  X t − X̄ t  ≤ Cb,σ,T, p 1 + X 0  p .
0≤k≤n k p t∈[0,T ] p n

In particular, if b and σ are Lipschitz continuous


       T
 mil,n   
 max  X tkn − X̄ tkn  ≤  sup  X t − X̄ tmil,n  ≤ C p,b,σ,T 1 + X 0  p .
0≤k≤n p t∈[0,T ] p n

(b) Stepwise
 constant
 Milstein scheme. As concerns the stepwise constant Milstein
scheme X tmil,n t∈[0,T ] defined by (7.37), one has (like for the Euler scheme!)

 
  
mil,n  T

 sup X t − X t  ≤ Cb,σ,T (1 + X 0  p ) (1 + log n).
t∈[0,T ] p n
7.5 The Milstein Scheme (Looking for Better Strong Rates…) 305

A detailed proof is provided in Sect. 7.8.8 in the case X 0 = x ∈ R and p ∈ [2, +∞).
Remarks. • As soon as the derivatives of b and σ are Hölder, the genuine Milstein
scheme converges faster than the Euler scheme. Furthermore, if b and σ are C 1 with
Lipschitz continuous derivatives,
1 the L p -convergence rate of the genuine Milstein
scheme is, as expected, O n .
• This O( n1 )-rate obtained when b and σ are (bounded and) Lipschitz continuous
should be compared to the weak rate investigated in Sect. 7.6, which is also O( n1 ) for
the approximation of E f (X T ) by E f ( X̄ Tn ) (under sightly more stringent assump-
tions). Comparing the performances of both approaches should rely on numerical
evidence and depends on the specified diffusion, function or step parameter.
• Claim (b) of the theorem shows that the stepwise constant Milstein scheme does
not converge faster than the stepwise constant Euler scheme (except, of course, at the
discretization times tkn ). To convince yourself of this, just to think of the Brownian
motion itself: in that case, b ≡ 0 and σ ≡ 0, so that both the stepwise constant
Milstein and Euler schemes coincide and consequently converge at the same rate! As
a consequence, since it is the only simulable version of the Milstein scheme when
dealing with path-dependent functionals, its use for the approximate  computation

(by Monte Carlo simulation) of the expectation of the form E F (X t )t∈[0,T ] should
not provide better results than implementing the standard stepwise constant Euler
scheme, as briefly described in Sect. 7.4.
By contrast, some functionals of the continuous Euler scheme can be simulated
in an exact way: this is the purpose of Chap. 8 devoted to diffusion bridges. This is
not the case for the Milstein scheme.

 Exercises. 1. A.s. convergence of the Milstein scheme. Derive from these L p -rates
an a.s. rate of convergence for the Milstein scheme.
2. Euler scheme of the Ornstein-Uhlenbeck process. We consider on the one hand
the sequence of random variables recursively defined by

Yk+1 = Yk (1 + μ) + σ Z k+1 , k ≥ 0, Y0 = 0,

where μ > 0,  > 0 are positive real numbers, and the Ornstein-Uhlenbeck process
solution to the SDE
d X t = μX t dt + σdWt , X 0 = 0.
Wtk −Wtk−1
Set tk = k, k ≥ 0 and Z k = √

, k ≥ 1.
(a) Show that, for every k > 0,

σ 2 (1 + μ)2k − 1
E |Yk |2 = .
μ 2 + μ
 tk+1
(b) Show that, for every k ≥ 0, X tk+1 = eμ X tk + σeμtk+1 e−μs dWs .
tk
306 7 Discretization Scheme(s) of a Brownian Diffusion

(c) Show that, for every k ≥ 0, for every n > 0

E|Yk+1 − X tk+1 |2 ≤ (1 + n ) e2μ E|Yk − X tk |2


 2
+ (1 + 1/n ) E |Yk |2 eμ − 1 − μ
 
+ σ2 (eμu − 1)2 du.
0

In what follows we assume that  = n = Tn where T is a positive real number and


n is a non-zero integer which may vary. However, we keep on using the notation Yk
rather than Yk(n) .
σ 2 e2μT
(d) Show that, for every k ∈ {0, . . . , n}, E |Yk |2 ≤ .

(e) Deduce the existence of a real constant C = Cμ,σ,T > 0 such that, for every
k ∈ {0, . . . , n − 1},

E|Yk+1 − X tk+1 |2 ≤ (1 + n )eμn E|Yk − X tk |2 + C3n .


 T 2
Conclude that E|Yk+1 − X tk+1 |2 ≤ C n
for some real constant C = Cμ,σ,T > 0.
3. Revisit the above exercise when μ ∈ (− 2 , 0)? What can be said about the depen-
dence of the real constants in T when μ ∈ (− 2 , − 1 )?
4. Milstein scheme may preserve positivity. We consider with the usual notations the
scalar SDE

d X t = b(t, X t )dt + σ(t, X t )dWt , X 0 = x > 0, t ∈ [0, T ]

with drift b : [0, T ] × R → R and diffusion coefficient σ : [0, T ] × R → R, both


assumed continuously differentiable in x on [0, T ] × (0, +∞). We do not prejudge
the existence of a strong solution.
(a) Show that, if


⎪ (i) σ(t, ξ) > 0, σx (t, ξ) > 0,





⎨ (ii) σ
(t, ξ) ≤ ξ,
∀ t ∈ [0, T ], ∀ ξ ∈ (0, +∞), 2 σx (7.40)





⎪ σσx (t, ξ)

⎩ (iii) ≤ b(t, ξ),
2

then the discrete time Milstein scheme with step Tn starting at x > 0 defined by (7.36)
satisfies
∀ k ∈ {0, . . . , n}, X̄ tmil
n
k
> 0 a.s.
7.5 The Milstein Scheme (Looking for Better Strong Rates…) 307

[Hint: decompose the right-hand side of Milstein scheme on (7.35) as the sum of
three terms, one being a square.]
(b) Show that, under Assumption (7.40), the genuine Milstein scheme starting at
X > 0 also satisfies X̄ tmil > 0 for every t ∈ [0, T ].
(c) Show that the Milstein scheme of a Black–Scholes model is positive if and only
if σ 2 ≤ 2r .
5. Milstein scheme of a positive CIR process. We consider a CIR process solution to
the SDE 
d X t = κ(a − X t )dt + ϑ X t dWt , X 0 = x > 0, t ≥ 0,
2
ϑ
where κ, a, ϑ > 0 and 2κa ≤ 1. Such an SDE has a unique strong solution (X tx )t≥0
living in (0, +∞) (see [183], Proposition 6.2.4, p.130). We set Yt = eκt X tx , t ≥ 0.
(a) Show that the Milstein scheme of the process (Yt )t≥0 is a.s. positive.
(b) Deduce a way to devise a positive simulable discretization scheme for the CIR
process (the convergence properties of this process are not requested).
6. Extension. We return to the setting of the above exercise 2. We want to relax the
assumption on the drift b. We still assume that the conditions (i)–(ii) in (7.40) on the
function σ are satisfied and we formally set Yt = eρt X tx , t ∈ [0, T ], for some ρ > 0.
Show that (X tx )t∈[0,T ] is a solution to the above SDE if and only if (Yt )t∈[0,T ] is a
solution to a stochastic differential equation

dYt = b(t, Yt ) + σ(t, Yt )dWt , Y0 = x > 0,

where b, σ : [0, T ] × R → R are functions depending on b, σ and ρ to be deter-


mined.
(a) Show that σ still satisfies (i)–(ii)
 in (7.40).
σσx (t,ξ)
(b) Deduce that if ρ0 = supξ>0 1ξ 2
− b(t, ξ) < +∞, then for every ρ ≥ ρ0 ,
the Milstein scheme of Y is a.s. positive

These exercises enhance a major asset of the one-dimensional Milstein scheme,


keeping in mind financial applications to pricing and hedging of derivatives: it may
be used to preserve positivity. Obvious applications to the CIR model in fixed income
(interest rates) or the Heston model for Equity derivatives.

7.5.2 Higher-Dimensional Milstein Scheme

Let us examine what the above expansion in small time becomes if the SDE (7.31) is
modified to be driven by a 2-dimensional Standard Brownian motion W = (W 1 , W 2 )
(still with d = 1). It reads

d X tx = b(X tx )dt + σ1 (X tx )dWt1 + σ2 (X tx )dWt2 , X 0x = x.


308 7 Discretization Scheme(s) of a Brownian Diffusion

The same reasoning as that carried out above shows that the first order term
t
σ(x)σ (x) Ws dWs in (7.33) becomes
0

  t
σi (x)σ j (x) Wsi dWsj .
i, j=1,2 0

 t
In particular, when i = j, this term involves the two Lévy areas Ws1 dWs2 and
 t 0
2 1
Ws dWs , linearly combined with, a priori, different coefficients.
0
If we return to the general setting of a d-dimensional diffusion driven by a
q-dimensional standard Brownian motion, with (differentiable) drift b : Rd → Rd
and diffusion coefficient σ = [σi j ] : Rd → M(d, q, R), elementary though tedious
computations lead us to define the (discrete time) Milstein scheme with step Tn as
follows:

X̄ 0mil = X 0 ,
T
X̄ tmil
n = X̄ tmil
n + n )
b( X̄ tmil
k+1 k n k

  n
tk+1
+ σ( X̄ tmil
n )Wt n +
k k+1
∂σ.i σ. j ( X̄ tmil
n )
k
(Wsi − Wtikn )dWsj , (7.41)
1≤i, j≤q tkn

k = 0, . . . , n − 1,

where Wtk+1
n := Wtk+1
n − Wtkn = Tn Z k+1 n
, σ.i (x) denotes the i-th column of the
matrix σ and, for every i, j ∈ {1, . . . , q},


d
∂σ.i
∀ x = (x 1 , . . . , x d ) ∈ Rd , ∂σ.i σ. j (x) := (x)σj (x) ∈ Rd . (7.42)
=1
∂x 

Remark. A more synthetic way to memorize this quantity is to note that it is the
Jacobian matrix of the vector σ.i (x) applied to the vector σ. j (x).

The ability of simulating such a scheme entirely relies on the exact simulations
  
tkn
q q
Wt1kn − n ,...,W n
Wt1k−1 tk − Wt n , (Wsi − Wtikn )dWsj , i, j = 1, . . . , q, i = j ,
k−1 n
tk−1

k = 1, . . . , n,

i.e. of identical copies of the q 2 -dimensional random vector


7.5 The Milstein Scheme (Looking for Better Strong Rates…) 309
  t 
q
Wt1 , . . . , Wt , Wsi dWsj , 1 ≤ i, j ≤ q, i = j
0

(at t = Tn ). To the best of our knowledge no convincing method (i.e. with a reasonable
computational cost) to achieve this has been proposed so far in the literature (see
however [102]).
The discrete time Milstein scheme can be successfully simulated when the tensors
terms ∂σ.i σ. j commute since in that case the Lévy areas disappear, as shown in the
following proposition.
Proposition 7.5 (Commuting case) (a) If the tensor terms ∂σ.i σ. j commute, i.e. if

∀ i, j ∈ {1, . . . , q}, ∂σ.i σ. j = ∂σ. j σ.i ,

then the discrete time Milstein scheme reduces to


 
T 1 q
X̄ tmil
n = X̄ tmil
n + n ) −
b( X̄ tmil ∂σ.i σ.i ( X̄ tmil
n ) + σ( X̄ tmil
n )Wt n
k+1
k+1 k n k 2 i=1 k k

1  j
+ ∂σ.i σ. j ( X̄ tmil
n )W n W n ,
i
tk+1 tk+1 X̄ 0mil = X 0 . (7.43)
2 1≤i, j≤q k

(b) When q = 1 the commutation property is trivially satisfied.


Proof. Let i = j in {1, . . . , q}. As ∂σ.i σ. j = ∂σ. j σ.i , both Lévy’s areas involving
W i and W j only appear through their sum in the scheme. Now, an integration by
parts shows that
 n
tk+1  n
tk+1
j j
(Wsi − Wtikn )dWsj + (Wsj − Wt n )dWsi = Wtik+1
n W n
t
k k+1
tkn tkn

j
since (Wti )t∈[0,T ] and (Wt )t∈[0,T ] are independent if i = j. The result follows noting
that, for every i ∈ {1, . . . , q},
 n
tk+1  
1 T
(Wsi − Wtikn )dWsi = (Wtk+1
i
n ) −
2
. ♦
tkn 2 n

 Example (Multi-dimensional Black–Scholes model). We consider the standard


multi-dimensional Black–Scholes model
 
q
j
d X ti = X ti r dt + σi j dWt , X 0i = x0i > 0, i = 1, . . . , d,
j=1

where W = (W 1 , . . . , W q ) a q-dimensional Brownian motion. Then, elementary


computations yield the following expression for the tensor terms ∂σ.i σ. j
310 7 Discretization Scheme(s) of a Brownian Diffusion
 
∀ i, j ∈ {1, . . . , q}, ∂σ.i σ. j (x) = σi σj x  =1,...,d

which obviously commute.

The rate of convergence of the Milstein scheme is formally the same in higher
dimension as it is in one dimension: Theorem 7.5 remains true with a d-dimensional
diffusion driven by a q-dimensional Brownian motion provided b : Rd → Rd and
σ : Rd → M(d, q, R) are C 2 with bounded existing partial derivatives.

Theorem 7.6 (Multi-dimensional discrete time Milstein scheme) (See e.g. [170])
Assume that b and σ are C 1 on Rd with bounded αb and ασ -Hölder continuous
existing partial derivatives, respectively. Let α = min(αb , ασ ). Then, for every p ∈
(0, +∞), there exists a real constant C p,b,σ,T > 0 such that for every X 0 ∈ L p (P),
independent of the q-dimensional Brownian motion W , the error bound established
in Theorem 7.5(a) remains valid.

However, one should keep in mind that this strong rate result does not prejudge
the ability to simulate this scheme. In a way, the most important consequence of this
theorem concerns the Euler scheme.

Corollary 7.2 (Discrete time Euler scheme with constant diffusion coefficient)
If the drift b ∈ C 2 (Rd , Rd ) with bounded existing partial derivatives and if σ(x) =
 is constant, then the discrete time Euler and Milstein schemes coincide. As a
consequence, the strong rate of convergence of the discrete time Euler scheme is,
in that very specific case, O( n1 ). Namely, for every p ∈ (0, +∞), there exists a real
constant C p,b,σ,T > 0
 
   
 max |X t n − X̄ nn | ≤ C p,b,σ,T T 1 + X 0  .
0≤k≤n k tk 
n p
p

7.6 Weak Error for the Discrete Time Euler Scheme (I)

In many situations, like the pricing of “vanilla” European options, a discretization


scheme X̄ n = ( X̄ tn )t∈[0,T ] of a d-dimensional diffusion process X = (X t )t∈[0,T ] is
introduced in order to compute by a Monte Carlo simulation an approximation
E f ( X̄ T ) of E f (X T ), i.e. only at a fixed (terminal) time. If one relies on the former
strong rates of convergence, we get, as soon as f : Rd → R is Lipschitz continuous
and b, σ satisfy (HTβ ) for β ≥ 21 ,
     
 
E f (X T ) − E f ( X̄ Tn ) ≤ [ f ]Lip E  X T − X̄ Tn  ≤ [ f ]Lip E sup  X t − X̄ tn 
t∈[0,T ]
 1
=O √ .
n
7.6 Weak Error for the Discrete Time Euler Scheme (I) 311

In fact, the first inequality in this chain turns out to be highly non-optimal since
it switches from a weak error (the difference only depending on the respective
(marginal) distributions of X T and X̄ Tn ) to a pathwise approximation X T − X̄ Tn (4 ).
To improve asymptotically the other two inequalities is hopeless since it has been
shown (see the remark and comments in Sect. 7.8.6 for a brief introduction) that,
under appropriate assumptions, X T − X̄ Tn satisfies a central limit theorem at rate n − 2
1

with non-zero asymptotic variance. In fact, one even has a functional form of this
central limit theorem for the whole process (X t − X̄ tn )t∈[0,T ] (see [155, 178]) still
at rate n − 2 . As a consequence, a rate faster than n − 2 in an L 1 sense would not be
1 1

consistent with this central limit result.


Furthermore, numerical experiments confirm that the weak rate of convergence
between the above two expectations is usually much faster than n − 2 . This fact has
1

long been known and has been extensively investigated in the literature, starting with
the two seminal papers [270] by Talay–Tubaro and [24] by Bally–Talay, leading
to an expansion of the time discretization error at an arbitrary accuracy when b
and σ are smooth enough as functions. Two main settings have been investigated:
when the function f is itself smooth and when the diffusion is “regularizing”, i.e.
propagates the regularizing effect of the driving Brownian motion (5 ) thanks to a non-
degeneracy assumption on the diffusion coefficient σ, typically uniform ellipticity for
σ (see below) or weaker assumption such as parabolic Hörmander (hypo-ellipticity)
assumption (see [139] (6 )).
The same kind of question has been investigated for specific classes of path-
dependent functionals F of the diffusion X with some applications to exotic option
pricing (see Chap. 8). These results, though partial and specific, often emphasize that
the resulting weak error rate is the same as the strong rate derived from the Milstein
scheme for these types of functionals, especially when they are Lipschitz continuous
with respect to the sup-norm.
As a second step, we will show how the Richardson–Romberg extrapolation
method provides a first procedure to take advantage of such weak rates, before fully
exploiting a higher-order weak error rates expansion in Chap. 9 with the multilevel
paradigm.

4 A priori X̄ n and X could be defined on different probability spaces: recall the approximation of
T T
the Black–Scholes model by binomial models (see [59]).
5 The regularizing property of the Brownian motion should be understood in its simplest form as

follows: if f is a Borel bounded function on Rd , then f σ (x) := E ( f (x + σW1 )) is a C ∞ function


for every σ > 0 and converges towards f as σ → 0 in every L p space, p > 0. This result is just a
classical convolution result with a Gaussian kernel rewritten in a probabilistic form.
6 Given in detail this condition is beyond the scope of this monograph. However, let us mention that

if the column vectors (σ. j (x))1≤ j≤q span Rd for every x ∈ Rd , this condition is satisfied. If not, the
same spanning property is requested, after adding enough iterated Lie brackets of the coefficients
of the SDE—re-written in a Stratonovich sense—and including the drift this time.
312 7 Discretization Scheme(s) of a Brownian Diffusion

7.6.1 Main Results for E f (X T ): the Talay–Tubaro and


Bally–Talay Theorems

We adopt the notations of the former Sect. 7.1, except that we still consider, for
convenience, an autonomous version of the SDE, with initial condition x ∈ Rd ,

d X tx = b(X tx )dt + σ(X tx )dWt , X 0x = x.

The notations (X tx )t∈[0,T ] and ( X̄ tn,x )t∈[0,T ] respectively denote the diffusion and the
Euler scheme of the diffusion with step Tn of the diffusion starting at x at time 0 (the
superscript n will often be dropped).
The first result is the simplest result on the weak error, obtained under less stringent
assumptions on b and σ.

Theorem 7.7 (see [270]) Assume b and σ are four times continuously differentiable
on Rd with bounded existing partial derivatives (this implies that b and σ are Lipschitz
continuous). Assume f : Rd → R is four times differentiable with polynomial growth
as well as its existing partial derivatives. Then, for every x ∈ Rd ,
 
1
E f (X Tx ) − E f ( X̄ Tn,x ) = O as n → +∞. (7.44)
n

Proof (partial). Assume d = 1 for notational convenience. We also assume that


b ≡ 0, σ is bounded and f has bounded first four derivatives, for simplicity. The
diffusion (X tx )t≥0,x∈R is a homogeneous Markov process with transition semi-group
(Pt )t≥0 (see e.g. [162, 251] among other references) reading on Borel test functions
g (i.e. bounded or non-negative)

Pt g(x) := E g(X tx ), t ≥ 0, x ∈ R.

On the other hand, the Euler scheme with step Tn starting at x ∈ R, denoted by
( X̄ txn )0≤k ≤n , is a discrete time homogeneous Markov chain with transition reading
k
on Borel test functions g
  
T d
P̄ g(x) = E g x + σ(x) Z , Z = N (0; 1).
n

To be more precise, this means for the diffusion process that, for any Borel test
function g,
 
∀ s, t ≥ 0, Pt g(x) = E g(X s+t ) | X s = x = E g(X tx )
7.6 Weak Error for the Discrete Time Euler Scheme (I) 313

and for its Euler scheme (still with tkn = kT


n
)
 
P̄ g(x) = E g( X̄ txk+1
n ) | X̄ n = x
x
tk = E g( X̄ xT ), k = 0, . . . , n − 1.
n

Now, let us consider the four times differentiable function f . One gets, by the semi-
group property satisfied by both Pt and P̄,

E f (X Tx ) = PT f (x) = P Tn ( f )(x) and E f ( X̄ Tx ) = P̄ n ( f )(x).


n

Then, by writing the difference E f (X Tx ) − E f ( X̄ Tx ) in a telescopic way, switching


from P Tn ( f ) to P̄ n ( f ), we obtain
n


n
E f (X T ) − E f ( X̄ T ) =
x x
P Tk ( P̄ n−k f )(x) − P Tk−1 ( P̄ n−(k−1) f )(x)
n n
k=1
n
 
= P Tk−1 (P Tn − P̄)( P̄ n−k f ) (x). (7.45)
n
k=1

This “domino” sum suggests two tasks to be accomplished:


– the first is to estimate precisely the asymptotic behavior of

P Tn f (x) − P̄ f (x)

with respect to the step T


n
and the first (four) derivatives of the function g.
– the second is to control the (first four) derivatives of the functions P̄  f with
respect to the sup norm, uniformly in  ∈ {1, . . . , n} and n ≥ 1, in order to propagate
the above local error bound.
Let us deal with the first task. First, Itô’s formula (see Sect. 12.8) yields
 t  t
1
Pt f (x) := E f (X tx ) = f (x) + E (f σ)(X sx )dWs + E f (X sx )σ 2 (X sx ) ds,
2
( 0
)* + 0
=0

where we use that f σ is bounded to ensure that the stochastic integral is a true
martingale.  
A Taylor expansion of f x + σ(x) Tn Z at x yields for the transition of the
Euler scheme (after taking expectation)
314 7 Discretization Scheme(s) of a Brownian Diffusion

P̄ f (x) = E f ( X̄ Tx )
   2
1T T
= f (x) + f (x)σ(x)E + ( f σ )(x)E
Z 2
Z
2n n
 3 ⎛  4 ⎞
σ 3
(x) T σ 4
(x) T
+ f (3) (x) E Z + E ⎝ f (4) (ξ) Z ⎠,
3! n 4! n
 
for a ξ ∈ x, X̄ T
⎛  4 ⎞
T σ 4
(x) T
= f (x) + ( f σ 2 )(x) + E ⎝ f (4) (ξ) Z ⎠
2n 4! n

T σ 4 (x)T 2
= f (x) + ( f σ 2 )(x) + cn ( f ),
2n 4!n 2

where |cn ( f )| ≤ 3 f (4) sup . This follows from the well-known facts that
E Z = E Z 3 = 0, E Z 2 = 1 and E Z 4 = 3. Consequently

 
T
1 n
P Tn f (x) − P̄ f (x) = E ( f σ 2 )(X sx ) − ( f σ 2 )(x) ds
2 0 (7.46)
σ 4 (x)T 2
− cn ( f ).
4!n 2

Applying again Itô’s formula to the C 2 function γ := f σ 2 yields


 
 1  s
E (f σ 2
)(X sx ) − ( f σ )(x) = E
2
γ (X ux )σ 2 (X ux )du
2 0

so that
   s
∀ s ≥ 0, sup E ( f σ 2 )(X sx ) − ( f σ 2 )(x)  ≤ γ σ 2 sup .
x∈R 2

Elementary computations show that

γ sup ≤ Cσ max  f (k) sup ,


k=2,3,4

where Cσ depends on σ (k) sup , k = 0, 1, 2, but not on f (with the standard conven-
tion σ (0) = σ).
Consequently, we derive from (7.46) that
 2
  T
 P T ( f )(x) − P̄( f )(x) ≤ C (k)
σ,T max  f sup .
n k=2,3,4 n
7.6 Weak Error for the Discrete Time Euler Scheme (I) 315

The fact that the first derivative f is not involved in these bounds is an artificial
consequence of our assumption that b ≡ 0.
Now we switch to the second task. In order to plug this estimate in (7.45), we need
to control the first four derivatives of P̄  f ,  = 1, . . . , n, uniformly with respect to
k and n. In fact, we do not directly need to control the first derivative since b ≡ 0 but
we will do it as a first example, illustrating the method in a simpler case.
Let us consider again the generic function f and its four bounded derivatives.
     
T T
( P̄ f ) (x) = E f x + σ(x) Z 1 + σ (x) Z
n n

so that
     
 T   T 
   
|( P̄ f ) (x)| ≤  f sup 1 + σ (x) Z  ≤  f sup 1 + σ (x) Z
 n   n 
1 2
2  
3 
3 T T
=  f sup 4E 1 + 2σ (x) Z + σ (x)2 Z 2
n n
  
T T
=  f sup 1 + (σ )2 (x) ≤  f sup 1 + σ (x)2
n 2n

since 1 + u ≤ 1 + u2 , u ≥ 0. Hence, we derive by induction that, for every n ≥ 1
and every  ∈ {1, . . . , n},
    σ (x)2 T
∀ x ∈ R, ( P̄  f ) (x) ≤  f sup 1 + σ (x)2 T /(2n) ≤  f sup e 2 ,

where we used that (1 + u) ≤ eu , u ≥ 0. This yields

σ 2
sup T
( P̄  f ) sup ≤  f sup e 2 .

Let us deal now with the second derivative,


    
( P̄ f ) (x) = d
f x + σ(x) Tn Z 1 + σ (x) Tn Z
dx
E
     2

= E f x + σ(x) Tn Z 1 + σ (x) Tn Z
   
+E f x + σ(x) Tn Z σ (x) Tn Z .
316 7 Discretization Scheme(s) of a Brownian Diffusion

Now
 ⎡     2 ⎤
  
 T T  2T
E ⎣ f x + σ(x) 1 + σ (x) ⎦ 
 Z Z  ≤  f sup 1 + σ (x) n
 n n 

  
and, using that f x + σ(x) Tn Z = f (x) + f (ζ)σ(x) Tn Z for some ζ, owing
to the fundamental theorem of Calculus, we get
      
 
 T T  T
E f x + σ(x) Z σ (x) Z  ≤  f sup σσ sup E (Z 2 )
 n n  n

since E Z = 0. Hence
  
T
∀ x ∈ R, |( P̄ f ) (x)| ≤  f sup 1 + σσ sup + (σ )2 sup ,
n

which implies the boundedness of |( P̄  f ) (x)|,  = 0, . . . , n − 1, n ≥ 1.


The same reasoning yields the boundedness of all derivatives ( P̄  f )(i) , i =
1, 2, 3, 4,  = 1, . . . , n, n ≥ 1.
Now we can combine our local error bound with the control of the derivatives.
Plugging these estimates into each term of (7.45) finally yields

  n
T2 T
E f (X x ) − E f ( X̄ x ) ≤ C max ( P̄  f )(i) sup ≤ Cσ,T, f T ,
σ,T 2
T T
1≤≤n,i=1,...,4
k=1
n n

which completes the proof. ♦

 Exercises. 1. Complete the above proof by inspecting the case of higher-order


derivatives (k = 3, 4).
2. Extend the proof to a (bounded) non-zero drift b.
If one assumes more regularity on the coefficients or some uniform ellipticity on
the diffusion coefficient σ it is possible to obtain an expansion of the error at any
order.

Theorem 7.8 (Weak error expansions) (a) Smooth function f (Talay–Tubaro’s


Theorem, see [270]). Assume b and σ are infinitely differentiable with bounded par-
tial derivatives. Assume f : Rd → R is infinitely differentiable with partial deriva-
tive having polynomial growth. Then, for every order R ∈ N∗ , the expansion

R
ck
(E R+1 ) ≡ E f ( X̄ Tn,x ) − E f (X Tx ) = k
+ O(n −(R+1) ) (7.47)
k=1
n
7.6 Weak Error for the Discrete Time Euler Scheme (I) 317

as n → +∞, where the real valued coefficients ck = ck ( f, T, b, σ) depend on f , T ,


b and σ.
(b) Uniformly elliptic diffusion (Bally–Talay’s Theorem, see [24]). If b and σ are
bounded, infinitely differentiable with bounded partial derivatives and if σ is uni-
formly elliptic, i.e.

∀ x ∈ Rd , σσ ∗ (x) ≥ ε0 Id for an ε0 > 0,

then the conclusion of (a) holds true for any bounded Borel function.

Method of proof for (a). The idea is to to rely on the PDE method, i.e. considering
the solution of the parabolic partial differential equation
 

+ L (u)(t, x) = 0, u(T, . ) = f
∂t

where L defined by

1
(Lg)(x) = g (x)b(x) + g (x)σ 2 (x),
2
denotes the infinitesimal generator of the diffusion. It follows from the Feynman–Kac
formula that (under some appropriate regularity assumptions)

u(0, x) = E f (X Tx ).

Formally (in one dimension), the Feynman–Kac formula can be established as follows
(see Theorem 7.11 for a more rigorous proof). Assuming that u is regular enough,
i.e. C 1,2 ([0, T ] × R) to apply Itô’s formula (see Sect. 12.8), then

f (X Tx ) = u(T, X Tx )
    T
T

= u(0, x) + + L (u)(t, X t )dt +
x
∂x u(t, X tx )σ(X tx )dWt
0 ∂t 0
 T
= u(0, x) + ∂x u(t, X tx )σ(X tx )dWt
0

since u satisfies the above parabolic PDE. Assuming that ∂x u has polynomial growth,
so that the stochastic integral is a true martingale, we can take expectation. Then, we
introduce domino differences based on the Euler scheme as follows
 
E f ( X̄ Tn,x ) − E f (X Tx ) = E u(T, X̄ Tn,x ) − u(0, x)
n 
= E u(tkn , X̄ tn,x n,x
n ) − u(tk−1 , X̄ n ) .
n
t
k k−1
k=1
318 7 Discretization Scheme(s) of a Brownian Diffusion

The core of the proof consists in applying Itô’s formula (to u, b and σ) to show that

 E φ(tkn , X txn ) 1
E u(tkn , X̄ tn,x n n,x
n ) − u(tk−1 , X̄ n )
tk−1 = k
+ o
k n2 n2

for some continuous function φ. Then, one derives (after new computations) that
T 1
E φ(t, X tx )dt
E f (X T ) − E f ( X̄ T ) =
x n,x 0
+o .
n n
This approach will be developed in full detail in Sect. 7.8.9 where the theorem is
rigorously proved.
Remarks. • The weak error expansion, alone or combined with strong error rates (in
quadratic means), are major tools to fight against the bias induced by discretization
schemes. This aspect is briefly illustrated below where we first introduce standard
Richardson–Romberg extrapolation for diffusions. Wide classes of multilevel esti-
mators especially designed to efficiently “kill” the bias while controlling the variance
are introduced and analyzed in Chap. 9.
• A parametrix approach is presented in [172] which naturally leads to the higher-
order expansion stated in Claim (b) of Theorem 7.8. The expansion is derived, in
a uniformly elliptic framework, from an approximation result of the density of the
diffusion by that of the Euler scheme.
• For extensions to less regular f —namely tempered distributions—in the uniformly
elliptic case, we refer to [138].
• The last important information about weak error from the practitioner’s viewpoint
is that the weak error induced by the Milstein scheme has exactly the same order
as that of the Euler scheme, i.e. O(1/n). So the Milstein scheme seems at a first
glance to be of little interest as long as one wishes to compute E f (X T ). However,
we will see in Chap. 9 that combined with its fast strong convergence, it leads to
unbiased-like multilevel estimators.

7.7 Bias Reduction by Richardson–Romberg Extrapolation


(First Approach)

This section is a first introduction to bias reduction. Chapter 9 is entirely devoted to


this topic and introduces more advanced methods like multilevel methods.
7.7 Bias Reduction by Richardson–Romberg Extrapolation (First Approach) 319

7.7.1 Richardson–Romberg Extrapolation with Consistent


Brownian Increments

Bias-variance decomposition of the quadratic error in a Monte Carlo simulation

Let V be a vector space of continuous functions with linear growth satisfying (E2 ) (the
case of non-continuous functions is investigated in [225]). Let f ∈ V . For notational
convenience, in view of what follows, we set W (1) = W and X (1) = X (including
X 0(1) = X 0 ∈ L 2 (, A, P) throughout this section). A regular Monte Carlo simulation
based on M independent copies ( X̄ T(1) )m , m = 1, . . . , M, of the Euler scheme X̄ T(1)
with step Tn induces the following global (squared) quadratic error
 2
 1   (1) m 
M
 2
 
E f (X T ) − f ( X̄ T )  = E f (X T ) − E f ( X̄ T(1) )
 M m=1 
2
 2
 
M
 
 1 
+ E f ( X̄ T(1) ) − f ( X̄ T(1) )m 
 M m=1 
 c 2 Var  f ( X̄ (1) )
2

+ O(n −3 ). (7.48)
1
= + T

n M
The above formula is the bias-variance decomposition of the approximation error
of the Monte Carlo estimator. The resulting quadratic error bound (7.48) emphasizes
that this estimator does not take full advantage of the above expansion (E2 ).
Richardson–Romberg extrapolation
To take advantage of the expansion, we will perform a Richardson–Romberg extrap-
olation. In this framework (originally introduced in the seminal paper [270]), one
considers the strong solution X (2) of a “copy” of Eq. (7.1), driven by a second Brow-
nian motion W (2) and starting from X 0(2) (independent of W (2) with the same distri-
bution as X 0(1) ) both defined on the same probability space (, A, P) on which W (1)
and X 0(1) are defined. One may always consider such a Brownian motion by enlarging
the probability space  if necessary.
Then we consider the Euler scheme with a twice smaller step 2n T
, denoted by X̄ (2) ,
(2)
associated to X (2) , i.e. starting from X 0 with Brownian increments built from W (2) .
We assume from now on that (E3 ) (as defined in (7.47)) holds for f to get more
precise estimates but the principle would work with a function simply satisfying
(E2 ). Then combining the two time discretization error expansions related to X̄ (1)
and X̄ (2) , respectively, we get
  c2
E f (X T ) = E 2 f ( X̄ T(2) ) − f ( X̄ T(1) ) + + O(n −3 ).
2 n2
Then, the new global (squared) quadratic error becomes
320 7 Discretization Scheme(s) of a Brownian Diffusion

 
 M
 (2) m   (1) m 2
E f (X ) − 1 2 f ( X̄ T ) − f ( X̄ T ) ) 
 T
M m=1 
2

 c 2 Var 2 f ( X̄ (2) ) − f X̄ (1) )


 
+ O(n −5 ).
2
= + T T
(7.49)
2n 2 M
The structure of this quadratic error suggests the following natural question:
Is it possible to reduce the (asymptotic) time discretization error without increas-
ing the Monte Carlo error (at least asymptotically in n…)?
Or,
 put differently: to what extent is it possible to control the variance term
Var 2 f ( X̄ T(2) ) − f ( X̄ T(1) ) ?
– Lazy simulation. If one adopts a somewhat “lazy” approach by using the pseudo-
random number generator purely sequentially to simulate the two Euler schemes, this
corresponds from a theoretical point of view to considering independent Gaussian
white noises (Z k(1) )k and (Z k(2) )k to simulate the Brownian increments in both schemes
and independent starting values or, equivalently, to assuming that W (1) and W (2) are
two independent Brownian motions and that X 0(1) and X 0(2) are i.i.d. (square integrable)
random variables. Then
     
Var 2 f ( X̄ T(2) ) − f ( X̄ T(1) ) = 4Var f ( X̄ T(2) ) + Var f ( X̄ T(1) )
  n→+∞  
= 5 Var f ( X̄ T ) −→ 5 Var f ( X̄ T ) .

In this approach the gain of one order on the bias (switch from c1 n −1 to c2 n −2 )
induces an increase of the variance by 5 and of the complexity by (approximately) 3.
– Consistent simulation (of the Brownian increments). If W (i) = W and X 0(i) = X 0 ∈
L 2 (P), i = 1, 2, then
  n→+∞    
Var 2 f ( X̄ T(2) ) − f ( X̄ T(1) ) −→ Var 2 f (X T ) − f (X T ) = Var f (X T )

since the Euler schemes X̄ (i) , i = 1, 2, both converge in L 2 (P) to X .


This time, the same gain in terms of bias has no impact on the variance, at least
for a refined enough scheme (n large). Of course, the complexity remains 3 times
higher, but the pseudo-random number generator is less solicited by a factor 2/3.
In fact, it is shown in [225] that this choice W (i) = W , X 0(i) = X 0 , i = 1, 2, leading
to consistent Brownian increments for the two schemes, is asymptotically optimal
among all possible choices of Brownian motions W (1) and W (2) . This result can be
extended to Borel functions f when the diffusion is uniformly elliptic (and b, σ
bounded, infinitely differentiable with bounded partial derivatives, see [225]).
7.7 Bias Reduction by Richardson–Romberg Extrapolation (First Approach) 321

ℵ Practitioner’s corner
T
From a practical viewpoint, one first simulates an Euler scheme with step 2n using
(2)
a white Gaussian noise (Z k )k≥1 , then one simulates the Gaussian white noise Z (1)
of the Euler scheme with step Tn by setting

(2) (2)
Z 2k + Z 2k−1
Z k(1) = √ , k ≥ 1.
2

Numerical illustration. We wish to illustrate the efficiency of the Richardson–


Romberg (RR) extrapolation in a somewhat extreme situation where the time dis-
cretization induces an important bias. To this end, we consider the Euler scheme of
the Black–Scholes SDE  
d X t = X t r dt + σdWt

with the following values for the parameters

X 0 = 100, r = 0.15, σ = 1.0, T = 1.

Note that such a volatility σ = 100% per year is equivalent to a 4 year maturity
with volatility 50% (or 16 years with volatility 25%). A high interest rate is chosen
accordingly. We consider the Euler scheme of this SDE with step h = Tn , namely
 √ 
X̄ tk+1 = X̄ tk 1 + r h + σ h Z k+1 , X̄ 0 = X 0 ,

where tk = kh, k = 0, . . . , n and (Z k )1≤k≤n is a Gaussian white noise. We purpose-


fully choose a coarse discretization step n = 10 so that h = 10 1
. One should keep in
mind that, in spite of its virtues in terms of closed forms, both coefficients of the
Black–Scholes SDE have linear growth so that it is quite a demanding benchmark,
especially when the discretization step is coarse. We want to price a vanilla Call
option with strike K = 100, i.e. to compute

C0 = e−r t E (X T − K )+

using a crude Monte Carlo simulation and an RR extrapolation with consistent Brow-
nian increments as described in the above practitioner’s corner. The Black–Scholes
reference premium is C0B S = 42.9571 (see Sect. 12.2). To equalize the complex-
ity of the crude simulation and its RR extrapolated counterpart, we use M sample
paths, M = 2k , k = 14, . . . , 226 for the RR-extrapolated simulation (214  32 000
and 226  67 000 000) and 3M for the crude Monte Carlo simulation. Figure 7.1
depicts the obtained results. The simulation is large enough so that, at its end, the
observed error is approximately representative of the residual bias. The blue line
(crude MC ) shows the magnitude of the theoretical bias (close to 1.5) for such a coarse
step whereas the red line highlights the improvement brought by the Richardson–
322 7 Discretization Scheme(s) of a Brownian Diffusion

45
Reference Price
Crude Monte Carlo (Euler based)
Richardson−Romberg Extrapolation (Euler based)

44.5

44
Estimated premium

43.5

43

42.5

42
14 16 18 20 22 24 26
Simulat ion c om ple xity 3 M with M = 2 k , k =13, . . ., 26

Fig. 7.1 Call option in a B- S model priced by an Euler scheme. σ = 1.00, r = 0.15%,
T = 1, K = X 0 = 100. Step h = 1/10 (n = 10). Black line: reference price; Red line: (Consistent)
Richardson–Romberg extrapolation of the Euler scheme of size M; Blue line: Crude Monte Carlo
simulation of size 3 M of the Euler scheme (equivalent complexity)

Romberg extrapolation: the residual bias is approximately equal to 0.07, i.e. the bias
is divided by more than 20.

 Exercises. 1. Let X , Y ∈ L 2 (, A, P).


(a) Show that
 
cov(X, Y ) ≤ σ(X )σ(Y ) and σ(X + Y ) ≤ σ(X ) + σ(Y )

where σ(X ) = Var(X ) = X  − E X 2 denotes the standard-deviation of X .
(b) Show that σ(X ) − σ(Y ) ≤ σ(X − Y ) and, for every λ ∈ R, σ(λX ) = |λ|σ(X ).
2. Let X and Y ∈ L 2 (, A, P) have the same distribution.
 
(a) Show that σ(X ) − σ(Y ) ≤ σ(X − Y ). Deduce that, for every α ∈ (−∞, 0] ∪
[1, +∞),  
Var αX + (1 − α)Y ≥ Var(X ).

(b) Deduce that consistent Brownian increments produce the Richardson-Romberg


meta-scheme with the lowest asymptotically variance as n goes to infinity.
3. (ℵ Practitioner’s corner…). (a) In the above numerical illustration, carry on testing
the Richardson–Romberg extrapolation based on Euler schemes versus crude Monte
Carlo simulation with steps Tn and 2nT
, n = 5, 10, 20, 50, respectively with
7.8 Further Proofs and Results 323

– independent Brownian increments,


– consistent Brownian increments.
(b) Compute an estimator of the variance of the estimators in both settings and
compare the obtained results.

7.8 Further Proofs and Results

Throughout this section, we recall that | .| always denotes for the canonical Euclidean
√ ! 2
norm on Rd and A = Tr(A A∗ ) = i j ai j denotes for the Fröbenius norm of
A = [ai j ] ∈ M(d, q, R). We will extensively use that, for every u = (u 1 , . . . , u d ) ∈
Rd , |Au| ≤ A |u|, which is an easy consequence of the Schwarz Inequality (in
particular, |||A||| ≤ A). To alleviate notation, we will drop the exponent n in tkn = kT
n
.

7.8.1 Some Useful Inequalities

On the non-quadratic case, Doob’s Inequality is not sufficient to carry out the proof:
we need the more general Burkhölder–Davis–Gundy Inequality. Furthermore, to get
some real constants having the announced behavior as a function of T , we will also
need to use the generalized Minkowski Inequality, (see [143]) established below in
a probabilistic framework.
 The generalized Minkowski Inequality: For any (bi-measurable) process X =
(X t )t≥0 and for every p ∈ [1, ∞),
  
 T  T
∀ T ∈ [0, +∞],  X t dt  ≤ X t  p dt. (7.50)
 
0 p 0

T  T
Proof. First note that, owing to the triangle inequality  0 X s ds  ≤ 0 |X s |ds, one
may assume without loss of generality that X Is a non-negative process. If p = 1 the
inequality is obvious. Assume now p ∈ (1, +∞). Let T ∈ (0, +∞) and let Y be a
non-negative random variable defined on the same probability space as (X t )t∈[0,T ] .
Let M > 0. It follows from Fubini’s Theorem and Hölder’s Inequality that
  
T T  
E (X s ∧ M)ds Y = E (X s ∧ M)Y ) ds
0 0
 T
≤ X s ∧ M p Y q ds with q = p
p−1
0
 T
= Y q X s ∧ M p ds.
0
324 7 Discretization Scheme(s) of a Brownian Diffusion

 T  p−1
The above inequality applied with Y := X s ∧ M ds
0

 p    p 1− 1p 
T T T
E X s ∧ M ds ≤ E X s ∧ Mds X s  p ds.
0 0 0

 p
T
If E 0 X s ∧ Mn ds
= 0 for any sequence Mn ↑ +∞, the inequality is obvious
T
since, by Beppo Levi’s monotone convergence Theorem, 0 X s ds = 0 P-a.s. Oth-
erwise, there is a sequence Mn ↑ +∞ such that all these integrals are non-zero (and
finite since X is bounded by M and T is finite). Consequently, one can divide both
sides of the former inequality to obtain
   p  1p 
T T
∀ n ≥ 1, E X s ∧ Mn ds ≤ X s  p ds.
0 0

Now letting Mn ↑ +∞ yields exactly the expected result owing to two successive
applications of Beppo Levi’s monotone convergence Theorem, the first with respect
to the Lebesgue measure ds, the second with respect to dP. When T = +∞, the result
follows by Fatou’s Lemma by letting T go to infinity in the inequality obtained for
finite T . ♦

 The Burkhölder–Davis–Gundy Inequality (continuous time) For every p ∈


(0, +∞), there exists two real constants c Bp DG > 0 and C pB DG > 0 such that, for
every continuous local martingale (X t )t∈[0,T ] null at 0,
     
     
c Bp DG  X T  ≤  sup |X t |  ≤ C pB DG  X T  . (7.51)
p t∈[0,T ] p p

For a detailed proof based on a stochastic calculus approach, we refer to


[251], p. 160. As, in this section, we are concerned with multi-dimensional con-
tinuous local martingale (stochastic integrals of matrices versus a q-dimensional
Brownian motion), we need the following easy extension of the right inequality

d
for d-dimensional local martingales X t = (X t1 , . . . , X td ): set X t = X i t . Let
i=1
p ∈ [1, +∞) and Cd, p = d Cp
B DG B DG
. Then
   
  B DG  
 sup |X t |  ≤ Cd, p  X  T
. (7.52)
t∈[0,T ] p p

In particular, if W is an (Ft )t∈[0,T ] standard Brownian motion on a filtered prob-


  ij
ability space , A, (Ft )t∈[0,T ] , P and (Ht )t∈[0,T ] = ([Ht ])t∈[0,T ] is an (Ft )t∈[0,T ] -
7.8 Further Proofs and Results 325

progressively measurable process having values in M(d, q, R) such that


T .
0 Ht  dt < +∞ P-a.s., then the d-dimensional local martingale 0 Hs dWs sat-
2

isfies  5 
  t    T 
  
  
  B DG 
Hs dWs  ≤ Cd, p  Ht  dt 
 sup   .
2 (7.53)
[0,T ] 0   0 
p p

7.8.2 Polynomial Moments (II)

Proposition 7.6 (a) For every p ∈ (0, +∞), there exists a positive real constant
κ p > 0 (increasing in p), such that, if b, σ satisfy:

∀ t ∈ [0, T ], ∀ x ∈ Rd , |b(t, x)| + σ(t, x) ≤ C(1 + |x|), (7.54)

then, every strong solution of Equation (7.1) starting from the finite random vector
X 0 (if any), satisfies
 
   
 
∀ p ∈ (0, +∞),  sup |X s | ≤ 2 eκ p C T 1 + X 0  p .
s∈[0,T ] 
p

(b) The same conclusion holds under the same assumptions for the continuous Euler
scheme with step Tn , n ≥ 1, as defined by (7.8), with the same constant κ p (which
does not depend n), i.e.
 
   
 n 
∀ p ∈ (0, +∞), ∀ n ≥ 1,  sup | X̄ s | ≤ 2 eκ p C T 1 + X 0  p .
s∈[0,T ] 
p

Remarks. • Note that this proposition makes no assumption either on the existence
of strong solutions to (7.1) or on some (strong) uniqueness assumption on a time
interval or the whole real line. Furthermore, the inequality is meaningless when
X0 ∈/ L p (P).
• The case p ∈ (0, 2) will be discussed at the end of the proof.

Proof. To alleviate the notation we assume from now on that d = q = 1.


(a) Step 1 (The process: first reduction). Assume p ∈ [2, ∞). First we introduce
 for
every integer N ≥ 1 the stopping time τ N := inf t ∈ [0, T ] : |X t − X 0 | > N (con-
vention inf ∅= + ∞). This is a stopping time since, for every t ∈ R+ , {τ N < t} =
6 78 19
{|X r − X 0 | > N } ∈ Ft . Moreover, {τ N ≤ t} = τN < t + for every
r ∈[0,t]∩Q k≥k
k
0
326 7 Discretization Scheme(s) of a Brownian Diffusion
7
k0 ≥ 1, hence {τ N ≤ t} ∈ Ft+ k1 = Ft+ = Ft since the filtration is càd (7 ). Fur-
0
k0 ≥1
τ
thermore, sup |X t N |
≤ N + |X 0 | so that the non-decreasing function f N defined
 ]
t∈[0,T 
 
by f N (t) :=  sup |X s∧τ N | , t ∈ [0, T ], is bounded by N + X 0  p . On the other
s∈[0,t] p

hand,
  
t∧τ N  s∧τ N 
sup |X s∧τ N | ≤ |X 0 | + |b(s, X s )|ds + sup  σ(u, X u )dWu  .
s∈[0,t] 0 s∈[0,t] 0

It follows from successive applications of both the regular and the generalized
Minkowski (7.50) Inequalities and of the BDG Inequality (7.53) that
5 
 t   t∧τ 

B DG 
N 
f N (t) ≤ X 0  p + 1{s≤τ N } b(s, X s ) p ds + Cd, σ(s, X )2 ds 
p  s 
0  0 
p
5 
 t   t 
 
≤ X 0  p + b(s ∧ τ N , X s∧τ N ) p ds + Cd, p 
B DG
 σ(s ∧ τ N , X s∧τ N )2 ds 

0  0 
p
5 
 t   t 

B DG C 

≤ X 0  p +C (1 + X s∧τ N  p )ds + Cd, p  (1 + |X s∧τ N |)2 ds 
0  0 
p
 5 
 t  

B DG C √t +
t 
≤ X 0  p +C (1 + X s∧τ N  p )ds + Cd, p  |X s∧τ N |2 ds 

0  0 
p

where we used in the last line the Minkowski inequality on L 2 ([0, T ], dt) endowed
with its usual Hilbert norm. Hence, the L p (P)-Minkowski Inequality and the obvious
√ 1
identity  .  p =  .  2p yield
2

⎛ ⎞
  t  21
t √  
f N (t) ≤ X 0  p + C B DG ⎝
(1 + X s∧τ N  p )ds + Cd, p C t +  ⎠.
 |X s∧τ N | ds 
2
0 0 p
2

p p
Now, as 2
≥ 1, the generalized L 2 (P)-Minkowski Inequality (7.50) yields

7 This holds true for any hitting time of an open set by an Ft -adapted càd process.
7.8 Further Proofs and Results 327
 t
f N (t) ≤ X 0  p + C (1 + X s∧τ N  p )ds
0
  t  21 
√  
+ Cd, p C
B DG
t+ |X s∧τ |  p ds
2
N
2
0
 t
= X 0  p + C (1 + X s∧τ N  p )ds
0
  t  21 
√  
+ Cd,
B DG
t+  X s∧τ  ds
2
.
p C N p
0

Consequently, the function f N satisfies


   21 
t t
f N (t) ≤ C f N (s)ds + B DG
Cd, p f N (s)ds
2
+ ψ(t),
0 0

where  √
ψ(t) = X 0  p + C t + Cd,
B DG
p t .

Step 2. (“À la Gronwall” Lemma).

Lemma 7.3 (“À la Gronwall” Lemma) Let f : [0, T ] → R+ and let


ψ : [0, T ] → R+ be two non-negative non-decreasing functions satisfying
 t  t  21
∀ t ∈ [0, T ], f (t) ≤ A f (s)ds + B f 2 (s)ds + ψ(t),
0 0

where A, B are two positive real constants. Then

f (t) ≤ 2 e(2 A+B )t ψ(t).


2
∀ t ∈ [0, T ],
√  
Proof. First, it follows from the elementary inequality xy≤ 1 x
2 B
+ By , x, y ≥
0, B > 0, that
  21    21 
t t
f (t) B t
f (s)ds
2
≤ f (t) f (s)ds ≤ + f (s)ds.
0 0 2B 2 0

Plugging this into the original inequality yields


 t
f (t) ≤ (2 A + B ) 2
f (s)ds + 2 ψ(t).
0

Gronwall’s Lemma 7.2 finally yields the announced result. ♦


328 7 Discretization Scheme(s) of a Brownian Diffusion

Step 3 (Conclusion when p ∈ [2, +∞)). Applying the above generalized Gronwall
Lemma to the functions f N and ψ defined in Step 1 leads to

 
 
∀ t ∈ [0, T ],  sup |X s∧τ N |
s∈[0,t] p
 √
= f N (t) ≤ 2 e(2+C(Cd, p ) )Ct
B DG 2
X 0  p + C(t + Cd,
B DG
p t) .

The sequence of stopping times τ N is non-decreasing and converges toward τ∞ ,


taking values in [0, T ] ∪ {∞}. On the event {τ∞ ≤ T }, |X τ N − X 0 | ≥ N so that
|X τ∞ − X 0 | = lim N →+∞ |X τ N − X 0 | = +∞ since X t has continuous paths. This is
a.s. impossible since [0, T ] is compact and (X t )t≥0 a.s. has continuous paths. As a
consequence τ∞ = +∞, a.s. which in turn implies that

lim sup |X s∧τ N | = sup |X s | P-a.s.


N s∈[0,t] s∈[0,t]

Then Fatou’s Lemma implies, by letting N go to infinity, that, for every t ∈ [0, T ],

   
   
 sup |X s | ≤ lim  sup |X s∧τ N |
s∈[0,t] p N s∈[0,t] p
 √
(2+C(Cd, )
B DG 2
)Ct
≤ 2e p X 0  p + C(t + Cd,
B DG
p t) , (7.55)


which yields, using that max( u, u) ≤ eu , u ≥ 0,
  
 
 sup |X s | ≤ 2 e(2+C(Cd, p ) )Ct X 0  p + eCt + eC(Cd, p ) t
B DG 2 B DG 2
∀ t ∈ [0, T ],
s∈[0,t] p

≤ 2 e(2+C(Cd, p ) )Ct ) t
B DG 2 B DG 2
(eCt + eC(Cd, p )(1 + X 0  p ).

One derives the existence of a positive real constant κ p,d > 0, only depending on the
B DG
BDG real constant C p,d , such that
   
 
∀ t ∈ [0, T ],  sup |X s | ≤ κ p,d eκ p C(C+1)t 1 + X 0  p .
s∈[0,t] p

Step 4 (Conclusion when p ∈ (0, 2)). The extension can be carried out as follows:
for every x ∈ Rd , the diffusion process starting at x, denoted by (X tx )t∈[0,T ] , satisfies
the following two obvious facts:
– the process X x is FtW -adapted, where FtW := σ(NP , Ws , s ≤ t), t ∈ [0, T ].
– If X 0 is an Rd -valued random vector defined on (, A, P), independent of W ,
then the process X = (X t )t∈[0,T ] starting from X 0 satisfies

X t = X tX 0 .
7.8 Further Proofs and Results 329

Consequently, using that p →  ·  p is non-decreasing, it follows that


     
   
 sup |X sx | ≤  sup |X sx | ≤ κ2,d eκ2,d C(C+1)t 1 + |x| .
s∈[0,t] p s∈[0,t] 2

Now
    
E sup |X t | p
= P X 0 (d x)E sup |X tx | p
t∈[0,T ] Rd t∈[0,T ]
 
≤ 2( p−1)+ (κ2,d ) p e pκ2,d C(C+1)T 1 + E |X 0 | p

(where we used that (u + v) p ≤ 2( p−1)+ (u p + v p ), u, v ≥ 0) so that


 
  (1− 1 ) ( 1 −1)  
 sup |X t | ≤ 2 p + κ2,d eκ2 C(C+C)T 2 p + 1 + X 0  p
t∈[0,T ] p
 
= 2|1− p | κ2,d eκ2,d C T 1 + X 0  p
1

 
≤ κ2,d eκ2,d C T 1 + X 0  p .

As concerns the SDE (7.1) itself, the same reasoning can be carried out only
if (7.1) satisfies an existence and uniqueness assumption for any starting value X 0 .
(b) (Euler scheme) The proof follows the same lines as above. One starts from the
integral form (7.9) of the continuous Euler scheme and one introduces for every
n, N ≥ 1 the stopping times
 
τ̄ N = τ̄ Nn := inf t ∈ [0, T ] : | X̄ tn − X 0 | > N .

To adapt the above proof to the continuous Euler scheme, we just need to note that
 n  
∀ s ∈ [0, t], 0 ≤ s ≤ s and  X̄  ≤ 
 sup

| X̄ s | . ♦
s p
s∈[0,t∧τ̄ N ] p

7.8.3 L p -Pathwise Regularity

Lemma 7.4 Let p ≥ 1 and let (Yt )t∈[0,T ] be an (Rd -valued) Itô process defined on
(, A, P) by   t t
Yt = Y0 + G s ds + Hs dWs , t ∈ [0, T ],
0 0

where G and H are (Ft )-progressively measurable having values in Rd and


T  
M(d, q, R), respectively andsatisfying
  0 |G s | + Hs  ds < +∞ P-a.s.
2

(a) For every p ≥ 2, writing  Ht  p = Ht  p to alleviate notation,


330 7 Discretization Scheme(s) of a Brownian Diffusion

1
∀ s, t ∈ [0, T ], Yt − Ys  p ≤ Cd,
B DG
p sup Ht  p |t − s| 2 + sup G t  p |t − s|
t∈[0,T ] t∈[0,T ]
 √ 1
≤ Cd,
B DG
p sup Ht  p + T sup G t  p |t − s| 2 .
t∈[0,T ] t∈[0,T ]

In particular, if supt∈[0,T ] Ht  p + supt∈[0,T ] G t  p < +∞, the process t → Yt is


Hölder with exponent 21 from [0, T ] into L p (P).
(b) If p ∈ [1, 2), then, for every s, t ∈ [0, T ],
 
B DG  sup H  |t − s| 21 + sup G  |t − s|
Yt − Ys  p ≤ Cd, p t p t p
t∈[0,T ] t∈[0,T ]
   √
B DG  sup H  + T sup G  |t − s| 21 .
≤ Cd, p  t p t p
t∈[0,T ] t∈[0,T ]

Proof. (a) Let 0 ≤ s ≤ t ≤ T . It follows from the standard and generalized


Minkowski Inequalities
  s+u  and the BDG Inequality (7.53) applied to the stochastic
integral s Hr dWr u≥0 that
 t   
   
t
Yt − Ys  p ≤  |G u |du  + 
 
 Hu dWu 

s p s p
5 
 t   t 

B DG 

≤ G u  p du + C p  Hu  du 
2

s  s 
p
 t  21
 
≤ sup G t  p (t − s) + C pB DG 
 Hu 2
du 
p
t∈[0,T ] s 2
 1
sup Hu 2  2p (t − s) 2
1
≤ sup G t  p (t − s) + C pB DG
t∈[0,T ] u∈[0,T ] 2

 
sup  Hu  p (t − s) 2 .
1
= sup G t  p (t − s) + C pB DG
t∈[0,T ] u∈[0,T ]

√ 1
The second inequality simply follows from |t − s| ≤ T |t − s| 2 .
(b) If p ∈ [1, 2], one simply uses that

 t  21  1  
   2  
 Hu  du  ≤ |t − s| 21  2 1  

2
p  sup Hu   = |t − s| 2  sup Hu 
s u∈[0,T ] p u∈[0,T ] 
2
2 p

and one concludes likewise. ♦

Remark.
 If H , G and Y are defined on the non-negative real line R+ and
sup G t  p + Ht  p < +∞, then t → Yt is locally 21 -Hölder on R+ . If
t∈R+
H = 0, the process is Lipschitz continuous on [0, T ].
7.8 Further Proofs and Results 331

Combining the above result for Itô processes (see Sect. 12.8) with those of Propo-
sition 7.6 leads to the following result on pathwise regularity of the diffusion solution
to (7.1) (when it exists) and the related Euler schemes.
Proposition 7.7 If the coefficients b and σ satisfy the linear growth assump-
tion (7.54) over [0, T ] × Rd with a real constant C > 0, then the Euler scheme
with step Tn and any strong solution of (7.1) satisfy for every p ≥ 1,

∀ n ≥ 1, ∀ s, t ∈ [0, T ],
√  
X t − X s  p +  X̄ tn − X̄ sn  p ≤ κ p,d Ceκ p,d C(C+1)T (1 +
1
T ) 1 + X 0  p |t − s| 2

where κ p,d ∈ (0, +∞) is a real constant only depending on C p,d


B DG
(increasing in p).
Proof. As concerns the process X , this is a straightforward consequence of the above
Lemma 7.4 by setting

G t = b(t, X t ) and Ht = σ(t, X t )

since      
max  sup |G t | p ,  sup Ht  p ≤ C 1 +  sup |X t | p .
t∈[0,T ] t∈[0,T ] t∈[0,T ]

One specifies the real constant κ p using Proposition 7.6. ♦

7.8.4 L p -Convergence Rate (II): Proof of Theorem 7.2

Step 1 ( p ≥ 2). Set

εt := X t − X̄ tn , t ∈ [0, T ]
 t  t
= b(s, X s ) − b(s, X̄ s ) ds + σ(s, X s ) − σ(s, X̄ s ) dWs
0 0

so that
 t  s 
     
sup |εs | ≤ (b(s, X s ) − b(s, X̄ s )ds + sup  σ(u, X ) − σ(u, X̄ ) d W .
 u u u 
s∈[0,t] 0 s∈[0,t] 0

One sets for every t ∈ [0, T ],


 
f (t) :=  sup |εs | p .
s∈[0,t]

 
It follows from the (regular) Minkowski Inequality on L p (P),  ·  p , BDG Inequal-
ity (7.53) and the generalized Minkowski inequality (7.50) that
332 7 Discretization Scheme(s) of a Brownian Diffusion
 
 t  t 1 
    2 2
f (t) ≤ b(s, X s ) − b(s, X̄ s ) ds + C B DG  σ(s, X s ) − σ(s, X̄ s ) ds 
p d, p  
0  0 
p
 t  t 1
    2  2
=   B DG 
b(s, X s ) − b(s, X̄ s ) p ds + Cd, p   σ(s, X s ) − σ(s, X̄ s ) ds 

 p
0 0 2
 t  t 1
     2
≤ b(s, X s ) − b(s, X̄ s ) ds + C B DG σ(s, X s ) − σ(s, X̄ s )2  p ds
p d, p 2
0 0
 t  t 1
    2
= b(s, X s ) − b(s, X̄ s ) ds + C B DG σ(s, X s ) − σ(s, X̄ s )2 ds .
p d, p p
0 0
  
Let us temporarily set τtX = 1 +  supt∈[0,t] |X s | t, t ∈ [0, T ]. Using Assumption
β  
(HT ) (see (7.14)) and the Minkowski Inequality on L 2 ([0, T ], dt), | . | L 2 (dt) spaces,
we get

 t
   
f (t) ≤ Cb,σ,T ⎝ (1 + X s  p )(s − s)β +  X s − X̄ s  p ds
0

 t 1
   2 2
+Cd, p B DG (1 + X s  p )(s − s)β +  X s − X̄ s  p ds ⎠
0

 t
   
≤ Cb,σ,T ⎝ (1 + X s  p )(s − s)β +  X s − X̄ s  p ds
0
⎡ ⎤⎞
 t 1  t 1
 
B DG ⎣1 +  sup |X | 
2  2 2
+Cd, s p s − s)2β ds +  X s − X̄ s  ds ⎦⎠
p p
s∈[0,T ] 0 0

 t  t
 
≤ Cb,σ,T ⎝τtX (s − s)β ds +  X s − X̄ s  ds
p
0 0
⎡ ⎤⎞
 t 1  t 1
2  2 2
B DG ⎣τ X
+Cd, (s − s)2β ds +  X s − X̄ s  ds ⎦⎠ .
p t p
0 0

Now, using that 0 ≤ s − s ≤ T


n
, we obtain
    t
  T β X
f (t) ≤ Cb,σ,T 1 + Cd,
B DG
p τ t + X s − X̄ s  p ds
n 0
 t 21 
 2
+ Cd, p
B DG  
X s − X̄ s p ds ,
0
7.8 Further Proofs and Results 333

Now, noting that

X s − X̄ s  p ≤ X s − X s  p + X s − X̄ s  p = X s − X s  p + εs  p
≤ X s − X s  p + f (s),

we derive
   21 
t √ B DG t
f (t) ≤ Cb,σ,T f (s)ds + 2 Cd, p f (s)2 ds + ψ(t) (7.56)
0 0

where
  β  t
T
ψ(t) := 1 + Cd,
B DG
p τ t
X
+ X s − X s  p ds
n 0
 t  21 (7.57)
√ B DG
+ 2 Cd, p X s − X s  p ds .
2
0

Step 2. It follows from Lemma 7.3 that

f (t) ≤ 2 Cb,σ,T e2Cb,σ,T (1+Cb,σ,T Cd, p )t


B DG
ψ(t). (7.58)

Now, we will use the L p (P)-path regularity of the diffusion process X established
in Proposition 7.7 to provide an upper-bound for the function ψ. We first note that, as
b and σ satisfy (HTβ ) with a positive real constant Cb,σ,T , they also satisfy the linear
growth assumption (7.54) with
 
Cb,σ,T := Cb,σ,T + sup |b(t, 0)| + σ(t, 0) < +∞
t∈[0,T ]

since b( . , 0) and σ( . , 0) are β-Hölder hence bounded on [0, T ]. Set for convenience
Cb,σ,T = Cb,σ,T (Cb,σ,T + 1). It follows from (7.57) and Proposition 7.7 that
 
  T β X
ψ(t) ≤ 1 + Cd,
B DG
p τt
n
 1
  √  T 2 √ B DG √ 
+ κ p Cb,σ eκ p Cb,σ,T t 1 + X 0  p 1 + t t + 2Cd, p t
n
 
  X T β
≤ 1 + Cd, p τt
B DG
n
 1
 √  2 κ p,d Cb,σ,T t
  T 2
+ 2 1 + 2Cd, p B DG
1 + t κ p,d Cb,σ e 1 + X 0  p ,
n
334 7 Discretization Scheme(s) of a Brownian Diffusion
√ √ B DG √ √ B DG
where we used the inequality (1 + t)(t + 2Cd, p t) ≤ 2(1 + 2Cd, p )
(1 + t) which can be established by inspecting the cases 0 ≤ t ≤ 1 and t ≥ 1.
2

Moreover, this time using Proposition 7.6(a), we derive that


    
τtX ≤ 1 + κ p,d eκ p,d Cb,σ,T T 1 + X 0  p t ≤ κ̃ p,d eκ̃ p,d Cb,σ,T T 1 + X 0  p t (7.59)

where κ̃ p,d = 1 + κ p,d . Hence, plugging the right-hand side of (7.59) into the above
inequality satisfied by ψ, we derive the existence of a real constant κ̃ p,d > 0, only
B DG
depending on Cd, p , such that

 β
  T
ψ(t) ≤ κ̃ p,d eκ̃ p Cb,σ,T T 1 + X 0  p t
n
  21
κ̃ p Cb,σ t
  T
+κ̃ p,d Cb,σ e 1 + X 0  p (1 + t)2
n
 β∧ 21
2   T
≤ κ p,d Cb,σ eκ p,d (1+Cb,σ ) t 1 + X 0  p ,
n

where we used (1 + u)2 ≤ 2eu , u ≥ 0, in the second line. The real constant κ p,d only
depends on κ̃ p,d , hence on Cd,
B DG
p . Finally, one plugs this bound into (7.58) at time
T to get the announced upper-bound.
Step 3 ( p ∈ (0, 2)). It remains to deal with the case p ∈ [1, 2). In fact, once we
β
observe that Assumption (HT ) ensures the global existence and uniqueness of the
solution X of (7.1) starting from a given random variable X 0 (independent of W ), it
can be solved following the approach developed in Step 4 of the proof of Proposi-
tion 7.6. We leave the details to the reader. ♦

Corollary 7.3 (Lipschitz continuous framework) If b and σ satisfy Condition (HT1 ),


i.e.

∀ s, t ∈ [0, T ], ∀ x, y ∈ Rd , |b(s, x) − b(t, y)| + σ(s, x) − σ(t, y)


 
≤ Cb,σ,T |t − s| + |x − y| ,

then for every p ∈ [1, ∞), there exists a real constant κ p,d > 0 such that

   1
  
n 
  T β∧ 2
∀ n ≥ 1, 
 sup X t − X̄ t  ≤ κ p,d Cb,σ e κ p,d (1+Cb,σ )2 t
1 + X 0  p
t∈[0,T ] p n
 
where Cb,σ,T := Cb,σ,T + supt∈[0,T ] |b(t, 0)| + σ(t, 0) < +∞.
7.8 Further Proofs and Results 335

7.8.5 The Stepwise Constant Euler Scheme

The aim of this section is to prove in full generality Claim (b) of Theorem 7.2. We
recall that the stepwise constant Euler scheme is defined by

∀ t ∈ [0, T ], X t := X̄ t ,

i.e. X t = X̄ tk , if t ∈ [tt , tk+1 ).


We saw in Sect. 7.2.1 that when X = W , a log n factor comes out in the a priori
error bound. One must again keep in mind that this question is quite crucial, at least
in higher dimensions, since the simulation of (functionals of) the genuine/continuous
Euler scheme is not always possible (see Chap. 8) whereas the simulation of the step-
wise constant Euler scheme is generally straightforward in any dimension, provided
b and σ are known.
Proof of Theorem 7.2(b). Step 1 (X 0 = x ∈ Rd ). We may assume without loss of
generality that p ∈ [1, ∞) owing to the monotonicity of L p -norms. Then
 t  t
X̄ tn − X̃ tn = X̄ tn − X̄ tn = b(s, X̄ s )ds + σ(s, X̄ s )dWs .
t t

One derives that


  T    
sup  X̄ tn − X̃ tn  ≤ sup b(t, X̄ t ) + sup σ(t, X̄ t )(Wt − Wt ). (7.60)
t∈[0,T ] n t∈[0,T ] t∈[0,T ]

Now, it follows from Proposition 7.6(b) that


 
    
  n 
 sup b(t, X̄ t )  ≤ 2eκ p Cb,σ,T T 1 + |x| .
t∈[0,T ] 
p

On the other hand, using the extended Hölder Inequality: for every p ∈ (0, +∞),

1 1
∀ r, s ≥ 1, + = 1,  f g p ≤  f r p gsp ,
r s
with r = s = 2 (other choices are possible), leads to
   
     
   
 sup |σ(t, X̄ t )(Wt − Wt )| ≤  sup σ(t, X̄ t ) sup |Wt − Wt |
t∈[0,T ]  t∈[0,T ] t∈[0,T ] 
p p
   
     
   
≤  sup σ(t, X̄ t )   sup |Wt − Wt |  .
 t∈[0,T ]   t∈[0,T ] 
2p 2p
336 7 Discretization Scheme(s) of a Brownian Diffusion

Now, like for the drift b, one has


 
    
 
 sup σ(t, X̄ t ) ≤ 2eκ2 p Cb,σ,T T 1 + |x| .
t∈[0,T ] 
2p

As concerns the Brownian term, one has


  
 
  T
 sup |Wt − Wt |  ≤ C W,2 p (1 + log n)
 t∈[0,T ]  n
2p

owing to (7.18) in Sect. 7.2.1. Finally, plugging these estimates into (7.60), yields
 
 
 n  T
 sup | X̄ t − X t |  ≤ 2 eκ p Cb,σ,T T (1 + |x|)
n
 t∈[0,T ]  n
p

κ2 p Cb,σ,T T T
+2 e (1 + log n),
(1 + |x|) × C W,2 p
n
 
  κ Cb,σ,T T   T T
≤ 2 C W,2 p + 1 e 2 p 1 + |x| (1 + log n) + .
n n
    √  T  
The result follows by noting that T
1 + log n + Tn ≤ 1 + T 1 + log n
n
 √ n

for every integer n ≥ 1 and by setting κ̃ p := 2 max (1 + T )(C W,2 p + 1), κ2 p .
Step 2 (Random X 0 ). When X 0 is no longer deterministic one uses that X 0 and W
are independent so that, with obvious notations,

   
E sup | X̄ tn,X 0 − X tn,X 0 | p = P X 0 (d x0 )E sup | X̄ tn,x0 − X tn,x0 | p ,
t∈[0,T ] Rd t∈[0,T ]

which yields the announced result.


Step 3 (Combination of the upper-bounds). This is a straightforward consequence
of Claims (a) and (b). ♦

7.8.6 Application to the a.s.-Convergence of the Euler


Schemes and its Rate

One can derive from the above L p -rate of convergence an a.s.-convergence result.
The main result is given in the following theorem (which extends Theorem 7.3 stated
in the homogeneous Lipschitz continuous case).
7.8 Further Proofs and Results 337

Theorem 7.9 If (HTβ ) holds and if X 0 is a.s. finite, the continuous Euler scheme
X̄ n = ( X̄ tn )t∈[0,T ] a.s. converges toward the diffusion X for the sup-norm over [0, T ].
Furthermore, for every α ∈ [0, β ∧ 21 ),

a.s.
n α sup |X t − X̄ tn | −→ 0.
t∈[0,T ]

The same convergence rate holds with the stepwise constant Euler scheme ( X tn )t∈[0,T ] .

Proof. We make no a priori integrability assumption on X 0 . We rely on the localiza-


tion principle at the origin. Let N ∈ N∗ ; set X 0(N ) := X 0 1{|X 0 |≤N } + N |XX 00 | 1{|X 0 |>N }
so that |X 0(N ) | ≤ N . Stochastic integration being a local operator, the solutions
 
(X t(N ) )t∈[0,T ] and (X t )t∈[0,T ] of the SDE (7.1) are equal on X 0 = X 0(N ) , namely
 
on |X 0 | ≤ N . The same property is obvious for the Euler schemes X̄ n and X̄ n,(N )
starting from X 0 and X 0(N ) , respectively. For a fixed N , we know from Theorem 7.2 (a)
that, for every p ≥ 1,

∃ C p,b,σ,β,T > 0 such that, ∀ n ≥ 1,


   p(β∧ 21 )
 (N )  T  p

E sup X̄ t − X t (N )  p
≤ C p,b,σ,β,T 1 + X 0  p .
t∈[0,T ] n

In particular,
  p   p
E 1{|X 0 |≤N } sup  X̄ tn,(N ) − X t  = E 1{|X 0 |≤N } sup  X̄ tn,(N ) − X t(N ) 
t∈[0,T ] t∈[0,T ]
  p
≤ E sup  X̄ tn,(N ) − X t(N ) 
t∈[0,T ]
  p(β∧ 21 )
 p T
≤ C p,b,σ,β,T 1 + X 0(N )  p .
n
 1
Let α ∈ (0, β ∧ 21 ) and let p > 1
β∧ 21 −α
. Then < +∞. Consequently,
n p(β∧ 2 −α)
1
n≥1
Beppo Levi’s monotone convergence Theorem for series with non-negative terms
implies
   p
E 1{|X 0 |≤N } n pα sup  X̄ tn − X t 
n≥1 t∈[0,T ]

≤ C p,b,σ,β,T N p T p(β∧ 2 ) n − p(β∧ 2 −α) < +∞.
1 1

n≥1
338 7 Discretization Scheme(s) of a Brownian Diffusion

Hence
  p
sup n pα  X̄ tn − X t  < +∞, P a.s.
n≥1 t∈[0,T ]
6     
on the event  X 0  ≤ N = X 0 ∈ Rd a.s.
= .
N ≥1

Finally, one gets:


 
1
P-a.s. sup | X̄ tn − X t | = o as n → +∞.
t∈[0,T ] nα

The proof for the stepwise constant Euler scheme follows exactly the same lines
since an additional log n term plays no role in the convergence of the above
series. ♦

Remarks and comments. • The above rate result strongly suggests that the critical
index for the a.s. rate of convergence is β ∧ 21 . The question is then: what happens
√ L
when α = β ∧ 21 ? It is shown in [155, 178] that (when β = 1), n(X t − X̄ tn ) −→
t , where  = (t )t∈[0,T ] is a diffusion process driven by a Brownian motion W
independent of W . This weak convergence holds in a functional sense, namely for
the topology of the uniform convergence on C([0, T ], Rd ). This process  is not P-
a.s. ≡ 0 if σ(x) ≡ 0, even a.s non-zero if σ never vanishes. The “weak functional”
feature means first that we consider the processes as random variables
  taking
 values in

their natural path space, namely the separable Banach space C [0, T ], Rd ,  . sup .
Then, one may consider the weak convergence of probability measures defined on
the Borel σ-field of this space (see [45], Chap. 2 for an introduction). In particular,
.sup being trivially continuous,

√ L
n sup |X t − X̄ tn | −→ sup |t |
t∈[0,T ] t∈[0,T ]

which implies that, if  is a.s. non-zero, P-a.s., for every ε > 0,


1+ε
lim n 2 sup |X t − X̄ tn | = +∞.
n t∈[0,T ]

(This easily follows either from the Skorokhod representation theorem; a direct
approach is also possible.)
 Exercise. One considers the geometric Brownian motion X t = e− 2 +Wt solution
t

to
d X t = X t dWt , X 0 = 1.

(a) Show that for every n ≥ 1 and every k ≥ 0,


7.8 Further Proofs and Results 339


k
  T
X̄ tkn = 1 + Wtn where tn = n
, Wtn = Wtn − Wt−1
n ,  ≥ 1.

=1

(b) Show that


1+ε
∀ε > 0, lim n 2 |X T − X̄ Tn | = +∞ P-a.s.
n

7.8.7 The Flow of an SDE, Lipschitz Continuous Regularity

If Assumption (7.2) holds, then for every x ∈ Rd , there exists a unique solution,
denoted by (X tx )t∈[0,T ] , to the SDE (7.1) defined on [0, T ] and starting from x. The
mapping (x, t) → X tx defined on [0, T ] × Rd is called the flow of the SDE (7.1).
One defines likewise the flow of the Euler scheme(which always exists). We will
now elucidate the regularity of these flows when Assumption (HTβ ) holds.

Theorem 7.10 If the coefficients b and σ of (7.1) satisfy Assumption (HTβ ) for a real
constant C > 0, then the unique strong solution (X tx )t∈[0,T ] starting from x ∈ Rd on
[0, T ] and the continuous Euler scheme ( X̄ n,x )t∈[0,T ] satisfy
∀ x, y ∈ Rd , ∀ n ≥ 1
   
 y   n,y 
 sup |X tx − X t | +  sup | X̄ tn,x − X̄ t | ≤ 2 eκ3 ( p,C T ) |x − y|,
t∈[0,T ] p t∈[0,T ] p

 
where κ3 ( p, u) = 2 + C(Cd, p ) u, u ≥ 0.
B DG 2

Proof. We focus on the diffusion process (X t )t∈[0,T ] . First note that if the above
bound holds for some p > 0 then it holds true for any p ∈ (0, p) since the  ·  p -
norm is non-decreasing in p. Starting from
  t
y
t 
X tx − Xt = (x − y) + (b(s, X sx ) − b(s, X sy ))ds + σ(s, X sx ) − σ(s, X sy ) dWs
0 0

one gets
 t
sup |X sx − X sy | ≤ |x − y| + |b(s, X sx ) − b(s, X sy )|ds
s∈[0,t] 0
 s 
   
+ sup  σ(u, X u ) − σ(u, X u ) dWu  .
x y
s∈[0,t] 0

 
 
Then, setting for every p ≥ 2, f p (t) :=  sup |X sx − X sy | , it follows from the
s∈[0,t] p

BDG Inequality (7.53) and the generalized Minkowski inequality (7.50) that
340 7 Discretization Scheme(s) of a Brownian Diffusion
5 
 t   t 

B DG 
 2 
f p (t) ≤ |x − y| + C
y
X sx − X s  p ds + Cd, σ(s, X s ) − σ(s, X s ) ds 
x y
p  
0  0 
p

 t  t 1

B DG C 
2
|X sx − X s |2 ds 
y y
≤ |x − y| + C X sx − X s  p ds + Cd, p  
0 0 p
2

 t  t   1
y  x y 2 2
≤ |x − y| + C X sx − X s  p ds + Cd,
B DG C
p X s − X s  ds .
0 0 p

Consequently, the function f p satisfies


   21 
t t
f p (t) ≤ |x − y| + C f p (s)ds + B DG
Cd, p f p2 (s)ds .
0 0

One concludes by the “à la Gronwall” Lemma 7.3 that


B DG 2
) )t
∀ t ∈ [0, T ], f p (t) ≤ eC(2+C(Cd, p |x − y|.

The proof for the Euler scheme follows the same lines once we observe that s ∈
[0, s]. ♦

7.8.8 The Strong Error Rate for the Milstein Scheme: Proof
of Theorem 7.5

In this section, we return to the scalar case d = q = 1 and we prove Theorem 7.5.
Throughout this section Cb,σ, p,T and K b,σ, p,T are positive real constants that may
vary from line to line.
First we note that the genuine (or continuous) Milstein scheme ( X̄ tmil,n )t∈[0,T ] , as
defined by (7.38), can be written in an integral form as follows
 t  t
X̄ tmil,n = X0 + b( X̄ smil,n )ds + σ( X̄ smil,n )dWs
0 0
 t s
+ (σσ )( X̄ umil,n )dWu dWs (7.61)
0 s

with our usual notation t = tkn = kT


n
if t ∈ [tkn , tk+1
n
) (so that u = s if u ∈ [s, s]). For
notational convenience, we will also drop throughout this section the superscript n
when no ambiguity arises.
(a) Step 1 (Moment control). Our first aim is to prove that the Milstein scheme
has uniformly controlled moments at any order, namely that, for every p ∈ (0, +∞),
7.8 Further Proofs and Results 341

there exists a real constant C p,b,σ,T > 0 such that


   
 
∀ n ≥ 1, sup  sup | X̄ tmil,n | ≤ Cb,σ,T 1 + X 0  p . (7.62)
n≥1 t∈[0,T ] p

We may assume without loss of generality that X 0 ∈ L p in throughout this step.


Set
 s
H̄s = σ( X̄ smil ) + (σσ )( X̄ umil )dWu = σ( X̄ smil ) + (σσ )( X̄ smil )(Ws − Ws )
s

so that  t  t
X̄ tmil = X0 + b( X̄ smil )ds + H̄s dWs .
0 0

It follows from the boundedness of b and σ that b and σ satisfy a linear growth
assumption.
We will follow the lines of the proof of Proposition 7.6, the specificity of the
Milstein framework being that the diffusion coefficient is replaced by the process
H̄s . So, our task is to control the term
 s∧τ̄ N 
 
sup  H̄u dWu 
s∈[0,t] 0

 
in L p where τ̄ N = τ̄ Nn := inf t ∈ [0, T ] : | X̄ smil,n − X 0 | > N , n, N ≥ 1.
 t∧τ̄ N
First assume that p ∈ [2, ∞). Since 0 H̄s dWs is a continuous local martingale,
it follows from the BDG Inequality (7.51) that
     21
  s∧τ̄ N   t∧τ̄ N 
 sup   B DG 
H̄u dWu  ≤ C p H̄s2 ds  .
    p
s∈[0,t] 0 p 0 2

Consequently, using the generalized Minkowski Inequality (7.50)


    t  21
  s∧τ̄ N   
 sup  H̄u dWu  ≤ C B DG 1{s≤τ̄ } H̄s 2 ds
   p N p
s∈[0,t] 0 p 0
 t  21

= C pB DG 1{s≤τ̄ } H̄s∧τ̄ 2 ds .
N N p
0

Now, for every s ∈ [0, t],


342 7 Discretization Scheme(s) of a Brownian Diffusion
     
1{s≤τ̄ } H̄s∧τ̄  ≤ σ( X̄ mil ) + 1{s≤τ̄ } (σσ )( X̄ mil )(Ws∧τ̄ N − Ws∧τ̄ )
N N p s∧τ̄ N p N s∧τ̄ N N p
 
 
≤ σ( X̄ s∧τ̄ N ) p + (σσ )( X̄ s∧τ̄ N )(Ws − Ws ) p
mil mil
     
= σ( X̄ s∧
mil   mil   
τ̄ N ) p + (σσ )( X̄ s∧τ̄ N ) p Ws − Ws p ,

where we used that (σσ )( X̄ s∧τ̄ N ) is Fs -measurable, hence independent of Ws − Ws .


mil

Using that σ and σσ have at most linear growth since σ is bounded, we derive
     
1{s≤τ̄ } H̄s∧τ̄  ≤ Cb,σ,T 1 +  X̄ mil  .
N N p s∧τ̄ N p

Finally, following the lines of the first step of the proof of Proposition 7.6 leads to
⎛   21 ⎞
  s∧τ̄ N  √ t 2
   
 sup  B DG ⎝
H̄u dWu  ≤ Cb,σ,T C p t+ sup | X̄ mil | ds
u p
⎠.
s∈[0,t] 0 p 0 u∈[0,s∧τ̄ N ]

Still following the lines of the proof Proposition 7.6, we include the step 4 to deal
with the case p ∈ (0, 2).
Moreover, we get as a by-product that, for every p > 0 and every n ≥ 1,
   
 
 sup | H̄t | ≤ K b,σ,T, p 1 + X 0  p < +∞, (7.63)
t∈[0,T ] p

where K b,σ,T, p does not depend on the discretization step n. As a matter of fact, this
follows from
 
sup | H̄t | ≤ Cb,σ 1 + sup | X̄ tmil,n | 1 + 2 sup |Wt |
t∈[0,T ] t∈[0,T ] t∈[0,T ]

so that, by the Schwarz Inequality when p ≥ 1/2,


   
 
 sup | H̄t | ≤ Cb,σ 1 + sup  sup | X̄ tmil,n |2 p 1 + 2 sup |Wt |2 p ,
t∈[0,T ] p n≥1 t∈[0,T ] t∈[0,T ]

where we used that  · 2 p is a norm in the right-hand side of the inequality. A similar
1  
bound holds when p ∈ (0, 1/2) since 1 + V 2 p ≤ 2 2 p 1 + V 2 p for any random
variable V .
Now, by Lemma 7.4 devoted to the L p -regularity of Itô processes, one derives the
existence of a real constant κb,σ, p,T ∈ (0, +∞) (not depending on n ≥ 1) such that
 1
  T 2
∀ t ∈ [0, T ], ∀ n ≥ 1,  X̄ tmil,n − X̄ tmil,n  p ≤ κb,σ, p,T 1 + X 0  p . (7.64)
n
7.8 Further Proofs and Results 343

Step 2 (Decomposition and analysis of the error when p ∈ [2, +∞), X 0 = x ∈ Rd ).


Set εt := X t − X̄ tmil , t ∈ [0, T ], and
 
f p (t) :=  sup |εs | p , t ∈ [0, T ].
s∈[0,t]

Using the diffusion equation and the continuous Milstein scheme one gets
 t  t
   
εt = b(X s ) − b( X̄ smil ) ds + σ(X s ) − σ( X̄ smil ) dWs
0 0
 t s
− (σσ )( X̄ u )dWu dWs
mil
0 s
  t
 t   
= b(X s ) − b( X̄ s ) ds +
mil
σ(X s ) − σ( X̄ smil ) dWs
0 0
 t
 
+ b( X̄ smil ) − b( X̄ smil ) ds
0
 t
+ σ( X̄ smil ) − σ( X̄ smil ) − (σσ )( X̄ smil )(Ws − Ws ) dWs .
0

First, one derives that


  s 
t    
sup |εs | ≤ b sup sup |εu |ds + sup   σ(X u ) − σ( X̄ u ) dWu 
mil
s∈[0,t] 0 u∈[0,s] s∈[0,t] 0
 s 
   
+ sup  b( X̄ umil ) − b( X̄ umil ) du 
s∈[0,t] 0
 s  
 
+ sup   σ( X̄ u ) − σ( X̄ u ) − (σσ )( X̄ u )(Wu − Wu ) dWu 
mil mil mil
s∈[0,t] 0

so that, using twice the generalized Minkowski Inequality (7.50) and the BDG
Inequality (7.51), one gets classically

 5
t t
f p (t) ≤ b sup f p (s)ds + C pB DG σ sup f p (s)2 ds
0 0
  s 
    
+
 sup 
 b( X̄ mil
u ) − b( X̄ mil
u ) du 

s∈[0,t] 0 p
( )* +
(B)
  s  
  

+  sup   σ( X̄ u ) − σ( X̄ u ) − (σσ )( X̄ u )(Wu − Wu ) dWu 
 .
mil mil mil
s∈[0,t] 0 p
( )* +
(C)
344 7 Discretization Scheme(s) of a Brownian Diffusion

Now using that b is αb -Hölder yields for every u ∈ [0, T ],

b( X̄ umil ) − b( X̄ umil ) = b ( X̄ umil )( X̄ umil − X̄ umil ) + ρb (u)| X̄ umil − X̄ umil |1+αb


 u
= b b ( X̄ u )(u − u) + b ( X̄ u )
mil mil
H̄v dWv
u
 1+α
+ρb (u) X̄ umil − X̄ umil  b ,

where ρb (u) is defined by the above equation on the event { X̄ umil = X̄ umil }
and is equal to 0 otherwise. This defines an (Fu )-adapted process,
bounded by the Hölder coefficient  [b ]αb of b . Using that for every ξ ∈ R,
|bb |(ξ) ≤ b sup b sup + |b(0)| |ξ| and (7.64) yields

 

 T

(B) ≤ b sup b sup + |b(0)|  sup | X̄ tmil |
t∈[0,T ] p n
  1+α2 b
  T
+[b ]αb K b,σ, p,T 1 + |x|
n
  s  u 
 
+ sup b ( X̄ umil ) H̄v dWv du  .
s∈[0,t] 0 u p

The last term in the right-hand side of


 the above equation needs a specific treat-
ment: a naive approach would yield a Tn term that would make the whole proof
crash down. So we will transform the regular Lebesgue integral into a stochastic
integral (hence a local martingale). This can be done either by a stochastic Fubini
theorem, or in a more elementary way by an integration by parts.

Lemma 7.5 Let G :  × R → R be an (Ft )t∈[0,T ] -progressively


 measurable pro-
T 2 (k−1)T kT
cess such that 0 G s ds < +∞ a.s. Set s̄ := n if s ∈
kT
n
, n . Then for every
t ∈ [0, T ],    
t s t
G u dWu ds = (s̄ ∧ t − s)G s dWs .
0 s 0

Proof. For every k = 1, . . . , n, an integration by parts yields


      
kT
n s kT
n s kT
n
 kT
G u dWu ds = G u dWu ds = − s G s dWs .
(k−1)T
n s (k−1)T
n
(k−1)T
n
(k−1)T
n
n

  t  s   t
(−1)T
Likewise, if t ∈ n
, T
n
, then G u dWu ds = (t − s)G s dWs ,
(−1)T (−1)T
n s n
which completes the proof by summing all the terms from k = 1 up to  with the last
one. ♦
7.8 Further Proofs and Results 345

We apply this lemma to the continuous adapted process G t = b ( X̄ tmil ) H̄t . We


derive by standard arguments that
  s  u   1
  T  2 2
   
 sup mil
b ( X̄ u ) B
H̄v d Wv du  ≤ C p DG (t¯ ∧ T − t)b ( X̄ t ) H̄t p dt
mil
s∈[0,t] 0 u  0
p
 1
T T 2
≤ C pB DG b sup  H̄t 2p dt
n 0
 T
≤ Cb,σ, p,T 1 + X 0  p ,
n

where we used first that 0 ≤ t¯ ∧ T − t ≤ T


n
and then (7.63). Finally, one gets that
 
  T 1+αb
(B) ≤ Cb,σ, p,T 1 + |x| .
n

We adopt a similar approach for the term (C). Elementary computations show
that

σ( X̄ umil ) − σ( X̄ umil ) − (σσ )( X̄ umil )(Wu − Wu )


T 1 
= σ b( X̄ umil ) + σ(σ )2 ( X̄ umil ) (Wu − Wu )2 − (u − u)
n 2
 1+α
+ρσ (u) X̄ mil − X̄ mil  σ
u u

where ρσ (u) is an (Fu )-adapted process bounded by the Hölder coefficient [σ ]ασ of
σ . Consequently, for every p ≥ 1,
 
 
σ( X̄ umil ) − σ( X̄ umil ) − (σσ )( X̄ umil )(Wu − Wu )
p

  1   
≤ σ b( X̄ umil ) p (u − u) + σ(σ )2 ( X̄ umil ) p (Wu − Wu )2 − (u − u) p
2
 1+α 1+ασ
+ [σ ]ασ  X̄ umil − X̄ umil  σ  p(1+α
σ )
  1+ασ
≤ Cb,σ, p,T 1 + |x| (u − u) + Z − 1 p (u − u) + [σ ]ασ (u − u) 2
2

  1+α2 σ
  T
≤ Cb,σ, p,T 1 + |x| .
n
346 7 Discretization Scheme(s) of a Brownian Diffusion

Now, owing toBDG Inequality, we derive, for every p ≥ 2, that


 t  2  21
 
(C) ≤ C pB DG σ( X̄ u ) − σ( X̄ u ) − (σσ )( X̄ u )(Wu − Wu ) du
mil mil mil
0 p

  1+α2 σ
  T
≤ C p Cb,σ, p,T 1 + |x|
B DG
.
n

Finally, combining the upper-bounds for (B) and (C) leads to

 5   1+ασ ∧αb
t t T 2
f p (t) ≤ b sup f p (s)ds + C pB DG σ sup f 2 (s)ds + Cb,σ, p,T (1 + |x|)
0 0 n

so that, owing to the “à la Gronwall” Lemma 7.3, there exists a real constant

  1+ασ2 ∧αb
T
f p (T ) ≤ Cb,σ, p,T (1 + |x|) .
n

Step 2 (Extension to p ∈ (0, 2) and random starting values X 0 ). First one uses
that p →  ·  p is non-decreasing to extend the above bound to p ∈ (0, 2). Then,
one uses that, if X 0 and W are independent, for any non-negative functional
 : C([0, T ], Rd )2 → R+ , one has with obvious notations

E(X, X̄ mil
)= P X 0 (d x0 ) E (X x0 , X̄ mil,x0 ).
Rd

Applying this identity with (x, x̄) = sup |x(t) − x̄(t)| p , x, x̄ ∈ C([0, T ], R),
t∈[0,T ]
completes the proof.
(b) This second claim follows from the error bound established for the Brownian
motion itself: as concerns the Brownian motion, both stepwise constant and continu-
ous versions of the Milstein and the Euler scheme coincide. So a better convergence
rate is hopeless. ♦

7.8.9 The Feynman–Kac Formula and Application to the


Weak Error Expansion by the PDE Method

In this section we return to the purely scalar case d = q = 1, mainly for notational
convenience, namely we consider the scalar version of (7.1), i.e.

d X t = b(t, X t )dt + σ(t, X t )dWt ,


7.8 Further Proofs and Results 347

where W is a standard (scalar) Brownian motion defined on a probability space


(, A, P) and X 0 , defined on the same space, is independent of W . We make the
following regularity assumption on b and σ:

 
C∞ ≡ b, σ ∈ C ∞ ([0, T ] × R) and ∀ k1 , k2 ∈ N, k1 + k2 ≥ 1,
 k1 +k2   k1 +k2 
∂ b  ∂ σ 
   
sup  ∂t k1 ∂x k2 (t, x) +  ∂t k1 ∂x k2 (t, x) < +∞.
(t,x)∈[0,T ]×R

In particular, b and σ are Lipschitz continuous in (t, x) ∈ [0, T ] × R. Thus, for every
t, t ∈ [0, T ] and every x, x ∈ R,
   
 ∂b   ∂b 
|b(t , x ) − b(t, x)| ≤    
sup  ∂x (s, ξ) |x − x| + sup  ∂t (s, ξ) |t − t|.
(s,ξ)∈[0,T ]×R (s,ξ)∈[0,T ]×R

Consequently, the SDE (7.1) always has a unique strong solution (X t )t∈[0,T ] starting
from any Rd -valued random vector X 0 , independent of the Brownian motion W on
(, A, P). Furthermore, as

|b(t, x)| + |σ(t, x)| ≤ sup (|b(t, 0)| + |σ(t, 0)|) + C|x| ≤ C (1 + |x|),
t∈[0,T ]

any such strong solution (X t )t∈[0,T ] satisfies (see Proposition 7.6):


 
∀ p ≥ 1, X 0 ∈ L p (P) =⇒ E sup |X t | p + sup E sup | X̄ tn | p < +∞. (7.65)
t∈[0,T ] n t∈[0,T ]

We recall that the infinitesimal generator L of the diffusion reads on every function
g ∈ C 1,2 ([0, T ] × R)

∂g 1 ∂2g
Lg(t, x) = b(t, x) (t, x) + σ 2 (t, x) 2 (t, x).
∂x 2 ∂x
As for the source function f appearing in the Feynman–Kac formula, which will
also be the function of interest for the weak error expansion, we make the following
regularity and growth assumption:

⎪ ∞
⎨ f ∈ C (R, R)
(F∞ ) ≡ and


∀ k ∈ N, ∃ rk ∈ N, ∃ Ck ∈ (0, +∞), | f (k) (x)| ≤ C(1 + |x|rk ),

where f (k) denotes the k-th derivative of f .


348 7 Discretization Scheme(s) of a Brownian Diffusion

The first result of the section is the fundamental link between Stochastic Differ-
ential Equations and Partial Differential Equations, namely the representation of the
solution of the above parabolic PDE as the expectation of a marginal function of
the SDE at its terminal time T whose infinitesimal generator is the second-order
differential operator of the PDE. This representation is known as the Feynman–Kac
formula.
 
Theorem 7.11 (Feynman–Kac formula) Assume C∞ and (F∞ ) hold.
(a) The parabolic PDE

∂u
+ Lu = 0, u(T, . ) = f (7.66)
∂t

has a unique solution u ∈ C ∞ ([0, T ] × R, R). This solution satisfies


 k 
∂ u   

∀ k ≥ 0, ∃ r (k, T ) ∈ N such that sup  k (t, x) ≤ Ck,T 1 + |x|r (k,T ) .
t∈[0,T ] ∂x
(7.67)
(b) Feynman–Kac formula: If X 0 ∈ L r0 (P), the solution u admits the following rep-
resentation
 
∀ t ∈ [0, T ], u(t, x) = E f (X T ) | X t = x = E f (X Tt,x ), (7.68)

where (X st,x )s∈[t,T ] denotes the unique solution of the SDE (7.1) starting at x at time
t. If, furthermore, the SDE is autonomous—namely b(t, x) = b(x) and σ(t, x) =
σ(x)—then  
∀ t ∈ [0, T ], u(t, x) = E f X Tx −t .

Notation: To alleviate notation, we will use throughout this section the notations
∂x f , ∂t f , ∂xt f , etc, for the partial derivatives instead of ∂∂xf , ∂∂tf , ∂x∂t
∂2 u
,…
 Exercise. Combining the above bound for the spatial partial derivatives with
∂t u = −Lu, show that

∀ k = (k1 , k2 ) ∈ N2 , ∃ r (k, T ) ∈ N such that


 k1 +k2 
∂ u   
sup  k (t, x) ≤ Ck,T 1 + |x|r (k,T ) .
t∈[0,T ] ∂t ∂x
1 k 2
7.8 Further Proofs and Results 349

Proof. (a) For this result, we refer to [2].


(b) Let u be the solution to the parabolic PDE (7.66). For every t ∈ [0, T ],
 T  T
u(T, X T ) = u(t, X t ) + ∂t u(s, X s )ds + ∂x u(s, X s )d X s
t t
 T
1
+ ∂x x u(s, X s )dX s 
2 t
 
T   T
= u(t, X t ) + ∂t u + Lu (s, X s )ds + ∂x u(s, X s )σ(s, X s )dWs
t t
 T
= u(t, X t ) + ∂x u(s, X s )σ(s, X s )dWs (7.69)
t

 t
since u satisfies the PDE (7.66). Now the local martingale Mt := ∂x u(s, X s )
0
σ(s, X s )dWs is a true martingale since
 
t  2
Mt = ∂x u(s, X s ) σ 2 (s, X s )ds ≤ C 1 + sup |X s |θ ∈ L 1 (P)
0 s∈[0,T ]

for an exponent θ ≥ 0. The above inequality follows from the assumptions on σ


and the resulting growth properties of ∂x u. The integrability is a consequence of
Eq. (7.54). Consequently (Mt )t∈[0,T ] is a true martingale and, using the assumption
u(T, . ) = f , we deduce that
 
∀ t ∈ [0, T ], E f (X T ) | Ft = u(t, X t ).

The chain rule for conditional expectation implies,


 
∀ t ∈ [0, T ], E f (X T ) | X t = u(t, X t ) P-a.s.

since X t is Ft -measurable. This shows that u(t, . ) is a regular version of the condi-
tional expectation on the left-hand side.
If we rewrite (7.69) with the solution (X st,x )s∈[0,T −t] , the same reasoning shows
that  T
u(T, X T ) = u(t, x) +
t,x
∂x u(s, X st,x )σ(s, X s )dWs
t

and taking expectation yields u(t, x) = E u(T, X Tt,x ) = E f (X Tt,x ).


t,x
When b and σ do not depend on t, one checks that (X t+s )s∈[0,T −t] is a solution
to (7.1) starting at x at time 0, where W is replaced by the standard Brownian
motion Ws(t) = Wt+s − Wt , s ∈ [0, T − t], whereas (X sx )s∈[0,T −t] is solution to (7.1)
starting at x at time 0, driven by the original Brownian motion W . As b and σ are
Lipschitz continuous, (7.1) satisfies a strong (i.e. pathwise) existence-uniqueness
350 7 Discretization Scheme(s) of a Brownian Diffusion

property. Following e.g. [251] (see Theorem (1.7), Chap. IX, p. 368), this implies
t,x
weak uniqueness, i.e. that (X t+s )s∈[0,T −t] and (X sx )s∈[0,T −t] have the same distribution.
Hence E f (X T ) = E f (X T −t ) which completes the proof.
t,x x

Remarks. • The proof of the Feynman–Kac formula itself (Claim (b)) only needs u
to be C 1,2 and b and σ to be continuous on [0, T ] × R and Lipschitz in x uniformly
in t ∈ [0, T ].
• In the time homogeneous case b(t, x) = b(x) and σ(t, x) = σ(x), one can proceed
by verification. Under smoothness assumption on b and σ, say C 2 with bounded
existing derivatives and Hölder second-order partial derivatives, one shows, using
the tangent process of the diffusion, that the function u(t, x) defined by u(t, x) =
E f (X Tx−t ) is C 1,2 in (t, x). Then, the above claim (b) shows the existence of a solution
to the parabolic P D E (7.66).

 Exercise (0-order term). If u is a C 1,2 -solution of the PDE ≡ ∂t u + Lu + r u = 0,


u(T, . ) = f where r : [0, T ] × R → R is a bounded continuous function, then show
that, for every t ∈ [0, T ],
 T
u(t, X t ) = E e t r (s,X s )ds f (X T ) P-a.s.

or, equivalently, that u(t, x) is a regular version of the conditional expectation


 T  T
u(t, X t ) = E e t r (s,X s )ds f (X T ) | X t = x = E e t r (s,X s )ds f (X Tt,x ) .
t,x

We may now pass to the second result of this section, the Talay–Tubaro weak error
expansion theorem, stated here in the non-homogeneous case (but in one dimension).

Theorem 7.12 (Smooth case) (Talay–Tubaro [270]): Assume that b and σ satisfy
(C∞ ), that f satisfies (F∞ ) and that X 0 ∈ L r0 (P). Then, the weak error can be
expanded at any order, namely

R
ck  1
∀ R ∈ N∗ , E f ( X̄ Tn ) − E f (X T ) = + O .
k=1
nk n R+1

Remarks. • The result at a given order R also holds under weaker smoothness
assumptions on b, σ and f (say CbR+5 , see [134]).
• Standard arguments show that the coefficients ck in the above expansion are the
first R terms of a sequence (ck )k≥1 .

Proof (R = 1) Following the original approach developed in [270], we rely on the


PDE method.
7.8 Further Proofs and Results 351

Step 1 (Representing and estimating E f (X T ) − E f ( X̄ Tn )). It follows from the


Feynman–Kac formula (7.68) and the terminal condition u(T, .) = f that
 
E f (X T ) = E f (X Tx )P X 0 (d x) = u(0, x)P X 0 (d x) = Eu(0, X 0 )
Rd Rd

and E f ( X̄ Tn ) = E u(T, X̄ Tn ). It follows that


   
E f ( X̄ Tn ) − f (X T ) = E u(T, X̄ Tn ) − u(0, X̄ 0n )

n
 
= E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) .
k=1

In order to evaluate the increment u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ), we apply Itô’s formula
(see Sect. 12.8) between tk−1 and tk to the function u and use that the Euler scheme
satisfies the pseudo-SDE with “frozen” coefficients

d X̄ tn = b(t, X̄ tn )dt + σ(t, X̄ tn )dWt .

Doing so, we obtain


 tk  tk
u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = ∂t u(s, X̄ sn )ds + ∂x u(s, X̄ sn )d X̄ sn
tk−1 tk−1
 tk
1
+ ∂x x u(s, X̄ sn )d X̄ n s
2 tk−1
 tk
 
= ∂t + L̄ u(s, s, X̄ sn , X̄ sn )ds
tk−1
 tk
+ σ(s, X̄ sn )∂x u(s, X̄ sn )dWs ,
tk−1

where L̄ is the “frozen” infinitesimal generator defined on functions


g ∈ C 1,2 ([0, T ] × R) by

1
L̄g(s, s, x, x) = b(s, x)∂x g(s, x) + σ 2 (s, x)∂x x g(s, x)
2
and ∂t g(s, s, x, x) = ∂t g(s, x).  t
The bracket process of the local martingale Mt = ∂x u(s, X̄ sn )σ(s, X̄ sn )dWs is
0
given for every t ∈ [0, T ] by
 T  2
MT = ∂x u(s, X̄ s ) σ 2 (s, X̄ sn )ds.
0
352 7 Discretization Scheme(s) of a Brownian Diffusion

Consequently, using that σ has (at most) linear growth in x, uniformly in t ∈ [0, T ],
and the control (7.67) of ∂x u(s, x), we have

MT ≤ C 1 + sup | X̄ tn |2 + sup | X̄ tn |r (1,T )+2 ∈ L 1 (P)
t∈[0,T ] t∈[0,T ]

since supt∈[0,T ] | X̄ tn | lies in every L p (P). Hence, (Mt )t∈[0,T ] is a true martingale, so
that
 tk 
 
E u(tk , X̄ tk ) − u(tk−1 , X̄ tk−1 ) = E
n n
(∂t + L̄)u(s, s, X̄ s , X̄ s )ds
n n
tk−1

(the integrability of the integral term follows from ∂t u = −Lu, which ensures the
polynomial growth of (∂t + L̄)u(s, s, x, x)) in x and x uniformly in s, s. Atthis stage, 
the idea is to expand the above expectation into a term φ̄(s, X̄ sn ) Tn + O ( Tn )2 . To
this end we will again apply Itô’s formula to ∂t u(s, X̄ sn ), ∂x u(s, X̄ sn ) and ∂x x u(s, X̄ sn ),
taking advantage of the regularity of u.
– Term 1. The function ∂t u being C 1,2 ([0, T ] × R), Itô’s formula between s = tk−1
and s yields
 s
∂t u(s, X̄ sn ) = ∂t u(s, X̄ sn ) + ∂tt u(r, X̄ rn ) + L̄(∂t u)(r, X̄ rn ) dr
s
 s
+ σ(r, X̄ rn )∂t x u(r, X̄ rn )dWr .
s

First, let us show that the local martingale term is the increment between s and s of
a true martingale, denoted by (Mt(1) )t∈[0,T ] from now on. Note that ∂t u = −L u so
that ∂xt u = −∂x L u is clearly a function with polynomial growth in x uniformly in
t ∈ [0, T ] since
   
   
∂x b(t, x)∂x u + 1 σ 2 (t, x)∂x x u (t, x) ≤ C 1 + |x|θ0
 2 

owing to (7.67). Multiplying this term by the function


  with linear growth σ(t, x)
preserves its polynomial growth. Consequently, Mt(1) t∈[0,T ] is a true martingale
since E M (1) T < +∞. On the other hand, using that ∂t u = −L u leads to

∂tt u(r, X̄ rn ) + L̄(∂t u)(r, r , X̄ rn , X̄ rn ) = ∂tt u(r, X̄ rn ) − L̄ ◦ Lu(r, r , X̄ rn , X̄ rn )


=: φ̄(1) (r, r , X̄ rn , X̄ rn ),

where φ̄(1) satisfies for every x, y ∈ R and every t, t ∈ [0, T ],


 (1)   
φ̄ (t, t, x, x) ≤ C1 1 + |x|θ1 + |x|θ1 .
7.8 Further Proofs and Results 353

This follows from the fact that φ̄(1) is defined as a linear combination of products of
b, ∂t b, ∂x b, ∂x x b, σ, ∂t σ, ∂x σ, ∂x x σ, ∂x u, ∂x x u at (t, x) or (t, x) (with “x = X̄ rn ” and
“x = X̄ rn ”).
– Term 2. The function ∂x u being C 1,2 , Itô’s formula yields
 s
∂x u(s, X̄ sn ) = ∂x u(s, X̄ sn ) + ∂xt u(r, X̄ rn ) + L̄(∂x u)(r, r , X̄ rn , X̄ rn ) dr
s
 s
+ ∂x x u(r, X̄ rn )σ(r, X̄ rn )dWr .
s

The stochastic integral is the increment of a true martingale (denoted by (Mt(2) )t∈[0,T ]
in what follows) and using that ∂xt u = ∂x (−L u), one shows likewise that

∂t x u(r, X̄ rn ) + L̄(∂x u)(r, r , X̄ rn , X̄ rn ) = L̄(∂x u) − ∂x (Lu) (r, r , X̄ rn , X̄ rn )
= φ̄(2) (r, r , X̄ rn , X̄ rn ),

where (t, t, x, x) → φ̄(2) (t, t, x, x) has a polynomial growth in (x, x) uniformly in


t, t ∈ [0, T ].
– Term 3. Following the same lines one shows that
 s
∂x x u(s, X̄ sn ) = ∂x x u(s, X̄ sn ) + φ̄(3) (r, r , X̄ rn , X̄ rn )dr + Ms(3) − Ms(3) ,
s

where (Mt(3) )t∈[0,T ] is a martingale and φ̄(3) has a polynomial growth in (x, x) uni-
formly in (t, t).
Step 2 (A first bound). Collecting all the results obtained in Step 1 yields

u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = (∂t + L)(u)(tk−1 , X̄ tnk−1 )


 tk  s
+ φ̄(r, r , X̄ rn , X̄ rn )dr ds
tk−1 tk−1
 tk 
 
+ Ms(1) − Mt(1)
k−1
+ b(tk−1 , X̄ tnk−1 ) Ms(2) − Mt(2)
k−1
tk−1

1 2  (3) (3) 
+ σ (tk−1 , X̄ tk−1 ) Ms − Mtk−1 ds
n
2
+ Mtk − Mtk−1

where
1
φ̄(r, r , x, x) = φ̄(1) (r, x, x) + b(r , x)φ̄(2) (r, x, x) + σ 2 (r , x)φ̄(3) (r, x, x).
2
354 7 Discretization Scheme(s) of a Brownian Diffusion

Hence, the function φ̄ satisfies a polynomial growth assumption: there exists θ, θ ∈ N


such that
 
∀ t, t ∈ [0, T ], ∀ x, x ∈ R, |φ̄(t, t, x, x)| ≤ Cφ̄ (T ) 1 + |x|θ + |x|θ

where Cφ̄ (T ) can be chosen to be non-decreasing in T (if b and σ are defined on


[0, T ] × R, T ≥ T satisfying (C∞ ) on it). The first term (∂t + L)(u)(tk−1 , X̄ tnk−1 )
on the right-hand side of the above decomposition vanishes since ∂t u + Lu = 0.
As concerns the third term, let us show that it has a zero expectation. One can
use Fubini’s Theorem since sup | X̄ tn | ∈ L p (P) for every p > 0 (this ensures the
t∈[0,T ]
integrability of the integrand). Consequently
 tk 
E Ms(1) − Mt(1)
k−1
+ b(tk−1 , X̄ tnk−1 )(Ms(2) − Mt(2)
k−1
)
tk−1
 
1 2 (3) (3)
+ σ (tk−1 , X̄ tk−1 )(Ms − Mtk−1 ) ds
n
2
 tk 
   
= E Ms(1) − Mt(1) k−1
+ E b(tk−1 , X̄ tnk−1 ) Ms(2) − Mt(2)
k−1
tk−1
1  2  
+ E σ (tk−1 , X̄ tnk−1 ) Ms(3) − Mt(3)
k−1
ds.
2

Now, all the three expectations inside the integral are zero since the M (k) , k = 1, 2, 3,
are true martingales. Thus
     
E b(tk−1 , X̄ tnk−1 ) Ms(2) − Mt(2) = E b(tk−1 , X̄ tnk−1 ) E Ms(2) − Mt(2) | Ftk−1
k−1
( )* + ( )* k−1
+
Ftk−1 -measurable =0

= 0,

etc. Finally, the original expansion is reduced to


 
  tk s
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ̄(r, r , X̄ rn , X̄ rn ) dr ds (7.70)
tk−1 tk−1

so that
 
   tk s  
E u(tk , X̄ n ) − u(tk−1 , X̄ n )  ≤ ds dr E |φ̄(r, r , X̄ rn , X̄ rn )|
tk tk−1
tk−1 tk−1
  (tk − tk−1 )2
≤ Cφ̄ (T ) 1 + 2 E sup | X̄ tn |θ∨θ
t∈[0,T ] 2
 2
T
≤ Cb,σ, f (T ) ,
n
7.8 Further Proofs and Results 355

where, owing to Proposition 6.6, the function Cb,σ, f (·) only depends on T (in a
non-decreasing manner). Summing over the terms for k = 1, . . . , n yields, keeping
in mind that X̄ 0n = X 0 ,
   
E f ( X̄ n ) − E f (X ) = E u(T, X̄ n ) − E u(0, X̄ n )
T T T 0

n
 
≤ E u(tk , X̄ n ) − E u(tk−1 , X̄ n )
tk tk−1
k=1
T 2
≤ nCb,σ, f (T )
n
T
= Cb,σ, f (T ) with Cb,σ, f (T ) = T Cb,σ, f (T ).
n

Let k ∈ {0, . . . , n − 1}. It follows from the preceding and the obvious equality tkk = T
n
,
k = 1, . . . , n, that,
  tk T
∀ k ∈ {0, . . . , n}, E f ( X̄ n ) − E f (X t ) ≤ C
tk k b,σ, f (tk ) ≤ Cb,σ, f (T ) .
k n

 Exercise. Compute an explicit (closed) form for the function φ̄.


Step 3 (First order expansion). To obtain an expansion at order 1, one must return
to the identity (7.70), namely
 
  tk s
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ̄(r, r , X̄ rn , X̄ rn ) dr ds.
tk−1 tk−1

This function φ̄ can be written explicitly as a polynomial of b, σ, u and (some of)


their partial derivatives at (t, x) or (t, x). Consequently, if b, σ and f satisfy (C∞ )
and (F∞ ), respectively, one shows that the function φ̄ satisfies
(i) φ̄ is continuous in (t, t, x, x),
   
  χ(m,m ,T )
m x m φ̄(t, t, x, x) ≤ C T 1 + |x|
(ii) ∂xm+m + |x|χ (m,m ,T ) , t, t ∈ [0, T ],
   
 
(iii) ∂tm+m
mtm φ̄(t, t, x, x)  ≤ C T 1 + |x|θ(m,m ,T ) + |x|θ (m,m ,T ) , t, t ∈ [0, T ].

In fact, as above, a C 1,2 -regularity in (t, x) is in fact sufficient to get a second


order expansion. We associate to φ̄ the function φ defined by

φ(t, x) := φ̄(t, t, x, x)

(which is at least a C 1,2 -function and) whose time and space partial derivatives have
polynomial growth in x uniformly in t ∈ [0, T ] owing to the above properties of φ̄.
356 7 Discretization Scheme(s) of a Brownian Diffusion

The idea is once again to apply Itô’s formula, this time to φ̄( . , r , . , X̄ rn ). Let
r ∈ [tk−1 , tk ) (so that r = tk−1 ). Then,
 r
φ̄(r, r , X̄ rn , X̄ rn ) = φ(r , X̄ rn ) + ∂x φ̄(v, v, X̄ vn , X̄ vn )d X̄ vn
tk−1
 r  1
+ ∂t φ̄(v, r , X̄ vn , X̄ rn ) + ∂x x φ(v, r, X̄ vn , X̄ rn )σ 2 (v, X̄ vn ) dv
tk−1 2
 r 
= φ̄(r , v, X̄ vn , X̄ vn ) + ∂t φ̄(v, v, X̄ vn , X̄ vn ) + L̄ φ̄( . , v, . , X̄ vn )(v, v, X̄ vn , X̄ vn ) dv
tk−1
 r
+ ∂x φ̄(v, v, X̄ vn , X̄ vn )σ(v, X̄ vn )dWv
tk−1

where we used that the mute variable v satisfies v = r = tk−1 . The stochastic integral
turns out to be the increment of a true square integrable martingale since
   
 
sup ∂x φ̄(v, v, X̄ vn , X̄ vn )σ(v, X̄ vn ) ≤ C 1 + sup | X̄ sn |θ ∈ L 2 (P)
v∈[0,T ] s∈[0,T ]

where θ ∈ N, owing to the above (ii) and the linear growth of σ. Then, Fubini’s
Theorem yields
 1T 2
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ(tk−1 , X̄ tnk−1 ) (7.71)
2 n
 tk  s  r
+ E ∂t φ̄(v, v, X̄ vn , X̄ vn )
tk−1 tk−1 tk−1

+ L̄ φ̄( . , v, . , X̄ vn )(v, v, X̄ vn , X̄ vn ) dv dr ds
 tk  s
+ E (Nr − Ntk−1 )dr ds .
tk−1 tk−1
( )* +
=0

Now,  
 
E sup  L̄ φ̄( . , v, . , X̄ v )(v, v, X̄ v , X̄ v ) < +∞
n n n
v,r ∈[0,T ]

owing to (7.65) and the polynomial growth of b, σ, u and its partial derivatives. The
same holds for ∂t φ̄(v, r , X̄ vn , X̄ rn ) so that
    r   
 tk s  1 T 3
E ∂t φ̄(v, v, X̄ vn , X̄ vn ) + ( . , v, . , , X̄ vn )(v, v, X̄ vn , X̄ vn ) dv dr ds  ≤ Cb,σ, f,T .
 3 n
tk−1 tk−1 tk−1
7.8 Further Proofs and Results 357

Summing from k = 1 to n yields

  n
 
E f ( X̄ Tn ) − f (X T ) = E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 )
k=1
  
T  T T 2
= E φ(s, X̄ sn )ds + O .
2n 0 n

In turn, for every k ∈ {0, . . . , n}, the function φ(tk , . ) satisfies Assumption (F∞ )
with some bounds not depending on k (this in turn follows from the fact that the
space partial derivatives of φ have polynomial growth in x uniformly in t ∈ [0, T ]).
Consequently, by Step 2,

  T
max E φ(tk , X̄ tnk ) − E φ(tk , X tk ) ≤ Cb,σ, f (T )
0≤k≤n n

so that
  t  t   t  
 k k   k 
max E φ(s, X̄ sn )ds − φ(s, X s )ds  = max  E φ(s, X̄ sn ) − E φ(s, X s ) ds 
0≤k≤n 0 0 0≤k≤n 0
T2
≤ Cb,σ, f (T ) .
n

Applying Itô’s formula to φ(u, X u ) between s and s shows that


 s  s
φ(s, X s ) = φ(s, X s ) + (∂t + L)(φ)(r, X r )dr + ∂x φ(r, X r )σ(X r )dWr ,
s s

which implies

  T
sup E φ(s, X s ) − E φ(s, X s ) ≤ C f,b,σ (T ) .
s∈[0,T ] n

Hence
   
 tk tk  Cb,σ, f (T )
max E φ(s, X s )ds − E φ(s, X s )ds  ≤ . (7.72)
0≤k≤n  0 0 n

Finally, combining all these bounds yields


 tk 1
T
E u(tk , X̄ tnk ) − E u(tk−1 , X̄ tnk−1 ) = E φ(s, X s ) ds + O
2n tk−1 n2
358 7 Discretization Scheme(s) of a Brownian Diffusion

uniformly in k ∈ {0, . . . , n}. One concludes by summing from k = 1 to n that


 T 1
T
E f ( X̄ T ) − E f (X T ) =
n
E φ(s, X s ) ds + O , (7.73)
2n 0 n2

which completes the proof of the first-order expansion. Note that the function φ can
be made explicit. ♦

One step beyond (Toward R = 2). To expand at a higher-order, say R = 2, one can
proceed as follows: first we go back to (7.71) which can be re-written as

 1T 2
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ(tk−1 , X̄ tnk−1 )
2 n
 tk  s  r
+E χ̄(v, v, X̄ vn , X̄ vn ) dv dr ds,
tk−1 tk−1 tk−1

where χ̄ and χ(t, x) = χ̄(t, t, x, x) and their partial derivatives satisfy similar
smoothness and growth properties as φ̄ and φ, respectively.
Then, one gets by mimicking the above computations
 1T 2 1T 3
E u(tk , X̄ tnk ) − u(tk−1 , X̄ tnk−1 ) = E φ(tk−1 , X̄ tnk−1 ) + E χ(tk−1 , X̄ tnk−1 )
2 n 3 n
 tk  s  r  t
+E ψ̄(v, v, X̄ vn , X̄ vn )dt dv dr ds,
tk−1 tk−1 tk−1 tk−1

where ψ̄ still has the same structure, so that


 tk  s  r  t  
T 4
E ψ̄(v, v, X̄ vn , X̄ vn )dt dv dr ds = O
tk−1 tk−1 tk−1 tk−1 n

uniformly in k ∈ {0, . . . , n − 1}. Consequently, still summing from k = 1 to n,


 
c1,n c2,n T 3
E f ( X̄ Tn ) − E f (X T ) = + 2 +O ,
n n n
 T  T
T T2
where c1,n = E φ(s, X̄ sn )ds and c2,n =
E χ(s, X̄ sn )ds. It is clear, since
2
0 0  3
T T
the funtion φ and χ are both continuous, that c1,n → E φ(s, X̄ s )ds and
2 0
7.8 Further Proofs and Results 359

T T
c2,n → E χ(s, X̄ s )ds. Moreover, extending to χ the rate results established
2 0
with φ, we even have that
 1  1
T T T2 T
c1,n − E φ(s, X s )ds = O and c2,n − E χ(s, X s )ds = O .
2 0 n 3 0 n
 
To get a second-order expansion it remains to show that the term O n1 in the con-
c  
vergence rate of c1,n is of the form n1 + O n12 . This can be obtained by applying
the results obtained for f in the proof of the first-order expansion to the functions
φ(tk , . ) with some uniformity in k (which is the technical point at this stage).
Further comments. In fact, to get higher-order expansions, this elementary approach,
close to the original proof from [270], is not the most convenient. Other approaches
have been developed since this seminal work on weak error expansions. Among
them, let us cite [64], based on an elegant duality argument. A parametrix approach
is presented in [172], which naturally leads to the higher-order expansion stated in
the theorem. In fact, this paper is more connected with the Bally–Talay theorem since
it relies, in a uniformly elliptic framework, on an approximation of the density of the
diffusion by that of the Euler scheme.

ℵ Practitioner’s corner. The original application of the first-order expansion,


already mentioned in the seminal paper [270], is of course the Richardson–Romberg
extrapolation introduced in Sect. 7.7. More recently, the first order expansion turned
out to be one of the two key ingredients of the Multilevel Monte Carlo method
(MLMC ) introduced by M. Giles in [107] (see Sect. 9.5.2 in Chap. 9) and devised to
efficiently kill the bias of a simulation while keeping the variance under control.
In turn, higher-order expansions are the key, first to Multistep Richardson–
Romberg extrapolation (see [225]), then, recently to the weighted multilevel methods
introduced in [198]. These two methods are also exposed in Chap. 9.

7.9 The Non-globally Lipschitz Case (A Few Words On)

The Lipschitz continuity assumptions made on b and σ to ensure the existence of


a unique strong solution to (7.1) over the whole non-negative real line R+ , hence
on any interval [0, T ], can be classically replaced by a local Lipschitz assumption
combined with a linear growth assumption. These assumptions read, respectively:
(i) Li ploc Local Lipschitz continuity assumption. Assume that b and σ are Borel func-
tions, locally Lipschitz continuous in (t, x) in the sense that

∀ N ∈ N∗ , ∃ L N > 0 such that


∀ x, y ∈ B(0, N ), ∀ t ∈ [0, N ],
|b(t, x) − b(t, y)| + σ(t, x) − σ(t, y) ≤ L N |x − y|
360 7 Discretization Scheme(s) of a Brownian Diffusion

(where B(0, N ) denotes the ball centered at 0 with radius N and  ·  denotes, e.g.
the Fröbenius norm of matrices).
(ii) LinG Linear growth
 1
∃ C > 0, ∀ t ∈ R+ , ∀ x ∈ Rd , |b(t, x)| ∨ σ(t, x) ≤ C 1 + |x|2 2 .

However, although this less stringent condition makes it possible to consider


typically oscillating coefficients, it does not allow to relax their linear growth behavior
at infinity. In particular, it does not take into account the possible mean-reverting
effect induced by the drifts, typically of the form b(x) = −sign(x)|x| p , p > 1.
In the autonomous case (b(t, x) = b(x) and σ(t, x) = σ(x)), this phenomenon is
investigated by Has’minskiı̆ in [145] (Theorem 4.1, p. 84) where (ii) LinG is replaced
by a weak Lyapunov assumption.

Theorem 7.13 If b :→ Rd and σ : Rd → M(d, q, R) satisfy (i) Li ploc (i.e. are


locally Lipschitz continuous) and the following weak Lyapunov assumption:
(ii)WL yap Weak Lyapunov assumption: there exists a twice differentiable function
V : Rd → R+ , C 2 with lim|x|→+∞ V (x) = +∞ and a real number λ ∈ R such that

  1  
∀ x ∈ Rd , L V (x) := b(x) | ∇V (x) + Tr σ(x)D 2 V (x)σ ∗ (x)
2 (7.74)
≤ λV (x),

then, for every x ∈ Rd , the SDE (7.1) has a unique (strong) solution (X tx )t≥0 on the
whole non-negative real line starting from x at time 0.

Remark. If λV (x) is replaced in (7.74) by λV (x) + μ with λ < 0 and μ ∈ R, the


assumption becomes a standard mean-reverting or Lyapunov assumption and V is
called a Lyapunov function. Then (X t )t≥0 admits a stationary regime such that ν =
P X t , t ∈ R+ , satisfies ν(V ) < +∞.

A typical example of where (i) Li ploc and (ii)WL yap are satisfied is the following.
Let p ∈ R and set
 p
b(x) = κ 1 + |x|2 2 x, x ∈ Rd .

The function b is clearly locally Lipschitz continuous but not Lipschitz continuous
when p > 0. Assume σ is locally Lipschitz continuous in x with linear growth in
the sense of (ii) LinG , then one easily checks with V (x) = |x|2 + 1, that
 
L V (x) = 2 b(x) | x + σ(x)2
p
≤ 2κ(1 + |x|2 ) 2 |x|2 + C 2 (1 + |x|2 ).
7.9 The Non-globally Lipschitz Case (A Few Words On) 361

• If κ ≥ 0 and p ≤ 0, then (i) Li ploc and (ii) LinG are satisfied, as well as (7.74) (with
λ = 2κ + C 2 ).
• If κ < 0, then (7.74) is satisfied by λ = C 2 for every p whereas (ii) LinG is not
satisfied.

Remark. If κ < 0 and p > 0, then the mean-reverting assumption  mentioned in


the previous remark is satisfied with λ ∈ (2κ, 0) with μλ = supx∈Rd (C 2 − λ) +
p 
2κ(1 + |x|2 ) 2 |x|2 + C 2 . In fact, in that case the SDE admits a stationary regime.

This allows us to deal with SDE s like perturbed dissipative gradient equations,
equations coming from Hamiltonian mechanics (see e.g. [267]), etc, where the drift
shares mean-reverting properties but has a (non-linear) polynomial growth at infinity.
Unfortunately, this stability, or more precisely non-explosive property induced by
the existence of a weak Lyapunov function as described above cannot be transferred
to the regular Euler or Milstein schemes, which usually have an explosive behavior,
depending on the step. Actually, this phenomenon can already be observed in a
deterministic framework with ODE s but is more systematic with (true) SDE s since,
unlike ODE s, no stability region usually exists depending on the starting value of
the scheme.
A natural idea, inspired by the numerical analysis of ODE s leads to the introduc-
tion fully and partially drift-implicit Euler schemes, but also new classes of explicit
approximation scarcely more complex than the regular Euler scheme. We will not go
further in this direction but will refer, for example, to [152] for an in-depth study of
the approximation of such SDE s, including moment bounds and convergence rates.
Chapter 8
The Diffusion Bridge Method:
Application to Path-Dependent Options
(II)

8.1 Theoretical Results About Time Discretization


of Path-Dependent Functionals

In this section we deal with some “path-dependent” (European) options. Such con-
tracts are characterized by the fact that their payoffs depend on the whole past of
the underlying asset(s) between the origin t = 0 of the contract
 and its maturity T .
This means that these payoffs are of the form F (X t )t∈[0,T ] , where F is a functional
usually naturally defined from D([0, T ], Rd ) → R+ (where D([0, T ], Rd ) is the set
of càdlàg functions x : [0, T ] → Rd (1 ) and X = (X t )t∈[0,T ] denotes the dynamics of
the underlying asset). We still assume from now on that X = (X t )t∈[0,T ] is a solution
to an Rd -valued SDE of type (7.1):

d X t = b(t, X t )dt + σ(t, X t )dWt ,

where X 0 is independent of the Brownian motion W .


The question of establishing weak error expansions, even at the first order, for
families of functionals of a Brownian diffusion is even more challenging than for time
marginal functions f (α(T )), α ∈ D([0, T ], Rd ). In the recent years, several papers
have provided such expansions for specific families of functionals F. These works
were essentially motivated by the pricing of European path-dependent options, like
Asian or Lookback options in one dimension, corresponding to functionals defined
for every α ∈ D([0, T ], Rd ) by
 T   
F(α) := f α(s)ds , F(α) := f α(T ), sup α(t), inf α(t)
0 t∈[0,T ] t∈[0,T ]

1 We need to define F on càdlàg functions in view of the stepwise constant Euler scheme, not to
speak of jump diffusion driven by Lévy processes.
© Springer International Publishing AG, part of Springer Nature 2018 363
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_8
364 8 The Diffusion Bridge Method: Application …

and, in higher dimensions, to barrier options with functionals of the form


 
F(α) = f α(T ) 1{τ D (α)>T } ,

where D is an open domain of Rd and τ D (α) := inf{s ∈ [0, T ] : α(s) or α(s−) ∈


/ D}
is the exit time from D by the generic càdlàg path α (2 ). In both frameworks, f is
usually at least Lipschitz continuous. Let us quote from the literature two well-
known examples of results (in a homogeneous framework, i.e. b(t, x) = b(x) and
σ(t, x) = σ(x)).
– The following theorem is established in [120].

Theorem 8.1 (a) If the domain D is bounded and hasa smooth enough boundary
(in fact C 3 ), if b ∈ C 3 (Rd , Rd ), σ ∈ C 3 Rd , M(d, q, R) , σ uniformly elliptic on D
(i.e. σσ ∗ (x) ≥ ε0 Id , ε0 > 0), then, for every bounded Borel function f vanishing in
a neighborhood of ∂ D,
     
1
E f (X T )1{τ D (X )>T } − E f ( X Tn )1{τ D ( X n )>T } = O √ as n → +∞ (8.1)
n

where X denotes the stepwise constant Euler scheme.


(b) If, furthermore, b and σ are C 5 and D is a half-space, then the genuine Euler
scheme X̄ n satisfies
     
1
E f (X T )1{τ D (X )>T } − E f ( X̄ Tn )1{τ ( X̄ n )>T } =O as n → +∞. (8.2)
D n

Note, however, that these assumptions are unfortunately not satisfied by standard
barrier options (see below).
– It is suggested in [264] (with a rigorous proof when X = W ) that if b, σ ∈
Cb4 (R, R), σ is uniformly elliptic and f ∈ C 4,2
pol (R ) (existing partial derivatives with
2

polynomial growth), then


     
1
E f (X T , max X t ) − E f ( X T , max X tk ) = O √
n n
as n → +∞. (8.3)
t∈[0,T ] 0≤k≤n n

A similar improvement – O( n1 ) rate – as above can be expected (but still remains a


conjecture) when replacing X by the continuous Euler scheme X̄ n , namely
     
1
E f (X T , max X t ) − E f ( X̄ Tn , max X̄ tn ) = O as n → +∞.
t∈[0,T ] t∈[0,T ] n

2 When α is continuous or stepwise constant and càdlàg, τ D (α) := inf{s ∈ [0, T ] : α(s) ∈
/ D}.
8.1 Theoretical Results About Time Discretization of Path-Dependent Functionals 365

More recently, relying on new techniques based on transport inequalities and the
Wasserstein metric, a partial result toward this conjecture has been established in [5]
with an error of O(n −( 3 −η) ) for every η > 0.
2

If we forget about the regularity assumptions, the formal intersection between


these two classes of path-dependent functionals (or non-negative payoffs) is not
empty since the payoff of a barrier option with domain D = (−∞, L) can be written
 
f (X T )1{τ D (X )>T } = F X T , sup X t with F(x, y) = f (x)1{y<L} .
t∈[0,T ]

Unfortunately, such a function g is never a smooth function so that, if the second


result is true it does not solve the first one.
For other results concerning these weak error expansions for functionals, we refer
to [17, 122, 123, 160] and the references therein. By contrast with the “vanilla
case”, these results are somewhat disappointing since they point out that the weak
error obtained with the stepwise constant Euler √ scheme is not significantly better
than the strong error since the only gain is a 1 + log n factor. The positive side is
that we can reasonably hope that using the genuine Euler scheme will again yield the
O(1/n)-rate in the first-order expansion of the time discretization error, provided we
know how to simulate the functional of interest of this scheme.

8.2 From Brownian to Diffusion Bridge: How to Simulate


Functionals of the Genuine Euler Scheme

To take advantage of the above rates and the reasonable guess we may have about
higher-order expansions, we do not need to simulate the genuine Euler scheme itself
(which is meaningless) but some specific functionals of this scheme like the maxi-
mum, the minimum, time integrals, etc, between two time discretization instants tkn
n
and tk+1 , given the (simulated) values of the discrete time Euler scheme. This means
bridging this discrete time Euler scheme into its genuine extension. First, we deal
with the standard Brownian motion itself.

8.2.1 The Brownian Bridge Method

We still denote by (FtW )t≥0 the (completed) natural filtration of a standard Brownian
motion W . We begin with a quick study of the standard Brownian bridge between 0
and T .

Proposition 8.1 Let W = (Wt )t≥0 be a standard Brownian motion.


(a) Let T > 0. Then, the so-called standard Brownian bridge on [0, T ] defined by
366 8 The Diffusion Bridge Method: Application …

t
YtW,T := Wt − W , t ∈ [0, T ], (8.4)
T T

is an FTW -measurable centered Gaussian process, independent of (WT +s )s≥0 , whose


distribution is characterized by its covariance structure

  st (s ∧ t)(T − s ∨ t)
E YsW,T YtW,T = s ∧ t − = , 0 ≤ s, t ≤ T.
T T
(b) Let T0 , T1 ∈ (0, +∞), T0 < T1 . Then
   
L (Wt )t∈[T0 ,T1 ] | Ws , s ∈
/ (T0 , T1 ) = L (Wt )t∈[T0 ,T1 ] | WT0 , WT1

so that (Wt )t∈[T0 ,T1 ] and (Ws )s ∈(T


/ 0 ,T1 ) are independent given (WT0 , WT1 ). Moreover,
this conditional distribution is given by

   
t − T0 B,T1 −T0
L (Wt )t∈[T0 ,T1 ] | WT0 = x, WT1 = y = L x + (y − x) + (Yt−T ) t∈[T0 ,T1 ] ,
T1 − T0 0

where B is a generic standard Brownian motion.

Proof (a) The process Y W,T is clearly centered since W is. Elementary computations
based on the covariance structure of the standard Brownian Motion

E Ws Wt = s ∧ t, s, t ∈ [0, T ],

show that, for every s, t ∈ [0, T ],

s t st
E (YtW,T YsW,T ) = E Wt Ws − E Wt WT − E Ws WT + 2 EWT2
T T T
st ts ts
= s∧t − − +
T T T
ts (s ∧ t)(T − s ∨ t)
= s∧t − = .
T T
L 2 (P)
Let GW = span Wt , t ≥ 0 be the closed vector subspace of L 2 (, A, P)
spanned by the Brownian motion W . Since it is a centered Gaussian process, it is
a well-known fact that independence and absence of correlation coincide in GW .
The process Y W,T belongs to this space by construction. Likewise one shows that,
for every u ≥ T , E (YtW,T Wu ) = 0 so that YtW,T ⊥ span(Wu , u ≥ T ). Consequently,
Y W,T is independent of (WT +u )u≥0 .
(b) First note that for every t ∈ [T0 , T1 ],
(T0 )
Wt = WT0 + Wt−T 0
,
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 367

where Wt(T0 ) = WT0 +t − WT0 , t ≥ 0, is a standard Brownian motion, independent of


FTW0 . Rewriting (8.4) for W (T0 ) leads to

s
W (T0 ) + YsW ,T1 −T0 .
(T0 )
Ws(T0 ) =
T1 − T0 T1 −T0

Plugging this identity into the above equality at time s = t − T0 leads to

t − T0 W (T0 ) ,T1 −T0


Wt = WT0 + (W − WT0 ) + Yt−T . (8.5)
T1 − T0 T1 0

(T0 )
W ,T1 −T0
It follows from (a) that the process Y := (Yt−T 0
)t∈[T0 ,T1 ] is a Gaussian process,
(T )
measurable with respect to FTW1 −T0 0 by (a), hence it is independent of FTW since W (T0 )
0
is. Consequently, Y is independent of (Wt )t∈[0,T0 ] . Furthermore, Y is independent of
(WT(T1 −T
0)
)
0 +u u≥0
by (a). Hence it is L 2 (P)-orthogonal to WT1 +u − WT0 , u ≥ 0, in GW .
As it is also orthogonal to WT0 in GW since WT0 is FTW -measurable, it is orthogonal
0

to WT1 +u = WT(T1 −T
0)
0 +u
+ WT0 , u ≥ 0. Which in turn implies independence of Y and
(WT1 +u )u≥0 since all these random variables lie in GW .
Finally, the same argument – in GW , no correlation implies independence – implies
that Y is independent of (Ws )s∈R+ \(T0 ,T1 ) i.e. of the σ-field σ(Ws , s ∈
/ (T0 , T1 )). The
end of the proof follows from the above identity (8.5) and the exercises below. ♦

 Exercises. 1. Let X , Y , Z be three random vectors defined on a probability space


(, A, P) taking values in Rk X , RkY and Rk Z , respectively. Assume that Y and (X, Z )
are independent. Show that for every bounded Borel function f : RkY → R and every
Borel function g : Rk X → RkY ,
   
E f (g(X ) + Y ) | (X, Z ) = E f (g(X ) + Y ) | X .
   
Deduce that L g(X ) + Y | (X, Z ) = L g(X ) + Y | X .
2. Deduce from the previous exercise that
   
/ 0 ,T1 ) = L (Wt )t∈[T0 ,T1 ] | (WT0 , WT1 ) .
L (Wt )t∈[T0 ,T1 ] | (Wt )t ∈(T

[Hint: consider the finite-dimensional marginal distributions (Wt1 , . . . , Wtn ) given


(X, Ws1 , . . . , Ws p ), where ti ∈ [T0 , T1 ], i = 1, . . . , n, X = (WT0 , WT1 ) and s j ∈
[T0 , T1 ]c , j = 1, . . . , p, then use the decomposition (8.5).]
3. The conditional distribution of (Wt )t∈[T0 ,T1 ] given WT0 , WT1 is that of a Gaussian
process, hence it can also be characterized by its expectation and covariance structure.
Show that they are given respectively by

  T1 − t t − T0
E Wt | WT0 = x, WT1 = y = x+ y, t ∈ [T0 , T1 ],
T1 − T0 T1 − T0
368 8 The Diffusion Bridge Method: Application …

and
  (T1 − t)(s − T0 )
Cov Wt , Ws | WT0 = x, WT1 = y = , s ≤ t, s, t ∈ [T0 , T1 ].
T1 − T0

8.2.2 The Diffusion Bridge (Bridge of the Genuine Euler


Scheme)

Now we return to the (continuous) Euler scheme defined by (7.8).

Proposition 8.2 (Bridge of the Euler scheme) Assume that σ(t, x) = 0 for every
t ∈ [0, T ], x ∈ R.
(a) The processes ( X̄ tn )t∈[tkn ,tk+1
n
] , k = 0, . . . , n − 1, are conditionally independent
given the σ-field σ( X̄ t n , k = 0, . . . , n).
n
k
(b) Furthermore, for every k ∈ {0, . . . , n}, the conditional distribution
   
L ( X̄ tn )t∈[tkn ,tk+1
n ] | X̄ n = x  ,  = 0, . . . , n = L ( X̄ )t∈[t n ,t n ] | X̄ n = x k , X̄ t n
n
t
n
t k k+1
n
tk k+1
= xk+1
  
n(t − tkn ) B, T /n
= L xk + (xk+1 − xk ) + σ(tkn , xk )Yt−t n
T k t∈[tkn ,tk+1
n ]

B, T /n
where (Ys )s∈[0,T /n] is a Brownian bridge (related to a generic Brownian motion
B) as defined by (8.4). The distribution of this Gaussian process (sometimes called
a diffusion bridge) is entirely characterized by:
 
n(t − tkn )
– its expectation function xk + (xk+1 − xk )
T t∈[tkn ,tk+1
n
]
and
(s ∧ t − tkn )(tk+1
n
− s ∨ t)
– its covariance operator σ 2 (tkn , xk ) , s, t ∈ [tkn , tk+1
n
].
n
tk+1 − tkn

Proof. Elementary computations show that for every t ∈ [tkn , tk+1


n
],

t − tkn n
W (tk ) ,T /n
X̄ tn = X̄ tnkn + n ( X̄ tk+1
n
n − X̄ tnkn ) + σ(tkn , X̄ tnkn )Yt−t n
tk+1 − tk
n k

n
(keeping in mind that tk+1 − tkn = T /n). Consequently, the conditional independence
 W (tkn ) ,T /n 
claim will follow if the processes Yt t∈[0,T /n]
, k = 0, . . . , n − 1, are inde-
 n 
pendent given σ X̄ t n ,  = 0, . . . , n . Now, it follows from the assumption on the

diffusion coefficient σ that
   
σ X̄ tnn ,  = 0, . . . , n = σ X 0 , Wtn ,  = 1, . . . , n .
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 369

So we have to establish the conditional independence of the processes


 W (tkn ) ,T /n 
Yt t∈[0,T /n]
, k = 0, . . . , n − 1, given σ(X 0 , Wtkn , k = 1, . . . , n) or, equiva-
lently, given σ(Wtkn , k = 1, . . . , n), since X 0 and W are independent (note that all the
above bridges are FTW -measurable). First observe that all the bridges
n
W (tk ) ,T /n
(Yt )t∈[0,T /n] , k = 0, . . . , n − 1 and W live in the same Gaussian space GW .
n
W (tk ) ,T /n
We know from Proposition 8.1(a) that each bridge (Yt )t∈[0,T /n] is indepen-
dent of both FtWn and σ(Wtk+1 n
+s − Wtkn , s ≥ 0). Hence, it is independent in par-
k
ticular, of all σ({Wtn ,  = 1, . . . , n}) again because GW is a Gaussian space. On
the other hand, all bridges are independent since they are built from indepen-
(t n )  W (tkn ) ,T /n 
dent Brownian motions (Wt k )t∈[0,T /n] . Hence, the bridges Yt t∈[0,T /n]
,
 
k = 0, . . . , n − 1, are i.i.d. and independent of σ Wtk , k = 1, . . . , n and conse-
n
 
quently of σ X 0 , Wtkn , k = 1, . . . , n .
 
Now X̄ tnn is σ {Wtn ,  = 1, . . . , k} -measurable, consequently are has
k

   
σ X̄ tnkn , Wtkn , X̄ tnk+1
n ⊂ σ {Wtn , =1, . . . , n}

 W (tkn ) ,T /n   
so that Yt t∈[0,T /n]
is independent of X̄ tnn , Wtkn , X̄ tnn . The conclusion
k k+1
follows. ♦

Now we know the distribution of the genuine Euler scheme between two suc-
cessive discretization times tkn and tk+1
n
conditionally to the Euler scheme at its
discretization times. Now, we are in position to simulate some functionals of the
continuous Euler scheme, namely its supremum.
Proposition 8.3 The distribution of the supremum of the Brownian bridge starting
W,T,y
at 0 and arriving at y at time T , defined by Yt = Tt y + Wt − Tt WT on [0, T ], is
given by
⎧  
  ⎨ 1 − exp − T2 z(z − y) if z ≥ max(y, 0),
W,T,y
P sup Yt ≤z =
t∈[0,T ] ⎩
0 if z ≤ max(y, 0).

Proof. The key is to have in mind that the distribution of Y W,T,y is that of the
conditional distribution of W given W T = y. So, we can derive the result from an
expression of the joint distribution of supt∈[0,T ] Wt , WT , e.g. from
 
P sup Wt ≥ z, WT ≤ y .
t∈[0,T ]

It is well-known from the symmetry principle that, for every z ≥ max(y, 0),
 
P sup Wt ≥ z, WT ≤ y = P(WT ≥ 2z − y).
t∈[0,T ]
370 8 The Diffusion Bridge Method: Application …

We briefly reproduce the proof for the reader’s convenience. If z = 0, the result is
d
obvious since WT = −WT . If z > 0, one introduces the hitting time
τz := inf{s > 0 : Ws = z} of [z, +∞) by W (convention inf ∅ = +∞). This is a
(FtW )-stopping time since [z, +∞) is a closed set and W is a continuous process
(this uses that z > 0). Furthermore, τz is a.s. finite since lim Wt = +∞ a.s. Conse-
t→+∞
quently, still by continuity of its paths, Wτz = z a.s. and Wτz +t − Wτz is independent
of FτWz . As a consequence, for every z ≥ max(y, 0), using that Wτz = z on the event
{τz ≤ T },
   
P sup Wt ≥ z, WT ≤ y = P τz ≤ T, WT − Wτz ≤ y − z
t∈[0,T ]
 
= P τz ≤ T, −(WT − Wτz ) ≤ y − z
 
= P τz ≤ T, WT ≥ 2z − y
 
= P WT ≥ 2z − y

since 2z − y ≥ z. Consequently, one may write for every z ≥ max(y, 0),

   +∞ ξ2
e− 2T
P sup Wt ≥ z, WT ≤ y = h T (ξ) dξ with h T (ξ) = √ .
t∈[0,T ] 2z−y 2πT

Hence, since the involved functions are differentiable one has


 
  P(WT ≥ 2z − (y + η)) − P(WT ≥ 2z − y) /η
P sup Wt ≥ z | WT = y = lim  
η→0
t∈[0,T ] P(WT ≤ y + η) − P(WT ≤ y) /η
h T (2z − y) (2z−y)2 −y 2 2z(z−y)
= = e− 2T = e− 2T . ♦
h T (y)

Corollary 8.1 Let λ > 0 and let x, y ∈ R. If Y W,T denotes the standard Brownian
bridge of W between 0 and T , then for every z ∈ R,


 t  
P sup x + (y − x) + λYtW,T ≤ z
t∈[0,T ] T
⎧  
⎨1 − exp − T2λ2 (z − x)(z − y) i f z ≥ max(x, y),
= (8.6)

0 i f z < max(x, y).
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 371

Proof. First note that

 
t t
x + (y − x) + λYtW,T = λx  + λ y  + YtW,T
T T

with x  = x/λ, y  = (y − x)/λ.


Then, the result follows by the previous proposition, using that for any real-valued
random variable ξ, every α ∈ R and every β ∈ (0, +∞),
 z − α  z − α
P(α + β ξ ≤ z) = P ξ ≤ =1−P ξ > . ♦
β β
d
 Exercise. Show using −W = W that
   
t
P inf x + (y − x) + λYtW,T ≤ z
t∈[0,T ] T
⎧  2 
⎨ exp − T λ2 (z − x)(z − y) if z ≤ min(x, y),
=

1 if z > min(x, y).

8.2.3 Application to Lookback Style Path-Dependent Options

In this section, we focus on Lookback style options (including  general barrier 


options), i.e. exotic options related to payoffs of the form h T := f X T , supt∈[0,T ] X t .
We want to compute an approximation of e−r t E h T using a Monte Carlo sim-
ulation based on the continuous  time Euler scheme, i.e. we want to compute
e−r t E f X̄ Tn , supt∈[0,T ] X̄ tn . We first note, owing to the chaining rule for conditional
expectation, that
   
E f X̄ Tn , sup X̄ tn = E E ( f ( X̄ Tn , sup X̄ tn ) | X̄ tn ,  = 0, . . . , n) .
t∈[0,T ] t∈[0,T ]

We derive from Proposition 8.2 that


   
E f ( X̄ Tn , sup X̄ tn ) | X̄ tn = x ,  = 0, . . . , n = f xn , max Mxn,k
k ,x k+1
t∈[0,T ] 0≤k≤n−1

where, owing to Proposition 8.3,


 nt n 
W (tk ) ,T /n
n,k
Mx,y := sup x+ (y − x) + σ(tkn , x)Yt
t∈[0,T /n] T
372 8 The Diffusion Bridge Method: Application …

are independent. This can also be interpreted as the random variables M X̄n,kn , X̄ n , k =
tkn n
tk+1

0, . . . , n − 1, being conditionally independent given X̄ tnn , k = 0, . . . , n. Following


k
Corollary 8.1, the distribution function G n,k n,k
x,y of M x,y is given by

  
2n
x,y (z)
G n,k = 1 − exp − (z − x)(z − y) 1 , z ∈ R.
T σ 2 (tkn , x) z≥max(x,y)

Then, the inverse distribution simulation rule (see Proposition 1.1) yields that
 
t n
W (tk ) ,T /n d −1 d
sup x+ T
(y − x) + σ(tkn , x)Yt = (G n,k
x,y ) (U ), U = U ([0, 1])
t∈[0,T /n] n
d −1
= (G n,k
x,y ) (1 − U ), (8.7)

d −1
where we used that U = 1 − U . To determine (G n,k
x,y ) (at 1 − u), it remains to solve
the equation G x,y (z) := 1 − u under the constraint z ≥ max(x, y), i.e.
n,k

 
2n
1 − exp − (z − x)(z − y) = 1 − u, z ≥ max(x, y),
T σ 2 (tkn , x)

or, equivalently,

T 2 n
z 2 − (x + y)z + x y + σ (tk , x) log(u) = 0, z ≥ max(x, y).
2n
The above equation has two solutions, the solution below satisfying the constraint.
Consequently,
  
1
(G nx,y )−1 (1 − u) = x + y + (x − y)2 − 2T σ 2 (tkn , x) log(u)/n .
2

Finally,
   
L max X̄ tn | { X̄ tnk = xk , k = 0, . . . , n} = L max (G nxk ,xk+1 )−1 (1 − Uk+1 ) ,
t∈[0,T ] 0≤k≤n−1

where (Uk )1≤k≤n are i.i.d. and uniformly distributed random variables over the unit
interval.
Pseudo-code for Lookback style options
We assume for the sake of simplicity that the interest rate r is 0. By Lookback
style options we mean the class of options whose payoff involve possibly both X̄ Tn
and the maximum of (X t ) over [0, T ], i.e.
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 373
 
E f X̄ Tn , sup X̄ tn .
t∈[0,T ]

Regular Call on maximum is obtained by setting f (x, y) = (y − K )+ , the (maxi-


mum) Lookback option by setting f (x, y) = y − x and the (maximum) partial look-
back f λ (x, y) = (y − λx)+ λ > 1.  
We want to compute a Monte Carlo approximation of E f X̄ Tn , supt∈[0,T ] X̄ tn
using the continuous Euler scheme. We reproduce below a pseudo-script to illustrate
how to use the above result on the conditional distribution of the maximum of the
Brownian bridge.

• Set S f = 0.
for m = 1 to M
• Simulate a path of the discrete time Euler scheme and set xk := X̄ tn,(m)n ,k=
k
0, . . . , n.
• Simulate (m) := max0≤k≤n−1 (G nxk ,xk+1 )−1 (1 − Uk(m) ), where (Uk(m) )1≤k≤n are
i.i.d. with U ([0, 1])-distribution.
• Compute f ( X̄ Tn,(m) , (m) ).
• Compute Sm := f ( X̄ Tn,(m) , (m) ) + Sm−1 .
f f

end.(m)
• Eventually,
  Sf
E f X̄ Tn , sup X̄ tn  M
t∈[0,T ] M

for large enough M (3 ).


Once one can simulate supt∈[0,T ] X̄ tn (and its minimum, see exercise below), it
is easy to price by simulation the exotic options mentioned in the former section
(Lookback, options on maximum) but also the barrier options, since one can decide
whether or not the continuous Euler scheme strikes a barrier (up or down). The
Brownian bridge is also involved in the methods designed for pricing Asian options.
 Exercise. (a) Show that the distribution of the infimum of the Brownian bridge
W,T,y
(Yt )t∈[0,T ] starting at 0 and arriving at y at time T is given by
⎧  
  ⎨ exp − T2 z(z − y) if z ≤ min(y, 0),
W,T,y
P inf Yt ≤z =
t∈[0,T ] ⎩
1 if z ≥ min(y, 0).

3 …Of course one needs to compute the empirical variance (approximately) given by
 2
1  1 
M M
f ((m) )2 − f ((m) )
M M
m=1 m=1

in order to design a confidence interval, without which the method is simply nonsense….
374 8 The Diffusion Bridge Method: Application …

(b) Derive a formula similar to (8.7) for the conditional distribution of the minimum
of the continuous Euler scheme using now the inverse distribution functions
  
n,k −1 1
(Fx,y ) (u) = x + y − (x − y) − 2 T σ (tk , x) log(u)/n , u ∈ (0, 1),
2 2 n
2
 
t W,T /n
of the random variable inf x+ T
(y − x) + σ(x)Yt .
t∈[0,T /n]
n

Warning! The above method is not appropriate for simulating the joint  distribution
of the (n + 3)-tuple X̄ tnn , k = 0, . . . , n, inf t∈[0,T ] X̄ tn , supt∈[0,T ] X̄ tn .
k

8.2.4 Application to Regular Barrier Options: Variance


Reduction by Pre-conditioning

By regular barrier options we mean barrier options having a constant level as a


barrier. An up-and-out Call is a typical example of such options with a payoff given
by
h T = (X T − K )+ 1{supt∈[0,T ] X t ≤L}

where K denotes the strike price of the option and L (L > K ) its barrier.
In practice, the “Call” part is activated at T only if the process (X t ) hits the barrier
L ≤ K between 0 and T . In fact, as far as simulation is concerned, this “Call part” can
be replaced by any Borel function f such that both f (X T ) and f ( X̄ Tn ) are integrable
(this is always true if f has polynomial growth owing to Proposition 7.2). Note
that these so-called barrier options are in fact a sub-class of generalized maximum
Lookback options having the specificity that the maximum only shows up through
an indicator function.  
Then, one may derive a general weighted formula for E f ( X̄ Tn )1{supt∈[0,T ] X̄ tn ≤L} ,
 
which is an approximation of E f (X T )1{supt∈[0,T ] X t ≤L} .

Proposition 8.4 (Up-and-Out Call option)


⎡ ⎛ ( X̄ tn −L)( X̄ tn −L)
⎞⎤
  
n−1 − 2n k k
σ 2 (tkn , X̄ tn )
= E ⎣ f ( X̄ Tn )1{max0≤k≤n X̄ tn ≤L} ⎝1 − e ⎠⎦ .
T
E f ( X̄ Tn )1 supt∈[0,T ] X̄ tn ≤L
k
k
k=0

(8.8)

Proof of Equation (8.8). This formula is a typical application of pre-conditioning


described in Sect. 3.4. We start from the chaining rule for conditional expectations:
   !
E f ( X̄ Tn )1{supt∈[0,T ] X̄ tn ≤L} = E E f ( X̄ Tn )1{supt∈[0,T ] X̄ tn ≤L} | X̄ tnk , k = 0, . . . , n .
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 375

Then we use the conditional independence of the bridges of the genuine Euler
scheme given the values X̄ tnk , k = 0, . . . , n, established in Proposition 8.2. It fol-
lows that
 
   
E f ( X̄ Tn )1{supt∈[0,T ] X̄ tn ≤L} = E f ( X̄ Tn )P sup X̄ tn ≤ L | X̄ tk , k = 0, . . . , n
t∈[0,T ]
 

n
=E f ( X̄ Tn ) G nX̄ n , X̄ n (L)
tk tk+1
k=1
⎡ ⎛ ( X̄ tn −L)( X̄ tn −L)
⎞⎤

n−1 − 2n k k+1
σ 2 (tkn , X̄ tn )
= E ⎣ f ( X̄ Tn )1 ⎝1 − e ⎠⎦ .
T
max0≤k≤n X̄ tnk ≤L
k ♦
k=0

Furthermore, we know that the random variable in the right-hand side always
has a lower variance since it is a conditional expectation of the random variable in
the left-hand side, namely
  ( X̄ tn −L)( X̄ tn


n−1
− 2n k k+1
σ 2 (tkn , X̄ tn )
−L)

f ( X̄ T )1 1−e
n T
Var max0≤k≤n X̄ tk ≤L
k

k=0
 
≤ Var f ( X̄ Tn )1 supt∈[0,T ] X̄ tn ≤L
.

 Exercises. 1. Down-and-Out option. Show likewise that for every Borel function
f ∈ L 1 (R, P X̄ n ),
T

   "n−1 ( X̄ tn −L)( X̄ tn −L)



− 2n k=0
k k+1
σ 2 (tkn , X̄ tn )
E f ( X̄ T )1 =E f ( X̄ T )1
n n T

inf t∈[0,T ] X̄ tn ≥L min0≤k≤n X̄ tnk ≥L


e k

(8.9)

and that the expression in the second expectation has a lower variance.
2. Extend the above results to barriers of the form

L(t) := eat+b , a, b ∈ R.

Remark Formulas like (8.8) and (8.9) can be used to produce quantization-based
cubature formulas (see [260]).

8.2.5 Asian Style Options

The family of Asian options is related to payoffs of the form


376 8 The Diffusion Bridge Method: Application …
 T 
h T := h X s ds
0

where h : R+ → R+ is a non-negative Borel function. This class of options may


benefit from a specific treatment to improve the rate of convergence of its time dis-
cretization. This may be viewed as a consequence of the continuity of the functional
#T
f → 0 f (s)ds from (L 1 ([0, T ], dt),  · 1 ).
What follows mostly comes from [187], where this problem has been extensively
investigated.
 Approximation phase: Let

σ2
X tx = x exp (μt + σWt ), μ = r − , x > 0.
2
Then
 T n−1 
 n
tk+1 
n−1  T /n
(tkn )
X sx ds = X sx ds = X txkn exp (μs + σWs )ds.
0 k=0 tkn k=0 0

So, we need to approximate the time integrals coming out in the r.h.s of the above
equation. Let B be a standard Brownian motion. It is clear by a √ scaling argument
already used in Sect. 7.2.2 that sups∈[0,T /n] |Bs | is proportional to T /n in L p , p ∈
[1, ∞). Although it is not true in the a.s. sense, owing to a missing log log term
coming from the LIL, we may write “almost” rigorously

σ2 2
B + “O(n − 2 )”.
3
∀ s ∈ [0, T /n], exp (μs + σ Bs ) = 1 + μs + σ Bs +
2 s
Hence

 T /n  T /n
T μ T2 σ2 T 2
exp (μs + σ Bs )ds = + + σ B s ds +
0 n 2 n2 0 2 2n 2

σ 2 T /n 2
(Bs − s)ds + “O(n − 2 )”,
5
+
2 0
 T /n
T r T2
= + + σ Bs ds
n 2 n2 0
  
σ2 T 2 1 2
( Bu − u)du + “O(n − 2 )”,
5
+
2 n 0


where Bu = Tn B Tn u , u ∈ [0, 1], is a standard Brownian motion on the unit interval
since, combining a scaling and a change of variable,
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 377

 T /n  2  1
T
(Bs2 − s)ds = ( Bu2 − u)du.
0 n 0

#1
0 ( Bu − u)du is centered and that, when
2
Owing to the fact that the random variable
(tkn )
B is replaced successively by (Wt )t∈[0, Tn ] , k = 0, . . . , n − 1, the resulting random
variables are independent hence i.i.d. one can in fact consider that the contribution
#1 (t n )
of this term is O(n − 2 ). To be more precise, the random variable 0 ((Wu k )2 − u) du
5

is independent of FtWn , k = 0, . . . , n − 1, so that


k

$ n−1  1 $2 n−1 $  $
$   $  $ x 1  (tkn ) 2  $2
$ (t ) $ $ (Wu ) − u du $
n
$ x
X tkn (Wu ) − u) du $ =
k 2
$ X tkn $
$ 0 $ 0
k=0 2
k=02

n−1 $
 $ $ $
$ x $2 $  (tkn ) 2  $2
1
= $X tkn $ $
$ (Wu ) − u du $
$
k=0
2 0 2
 4 $ 1 $
T $  2  $2
≤n $ Bu − u du $
$ X T 2
x 2
n $
0 2

since (X tx )2 is a sub-martingale. As a consequence


$ n−1  1 $
$  (tkn ) 2  $
$ $
(Wu ) − u du $ = O(n − 2 ),
3
x
$ X tkn
$ 0 $
k=0 2

which justifies to considering “conventionally” the contribution of each term to be


O(n − 2 ), i.e. negligible. This leads us to use the following approximation
5

 T /n  T /n
 (t n )  T r T2 (tkn )
exp μs + σWs k ds  Ikn := + +σ Ws ds.
0 n 2 n2 0

 Simulation phase: Now, it follows from Proposition 8.2 applied to the Brown-
ian motion (which is its own genuine Euler scheme), that the n-tuple of processes
 (tkn ) 
Wt t∈[0,T /n] , k = 0, . . . , n − 1, are independent processes given σ(Wtkn , k =
1, . . . , n) with conditional distribution given by
 (t n )   (t n ) 
L (Wt  )t∈[0,T /n] | Wtkn = wk , k = 1, . . . , n = L (Wt  )t∈[0,T /n] | Wtn = w , Wt+1
n = w+1
  
nt W,T /n
=L (w+1 − w ) + Yt
T t∈[0,T /n]

for every  ∈ {0, . . . , n − 1}.


# T /n (t n )
Consequently, the random variables 0 Ws  ds,  = 0, . . . , n − 1, are condi-
tionally independent given σ(Wtkn , k = 1, . . . , n) with a Gaussian conditional distri-
378 8 The Diffusion Bridge Method: Application …
 
bution N m(Wtn , Wt+1
n ); σ 2 having

 T /n
nt T
m(w , w+1 ) = (w+1 − w )dt = (w+1 − w )
0 T 2n
 T /n 2
(with w0 = 0) as a mean and a (deterministic) variance σ 2 = E YsW,T /n ds .
0
We can use the exercise below for a quick computation of this quantity based on
stochastic calculus. The computation that follows is more elementary and relies on
the distribution of the Brownian bridge between 0 and Tn :

 T /n 2 
W,T /n
E YsW,T /n ds = E (YsW,T /n Yt ) ds dt
0 [0,T /n]2
  
n T
= (s ∧ t) − (s ∨ t) ds dt
T [0,T /n]2 n
 3 
T
= (u ∧ v)(1 − (u ∨ v)) du dv
n [0,1]2
 
1 T 3
= .
12 n

 Exercise. Use stochastic calculus to show directly that


 T 2  T 2  T  2
T T T3
E YsW,T ds =E Ws ds − W = −s ds = .
0 0 2 T 0 2 12

Now we are in a position to write a pseudo code for a Monte Carlo # simulation. 
T
Pseudo-code for the pricing of an Asian option with payoff h 0 X sx ds
for m := 1 to M
d

• Simulate the Brownian increments Wt(m) n := Tn Z k(m) , k = 1, . . . , n, Z k(m) i.i.d.
k+1
with distribution N (0; 1); set
  
– wk(m) := Tn Z 1(m) + · · · + Z n(m) ,
 
– xk(m) := x exp μtkn + σwk(m) , k = 0, . . . , n.
• Simulate independently ζk(m) , k = 1, . . . , n, i.i.d. with distribution N (0; 1) and set

⎛ ⎞
   3
n,(m) d T r T 2 T 1 T 2 (m)
Ik := + + σ ⎝ (wk+1 − wk ) + √ ζk ⎠ , k = 0, . . . , n − 1.
n 2 n 2n 12 n

• Compute
8.2 From Brownian to Diffusion Bridge: How to Simulate Functionals … 379
 n−1 

(m)
hT =: h xk(m) Ikn,(m)
k=0

end.(m)
1  (m)
M
Premium  e−r t h .
M m=1 T

end.
 Time discretization error estimates: Set
 T 
n−1
AT = X sx ds and ĀnT = X txkn Ikn .
0 k=0

This scheme, which is clearly simulable but closely dependent on the Black–Scholes
model in its present form, induces the following time discretization error, established
in [187].

Proposition 8.5 (Time discretization error) For every p ≥ 2,

A T − ĀnT  p = O(n − 2 )
3

so that, if g : R2 → R is Lipschitz continuous, then


 3
g(X Tx , A T ) − g(X Tx , ĀnT ) p = O n − 2 .

Remark. The main reason for not considering higher-order expansions of


# t + σ Bt #) tis thatwe are not able to simulate at a reasonable cost the triplet
exp(μt
Bt , 0 Bs ds, 0 Bs2 ds , which is no longer a Gaussian vector and, consequently,
#t #t 2 
0 Bs ds, 0 Bs ds given Bt .
Chapter 9
Biased Monte Carlo Simulation,
Multilevel Paradigm

Warning: Even more than in other chapters, we recommend the reader to carefully
read the “Practitioner’s corner” sections to get more information on practical aspects
in view of implementation. In particular, many specifications of the structure param-
eters of the multilevel estimators are specified a priori in various Tables, but these
specifications admit variants which are discussed and detailed in the Practitioner’s
corner sections.

9.1 Introduction

In the first chapters of this monograph, we explored how to efficiently implement


Monte Carlo simulation method to compute various quantities of interest which can
be represented as the expectation of a random variable. So far, we focused focused
on the numerous variance reduction techniques.
One major field of application of Monte Carlo simulation, in Quantitative
Finance, but also in Physics, Biology and Engineering Sciences, is to take advan-
tage of the representation of solutions of various classes of PDE s as expec-
tations of functionals of solutions of Stochastic Differential Equations (SDE).
The most famous (and simplest) example of such a situation is provided by the
Feynman–Kac representation of the solution of the parabolic P D E ≡ ∂u + Lu = f ,
    ∂t
where Lu = b|∇x u (x) + 21 Tr a(x)D 2 u(x) , u(T, . ) = f with b : Rd → Rd and
a : Rd → M(d, R) established in Theorem 7.11 of Chap. 7. Under appropriate
assumptions on a, b and f , a solution u exists and we can represent u by u(t, x) =
E f (X Tx −t ), where (X tx )t∈[0,T ] is the Brownian diffusion, i.e. the unique solution to
d X tx = b(X tx )dt + σ(X tx )dWt , X 0x = x, where a(ξ) = σσ ∗ (ξ). This connection can
be extended to some specific path-dependent functionalsof the diffusion, typically 
    T
of the form f X Tx , supt∈[0,T ] X tx , f X Tx , inf t∈[0,T ] X tx , f X Tx , 0 ϕ(X tx )dt . Thus,

© Springer International Publishing AG, part of Springer Nature 2018 381


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_9
382 9 Biased Monte Carlo Simulation, Multilevel Paradigm

in the latter case, one can introduce


t the 2-dimensional degenerate SDE satisfied by
the pair (X tx , Yt ) where Yt = 0 ϕ(X s )ds

d X t = b(t, X t )dt + σ(t, X t )dWt , X 0 = x, dYt = ϕ(X t )dt, Y0 = 0.

Then, one can solve the “companion” parabolic PDE associated to its infinitesimal
generator. See e.g. [2], where other examples in connection with pricing and hedging
of financial derivatives under a risk neutral assumption are connected with the corre-
sponding PDE derived by the Feynman–Kac formula (see Theorem 7.11). Numerical
methods based on numerical schemes adapted to these PDEs are proposed. How-
ever these PDE based approaches are efficient only in low dimensions, say d ≤ 3.
In higher dimensions, the Feynman–Kac representation is used in the reverse sense:
one computes u(0, x) = E f (X Tx ) by a Monte Carlo simulation rather than trying to
numerically solve the PDE. As far as derivative pricing and hedging is concerned
this means that PDE methods are efficient, i.e. they outperform Monte Carlo simu-
lation, until dimension d = 3, but beyond, the Monte Carlo (or quasi-Monte Carlo)
method has no credible alternative, except for very specific models. When dealing
with risk estimation where the structural dimension is always high, there is even less
alternative.
As we saw in Chap. 7, the exact simulation of such a diffusion process, at fixed
horizon T or “through” functionals of its paths, is not possible in general. A notice-
able exception is the purely one-dimensional setting (d = q = 1 with our notations)
where Beskos and Roberts proposed in [42] (see also [43]) such an exact procedure
for vectors (X t1 , . . . , X tn ), tk = kT
n
, k = 1, . . . , n, based on an acceptance-rejection
method involving a Poisson process. Unfortunately, this method admits no significant
extension in higher dimensions.
More recently, various groups, first in [9, 18], then in [149] and in [240]) devel-
oped a kind of “weak exact” simulation method for random variables of the form
f (X T ) in the following sense: they exhibit a simulable random variable  satisfying
E  = E f (X T ) and Var() < +∞. Beyond the assumptions made on b and σ, the
approach is essentially limited to this type of “vanilla” payoff (depending on the dif-
fusion at one fixed time T ) and seems, from this point of view, less flexible than the
multilevel methods developed in what follows. These methods, as will be seen further
on, rely, in the case of diffusions, on the simulation of discretization schemes which
are intrinsically biased. However, we have at hand two kinds of information: the
strong convergence rate of such schemes toward the diffusion itself in the L p -spaces
(see the results for the Euler and Milstein schemes in Chap. 7) on the one hand and
some expansions of the weak error, i.e. of the bias on the other hand. We briefly saw
in Sect. 7.7 how to use such a weak error expansion to devise a Richardson–Romberg
extrapolation to (partially) kill this bias. We will first improve this approach by intro-
ducing the multistep Richardson–Romberg extrapolation, which takes advantage of
higher-order expansions of the weak error. Then we will present two avatars of the
multilevel paradigm. The multilevel paradigm introduced by M. Giles in [107] (see
also [148]) combines both weak error expansions and a strong approximation rate
to get rid of the bias as quickly as possible. We present two natural frameworks for
9.1 Introduction 383

multilevel methods, the original one due to Giles, when a first order expansion of
the weak error is available, and one combined with multistep Richardson–Romberg
extrapolation, which takes advantage of a higher-order expansion.
The first section develops a convenient abstract framework for biased Monte
Carlo simulation which will allow us to present and analyze in a unified way various
approaches and improvements, starting from crude Monte Carlo simulation, standard
and multistep Richardson–Romberg extrapolation, the multilevel paradigm in both
its weighted and unweighted forms and, finally, the randomized multilevel methods.
On our way, we illustrate how this framework can be directly applied to time stepping
problems – mainly for Brownian diffusions at this stage – but also to optimize the
so-called Nested Monte Carlo method. The object of the Nested Monte Carlo method
is to compute by simulation of functions of conditional expectations which rely on
the nested combination of inner and outer simulations. This is an important field
of investigation of modern simulation, especially in actuarial sciences to compute
quantities like the Solvency Capital Requirement (SCR) (see e.g. [53, 76]). Numerical
tests are proposed to highlight the global efficiency of these multilevel methods.
Several exercises will hopefully help to familiarize the reader with the basic principles
of multilevel simulation but also some recent more sophisticated improvements like
antithetic meta-schemes. At the very end of the chapter, we return to the weak exact
simulation and its natural connections with randomized multilevel simulation.

9.2 An Abstract Framework for Biased Monte


Carlo Simulation

The quantity of interest is


I0 = E Y0 ,

where Y0 : (, A, P) → R is a square integrable real random variable whose simu-


lation cannot be performed at a reasonable computational cost (or complexity). We
assume that, nevertheless, it can be approximated, in a sense made more precise later
by a family of random variables (Yh )h∈H as h → 0, defined on the same probability
space, where H ⊂ (0, +∞) is a set of bias parameters such that H ∪ {0} is a compact
set and
H
∀ n ∈ N∗ , ⊂ H.
n

Note that H being nonempty, it contains a sequence hn → 0 so that 0 is a limiting


point of H. As far as approximation of Y0 by Yh is concerned, the least we can ask
is that the induced “weak error” converges to 0, i.e.

E Yh → E Y0 as h → 0. (9.1)
384 9 Biased Monte Carlo Simulation, Multilevel Paradigm

However, stronger assumptions will be needed in the sequel, in terms of weak but also
strong (L 2 ) approximation. Our ability to specify the simulation complexity of Yh for
every h ∈ H will be crucial when selecting such a family for practical implementation.
Waiting for some more precise specifications, we will simply assume that

Yh can be simulated at a computational cost of κ(h),

which we suppose to be “acceptable” by contrast with that of Y0 , where κ(h) is more


or less representative of the number of floating point operations performed to simulate
one (pseudo-)realization of Yh from (finitely many pseudo-)random numbers. We will
assume that h → κ(h) is a decreasing function of h and that lim h→0 κ(h) = +∞ in
order to model the fact that the closer Yh is to Y0 , the higher the simulation complexity
of Yh is, keeping in mind that Y0 cannot be simulated at a reasonable cost. It is often
convenient to specify h based on this complexity by assuming that κ(h) is inverse
linear in h, i.e. κ(h) = κ̄h .
To make such an approximate simulation method acceptable, we need to control
at least the error induced by the computation of E Yh by a Monte Carlo simulation, i.e.
M
by considering  (Yh ) M = M1 k=1 Yh , where (Yh )k≥1 is a sequence of independent
k k

copies of Yh . For that purpose, all standard ways to measure this error (quadratic
norm, CLT to design confidence intervals or LIL) rely on the existence of a variance
for Yh , that is, Yh ∈ L 2 . Furthermore, except in very specific situations, one further
natural constraint on the (Yh )h∈H is that (Var(Yh ))h∈H remains bounded as h → 0
in H but, in practice, this is not enough and we often require the more precise

Var(Yh ) → Var(Y0 ) as h → 0 (9.2)

or, equivalently, under (9.1) (convergence of expectations)

E Yh2 → E Y02 as h → 0.

In some sense, the term “approximation” to simply denote the convergence of the
first two moments is a language abuse, but we will see that Y0 is usually a functional
F(X ) of an underlying “structural stochastic process” X and Yh is defined as the same
functional of a discretization scheme F(  X n ), so that the above two convergences are
usually the consequence of the weak convergence of  X n toward X in distribution
with respect to the topology on the functional space of the paths of the processes X
and  X n , typically the space D([0, T ], Rd ) of right continuous left limited functions
on [0, T ].
Let us give two typical examples where this framework will apply.
 Examples.
1. Discretization scheme of a diffusion (Euler). Let X = (X t )t∈[0,T ] be a Brownian
diffusion with Lipschitz continuous drift b(t, x) and diffusion coefficient σ(t, x),
both defined on [0, T ] × Rd and having values in Rd and M(d, q, R), respectively,
associated to a q-dimensional Brownian motion W defined on a probability space
9.2 An Abstract Framework for Biased Monte Carlo Simulation 385

(, A, P), independent of X 0 . Let ( 


X tn )t∈[0,T ] and ( X̄ tn )t∈[0,T ] denote the càdlàg step-
wise constant and genuine (continuous) Euler schemes with step Tn , respectively, as
defined in Chap. 7. Let F : D([0, T ], Rd ) → R be a (PX -a.s.) continuous functional.
Then, set

T T  n   
H= , n ∈ N∗ , h = , Yh = F ( 
X t )t∈[0,T ] and F ( X̄ tn )t∈[0,T ]
n n
respectively. Note that the complexity of the simulation is proportional to the number
n of time discretization steps of the scheme, i.e. κ(h) = κ̄h .
2. Nested Monte Carlo. The aim of the so-called Nested Monte Carlo simulation is
to compute expectations involving a conditional expectation, i.e. of the form
 
I0 = E Y0 with Y0 = f E (F(X, Z ) | X ) ,

where X , Z are two independent random vectors defined on a probability space


(, A, P) having values in Rd and Rq , respectively, and F : Rd × Rq → R is a Borel
function such that F(X, Z ) ∈ L 1 (P). We assume that both X and Z can be simulated
at a reasonable computational cost. However, except in very specific situations, one
cannot directly simulate exact independent copies of Y0 due to the presence of a
conditional expectation. Taking advantage of the independence of X and Z , one has
 
Y0 = f [E F(x, Z )]|x=X .

This suggests to first simulate M independent copies X m , m = 1, . . . , M, of X and,


for each such copy, to estimate E F(x, Z ) |x=X m by an inner Monte Carlo simulation
of size K ∈ N∗ by

K
1
F(X m , Z km ), where (Z km )k=1,...,K are i.i.d. copies of Z , independent of X ,
K
k=1

finally leading us to consider the estimator

M
 K

 1 1
I M,K = f F(X m , Z km ) .
M m=1
K k=1

The usual terminology used by practitioners to discriminate between the two


sources of simulations in the above naive nested estimator is the following: the
simulations of the random variables Z km indexed by k running from 1 to K for each
fixed m are called inner simulations whereas the simulations of the random variables
X m indexed by m running from 1 to M are called the outer simulations.
386 9 Biased Monte Carlo Simulation, Multilevel Paradigm

This can be inserted in the above biased framework by setting, an integer K 0 ≥ 1


being fixed,
1  
H= , k ∈ N∗ , Y0 = f [E F(x, Z )]|x=X (9.3)
K0k

and
 K

1 1
Yh = f F(X, Z k ) , h = ∈ H, (Z k )k=1,...,K i.i.d. copies of Z (9.4)
K k=1
K

independent of X .
If f is continuous with linear growth, it is clear by the SLLN under the above
assumptions that Yh → Y0 in L 1 , so that E Yh → E Y0 . If, furthermore, F(X, Z ) ∈
L 2 (P), then Var(Yh ) → Var(Y0 ) as h → 0 in H (i.e. as K → +∞).  Consequently, if
K and M go to infinity, then  I M,K will converge toward its target f E (F(X, Z ) | X )
as M, K → +∞ in a proper way, but, it is clearly a biased estimator at finite range.
Note that the complexity of the simulation of Yh as defined above is proportional
to the length K of this inner Monte Carlo simulation, i.e. again of the form κ(h) = κ̄h .
Two main settings are extensively investigated in nested Monte Carlo simulation:
the case of smooth functions f (at least Hölder) on the one hand and that of indicator
functions of intervals. This second setting is very important as it corresponds to the
computation of quantiles of a conditional expectation, and, in a second stage, of
quantities like Value-at-Risk which is the related inverse problem. Thus, it is a major
issue in actuarial sciences to compute the Solvency Capital Requirement (SCR) which
appears, technically speaking, exactly as a quantile of a conditional expectation. To
be more precise, this conditional expectation is the value of the available capital at a
maturity T (T = 30 years or even more) given a short term future (say 1 year). For a
first example of computation of SCR based on a nested Monte Carlo simulation, we
refer to [31]. See also [76].
Our aim in this chapter is to introduce methods which reduce the bias efficiently
while keeping the variance under control so as to satisfy a prescribed global quadratic
error with the smallest possible “budget”, that is, with the lowest possible complexity.
This leads us to model a Monte Carlo simulation as a constrained optimization
problem. This amounts to minimizing the (computational) cost of the simulation for
a prescribed quadratic error. We first develop this point of view for a crude Monte
Carlo simulation to make it clear and, more importantly, to have a reference. To this
end, we will also need to switch to more precise versions of (9.1) and (9.2), as will
be emphasized below. Other optimization approaches can be adopted which turn out
to be equivalent a posteriori.
9.3 Crude Monte Carlo Simulation 387

9.3 Crude Monte Carlo Simulation

The starting point is the bias-variance decomposition of the error induced by using
independent copies (Yhk )k≥1 of Yh to approximate the quantity of interest I0 = E Y0
by the Monte Carlo estimator

M
 1
IM = Yhk ,
M k=1

namely
 2  2 Var(Yh )
IM 2 = E Y0 − E Yh +
E Y0 −  . (9.5)
   M 
 
squared bias Monte Carlo variance

This decomposition is a straightforward consequence of the identity


 
1
M
 ⊥ 1
M
E Y0 − Yhk = E Y0 − E Yh + E Yh − Yhk .
M k=1
M k=1

 M 
The two terms are orthogonal in L 2 since E E Yh − 1
M k=1 Yhk = 0.

Definition 9.1. The squared quadratic error I0 −  IM 22 = E (I0 − 


IM )2 is called
the Mean Squared Error (MSE) of the estimator whereas the quadratic error itself
I0 − 
IM 2 is called the Rooted Mean Squared Error (RMSE).

Our aim in what follows is to minimize the cost of a Monte Carlo simulation for
a prescribed (upper-bound of the) R M S E = ε > 0, assumed to be small, and in any
case in (0, 1] in non-asymptotic results. The idea is to make a balance between the
bias term and the variance term by an appropriate choice of the parameter h and the
size M of the simulation in order to achieve this prescribed error bound at a minimal
cost. To this end, we will strengthen our bias assumption (9.1) by assuming a weak
rate of convergence or first order weak expansion error
 α
WE 1
≡ E Yh = E Y0 + c1 h α + o(h α ) (9.6)

where α > 0. We assume that this first order expansion is consistent, i.e. that c1 = 0.
Remark. Note that under the assumptions of Theorem 7.8, (a) and (b), the above
assumption holds true in the framework of diffusion approximation by the Euler
scheme with α = 1 when Yh = f ( X̄ Tn ) (with h = Tn ). For path-dependent functionals,
we saw in Theorem 8.1 that α drops down (at least) to α = 21 in situations involving
the indicator function of an exit time, like for barrier options and the genuine Euler
scheme.
388 9 Biased Monte Carlo Simulation, Multilevel Paradigm

As for the variance, we make no significant additional assumption at this stage,


except that, if Var(Yh ) → Var(Y0 ), we may reasonably assume that

σ̄ 2 = sup Var(Yh ) < +∞,


h∈H

keeping in mind that, furthermore, we can replace σ̄ 2 by Var(Y0 ) in asymptotic results


when h → 0.
Consequently, we can upper-bound the M S E of  I M by

σ̄ 2 
M S E(
I M ) ≤ c12 h 2α + + o h 2α ).
M
On the other hand, we make the natural assumption (see the above examples) that
the simulation of (one copy of) Yh has complexity

κ̄
κ(h) = . (9.7)
h

We denote by Cost( IM ) the global complexity – or computational cost – of the


simulation of the estimator I M . It is clearly given by

Cost(
IM ) = Mκ(h)

since a Monte Carlo estimator is linear. Consequently, our minimization problem


reads
inf Cost(IM ) = inf Mκ(h). (9.8)
R M S E(
IM )≤ε R M S E(
IM )≤ε

If we replace M S E(
I M ) by its above upper-bound and neglect the term o(h 2α ), this
problem can be re-written by the following approximate one:

M κ̄ σ̄ 2
inf , h ∈ H, M ∈ N∗ , c12 h 2α + ≤ ε2 . (9.9)
h M
Now, note that
M κ̄ M κ̄ σ̄ 2 /M κ̄ σ̄ 2
= = ,
h h σ̄ 2 /M h(σ̄ 2 /M)

where the M S E constraint reads σ̄ 2 /M ≤ ε2 − c12 h 2α . It is clear that, in order to


minimize the above ratio, this constraint should be saturated, so that
9.3 Crude Monte Carlo Simulation 389
 
M κ̄ σ̄ 2 κ̄σ̄ 2
inf , h ∈ H, M ∈ N∗ , c12 h 2α + ≤ ε2 ≤ inf (9.10)
h M 1
0<h<(ε/|c1 |) α h(ε2 − c12 h 2α )
κ̄σ̄ 2
=
1 h(ε − c1 h
sup 2 2 2α )
0<h<(ε/c1 ) α
1 1
(1 + 2α)1+ 2α |c1 | α
= κ̄ σ̄ 2 1
, (9.11)
2α ε2+ α

where the infimum in the right-hand side of (9.10) is in fact attained at


  α1
ε
h min (ε) = √ .
1 + 2α |c1 |

Then one sets

h ∗ (ε) = nearest lower neighbor in H of h min (ε),

which
h liesin H but is possibly slightly suboptimal. Thus, if H is of the form H =
n
, n ≥ 1 ,
h ∗ (ε) = hh/ h min (ε)−1 . (9.12)

We derive the simulation size by plugging h min (ε) into the saturated constraint
c12 h 2α + σ̄M = ε2 , which yields after elementary computations
2

 
1  σ̄ 2
M = M ∗ (ε) = 1+ .
2α ε2

Taking advantage of the fact that Var(Yh ) converges to Var(Y0 ), one may improve
the above minimal complexity as ε → 0 (and make more rigorous the above opti-
mization by taking into account the term o(h 2α ) that we neglected). In fact, one easily
checks that
1
(1 + 2α)1+ 2α 1
Cost(
1
inf IM )  κ̄ Var(Y0 )|c1 | α 1 as ε → 0. (9.13)
R M S E(
IM )≤ε 2α ε α
2+

This optimization procedure should be compared to a virtual unbiased simulation


by averaging independent copies of Y0 if the simulation cost κ0 of Y0 is finite. In
such a case, M S E(
IM ) = Var(Y
M
0)
, so that M ∗ (ε) = Var(Y
ε2
0)
and the resulting complexity
reads
κ0 Var(Y0 ) Var(Y0 )
inf Cost = κ0 M ∗ (ε) = and M ∗ (ε) = . (9.14)
R M S E(
IM )≤ε ε2 ε2
390 9 Biased Monte Carlo Simulation, Multilevel Paradigm

The main conclusion of this introductory section is the following: the challenge
of reducing (killing…) the bias is to switch from the exponent 2 + α1 to an exponent
as close to 2 as possible. This is a quite important question, as highlighted by the
following simple illustration based on (9.11) or (9.13):

• To increase the accuracy of an unbiased simulation by a factor of 2, one needs a 4


times longer simulation.  α
• If the simulation is biased and W E 1 holds with α = 1, one needs an 8 times
longer simulation to increase the
 accuracy
α by a factor of 2.
• If the simulation is biased and W E 1 holds with α = 1/2, one needs a 16 times
longer simulation to increase the accuracy by a factor of 2.

9.4 Richardson–Romberg Extrapolation (II)

Richardson–Romberg extrapolation based on weak error expansion has already been


briefly introduced in the framework of diffusions and their discretization schemes in
Sect. 7.7.

9.4.1 General Framework

We assume that the weak error expansion of E Yh in the (h α )≥1 scale holds at the
second order, i.e.
 α  
WE 2
≡ E Yh = E Y0 + c1 h α + c2 h 2α + o h 2α , (9.15)

and that (9.2) holds, i.e. Var(Yh ) → Var(Y0 ) as well as σ̄ 2 = suph∈H Var(Yh ) < +∞.
Remark.
 α The implementation
 α of Richardson–Romberg extrapolation only requires
W E 1 . We assumed W E 2 to enhance its best possible performance. The existence
of such an expansion (already tackled in the diffusion framework) will be discussed
further on in Sect. 9.5.1, e.g. for nested Monte Carlo simulation.
h
One deduces by linearly combining (9.15) with h and 2
that
 
2α E Y h2 − E Yh = (2α − 1)E Y0 + (2−α − 1)c2 h 2α + o h 2α . (9.16)

This suggests a natural way to modify the estimator of I0 = E Y0 to reduce the


bias: set
2α Y h2 − Yh

Yh = , h ∈ H.
2α − 1
9.4 Richardson–Romberg Extrapolation (II) 391
 
h )h∈H satisfies W E 2α . Without additional assumptions
It is clear from (9.16) that (Y 1
h does not satisfy Assumption (9.2) (convergence of the variance), however its
Y
variance remains bounded if that of Yh does since

1  α 2
Var(Y h )2 ≤
h ) = σ(Y 2 σ(Y h2 ) + σ(Yh )
(2α − 1)2
 2α + 1 2
≤ α σ̄ 2 (9.17)
2 −1

with the notations of the former section. Also note that the simulation cost κ̃(h) of
h satisfies κ̃(h) ≤ κ(h) + 2κ(h) = 3κ̄ ; note that for some applications like nested
Y h
Monte Carlo one may obtain a slower increase of the complexity since κ̃(h) =
κ(h) ∨ (2κ(h)) = 2hκ̄ . This leads us to introduce, for every h ∈ H, the Richardson–
Romberg estimator

M M
 1 1
I MR R = 
I MR R (h, M) = hk =
Y 2α Y hk − Yhk , M ≥ 1,
M k=1
M(2α − 1) k=1
2

where (Yhk , Y hk )k≥1 are independent copies of (Yh , Y h2 ). Applying the results of the
2
former section to the family (Y h )h∈H with α̃ = 2α, κ̃(h), c̃1 = −(1 − 2−α )c2 and
the upper-bound (9.17) of σ(Ỹh ), we deduce from (9.11) that

(1 + 4α)1+ 4α 3(2α + 1)2 [(1 − 2−α )|c2 |] 2α κ̄ σ̄ 2


1 1

inf Cost(
I MR R ) ≤ 1 .

I MR R −I0 2 ≤ε 4α (2α − 1)2 ε2+ 2α
1
In view of the exponent 2+ 2α of ε (the prescribed RMSE ) in the above minimal
Cost function, it turns out that we are halfway toward an unbiased simulation.
However, the brute force upper-bound established for Var(Y h ) is not satisfactory
since it relies on the triangle inequality for the standard deviation, so we will try
to improve it by investigating the internal structure of the family (Yh )h∈H . (In what
follows all lim , lim , etc, should always be understood for h ∈ H.)
h→0 h→0
h , we know that
If we apply the reverse triangle inequality to Y

h ) ≥ 1
Var(Y (2α σ(Yh ) − σ(Y h2 ))2
(2α − 1)2

so that, under Assumption (9.2),


α
h ) ≥ (2 − 1) σ(Y0 )2 = Var(Y0 ).
2
lim Var(Y (9.18)
h→0 (2α − 1)2
392 9 Biased Monte Carlo Simulation, Multilevel Paradigm

On the other hand, by Schwartz’s Inequality, Cov(Yh , Y h2 ) ≤ Var(Yh )Var(Y h2 ),
which implies
lim Cov(Yh , Y h2 ) ≤ Var(Y0 ). (9.19)
h→0

Bi-linearity of the covariance operator yields

Var(2α Y h2 − Yh ) = 22α Var(Y h2 ) − 21+α Cov(Yh , Y h2 ) + Var(Yh )

so that

h ) = 2 + 1 Var(Y0 ) − 2
2α 1+α
lim Var(Y lim Cov(Yh , Y h2 ).
h→0 (2α − 1)2 (2α − 1)2 h→0

Combining the lower bound (9.18) and the upper bound (9.19) with the former
equality shows that:

h ) = Var(Y
lim Var(Y 0 ) iff lim Var(Y
h ) ≤ Var(Y
0 )
h→0 h→0
iff lim Cov(Yh , Y h ) ≥ Var(Y0 )
2
h→0
iff lim Cov(Yh , Y h ) = Var(Y0 ).
h→0 2

As a consequence, the best choice, if possible, is to select a family (Yh )h∈H of


approximators satisfying

lim Cov(Yh , Y h2 ) = Var(Y0 ).


h→0

But, then
   
Yh − Y h 2 = Var(Yh − Y h ) + E (Yh − Y h ) 2
2 2 2 2
 2
= Var(Yh ) − 2 Cov(Yh , Y h2 ) + Var(Y h2 ) + E (Yh − Y h2 )
→ Var(Y0 )(1 − 2 + 1) + 02 = 0 as h → 0 in H,

i.e.
L2
Yh − Y h2 −→ 0 as h → 0 in H. (9.20)

This strongly suggests, in order to better control the variance of the estimator, to
consider a family (Yh )h∈H which
 strongly  converges toward Y0 in quadratic norm
(which in turn will imply that Yh − Y h2 2 → 0) that is

L2
Yh −→ Y0 as h → 0 in H. (9.21)
9.4 Richardson–Romberg Extrapolation (II) 393

Warning (the temptation of laziness)! It should be emphasized that the lazy


approach consisting in considering Yh and Y h2 to be independent leads to

22α Var(Yh ) + Var(Y h2 ) 22α + 1


h ) =
Var(Y
h→0
−→ Var(Y0 ).
(2α − 1)2 (2α − 1)2

h is approximately multiplied by 2α2α +12 compared to Var(Yh )


Thus the variance of Y (2 −1)
when h is small! This corresponds to a factor of 5 when α = 1 and a 3−23√2  17.5
factor when α = 21 . This huge increase of the variance becomes a major obstacle
to the implementation of the method, except possibly when ε is very small. In any
case, this choice is never competitive with that of a coherent family of approximators
(Yh )h∈H .
 Examples. 1. In the framework  of Euler discretization
 schemes
 for Brownian diffu-
sions, when Yh = F (  X tn )t∈[0,T ] or Yh = F ( X̄ tn )t∈[0,T ] with F a  · sup -Lipschitz
continuous, the strong convergence assumption (9.21) amounts to the convergence in
L 2 of sup |X t − X tn | or sup |X t − X̄ tn | to 0 established in Chap. 7 (Theorem 7.2).
t∈[0,T ] t∈[0,T ]

2. In the nested Monte Carlo framework as described by (9.3) and (9.4) at the begin-
ning of the chapter, (9.21) is an easy consequence of the SLLN and Fubini’s Theorem
when the function f is Lipschitz continuous and F(X, Z ) ∈ L 2 .
 Exercise. Prove the statement about nested the Monte Carlo method in the pre-
ceding example. Extend the result to the case where f is ρ-Hölder under appropriate
assumptions on F(X, Z ).

As a conclusion to this section, let us mention that it may happen that (9.20)
holds and (9.21) fails. This is possible if the rate of convergence in (9.20) is not
fast enough. Such a situation is observed with the (weak) approximation of the
geometrical Brownian motion by a binomial tree (see [32]).

9.4.2 ℵ Practitioner’s Corner

Brownian diffusions
Let X = (X t )t∈[0,T ] be the Brownian diffusion solution to the SDE with Lipschitz
continuous drift b(t, x) and diffusion coefficient σ(t, x), driven by a q-dimensional
Brownian motion on a probability space (, A,  P) (with X 0 independent of W ). The
Euler scheme with step h = Tn – so that H = Tn , n ≥ 1 – reads:

T T
X̄ tnk+1
n = X̄ tnkn + b(tkn , X̄ tnkn ) + σ(tkn , X̄ tnkn )Uk+1
n
, X̄ 0n = X 0 ,
n n
394 9 Biased Monte Carlo Simulation, Multilevel Paradigm

where tkn = kT
n
and Ukn = Tn (Wtkn − Wtk−1n ), k = 1, . . . , n. We consider now the

T
Euler scheme with step 2n , designed with the same Brownian motion. We have

T T
=
X̄ t2n2n + b(tk , X̄ t 2n ) +
X̄ t2n2n
2n 2n
σ(t 2n , X̄ t2n2n )Uk+1
2n
, X̄ 02n = X 0 ,
k+1 2n k k 2n k k


where tk2n = kT
2n
and U 2n
k = 2n
T
(Wtk2n − Wtk−12n ), k = 1, . . . , 2n. Hence, it is clear by

their very definition that


2n
U2k−1 + U2k
2n
Ukn = √ , k = 1, . . . , n. (9.22)
2

The joint simulation of these two schemes can be simply performed as follows:
• First simulate 2n independent pseudo-random numbers Uk2n , k = 1, . . . , 2n, with
distribution N (0; Iq );
• then compute the Ukn , k = 1, . . . , n using (9.22).
An alternative method is to:
• first simulate the Brownian increments of the coarse scheme with step Tn ,
• then use the recursive simulation of the Brownian motion to simulate the incre-
T
ments of the refined scheme with step 2n . To be precise, a straightforward
application of the Brownian bridge method (see Proposition 8.1, applied with
( Tn )
Wt = W Tn +t − W Tn between 0 and T
n
) yields that

  y − x T 
L W (2k+1)T − W kTn | W kTn = x, W (k+1)T = y = N ;
2n n 2 4n
so that x + y T 
 
L W (2k+1)T | W kTn = x, W (k+1)T = y = N ; .
2n n 2 4n
Note that the joint simulation of (functionals of) two genuine (continuous) Euler
schemes with respective steps Tn and 2n T
cannot be performed in an as elementarily
way: the joint simulation of the diffusion bridge method for both schemes remains,
to our knowledge, an open problem.
Nested Monte Carlo
– Weak error expansion ( f smooth). In the nested Monte Carlo framework defined
in (9.4) and (9.3), the question of the existence of a first- or second-order weak
error expansion, with α = 1, remains reasonably elementary when the function f is
smooth enough. Thus, if f is five times
 differentiable
α with bounded existing partial
derivatives and F(X, Z ) ∈ L 5 , then W E 2 holds with α = 1. This is a special case
of an expansion at order R established in [198].
9.4 Richardson–Romberg Extrapolation (II) 395

– Weak error expansion ( f indicator function). When f = 1[a,+∞) or, more gen-
erally, any indicator function of an interval, establishing the existence of a first-order
weak error expansion is a much more involved task, first achieved in [128] to our
knowledge. It relies on smoothness assumptions α joint law of (0 , h ). In [113],
 on the
a weak error expansion at order R ≥ 2, that is, W E R , still with α = 1, is established
by a duality method (see also Sect. 9.6.2 for a more precise statement).
– Complexity. Temporarily assume
K  h = 1 – i.e. K 0 = 1 – for simplicity. Given the
form of Yh = f K1 k=1 F(X, Z k ) , (Z k )k≥1 i.i.d. the expression of the complexity
h differs from the generic case and can be slightly improved (if the computa-
of Y
tion
 of one value of the function f is neglected) since one just needs to simulate
F(X, Z k ) k=1,...,2K to simulate both Yh and Y h2 . Consequently, the complexity of
the computation of Y h in this framework is

2κ̄ 3κ̄

κ(h) = κ̄ 2K = (rather than h
).
h

9.4.3 Going Further in Killing the Bias: The Multistep


Approach

In this section, we extend the above Richardson–Romberg method by introducing a


multistep approach based on a higher-order extension of the weak error E Yh − E Y0 .
The technique described in this section should be understood as a foundation for the
multilevel methods rather than a direct approach since it will be outperformed by
the weighted multilevel methods. In particular, the weights introduced in this section
will be used in the next one devoted to the multilevel paradigm.
Throughout this section, we assume that the following holds for some integer
R≥2
 α R
 
W E R ≡ E Yh = E Y0 + cr h αr + o h αR . (9.23)
r =1

As for all Taylor-like expansions of this type, the coefficients cr are unique and
do not depend on R ≥ r .
We also make the strong convergence assumption (9.21) (Yh → Y0 in L 2 ) as well,
though it is not mandatory in this setting.
Definition 9.2. An increasing R-tuple n = (n 1 , . . . , n R ) of positive integers satis-
fying 1 = n 1 < n 2 < · · · < n R is called a vector of refiners. For every h ∈ H, the
resulting sub-family of approximators is denoted by Yn,h := Y nh i=1,...,R .
i

The driving idea is, a vector n of refiners being fixed, to extend the Richardson–
Romberg estimator by searching for a linear combination of the components of E Y nh
 α r
which kills the resulting bias up to order R − 1, relying on the expansion W E R and
396 9 Biased Monte Carlo Simulation, Multilevel Paradigm

then deriving the multistep estimator accordingly. To determine this linear combina-
tion, whose coefficients hopefully will not depend on h ∈ H, we consider a generic
R-tuple of weights w R = (w1R , . . . , w RR ) ∈ R R .
To alleviate notation we will drop the superscript R and write w instead
 of wR
α
when no ambiguity arises. Idem for the components w. Then, owing to W E R ,

R  R  R
h αr
R  
αR
w j E Y nh = w j E Y0 + wj + cr
o h
j=1
j
j=1 j=1 r =1
n αj
⎡ ⎤
 R  R R
w  
cr h αr ⎣ ⎦ + o h αR , (9.24)
j
= w j E Y0 + αr
j=1 r =1 j=1 j
n

where we interchanged the two sums in the second line. If w is a solution to the
system ⎧


R

⎪ wj = 1


j=1


R
wj

⎪ = 0, r = 1, . . . , R − 1,

⎩ n αr
j=1 j

or, equivalently,
R
wj
α(r −1)
= 0r −1 , r = 1, . . . , R, (9.25)
j=1 n j

then (9.24) reads


R  
 (R) c R h αR + o h αR ,
w j E Y nh = W R+1
j
j=1

 (R) is defined by
where W R+1
R
 (R) = wj
W . (9.26)
R+1
j=1
n αR
j

Like for w, we will often write W R+1 instead of W


 (R) in what follows. The linear
R+1
system (9.25) is obviously of Vandermonde type, namely

Vand(x1 , . . . , x R ) w = [ci−1 ]i=1:R ,

where the Vandermonde matrix attached to an R-tuple (x1 , . . . , x R ) ∈ (0, +∞) R


−α
is defined by Vand(x1 , . . . , x R ) = [x i−1
j ]i, j=1:R and c ∈ R. In our case, x i = n i ,
i = 1, . . . , R and c = 0. As a consequence, Cramer’s formulas yield
9.4 Richardson–Romberg Extrapolation (II) 397

det[Vand(n −α −α −α −α
1 , . . . , n i−1 , 0, n i+1 , . . . , n R )]
wi = , i = 1, . . . , R.
det[Vand[(n −α −α −α
1 , . . . , n i , . . . , n R )]

As a consequence, the weight vector solution to (9.25) has a closed form since it
is classical background that
)
∀ x1 , . . . , x R > 0, det[Vand(x1 , . . . , xi , . . . , x R )] = (x j − xi ).
1≤i< j≤R

 R+1 .
Synthetic formulas are given in the following proposition for w and W

Proposition 9.1 (a) For every fixed integer R ≥ 2, the weight vector w admits a
closed form given by

n α(R−1) 1
∀i ∈ {1, . . . , R}, wi = ) i = (−1) R−i ) . (9.27)
α α
(n i − n j ) |1 − (n j /n i )α |
j=i j=i

(b) Furthermore,
)
R
 R+1 = (−1)
R−1
W where n! = ni . (9.28)
n!α i=1

Proof. (a) Temporarily set xi = n i−α , i = 1, . . . , R. The above Cramer formula


reads, after canceling the terms not containing the index i which appear simultane-
ously in both products at the top and at the bottom of the ratio,
* *
1≤k<i (x k
− 0) × i<≤R (0 − x )
wi = * *
1≤k<i (x k − x i ) × i<≤R (x i − x  )
) xk xi−(R−1)
= =* −1 −1
1≤k≤R,k=i
xk − xi k=i (x i − x k )

n iα(R−1) n iα(R−1)
= * α α = (−1) R−i
* α α ,
k=i (n i − n k ) k=i |n i − n k |

where we used in the last line that the refiners are increasing.
Furthermore, setting X = 0 in the elementary decomposition of the rational ratio

R
1 1 1
*R = *
k=i ( xi − ) (X − )
1 1 1
i=1 (X − )
1
xi i=1 xk xi

yields the elementary identity


398 9 Biased Monte Carlo Simulation, Multilevel Paradigm

)
R R
xi
(−1) R−1 xi = *
k=i xi − xk )
( 1 1
i=1 i=1
R * R
xiR k=i xk
= * = xiR wi .
i=1 k=i (x k − x i ) i=1

Replacing xi by its value n i−α for every index i leads to the announced result since,
by the very definition (9.26) of W R+1 ,

R
wi
R )
R
(−1) R−1
 R+1 =
W = (−1) R−1 xiR wi = (−1) R−1 xi = .
i=1
n iα i=1 i=1
n!α ♦

As a straightforward consequence of the former proposition, we get with the


weights given by (9.27),

R
c R αR  
wi E Y nh = E Y0 + (−1) R−1 α
h + o h αR . (9.29)
i=1
i n!

Remark. The main point to be noticed is the universality of these weights, which do
not depend on h ∈ H, but only on the chosen vector of refiners n and the exponent α
of the weak error expansion.
These computations naturally suggest the following definition for a family of
multistep estimators of depth R.
Definition 9.3. (Multistep estimator) The family of multistep estimators of depth R
associated to the vector of refiners n is defined for every h ∈ H and every simulation
size M ≥ 1 by

M R
 1
I MR R = 
I MR R (R, n, h, M) = wi Y kh (9.30)
M k=1 i=1
ni

R M
1
= wi Y kh , (9.31)
M i=1 k=1
ni

 
k
where Yn,h are independent copies of Yn,h = Y nh i=1,...,R and the weight vector w
i
is given by (9.27).
The parameter R is called the depth of the estimator.
Remark. If R = 2 and n = (1, 2) the multistep estimator is just the regular
Richardson–Romberg estimator introduced in the previous section.
 Exercises. 1. Weights of interest. Keep in mind that the formulas below, especially
those in (b), will be extensively used in what follows.
(a) Show that, if n i = i and α = 1, then, the depth R ≥ 2 being fixed,
9.4 Richardson–Romberg Extrapolation (II) 399

iR  R+1 = (−1)
R−1
wi = (−1) R−i , i = 1, . . . , R and W . (9.32)
i!(R − i)! R!

(b) Show that if n i = N i−1 (N integer, N ≥ 2) and α ∈ (0, +∞), then

1
wi = (−1) R−i *
1≤ j≤R, j=i |1 − N −α(i− j) |
α
N − 2 (R− j)(R− j+1)
= (−1) R−i *i−1 * R− j (9.33)
−α j ) −α j )
j=1 (1 − N j=1 (1 − N

and
 R+1 = (−1)R(R−1) .
R−1
W
Nα 2
 α
2. When c1 = 0. Assume that WE R holds with c1 = 0

 α R
 
WE R ≡ E Yh = E Y0 + cr h αr + o h αR .
r =2

 † 
Prove the existence and determine an (R − 1)-tuple of weights w1† , . . . , wR−1
 † such that
and a coefficient W R

R−1
 † c R h αR + o(h αR ).
wi† E Y nh = W R
i
i=1

[Hint: One can proceed either by mimicking the general case or by an appropriate
re-scaling of the regular weights at order R − 1.]

Let us briefly analyze the basic properties of this estimator.


 α
– Bias. The preceding shows that, if W E R holds, then the bias of 
I MR R is independent
of M, and reads
 α
hR  
E
I MR R − E Y0 = (−1) R−1 c R + o h αR .
n!

– Complexity. The (unitary) complexity (i.e. for M = 1) is given by


n nr n  |n|
1
κ R R (h) = κ̄ + ··· + + · · · + R = κ̄ ,
h h h h

where |n| = n 1 + · · · + n R is the 1 -norm of the refiner vector n.


Note that we neglect the multiplication by the weights in this evaluation. The
first reason is that they are pre-computed off-line, the second is that the computation
400 9 Biased Monte Carlo Simulation, Multilevel Paradigm

of 
I MR R defined in (9.30) is performed in practice by (9.31), which only requires R
multiplications.
– Variance. As expected
 
  Var 1≤i≤R wi Y nh
Var 
I MR R = i
.
M
 Exercises. 1. Analysis of the Richardson–Romberg estimator. Let I0 = E Y0 .
Show by mimicking
 theαanalysis of the crude Monte
 Carlo simulation (carried out
under assumption W E 2 ) that, if H = hn , n ≥ 1 ,
 α
– W E R holds
and
– Yh − Y0 2 → 0 as h → 0 in H,
then the Multistep Richardson–Romberg estimator of I0 satisfies
 1
  R

R R (1 + 2αR)1+ 2αR 1 |n| 1
inf Cost( I M ) ∼ |c R | Var
αR wi Y nh 1 1


h∈H
I MR R −I0 2 <ε
2αR i=1
i n! ε
R 2+ αR

 1

(1 + 2αR)1+ 2αR 1 |n| 1
∼ |c R | αR Var(Y0 ) 1 1 as ε → 0
2αR n! ε R 2+ αR

where the optimal parameters h ∗ (ε) and M ∗ (ε) (simulation size) are given by
1
ε αR
h ∗ (ε) = hh/ h opt (ε)−1 ∈ H with h opt (ε) = 1 1
(1 + 2αR) 2αR |c R | αR

and   
∗ 1 Var(Yh ∗ (ε) )
M (ε) = 1+ .
2αR ε2

(As defined, h ∗ (ε) is the nearest lower neighbor of h opt (ε) in H.)
2. Show that
|n|
(a) 1 ≥ R.
n! R
1 1
(b) If n i = i, i = 1, . . . , R, |n| = R(R+1)
2
and n! R = n! R ∼ R
e
as R ↑ ∞ so that
|n|
1 ∼
e(R+1)
2
.
n! R
|n| R−1
(c) If n i = N i−1 (N ∈ N, N ≥ 2), then 1 ∼N 2 .
n! R

(d) Show that the choice n i = i, i = 1, . . . , R, for the refiners may be not optimal
to minimize |n|1 [Hint: test n i = i + 1, i = 2, . . . , R].
n! R
9.4 Richardson–Romberg Extrapolation (II) 401

The conclusions that can be drawn from the above exercises are somewhat
 R R  con-
tradictory concerning the asymptotic cost of the Multistep estimator 
I M M≥1 to
achieve an R M S E of ε:
• At a first glance, it does fill the gap between the crude Monte Carlo rate – ε−2− α
1

– and the virtual unbiased Monte Carlo simulation – ε−2 – since αR 1


→ 0 as R
grows to infinity.
• When looking more carefully at the formula, when implemented with explicit
refiners (see exercise 1), it appears (see exercise 2) that this asymptotic cost contains
the factor |n|1 , which increases, at least linearly in R, when R grows to infinity.
n! R
This suggests that the increase of the depth of the multistep simulation will strongly
impact the complexity of the global simulation, making it of little interest when ε
is not dramatically small. This is verified by numerical experiments.
The multilevel paradigm described below will allow us to overcome this problem.

ℵ Practitioner’s corner
Much of the information provided in this section will be useful in Sect. 9.5.1 that
follows, devoted to weighted multilevel methods.
 The refiners and the weights. As suggested in the preceding Exercise 2, the choice
n i = i is close to optimality to minimize |n|1 . The other somewhat hidden condition
n! R
to obtain the announced asymptotic behavior of the complexity as ε → 0 is the strong
convergence assumption Yh → Y0 in L 2 .
 Two examples of weights.
(a) α = 1, n i = i, i = 1, . . . , R:

R = 2 :w1(R) = −1, w2(R) = 2.


1 9
R = 3 :w1(R) = , w2(R) = −4, w3(R) = .
2 2
1 27 32
R = 4 :w1 = − , w2 = 4, w3 = − , w4(R) =
(R) (R) (R)
.
6 2 3

(b) α = 21 , n i = i, i = 1, . . . , R:

(R)
√ (R)
√ √
R = 2 :w1 = −(1 + 2), w2 = 2(1 + 2).
√ √ √
3− 2 3−1
R = 3 :w1(R) = √ √ , w2(R) = −2 √ √ ,
2 2− 3−1 2 2− 3−1

2−1
w3(R) =3 √ √ .
2 2− 3−1
√ √ 3 √  √ √
(1+ 2)(1+ 3)
R = 4 :w1(R) =− , w2(R) = 4 + 2 ( 3+ 2),
2 2
(R) 3 √ √ √ √ (R)
√ √
w3 = − ( 3+ 2)(2+ 3)(3+ 3), w4 = 4(2+ 2)(2+ 3).
2
402 9 Biased Monte Carlo Simulation, Multilevel Paradigm

 Nested Monte Carlo. Once again, given the form of the approximator Yh , the
complexity of the procedure is significantly smaller in this case than announced in
the general framework. In fact,
nR
κ R R (h) =
h
so that, revisiting Exercise 1.,
 1

(1 + 2αR)1+ 2αR 1 1
inf Cost(
1
I MR R ) ∼ |c R | αR Var(Y0 ) 1 1 as ε → 0.

h∈H
I MR R −I0 2 <ε
2αR n! ε
R 2+ αR

|n|
What should be noticed is that, when n i = i, i = 1, . . . , R, the ratio 1 (which
n! R
grows at least like R) is replaced in that framework by 1
1 ∼ e
R
as R ↑ +∞.
n! R

 Brownian diffusions. Among the practical aspects to be dealt with, the most
important one isundoubtedly to simulate (independent copies of) the vector  Yn,h
when Y0 = F X , X = (X t )t∈[0,T ] is a Brownian diffusion and Yh = F  X n , where
F is a functional defined on D([0, T ], Rd ) and  X n = ( X tn )t∈[0,T ] is the (càdlàg)
T
stepwise constant Euler (or Milstein) scheme with step n of the underlying SDE,
especially when n = (1, . . . , R).
Like for standard Richardson–Romberg extrapolation, one has the choice to sim-
ulate the Brownian increments in a consistent way as requested either by using an
abacus starting from the most refined scheme or by a recursive simulation starting
from the coarsest increment. The structure of the refiners (n i = i, i = 1, . . . , R)
makes the task significantly more involved than for the standard Romberg extrapo-
lation.
We refer to [225] for an algorithm adapted to these refined schemes which spares
time and complexity in the simulation of these Brownian increments.

9.5 The Multilevel Paradigm

The multilevel paradigm was originally introduced by M. Giles in [107] (see


also [148]) in a framework based on a first order expansion of the weak error. This
original approach is developed in Sect. 9.5.2, but, to ensure continuity with the pre-
vious section devoted to multistep estimators, we made the choice to introduce the
weighted version of this multilevel
α paradigm as developed in [198]. It relies on
higher-order expansions W E R of the weak error. In a second step, we will dis-
 α
cuss Giles’ regular version, which only requires W E 1 , and compare the respective
performances of these two classes of estimators.
9.5 The Multilevel Paradigm 403

Throughout this section, unless explicitly mentioned (e.g. in exercises), the refin-
ers have a geometric structure, namely

n i = N i−1 , i = 1, . . . , R. (9.34)

Nevertheless, we will still use the synthetic notation n i .


We will also assume, though it is not mandatory, that the parameter set H is of
the form  
h
H= , n≥1 where h ∈ (0, +∞).
n

9.5.1 Weighted Multilevel Setting

The basic principle of the multilevel paradigm is to split the Monte Carlo simulation
into two parts: a first part made of a coarse level based on an estimator of Yh = Y nh
1
with a not too small h, that will achieve the most part of the job to compute I0 = E Y0
with a bias E Yh , and a second part made of several correcting refined levels relying on
increments Y nh − Y n h to correct the former bias. These increments being small –
i i−1
since both Y nh and Y n h are close to Y0 – have small variance and need smaller
i i−1
simulated samples to perform the expected bias correction. By combining these
increments in an appropriate way, one can almost “kill” the bias while keeping under
control the variance and the complexity of the simulation under control.
Step 1 (Killing the bias). Let us assume

 α R
 
WE R
≡ E Yh = E Y0 + cr h αr + o h αR . (9.35)
r =1

Then, starting again from the weighted combination (9.24) of the elements of Yn,h
with the weight vector w given by (9.27), one gets by an Abel transform

R R  
wi E Y nh = W1 E Yh + Wi E Y nh − E Y n h
i i i−1
i=1 i=2
R  
= W1 E Y h + Wi E Y nh − Y n h
i i−1
i=2
⎡ ⎤
⎢ R  ⎥
⎢ ⎥
= E ⎢ Yh(1) + Wi Y (i)
h − Y h
(i)
⎥ (9.36)
⎣  ⎦
  
ni n i−1
i=2
coarse level
refined level i≥2
404 9 Biased Monte Carlo Simulation, Multilevel Paradigm

where
Wi = Wi(R) = wi + · · · + w R , i = 1, . . . , R, (9.37)

(i)
(note that W1 = w1 + · · · + w R = 1) and, in the last line, the families (Yn,h )i=1,...,R
are independent copies of Yn,h . This last point at this stage may seem meaningless
but, doing so, we draw the attention of the reader, in view of the future variance
computations (1 ).
As the weights satisfy the Vandermonde system (9.25), we derive from (9.36) that

R    
E Yh(1) + Wi E Y (i) (i)
h − Y h
 R+1 c R h αR + o h αR ,
= E Y0 + W (9.38)
ni n i−1
i=2

 R+1 = (−1) R−1


where W since n i = N i−1 .
R(R−1)
α
N 2

Step 2 (Multilevel estimator). As already noted, E Yh(1)  I0 and Yh(1) is close to


(a copy of) Y0 so that it has no reason to be small,whereas Y (i) (i)
h − Y h  0 since
ni n i−1
both quantities are supposed to be close to (copies of) Y0 . The underlying idea to
devise the (weighted) multilevel estimator is to calibrate the R different sizes Mi of
sample paths assigned to each level i = 1, . . . , R so that M1 + · · · + M R = M. This
leads to a first definition of Multilevel Richardson–Romberg estimators or weighted
Multilevel estimators attached to (Yh )h∈H .
They are defined, for every h ∈ H and every M ∈ N, M ≥ R where M1 , . . . , M R ∈
N∗ and M1 + · · · + M R = M, by

1
M1 R
Wi
Mi  
 ,...,M R =
I MM1L2R Yh(1),k + Y (i),k
h − Y (i),k
h ,
M1 k=1 i=2
Mi k=1
ni n i−1

 (i),k   
where Yn,h , i = 1, . . . , R, k ≥ 1, are i.i.d. copies of Yn,h := Y nh i=1,...,R and the
i
weight vector (Wi )i=1,...,R is given by (9.37).
However, in view of the optimization of the Mi , it is more convenient to introduce
the formal framework of stratification: we re-write the above multilevel estimator as
a stratified estimator, i.e. we set qi = MMi , i = 1, . . . , R, so that
-M .
1 1
Yh(1),k
R
Wi
Mi  

I MM L2R := + Yh(i),k
−Y (i),k
h .
M k=1
q1 i=2
qi k=1
ni n i−1

1 Other (i)
choices for the correlation structure of the families Yn,h could a priori be considered,
e.g. some negative correlations between successive levels, but this would cause huge simulation
problems since the control of the correlation between two families seems difficult to monitor a
priori.
9.5 The Multilevel Paradigm 405

Conversely, if

q = (qi )1≤i≤R ∈ S R := (u i )1≤i≤R ∈ (0, 1) R : ui = 1


1≤i≤R

(so S R denotes here the “open” simplex of [0, 1] R ), the above estimator is well
defined as soon as M ≥ min1i qi (≥ R) by setting Mi = qi M ≥ 1, i = 1, . . . , R.
Then M1 + · · · + M R is not equal to M but lies in {M − R + 1, . . . , M}. This leads
us to the following slightly different formal definition.

Definition 9.4. (Weighted Multilevel estimators) The family of weighted – or


Richardson–Romberg – multilevel (ML2R) estimators attached to (Yh )h∈H is defined
as follows: for every h ∈ H, every q = (qi )1≤i≤R ∈ S R and every integer M ≥ 1,

1
M1 R
Wi
Mi  

I MM L2R = 
I MM L2R (q, h, R, n) := Yh(1),k + Y (i),k
h − Y (i),k
h ,
M1 k=1 i=2
Mi k=1
ni n i−1

  (i),k  (9.39)
(convention the 01 0k=1 = 0) where Mi = qi M, i = 1, . . . , R, Yn,h , i = 1, . . . ,
 
R, k ≥ 1, are i.i.d. copies of Yn,h := Y nh i=1,...,R and the weight vector (Wi )i=1,...,R
i
is given by (9.37).

Note that, as soon as M ≥ M(q) := (mini qi )−1 , all Mi = qi M ≥ 1, i =


1, . . . , R. This condition on M will be implicitly assumed in what follows so that no
level is empty.
Remark. The short notation  I MM L2R is convenient but clearly abusive since these
estimators depend on the whole set of parameters (h, q, n, R) (coarse bias parameter,
vector of paths allocation across the levels, vector of refiners and depth). Note that,
owing to our parametrization of n, we could replace n by its root N in the list of
parameters of the estimator.
We are now in a position to evaluate the basic characteristics of this estimator.
 Bias. For every M ≥ M(q) := 1
mini qi
, each Mi ≥ 1 so that the estimator satisfies,

 h R α  R  
   RR h α
 M L2R
Bias I M 
= Bias I1 = (−1) c R
R−1
+o (9.40)
n! n!

owing to (9.38). Of course, by linearity, this bias does not depend on M.


( j)
 Variance. Taking advantage of the mutual independence of (Yn,h )i=1,...,R across the
levels, one straightforwardly computes the variance of 
I MM L2R : for every M ≥ M(q),
we have
406 9 Biased Monte Carlo Simulation, Multilevel Paradigm
 
R Var Y ih − Y i h
Var(Y 1
)
Var(
ni n i−1
I MM L2R ) = h
+ Wi2
M1 i=2
M i
⎛  ⎞
R Var Y h − Y h
1 ⎝ Var(Yh ) ni n i−1

= + Wi2 (9.41)
M q1 i=2
q i

R
1 σW2 (i, h)
= , (9.42)
M i=1
qi

where we set
 
σW2 (1, h) = Var(Yh ) and σW2 (i, h) = Wi2 Var Y nh − Y n h , i = 2, . . . , R.
i i−1

 Complexity. The complexity of such an estimator is a priori given — or at least


dominated – by
 
M1
R n n i−1 
Cost(
i
I MM L2R ) = κ̄ + Mi +
h i=2
h h

κ̄M  
R
≤ q1 + qi (n i + n i−1 )
h i=2

κ̄M    R 
= q1 + 1 + N −1 qi n i , (9.43)
h i=2

where we used that Mi = Mqi  ≤ Mqi and the specified form n i = N i−1 , i =
1, . . . , R, (see (9.34)) of the refiners in the penultimate line (and below). If we set
 
κ1 = 1, κi = 1 + N −1 n i , i = 2, . . . , R,

then, if we neglect the error induced by replacing Mi = Mqi  by Mqi , we may set

R
κ̄M
Cost(
I MM L2R ) = κi qi . (9.44)
h i=1

 Exercise.
  Show that in the case of nested Monte Carlo where
K
Yh = f K1 k=1 F(X, Z k ) , the complexity reads (with the same approximation
Mi = Mqi as above)

M R
ni  κ̄M 
R 
Cost(
1
I MM L2R ) = κ̄ + Mi = q1 + qi n i (9.45)
h i=2
h h i=2
9.5 The Multilevel Paradigm 407

so that, for such a nested simulation, one may set κi = n i , i = 1, . . . , R.


Step 3 (Keeping the variance and the complexity under control). Like for the anal-
ysis of the crude and Monte Carlo multistep estimators, the aim, at this stage, is
to minimize the simulation complexity (or cost in short) of the whole simulation
for a prescribed R M S E ε > 0. This amounts to solving, at least approximately, the
following optimization problem

inf Cost(
I MM L2R ). (9.46)
R M S E(
I MM L2R )≤ε

The same formal manipulations as in the multistep framework show that

I MM L2R )Var(
Cost( I MM L2R ) Effort( I MM L2R )
Cost(
I MM L2R ) = = . (9.47)
Var( I MM L2R ) Var(I MM L2R )

Plugging the respective expressions (9.42) and (9.44) of the variance and the
complexity into their product yields the following expression of the effort (after
simplifying by M)
 R
 R

κ̄ σW2 (i, h)
Effort(
I MM L2R ) = Cost(
I MM L2R )Var(
I MM L2R ) = κi qi (9.48)
h i=1 i=1
qi

so that the effort does not depend upon the size M of the simulation in the sense that

I MM L2R ) = Cost(
Effort( I MMqL2R )Var(
I MMqL2R ) = Effort(
I MMq L2R ),

where Mq = 1/ mini qi . Such a property is universal among Monte Carlo estima-
tors.
The bias-variance decomposition of the weighted multilevel estimator  I MM L2R and
the fact that the R M S E constraint should be saturated to maximize the denominator
of the ratio in (9.47) allow us to reformulate our minimization problem as
- .
Effort(
I MM L2R )
inf M L2R
Cost( I M )= inf  2 (9.49)

I MM L2R −I0 2 ≤ε |Bias(
I MM L2R )|<ε ε2 − Bias  I MM L2R

keeping in mind that the right-hand side does not depend on M when M ≥ Mq .
This reformulation suggests to consider the above optimization problem as a
function of two of the three free parameters, namely q and h, the depth R being fixed
(the root N has a special status). Note that the right-hand side of (9.49) seemingly
no longer depends on the simulation size M. In fact, this size is determined further
on in (9.53) as a function of the RMSE ε and the optimized parameters q ∗ (ε) and
h ∗ (ε). We propose a sub-optimal solution for (9.49) divided in two steps:
– first minimizing the effort for a fixed h ∈ H, i.e. the numerator of the above
ratio,
408 9 Biased Monte Carlo Simulation, Multilevel Paradigm

– then minimizing the cost in the bias parameter h ∈ H , that is, maximize the
denominator of the ratio (and plug the resulting optimized h in numerator).
Doing so, it is clear that we will only get sub-optimal solutions to the original
optimization problem since the numerator depend on h. Nevertheless, the compu-
tations become tractable and lead to closed forms for our optimized parameters q
and h.
 Minimizing the effort.
This phase follows from the elementary lemma below, a straightforward applica-
tion of the Schwartz Inequality and its equality case (see [52]).

Lemma 9.1. Let ai > 0, bi > 0 and q ∈ S R . Then


- R
.- R
. - R
.2
ai
bi qi ≥ ai bi
i=1
qi i=1 i=1


ai bi−1
and equality holds if and only if qi = , i = 1, . . . , R, with q † = q † (a, b) =
 R  −1
q†

i=1 ai bi .

Applying this to (9.48), we derive the solution to the effort minimization problem:
 
σW (i, h)
argminq∈S R Effort(
I1M L2R ) = q ∗ = ∗,†
√ (9.50)
q (h) κi i=1,...,R

 σW
with q ∗,† (h) = √(i,h) . The resulting minimal effort reads
1≤i≤R κi

 R
2
κ̄ √
min Effort(
I1M L2R ) = κi σW (i, h) .
q∈S R h i=1

However, this formal approach cannot be implemented as such in practice since the
true values of σW (i, h) are usually unknown and should be replaced by an upper-
bound (see (9.58) and (9.61) further on).
 Minimizing the resulting cost.
If we now compile our results on the above effort minimization, the cost for-
mulation (9.44) and the bias expansion (9.40), the global optimization problem
inf Cost(I MM L2R ) is “dominated”, if we neglect the second-order term in
R M S E(
I MM L2R )≤ε
the bias, by
 √ 2
R
i=1 κ i σ W
(i, h)
 R 2α κ̄
inf   h R 2α  .
h∈H : c2R hn! < ε2 h ε − c R n!
2 2
9.5 The Multilevel Paradigm 409

We adopt the suboptimal strategy consisting in maximizing the denominator of this


ratio, where we want to plug the resulting h ∗ (ε) into the whole ratio. Then, we
temporarily forget about the constraint h ∈ H: it reads
-   2α . -   2α .
hR hR
sup h ε 2
− c2R ≤ sup h ε 2
− c2R .
ε n!α  Rα
1 n!  ε n!α  Rα
1 n!
h∈H : 0<h< |c R | 0<h< |c R |

Elementary optimization in h shows that


  2α    αR1
hR 2αR ε 1
sup h ε − 2
c2R = ε2 n! R
 R 2α n! (1 + 2αR)
1
1+ 2αR |c R |
h : c2R h
n! < ε2

is in fact a maximum attained at


  αR1 1

 ε n! R
h(ε) = . (9.51)
|c R | (1 + 2αR) 2αR
1

This leads us to set


3 4−1
h ∗ (ε) = lower nearest neighbor of 
h(ε) in H = h h/
h(ε) .

When ε → 0,  h(ε) → 0 and h ∗ (ε) ∼ 


h(ε) but, of course, it remains a priori subop-
timal at finite range so that
 √ 2
κi σW (i, h ∗ (ε))
1 1
(1 + 2αR)1+ 2αR |c R | αR 1≤i≤R
inf Cost M L2R
(h, M, q) ≤ κ̄ 1 1
.
R M S E(
I M L2R )≤ε 2αR n! R ε2+ αR
M
(9.52)
 Computing the size M ∗ (ε) of the simulation.
The global size M ∗ (ε) of the simulation is obtained by saturating the con- 
Var 
I1M L2R
straint M
IM L2R
− I0 2 ≤ ε with h = 
h(ε), using that  M
IM L2R
− I0 2 = M ∗ (ε) +
2
 M L2R 2
Bias IM . We get
 M L2R 
∗ Var  I1
M (ε) =  2 .
ε2 − Bias  I MM L2R

Plugging the value of the variance (9.42) and the bias (9.40) into this equation finally
yields
5   M L2R  6 5  ∗,†  √ 6
∗ 1 Var 
I1 1 q ∗
1≤i≤R κi σW (i, h (ε))
M (ε) = 1+ = 1+ .
2αR ε2 2αR ε2
(9.53)
410 9 Biased Monte Carlo Simulation, Multilevel Paradigm

One may re-write this formula in a more tractable or convenient way (see Table 9.1 in
Practitioner’s corner further on) by replacing κi , σW (i, h) and h ∗ (ε) by their available
expressions or bounds.
Remark. If we assume furthermore that Yh → Y0 in L 2 as ε → 0, then h ∗ (ε) → 0 so
 ∗
2
that 1≤i≤R κi σW (i, h (ε)) → Var(Y0 ) since σW (i, h ∗ (ε)) → 0 for i = 2, . . . , R,
σW (1, h ∗ (ε)) → Var(Y0 ) and W1 = 1. Finally, we get the following asymptotic result
1 1
(1 + 2αR)1+ 2αR |c R | αR Var(Y0 )
M L2R
inf Cost (h, M, q)  1 1 as ε → 0.

I MM L2R −I0 2 ≤ε 2αR n! R ε2+ αR
(9.54)
Temporary conclusions
– The first good news that can be drawn from this asymptotic result is that we still
1
have the term ε2+ αR , which shows that, the larger the depth R is, the closer we get
to an unbiased simulation.
– The second good news is that the ratio |n|1 (≥ R) that appeared in the multistep
n! R
framework and caused problems at fixed RMSE ε > 0 is now replaced by 1
1 =
n! R
1
R−1 which goes to 0 as R → +∞.
N 2

– However this last result (9.54) remains purely asymptotic as ε → 0. In view


of practical implementation, it is not yet satisfactory since we plan to design an
estimator I MM L2R for a fixed prescribed R M S E ε. In particular, at this stage, it seems
not to be impacted by the convergence rate of the variance of Yh to that of Y0 , but
this is only an artifact induced by the asymptotic approach.
Fortunately, we have not yet taken advantage of the last free parameter of the
problem, namely the depth R of the estimator.
Step 4 (Final optimization (R = R(ε) → +∞)). In this last phase of the optimiza-
tion process, we start from the non-asymptotic optimized upper-bound (9.52) of
Cost  I MM L2R . Our aim at this stage is to specify the depth R of the estimator as a
function of the RMSE ε, once the above optimized specifications of both the effort
and the step have been performed. To achieve this final task we need to have a more
precise non-asymptotic control on the optimized effort.
 Non-asymptotic control of the effort (or when strong convergence comes back
into the game).
We temporarily return to a generic parameter h ∈ H (to alleviate notation). To
control the effort of the estimator w.r.t. the depth R, the (strong) L 2 -convergence of
Yh − Y h2 to 0 (combined with E Yh → 0) as h → 0 is not precise enough: we need a
rate of convergence of Var(Yh − Y h2 ) as h → 0 in H, in order to control the weighted
 
standard deviation terms σW (i, h) = |Wi | σ Y nh − Y n h .
i i−1
To this end, we will make the following “optimistic” (or “demanding”) hypothesis:
we assume from now on that there exists β > 0 and V1 > 0 such that

(V E)β ≡ ∀ h, h  ∈ H ∪ {0}, Var(Yh − Yh  ) ≤ V1 |h − h  |β . (9.55)


9.5 The Multilevel Paradigm 411

 This
 choice
 is justified  by the following facts: in the diffusion framework (Yh =
f X̄ T or F ( X̄ tn )t∈[0,T ] , X̄ n discretization scheme and h = Tn ), Assumption (V E)β
is consistent with the asymptotic variance of the Central Limit Theorem established
β    β2
in [34], namely a weak convergence h − 2 Yh − Y Nh → NN−1 h Z , Z = N (0; 1)
d

(when β = 1). This is confirmed by various empirical evidences not reproduced here
for other values of β. In the nested Monte Carlo framework, (V E)β is also satisfied 
(see Proposition 9.2 further on) through the elementary inequality Var Yh − Yh  ≤
 
Yh − Yh  2 ≤ V1 |h − h  |β . However, it is also clear that this bound is not the most
2
accessible one in many situations: thus in a diffusion framework, one has more
naturally access to a control of Yh − Y0 in quadratic norm, relying on results like
those established in Chap. 7 devoted to time discretization schemes of SDEs. In fact,
what follows can be easily adapted to a less sharp framework where one has only
access to a control of the form
 2
∀ h ∈ H, Yh − Y0 2 ≤ V1 h β .

This is, for example, the framework adopted in [198] (see also the next exercise).
Let us briefly discuss this assumption from a technical viewpoint. First, note that
(V E)β is clearly implied by
 2
(S E)β ≡ ∀ h, h  ∈ H ∪ {0}, Yh − Yh  2 ≤ V1 |h − h  |β (9.56)

which is more directly connected to the quadratic strong rate of convergence of


Y toward Y0 . It even holds possibly with the same V1 since Var(Yh − Yh  ) ≤
h 
Yh − Yh  2 .
2  α
1 ∈
Moreover, if both W E 1 and (V E)β hold, then there exists a real constant V
(0, +∞) such that
 2
1 h (2α)∧β ,
Yh − Yh  22 = Var(Yh − Yh  ) + E Yh − E Yh  ≤ V
 
1 . Conversely, if (S E)β and W E α hold, it follows from
i.e. (S E)(2α)∧β holds with V 1
the Schwartz Inequality that 2α ≥ β.
The control of the effort under this new hypothesis is straightforward: it follows
from Assumption (V E)β that, for every i ∈ {2, . . . , R},
 
σW (i, h) = |Wi |σ Y nh − Y n h (9.57)
i i−1
7h h 77 β2
7
≤ |Wi | V1 7 − 7
ni n i−1
7
β 7 1 1 77 β2
= |Wi | V1 h 2 7 − 7 (9.58)
ni n i−1
β β β
≤ |Wi | V1 h 2 N −(i−1) 2 N − 1 2 ,
412 9 Biased Monte Carlo Simulation, Multilevel Paradigm

where we used in the last line the specific form of the refiners n i = N i−1 . On the
coarse level,
σW (1, h) = σ(Yh ) = Var(Yh ).

As  1
κi = n i + n i−1 = n i 1 + , i ∈ {2, . . . , R}, (9.59)
N
we finally obtain, after setting
8
V1
θ = θh = , (9.60)
Var(Yh )

7 7β
β 7h h 72
1 θh 2 |W | 7
i ni
− n i−1 7
q1∗ = †,∗ , qi∗ = √ , i = 2, . . . , R (9.61)
q q †,∗ n i + n i−1

and
 R
2  R

√ β  −1 β
κi σW (i, h) ≤ Var(Yh ) 1 + θh 2 |Wi | n i−1 − n i−1 2 n i + n i−1 (9.62)
i=1 i=2
 R
2
β β
−1 1
(i−1) 1−β
= Var(Yh ) 1 + θh (1 + N
2 ) (N − 1)
2 2 |Wi |N 2 ,
i=2

keeping in mind that (Wi )1≤i≤R = (Wi(R) )1≤i≤R is an R-tuple, not the first R terms
of a sequence.
Remarks. • If the two processes Y n h and Y nh do not remain “close” in a pathwise
i−1 i
sense, the multilevel estimator may “diverge”: while it maintains its performance
as a bias killer, it loses all its variance control abilities as observed, for example,
in [153] with the regular multilevel estimator presented in the next section. The same
effect can be reproduced with this weighted estimator. Typically with diffusions, this
means that the two discretization schemes involved in each level i are based on the
same driving Brownian motion W i (these W i being independent).
• However, in some specific situations, one may even get rid of the strong con-
vergence itself, see e.g. [32] for such an approach for diffusion processes where an
approximation of the underlying diffusion process is performed by a kind of binomial
tree which preserves the performances of the multi-level paradigm, though not con-
verging strongly. This idea is applied to deal with Lévy-driven diffusions processes
whose discretization schemes involve a wienerization of the small jumps.
 Exercise (Effort control under quadratic convergence, see [198].) In practice, the
simplest or the most commonly known information available about a family (Yh )h∈H
of approximations of Y0 concerns the quadratic rate of convergence of Yh toward Y0 ,
9.5 The Multilevel Paradigm 413

namely that there exists β > 0, V1 > 0, such that the following property holds
 2
(S E)0β ≡ ∀ h ∈ H, Yh − Y0 2 ≤ V̄1 h β . (9.63)

(a) Show that, under (S E)0β (and for refiners of the form n i = N i−1 )

β β β
σW (i, h) ≤ |Wi | V̄1 h 2 N −(i−1) 2 1 + N 2 , i = 2, . . . , R.

[Hint: use the sub-additivity of standard deviation.]


(b) Show that, if κi = n i (1 + 1
N
) for i ∈ {2, . . . , R} (and κ1 = 1), one finally obtains

 R
2  R
2
√ β
−1 1 β
(i−1) 1−β
κi σW (i, h) ≤ Var(Yh ) 1 + θ̄h h (1 + N 2 ) (N
2 2 + 1) |Wi |N 2

i=1 i=2


where θ̄h = V̄1
Var(Yh )
.
β
(c) Show that on the coarse level σW (1, h) ≤ σ(Y0 ) + V̄1 h 2 . Deduce the less sharp
inequality
 R
2  R
2
√ β
−1 1 β
(i−1) 1−β
κi σW (i, h) ≤ Var(Yh ) 1 + θ̄0 h (1 + N
2 ) (N
2 2 + 1) |Wi |N 2

i=1 i=1


where θ̄0 = V̄1
Var(Y0 )
.
 Optimization of the depth R = R(ε).
First, we need to make one last additional assumption, namely that the weak error
expansion holds true at any order and that the coefficients cr do not go too fast toward
infinity as r → +∞, namely
⎧ α

⎨ W E R holds for every depth R ≥2
 α
WE ∞
≡ and (9.64)

⎩ 1
c∞ = lim R→+∞ |c R | R ∈ (0, +∞)

(the case c∞ = 0 can be handled in what follows by considering  c∞ = η > 0 to be


arbitrarily small, though it will provide suboptimal results).
The key point at this stage is to note that, for a fixed RMSE ε > 0, the complexity
1
1+ 2αR
formally goes to 0 as R → +∞ owing to (9.52) since (1+2αR)
1

2αR
→ 1 and |c R | αR
1 R−1
remains bounded whereas n! R = N 2 ↑ +∞. Moreover, the complexity depends
on ε as ε−(2+ αR ) so that, the larger R is, the more this complexity behaves like that of
1

an unbiased estimator (in ε−2 ). These facts strongly suggest to consider R as large as
possible, under the constraint that h ∗ (ε) lies in H and remains close to h(ε), which
is equivalent to h(ε) ≤ h.
414 9 Biased Monte Carlo Simulation, Multilevel Paradigm

This argument is all the more true if ε is small but remains heuristic at this stage
and cannot be considered as a rigorous optimization: such a procedure may turn
out to be suboptimal, especially if the prescribed RMSE is not small enough. An
alternative for practical implementation is to numerically minimize the upper-bound
of the cost on the right-hand side of (9.52), once the parameter θ has been estimated
(see Practitioner’s corner further on). One reason for computing a closed formula
for the depth R and the other parameters of the Richardson–Romberg estimator is to
obtain asymptotic bounds on the complexity (see Theorem 9.1).
The inequality h(ε) ≤ h reads, owing to (9.51) and after taking the log and tem-
porarily considering R as a real number,

2 1 2R  1  log(1 + 2αR)


R(R − 1) − log − log h |c R | αR − ≤ 0.
α log N ε log N α log N

At this stage, we need to consider approximations if we want to get a closed form


at any order R to be able to carry out an asymptotic study of the behavior of our
estimator.
1 1
First, we assume that |c R | αR  
c∞α (which is true if R is large enough under our
assumptions…). Then, the above inequality reads
  1  √ 
2 log hc α 2 1 + 2α R
R2 − R 1 + ∞
− log ≤ 0.
log N α log N ε
√ 
2 1+2α R
Solving this inequality as if α log N
log ε
is a constant term yields

9
 1  :  1  2 √ 
log h
c : 1 log hc∞α 2 log 1+2α R
1
R ≤ ϕ (R) := + ∞
α
+ ; + + ε
.
2 log N 2 log N α log N

Note that ϕε is increasing in R. The highest admissible depth R ∗ (ε) is the highest
integer R which satisfies the above equality. If we set R0 = 1 and Rk+1 = ϕε (Rk ),
then Rk ↑ R∞ and
R ∗ (ε) = R∞ . (9.65)

Although the above convergence is almost instantaneous (two or three iterations


are enough) a simpler “closed form” is to set set R ∗ (ε) := ϕ (2), i.e
⎡ 9 ⎤
: 2 √
1
) : 1 log(h 1
c∞α ) 2 log( 1+4α )
∗ ⎢1 log(h
c α
; ε ⎥
R (ε) = ⎢ + ∞
+ + + ⎥. (9.66)
⎢2 log(N ) 2 log(N ) α log(N ) ⎥
⎢ ⎥

On the way to an asymptotic analysis of the optimized weighted multilevel esti-


mators, the first quantity to be investigated is clearly the family of weight vectors
9.5 The Multilevel Paradigm 415
 
W(R) = Wi(R) i=1,...,R as R grows. The following lemma shows that they remain
uniformly bounded as R grows.
 
Lemma 9.2 Let α > 0 and the associated weights W(R) j i=1,...,R
be defined
by (9.37) with n i = N .
i−1
) α
(a) Let a = (1 − N −kα )−1 ,  ≥ 1 and b = (−1) N − 2 (+1) .
1≤k≤−1

R R− j
W(R)
j = a a R−+1 b R− = a R− a+1 b .
= j =0

*
(b) The sequence α ↑ a∞ = 1≤k≤−1 (1 − N −kα )−1 as  ↑ +∞ and 
B∞ =
+∞
|b | < +∞ so that the weights W(R)
j are uniformly bounded and
=0

2 
∀R ∈ N∗ , ∀ j ∈ {1, . . . , R} , |W(R)
j | ≤ a∞ B∞ < +∞. (9.67)

Proof. (a) By re-writing (9.33) we get w j = w(R)


j = a j b R− j a R− j+1 so that

R R− j
W(R)
j = a b R− a R−+1 = a R− b a+1 .
= j =0

(b) It is clear that a ↑ a∞


< +∞ and that the series with general term b is absolutely
convergent, since 
α
B∞ = ≥1 N − 2 (+1) < +∞. Hence
7 7
7 (R) 7 2 
∀R ∈ N∗ , ∀ j ∈ {1, . . . , R} , 7W j 7 ≤ a∞ B∞ . ♦

R √  
Now, we have to inspect the behavior of i=1 κi σW i, h ∗ (ε) depending on the
value of β. Given (9.48), there are three different cases:
• β ∈ (1, +∞): fast strong approximation (corresponding, for example, to the use
of the Milstein scheme for vanilla payoffs in a local volatility model),
• β = 1: regular strong approximation (corresponding, for example, to the use of
the Euler scheme for vanilla or lookback payoffs in a local volatility model),
• β ∈ (0, 1): slow strong approximation (corresponding, for example, to the use of
the Euler scheme for payoffs including barriers in a local volatility model or the
computation of quantile for risk measure purposes).
Combining all of the preceding and elementary though slightly technical compu-
tations (see [198] for details), lead to the following theorem.
416 9 Biased Monte Carlo Simulation, Multilevel Paradigm

Theorem 9.1 (Weighted ML2R estimator) Let n i = N i−1 , i ≥ 1. Assume (V E)β


 α
and W E ∞ (see (9.64)) and let h = h and let θ = θh = Var(Y
V1
h)
(see (9.60)).

(a) The family of ML2R estimators satisfies, as ε →


⎧ 0,

⎪1 if β > 1,
 M L2R  Var(Y ) ⎨
inf Cost 
IM K M L2R h
× log(1/ε)

if β = 1,
h∈H, q∈S R , R, M≥1, ε2 ⎪
⎪ 1−β
2 log(1/ε) log(N )
⎩ √
M
IM L2R −I0 2 ≤ε e α if β < 1.

where K M L2R = K M L2R (α, β, N , h, c∞ , V1 ).


(b) When β < 1 the optimal rate is achieved with N = 2.
(c) The rates in the right-hand side in (a) are achieved by setting in the defini-
tion (9.39) of 
I MM L2R , h = h, q = q ∗ (ε), R = R ∗ (ε) and M = M ∗ (ε). Closed forms
are available for q ∗ (ε), R ∗ (ε), M ∗ (ε) (see Table 9.1 in Practitioner’s corner here-
after and the comments that follow for variants).

Comments. • If β > 1 (fast strong approximation), the weighted Multilevel estima-


tor (ML2R) behaves in terms of rate like an unbiased estimator, namely ε−2 . Any fur-
ther gain will rely on the reduction of the (optimal) constant
K M L2R (α, β, N , θ, h,c∞ , V1 ).
• If β ∈ (0, 1], the exact unbiased rate cannot be achieved. However, it is almost
attained, at least in a polynomial scale, since
 M L2R 
inf Cost 
IM = o(ε−η ), ∀ η > 0.
h∈H, q∈S R , R, M≥1

I MM L2R −I0 2 ≤ε

We can sum up Theorem 9.1 by noting that


ML2R estimators make it always possible to carry out quasi-unbiased simulations,
whatever the “strong error” rate parameter β is.
• Closed forms for the constants K M L2R (α, β, N , θ, h,
c∞ , V1 ) are given in [198] in
the slightly different setting where (V E)β is replaced by (S E)β (see (9.56)). Under
(V E)β such formulas are left as an exercise (see Exercise 1. after the proof of the
theorem).
• Under a “sharp” version of (V E)β , a refined version of Lemma 9.2 (see Exer-
cise 2.(a)-(b) after the proof of the theorem) and few additional conditions, a strong
law of large numbers as well as a CLT can be established for the optimized weighted
Multilevel Richardson–Romberg estimator (see [112]). Based on this refined version
of Lemma 9.2, one may also obtain better constants K M L2R (α, β, N , θ, h, c∞ , V1 )
than those obtained in [198], see [113] and Exercise 2.(c).
• The theorem could be stated using Var(Y0 ) by simply checking that
9.5 The Multilevel Paradigm 417
8
 2 V1 θ
Var(Yh ) ≤ Var(Y0 ) 1 + θ0 h with θ0 = ≤ .
Var(Y0 ) 1−θh

Proof of Theorem 9.1. The starting point of the proof is the upper-bound (9.52) of
the complexity of ML2R estimator. We will analyze its behaviour as ε → 0 when the
parameters h and R are optimized (see Table 9.1) i.e. set at h = h and R = R ∗ (ε).
1 R−1
First note that, with our choice of geometric refiners n i = N i−1 , n! R = N 2 .
(R)
Then, if we denote W∗ = sup R≥1 max1≤r ≤R Wi , we know that W∗ < +∞ owing
to Lemma 9.2. Hence, if we denote the infimum of the complexity of the ML2R
estimators as defined in the theorem statement by
 M L2R 
Cost opt = inf Cost 
IM ,
h∈H, q∈S R , R, M≥1,

I MM L2R −I0 2 ≤ε

it follows from (9.52) that


 
ε2 Costopt ≤ ε2 Cost M L2R h, M ∗ (ε), q ∗ (ε), N
1 M L2R
1
(1 + 2αR)1+ 2αR |c R | αR Effortopt
≤ κ̄ 1 1 .
2αR n! R ε αR

M L2R
where Effortopt is given by setting h = h in (9.48), namely

 2
R
−1
 β2 √
M L2R
Effortopt = Var(Yh )K 1 (β, θ, h, V1 ) |Wi |(n i−1 − n i−1 n i + n i−1 .
i=1

Now, specifying in this inequality the values of the refiners n i , yields


 R
2
1−β
M L2R
Effortopt ≤ Var(Yh )K 1 (β, N , θ, h, V1 ) |Wi |N 2 .
i=1

Keeping in mind that lim R ∗ (ε) = +∞ owing to (9.66), one deduces that
ε→0
1
(1 + 2αR)1+ 2αR 1 1
lim = 1 and, using (W E)α∞ , that lim |c R | αR = 
c∞α < +∞.
R→+∞ 2αR R→+∞
Moreover, note that
⎡8 ⎛ ⎞⎤
2 log(1/ε) 1
R ∗ (ε) = ⎢ ∗ ⎝
⎢ α log N + r + O    ⎥ as ε → 0
⎠⎥
⎢ log 1/ε ⎥

where r ∗ = r ∗ (α, N , h). In particular, one deduces that


418 9 Biased Monte Carlo Simulation, Multilevel Paradigm

R ∗ (ε)−1 − αR1∗ (ε)


N− 2 ε
 
√ log N ∗ log(1/ε)
= N exp − R (ε) +
2 αR ∗ (ε)
   
1 log N log(1/ε) log N log(1/ε) r ∗   − 1 
≤ N 2 exp − + − log N + O log 1/ε 2
2α 2α 2
= K 2 (α, β, N , θ, h,
c∞ ) (ε) with (ε) → 1 as ε → 0.

At this stage, it is clear that the announced rate results will follow from the
asymptotic behaviour of
 R ∗ (ε) 2
1−β
|Wi |N 2 as ε → 0,
i=1

depending on the value of β. Elementary computations yield




1−β
⎪ NR

2
if β < 1


1−β
−1


N 2
R
1−β

|Wi |N 2 ≤ W∗ R if β = 1


i=1 ⎪




1−β
⎩ N 2
1−β if β > 1.
1−N 2

Now we are in a position to inspect the three possible settings.


– β > 1: It follows from what precedes that lim ε2 Costopt < +∞.
ε→0

– β = 1: As R ∗ (ε) ∼ 2 log(1/ε)
α log N
when ε → 0, it follows that R ∗ (ε)2 ∼ 2 log(1/ε)
α log N
so that
ε2
lim Costopt < +∞.
ε→0 log(1/ε)

– β < 1: The conclusion follows from the fact that


 -8  .
∗ 2 log(1/ε)   − 1
N (R (ε)−1)(1−β) ≤ N β−1 exp (1 − β) log N + O log 1/ε 2
α log N
 
1−β
≤ K 1 (θ, h, α, β, N ) exp √ 2 log N log(1/ε) λ(ε)
α

with lim λ(ε) = 1. One concludes that


ε→0

 
1−β
lim ε exp − √
2
2 log N log(1/ε) Costopt < +∞. ♦
ε→0 α
9.5 The Multilevel Paradigm 419

 Exercises. 1. Explicit constants. Determine explicit values for the constant


K M L2R (α, β, N , θ, h,
c∞ ) depending on β. [Hint: Inspect carefully the above proof
and use Lemma 9.2.]
2. Quest for better constants.
 With the notation of Lemma 9.2 let α > 0 and the
associated weights W(R) j j=1,...,R
be as given in (9.37).
(a) For every γ > 0 and for every integer N ≥ 2,

R
7 R 7 −γ( j−1) 1
lim 7W 7 N = .
R→+∞
j=2
j
Nγ − 1

(b) Let (v j ) j≥1 be a bounded sequence of positive real numbers. Let γ ∈ R and
assume that lim v j = 1 when γ ≥ 0. Then the following limits hold: for every N ≥ 2,
j

⎧

⎪ N γ( j−1) v j < +∞ for γ < 0,
R
7 R 7 γ( j−1) R→+∞ ⎨ j≥2
7W 7 N vj ∼ R for γ = 0,
j
⎪  7 j−1 77 −γ j
j=2 ⎩ N γ R a∞ j≥1 77 =0
⎪ b 7 N for γ > 0.

(c) Use these results to improve the values of the constants K M L2R (α, β, N , θ, h,
c∞ ,
V1 ) in Theorem 9.1 compared to those derived by relying on Lemma 9.2 in the former
exercise.

Remark. These sharper results on the weights are needed to establish the CLT sat-
isfied by the ML2R estimators as ε → 0 (see [112]).

ℵ Practitioner’s corner I: Specification and calibration of the estimator param-


eters
We assume in this section that Assumptions (W E)α∞ and (S E)β are in force. The
coarsest parameter h is also assumed to be specified by exogenous considerations or
constraints. The specification and calibration phase of the parameters of the ML2R
estimators for a prescribed RMSE ε > 0 (quadratic error) is two-fold:
– As a first step, for a given root N ≥ 2, we can specify all the parameters or
calibrate all the parameters from the theoretical knowledge we have of the pair of
characteristics (α, β). The details are given further on. One must keep in mind that
we first determine R = R ∗ (ε), then the weights W(R) and the optimal bias parameter
h = h ∗ (ε). These only depend on (α, β) (except for  c∞ , see the discussion further
on). Then, we calibrate V1 and Var(Yh ) (hence θ = θh ), which are more “payoff”
dependent. Finally, we derive the optimal allocation policy q(ε) = (qi (ε))1≤i≤R and
the global size or budget of the simulation M ∗ (ε).
All these parameters are listed in Table 9.1 (we dropped all superscripts ∗ inside
the Table to alleviate notation).
Since our way to determine the optimal depth R ∗ (ε) consists in saturating the
constraint h ∗ (ε) ≤ h into an equality, it follows that the analytic expression for h ∗ (ε)
420 9 Biased Monte Carlo Simulation, Multilevel Paradigm

as a projection of h max (ε) on H, namely


<= >
h ∗ (ε) = h c∞α ε− αR N − 2 ,
1 1 1 R−1
h(1 + 2αR) 2αR 

eventually boils down to h ∗ (ε) = h.


– As a second step, we will roughly perform a minimization of the complexity
based on the upper-bound (9.54) of Cost(
I MM L2R ).
Remark. Our specifications, especially those of the allocation policy and of the
global size of the estimator (obtained from formula (9.52)) rely on the fact that the
complexity of the simulation of Y nh − Y n h is dominated by κ̄h (n i + n i−1 ), which
i i−1
typically corresponds to the case of diffusion. See the comments further on for
variants, especially for nested Monte Carlo where the complexity is lower (see the
end of the paragraph devoted to nested Monte Carlo in Practitioner’s corner III).
Let us start with the structure parameters, namely α, the resulting weights W(R) ,
β and the coarsest bias parameter h.
 Structure parameters of the ML2R estimator: root, weights and bias parameter
h
– For a weak error characteristic α > 0, the resulting weights (Wi(R) )i=1,...,R are
given by

1
Wi(R) = wi(R) + · · · + w(R) where wi(R) = (−1) R−i * , i = 1, . . . , R
R
j=i |1 − N −α(i− j) |

(the formula for wi(R) is established in (9.33)). Keep in mind that W(R)
1 = 1.
– For a given root N ∈ N, N ≥ 2, the refiners are fixed as follows: n i = N i−1 ,
i = 1, . . . , R.
– When β ∈ (0, 1), the optimal choice for the root is N = 2.
– The choice of the coarsest bias parameter h is usually naturally suggested by the
model under consideration. As all the approaches so far have been developed for an
absolute error, it seems natural to choose h to be lower or close to one in practice: in
particular, in a diffusion framework with horizon T  1, we will not choose h = T
but rather T /2 or T /3, etc; likewise, for nested Monte Carlo if the variance of the
inner simulation is too large, the parameter K 0 = 1/h will be chosen strictly greater
than 1 (say equal to a few units). See further on in this section for other comments.
c∞ (and its connections with h and R ∗ (ε))
 How to calibrate 
First, it is not possible to include an estimation of c∞ in a pre-processing phase –
beyond the fact that R ∗ (ε) depends on  c∞ – since the natural statistical estimators
 α
are far too unstable. On the other hand, to the best of our knowledge, under W E ∞ ,
there are no significant results about the behavior of the coefficients cr as r → +∞,
whatever the assumptions on the model are.
– If h = 1, then h ∈ H can be considered as small and we observe that if the
coefficients cr of the weak error expansion have polynomial growth, i.e. |cr | = O(r a )
9.5 The Multilevel Paradigm 421

Table 9.1 Parameters of the ML2R estimator (standard framework)


n n i = N i−1 , i = 1, . . . , R (convention: n 0 = n −1
0 = 0)
⎡ 9 ⎤
: 2  √ 
c∞α h) :
1 1
⎢1 log( ; 1 log(c∞α h) log 1+4αε ⎥
R= R ∗ (ε) ⎢ + + + +2 ⎥ (see also (9.66) and (9.69))
⎢2 log(N ) 2 log(N ) α log(N ) ⎥
⎢ ⎥

h = h ∗ (ε) h (does not depend upon ε)


β 7 7 −1 −1  2
β
1 θh 2 7W(R) 7
j (N ) n j−1 − n j
q1 (ε) = † , q j (ε) = √ , j = 2, . . . , R,
qε qε† n j−1 + n j
q = q ∗ (ε) 8
R
V1
with qε† s.t. q j (ε) = 1 and θ = θh =
Var(Yh )
j=1
⎡ ⎛ ⎞⎤
β
R
7 −1 −1  2
β
⎢ Var(Yh ) qε† ⎝1 + θ h 2 |W(R) (N ) 7 n − n n + n j⎠⎥
⎢  j j−1 j j−1 ⎥
⎢ ⎥
M = M ∗ (ε) ⎢ 1+ 1 j=2

⎢ ε 2 ⎥
⎢ 2αR ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥

1
as r → ∞, then 
c∞ = lim |cr | r = 1 = h so that, from a practical point of view,
r →+∞
1
|c R | R  1 as R = R(ε) grows.
– If h = 1, it is natural to perform a change of unit to return to the former situation,
i.e. express everything with respect to h. It reads
? @ ?@  αr   
h αR
R
Yh Y0 h
E =E + cr hαr −1 + hαR−1 o .
h h r =1
h h

If we consider that this expansion should behave as the above normalized one, it is
natural to “guess” that |c R h αR−1 | R → 1 as R → +∞, i.e.
1

c∞ = h−α .
 (9.68)
1
Note that this implies that c∞α h = 1, which dramatically simplifies the expression
1
of R(ε) in Table 9.1 since several terms involving log( c∞α h) vanish. The resulting
formula reads ⎡ 8  √1+4α  ⎤
1 1 log
R ∗ (ε) = ⎢ ε
⎢ 2 + 4 + 2 α log(N ) ⎥
⎥ (9.69)
⎢ ⎥
 
which still grows as O log(1/ε) as ε → 0.
Various numerical experiments (see [113, 114]) show that the ML2R estimator,
implemented with this choice |c R | R  h−α , is quite robust under variations of the
1

coefficients cr .
422 9 Biased Monte Carlo Simulation, Multilevel Paradigm

At this stage, for a prescribed RMSE , R(ε) can be computed. Note that, given the
value of R(ε) (and the way it has been specified), one has

h ∗ (ε) = h.

 Payoff-dependent simulation parameters: estimation of Var(Yh ), V1 and θ



Such an estimation is crucial since Var(Yh ) and θ = Var(Y V1
h)
both appear in the
 
computation of the optimal allocation policy qi (ε) 1≤i≤R(ε) and in the global size of
the ML2R estimator. A (rough) statistical pre-processing is subsequently mandatory.
– Estimation of Var(Yh ). Taking into account our optimization procedure, one
always has h ∗ (ε) = h (see Table 9.1.) One can without damage perform a rough
estimation of Var(Yh ). Namely
m m
1 1
Var(Yh )  (Yhk − Y h,m )2 where Y h,m = Yhk . (9.70)
m k=1
m k=1

– Estimation of V1 . Under (V E)β , for every h 0 , h 0 ∈ H, h 0 < h 0 ,


  7 7β
Var Yh 0 − Yh 0 ≤ V1 7h 0 − h 0 7 .

This suggests to choose a not too small h 0 , h 0 in H to estimate V1 based on the


formula  
1 (h 0 , h  ) = |h 0 − h 0 |−β Var Yh 0 − Yh  .
V1  V 0

It remains to estimate the variance by its usual estimator, which yields

 −β m
1 (h 0 , h 0 , m) = |h 0 − h 0 |
V1  V (Yhk0 − Yhk − Y h 0 ,h 0 ,m )2 ,
m k=1
0

m
1
where Y h 0 ,h 0 ,m = Yhk0 − Yhk . (9.71)
m k=1
0

We make a rather conservative choice by giving the priority to fulfilling the prescribed
error ε at the cost of a possible small increase in complexity. This led us to consider
in all the numerical experiments that follow

h h0 h
h0 = and h 0 = =
5 2 10

so that |h 0 − h 0 |−β = 10β h−β .


These choices turn out to be quite robust and provide satisfactory results in all
investigated situations. They have been adopted in all the numerical experiments
reproduced in this chapter.
9.5 The Multilevel Paradigm 423

 Exercise. Determine a way to estimate V̄1 in (9.63) when (S E)0β is the only
available strong error rate.
 Calibration of the root N
This corresponds to the second phase described at the beginning of this section.
– First, let us recall that if β ∈ (0, 1), then the best root is always N = 2.
– When β ≥ 1 – keeping in mind that R will never go beyond 10 or 12 for common
values of the prescribed error ε, and once the parameters Var(Y0 ) and V1 have been
estimated – it is possible to compute the upper-bound (9.54) of Cost( I MM L2R ) for
various values of N and select the minimizing root.

ℵ Practitioner’s corner II: Design of a confidence interval at level α


We temporarily adopt in this section the slight abuse of notation  M L2R
I M(ε) to denote the
ML2R estimator designed to satisfy the prescribed RMSE ε.
The asymptotic behavior of ML2R estimators as the RMSE ε → 0, especially the
CLT , has been investigated in several papers (see [34, 35, 75], among others) in
various frameworks (Brownian diffusions, Lévy-driven diffusions,…). In [112], a
general approach is proposed. It is established, under a sharp version of (V E)β and,
when β ≤ 1, an additional uniform integrability condition that both weighted and
regular Multilevel estimators are ruled by CLT at rate ε as the RMSE ε → 0. By a
sharp version of (V E)β , we mean
 

−β 
2
 1 β
lim h Yh − Y 2 = v1 1 −
h
h→0, h∈H N N

where v1 ∈ (0, V1 ].
As a consequence, it is possible to design a confidence interval under the assump-
tion that this asymptotic Gaussian behavior holds true. The reasoning is as follows:
we start from the fact that
 M L2R 2  M L2R   M L2R 
I M(ε) − I0 2 = Bias2 
 I M(ε) + Var 
I M(ε) = ε,
7  M L2R 7
keeping in mind that I0 = E Y0 . Then, both 7Bias  I M(ε) 7 and the standard deviation
 M L2R 

σ I M(ε) are dominated by ε. Let qc denote the quantile at confidence level c ∈ (0, 1)
for the normal distribution (2 ). If ε is small enough,
  M L2R   M L2R  
P  M L2R
I M(ε) ∈ E M L2R
I M(ε) − qc σ 
I M(ε) , E M L2R
I M(ε) + qc σ 
I M(ε)  c.

 M L2R 
On the other hand, E  M L2R
I M(ε) = I0 + Bias 
I M(ε) so that

2c is used here to avoid confusion with the exponent α of the weak error expansion.
424 9 Biased Monte Carlo Simulation, Multilevel Paradigm
A  M L2R   M L2R B
E M L2R
I M(ε) − qc σ 
I M(ε) , E  M L2R
I M(ε) + qc σ I M(ε)
A 7  M L2R 7  M L2R  7  M L2R 7  M L2R B
⊂ I0 − 7Bias  I M(ε) 7 + qc σ  I M(ε) I M(ε) 7 + qc σ 
, I0 + 7Bias  I M(ε) .
(9.72)

Now it is clear that the maximum of u + qc v under the constraints u, v ≥ 0,


u 2 + v 2 = ε2 is equal to max(1, qc )ε. Combining these facts eventually leads to the
following (conservative) confidence interval:
 
P  M L2R
I M(ε) ∈ I0 − max(1, qc )ε, I0 − max(1, qc )ε  c. (9.73)

Note that in general c is small so that qc > 1. A sharper (i.e. narrower) confidence
interval at a given confidence level c can be obtained by estimating on-line the
empirical variance of the estimator based on Eq. (9.41), namely
⎡ ⎤1
M1  A B 2 R W2 Mi  A B 2 2
 M L2R  1 ⎣ 1 (1),k (1) (i),k (i),k (i) (i)
 
σ I  √ Y − Yh + i Y h −Y h − Y h −Y h ⎦
M(ε) M M1 k=1 h M1 i=2
Mi
k=1 n i n i−1 ni n i−1 Mi

A B M
where Y = 1
M k=1 Y k denotes the empirical mean of i.i.d. copies Y k of the inner
M
variable Y , Mi = qi (ε)M(ε), etc. Then, one evaluates the squared bias by setting
 M L2R    
Bias 
 I M(ε)  ε2 − Var I MM L2R .
 M L2R 
Then plugging these estimates into (9.72), i.e. replacing the bias Bias  I
 M L2R   M L2RM(ε)

and the standard-deviation σ  I M(ε) by their above estimates  Bias 
I M(ε) and
 M L2R  7  M L2R 7  M L2R 
 
σ I M(ε) 7 
, respectively, in the magnitude Bias I M(ε) 7 
+ qc σ I M(ε) of the theo-
retical confidence interval yields the expected empirical confidence interval at level
c
A 7  M L2R 7  M L2R  7  M L2R 7  M L2R B
Iε,c = I0 − 7Bias 
I M(ε) 7 + qc
σ 
I M(ε) , I0 + 7Bias 
I M(ε) 7 + qc
σ 
I M(ε) .

 α
ℵ Practitioner’s corner III: The assumptions (V E)β and W E R
In practice, one often checks (V E)β as a consequence of (S E)β , as illustrated below.
 Brownian diffusions (discretization schemes)
Our task here is essentially to simulate two discretization schemes with step h
and h/N in a coherent way, i.e. based on increments of the same Brownian motion.
A discretization scheme with step h = Tn formally reads

(n)  (n) √ 
X kh =  X (k−1)h , h Uk(n) , k = 1, . . . , n, X̄ 0n = X 0 ,
9.5 The Multilevel Paradigm 425

W −W
where Uk(n) = kh √hk(h−1) = N (0; Iq ) are i.i.d. The most elementary way to jointly
d

simulate the coarse scheme X (n) and the fine scheme (N n)


 (n NX)  is to start by simulating
(N n)
a path (X k h )k=0,...n N of the fine scheme using Uk k=1,...,n N
and then simulate
N
(n)
the path (X kh )k=0,...,n using the sequence (Uk(n) )k=1,...,n defined by
N (n N )
=1 U(k−1)N +
Uk(n) = √ , k = 1, . . . , N .
N

– Condition (V E)β or (S E)0β . For discretization schemes of Brownian diffusions,


 
the bias parameter is given by h = Tn ∈ H = hn , n ≥ 1 . If Yh = F ( X̄ tn )t∈[0,T ] ,
   
F ( X̃ tn )t∈[0,T ] (Euler schemes) or F ( X̃ tn,mil )t∈[0,T ] (genuine Milstein scheme).
The easiest condition to check is (S E)0β , since it can usually be established by
combining the (Hölder/Lipschitz) regularity of the functional F with respect to the
 · sup -norm with the L 2 -rate√of convergence of the (simulable) discretization
scheme under consideration ( h for the discrete time Euler scheme at time T ,
h log(1/ h) for the sup-norm and h for the Milstein scheme).
The condition (V E)β is not as easy to establish. However, as far as the Euler
scheme is concerned, when h ∈ H and h  = Nh , it can be derived from a functional
CentralLimit Theorem for the difference X̄ n − X̄ n N , which weakly converges at
a rate n NN−1 toward a diffusion process free of N as n → +∞. Combined with
appropriate regularity and growth assumptions on the function f or the functional
F, this yields that

  7 7β
Yh − Y h 2 ≤ Vi 77h − h 77 with β = 1.
N 2 N
For more details, we refer to [34] and to [112] for the application to path-dependent
functionals.
 α
– Condition W E R . As for the Euler scheme(s), we refer to Theorem 7.8 (for
 
1-marginals) and Theorem 8.1 for functionals. For 1-marginals F(x) = f x(T ) ,
 1
the property (E R+1 ) implies W E R . When F is a “true” functional, the available
results
 areαpartial since no higher-order expansion of the weak error is established
and W E R , usually with α = 21 (like for barrier options) or 1 (like for lookback
payoffs), should still be considered as a conjecture, though massively confirmed by
numerical experiments.
 Nested Monte Carlo
We retain the notation introduced at the beginning of the chapter. For convenience,
we write
K
  1 1 1
0 = E F(X, Z ) | X and h = F(X, Z k ), h = ∈H= , K ∈ K 0 N∗ ,
K K K
k=1
426 9 Biased Monte Carlo Simulation, Multilevel Paradigm
 
so that Y0 = f (0 ) and Yh = f h .
– Condition (V E)β or (S E)β . The following proposition yields criteria for (S E)β
to be fulfilled.

Proposition 9.2 (Strong error) (a) Lipschitz continuous function f , quadratic


case. Assume F(X, Z ) ∈ L 2 . If f is Lipschitz continuous, then,
 2
∀ h, h  ∈ H ∪ {0}, Yh − Yh  2 ≤ V1 |h − h  |,
 2  
where V1 = [ f ]2Lip  F(X, Z ) − E (F(X, Z )| X )2 ≤ [ f ]2Lip Var F(X, Z ) so that
(S E)β (hence (V E)β ) is satisfied with β = 1 by setting h  = 0.
(b) Lipschitz continuous function f , L p -case. If, furthermore, F(X, Z ) ∈ L p , p ≥ 2,
then  p
∀ h, h  ∈ H ∪ {0}, Yh − Yh   p ≤ V 2p |h − h  | 2 ,
p
(9.74)
    p
where V 2p = 2[ f ]Lip C pM Z ) p  F(X, Z ) − E F(X, Z )| X  p and C pM Z is the uni-
versal constant in the right-hand side of the Marcinkiewicz–Zygmund (or B.D.G.)
Inequality (6.49).
(c) Indicator function f = 1(−∞,a] , a ∈ R. Assume F(X, Z ) ∈ L p for some p ≥ 2.
Assume that the distributions of 0 , h , h ∈ (0, h 0 ] ∩ H (h 0 ∈ H), are absolutely
continuous with uniformly bounded densities gh . Then, for every h ∈ (0, h 0 ] ∩ H,
 2
  p
1{h ≤a} − 1{h ≤a}  ≤ C X,Y, p |h − h  | 2( p+1) ,
2

where
p  p 1    p      p

C X,Y, p =2 p + 1 p p + 1 + p p + 1 sup g sup + 1 p+1 2 C pM Z  F(X, Z ) − E F(X, Z ) | Y  p p+1 .
h∈(0,h 0 ]∩H h

 
Hence, Assumption (S E)β holds with β = p
2( p+1)
∈ 0, 21 .

Remark. The boundedness assumption made on the densities gh of h in the above
Claim (c) may look unrealistic. However, if the density g0 of 0 is bounded and
the assumptions of Theorem
 9.2 α (b) further on – to be precise, in the item devoted
to weak error expansion W E R – are satisfied for R = 0, then

1
gh (ξ) ≤ g0 (ξ) + o(h 2 ) uniformly in ξ ∈ Rd .

Consequently, there exists an h 0 = 1


K̄ 0
∈ H such that, for every h ∈ (0, h 0 ],

∀ ξ ∈ Rd , gh (ξ) ≤ g0 (ξ) + 1.


9.5 The Multilevel Paradigm 427

Proof. (a) Except for the constant, this claim is a special case of Claim (b) when
p = 2. We leave the direct proof as an exercise.
 z) = F(ξ, z) − E F(ξ, Z ), ξ ∈ Rd , z ∈ Rq , and assume that h = 1 and
(b) Set F(ξ, K
h = K  > 0, 1 ≤ K ≤ K  , K , K  ∈ N∗ . First note that, X and Z being independent,
 1

Fubini’s theorem implies

C 7 7p
  71 K K 7
Y h − Y h   p ≤ [ f ] p 7  Zk ) − 1  Z k )77 .
PX (dξ) E 7 F(ξ, F(ξ,
p Lip
Rd 7K K 7
k=1 k=1

Then, for every ξ ∈ Rd , it follows from Minkowski’s Inequality that


 
 K   K   K 
1 K
    
  Zk ) − 1
F(ξ,  
F(ξ, Z k ) ≤ |h − h | 
F(ξ, Z k ) + h   Z k )
F(ξ,  .
K K
 k=1 k=1  k=1
p
k=K +1
p
p

Applying the Marcinkiewicz–Zygmund Inequality to both terms on the right-hand


side of the above inequality yields
    21  1
 K     K 2
1 K
 
K
 
  Zk ) − 1
F(ξ,  Z k ) ≤ |h − h  |C pM Z
F(ξ,   Z k )2 
F(ξ,  + h  C pM Z   Z k )2 
F(ξ,
K K   p  
 k=1 k=1  k=1 k=K +1 p
p 2 2
  1 
 Z ) + h  (K  − K ) 2  F(ξ,
≤ |h − h |K  F(ξ,  Z )
1
 2
p p

where we again used (twice) Minkowski’s Inequality in the last line. Finally, for
every ξ ∈ Rd ,
 
 K    1 
1 K
  
  Zk ) − 1
F(ξ,  Z ) (h − h  ) √1 + h  1 − 1 2
 Z k ) ≤ C pM Z  F(ξ,
F(ξ,
K K  p 
 k=1 k=1  h h h
p
 
  h   21  h   21
= C pM Z  F(ξ,
 Z ) (h − h  ) 2
1
p
1 − +
h h
MZ 
 
≤ 2 Cp   21
F(ξ, Z ) p (h − h ) .

Plugging this bound into the above equality yields


C
   
Yh − Yh   p ≤ (2 C M Z ) p PY (dξ) F(ξ,
 Z ) p (h − h  ) 2
p

p p p
Rd
 p
= (2 C p ) F(X, Z ) − E (F(X, Z )| X ) p (h − h  ) 2
M Z p p

  p
≤ 4 C pM Z ) p  F(X, Z ) p (h − h  ) 2 .
p

If h  = 0, we get
428 9 Biased Monte Carlo Simulation, Multilevel Paradigm
7  7p
7 1 K  77
p 7
E |Yh − Y0 | ≤ [
p
f ]Lip E 7 F(X, Z k ) − E F(X, Z ) |X 7
7 K 7
k=1
C 7 7
71 K 7 p
p 7  Z k )77 PX (dξ)
= [ f ]Lip E7 F(ξ,
Rd 7K 7
k=1

since E (F(X, Z ) |X ) = E F(ξ, Z ) |ξ=X a.s. and the result follows as above.
(c) The proof of this claim relies on Claim (b) and the following elementary lemma.

Lemma 9.3 Let  and  be two real-valued random variables lying in L p (, A, P),
p ≥ 1, with densities g and g , respectively. Then, for every a ∈ R,
 2        p+1 p
  p
  
1{ ≤a} − 1{≤a}  ≤ p p+1 + p p+1 g sup + g sup  −   p+1 .
− p 1

2 p
(9.75)

Proof. Let λ > 0 be a free parameter. Note that


 2    
 
1{≤a} − 1{ ≤a}  = P  ≤ a ≤  + P  ≤ a ≤ 
2
   
≤ P  ≤ a,  ≥ a + λ + P  ≤ a ≤  ≤ a + λ
   
+ P  ≤ a,  ≥ a + λ + P  ≤ a ≤  ≤ a + λ
  
≤ P  −  ≥ λ) + P  −  ≥ λ)
+ P( ∈ [a, a + λ]) + P( ∈ [a, a + λ])

= P | − | ≥ λ) + P( ∈ [a, a + λ]) + P( ∈ [a, a + λ])
E | − | p     
≤ + λ g sup + g sup .
λ p

A straightforward optimization in λ yields the announced result. ♦

At this stage, Claim (c) becomes straightforward: plugging the upper-bound of


the densities and (9.74) into Inequality (9.75) of Lemma 9.3 applied with  = h
and  = h  completes the proof. ♦

 Exercise. (a) Prove Claim (a) of the above proposition with the announced con-
stant.
(b) Show that if f is ρ-Hölder, ρ ∈ (0, 1] and p ≥ ρ2 , then
ρ
∀ h, h  ∈ H, Yh − Yh   p ≤ [ f ]ρ,H ol F(X, Z ) − E (F(X, Z ) | X )ρpρ |h − h  | 2 .
 α
– Condition W E R . For this expansion we again need to distinguish between the
smooth case and the case of indicator functions. In the non-regular case where f is an
indicator function, we will need a smoothness assumption on the joint distribution of
(0 , h ) that will be formulated as a smoothness assumption on the pair (0 , h −
9.5 The Multilevel Paradigm 429

0 ). The first expansion result in that direction was established in [128]. The result
in Claim (b) below is an extension of this result in the sense that if R = 1, our
assumptions coincide with theirs.
We denote by ϕ0 the natural regular (Borel) version of the conditional mean
function of F(X, Z ) given X , namely
 
ϕ0 (ξ) = E F(ξ, Z ) = E F(X, Z ) | X = ξ , ξ ∈ R.

In particular, Y0 = ϕ0 (0 ) P-a.s.

Theorem 9.2 (Weak error expansion) (a) Smooth setting. Let R ∈ N∗ . Assume
F(X, Z ) ∈ L 2R+1 (P) and let f : R → R be a 2R + 1 times differentiable function
with bounded derivatives f (k) , k = R + 1, . . . , 2R + 1. Then, there exists real coef-
ficients c1 ( f ), . . . , c R ( f ) such that

R
∀h ∈ H, E Yh = E Y0 + cr ( f )h r + O(h R+1/2 ). (9.76)
r =1

(b) Density function and smooth joint density. Assume that F(X, Z ) ∈ L 2R+1 (P) for
some R ∈ N and that d = 1. Assume the distribution (0 , h − 0 ) has a smooth
density with respect to the Lebesgue measure on R2 . Let g0 be the density of 0
and let g | − =ξ̃ be (a regular version of) the conditional density of 0 given
0 h 0

h − 0 = ξ. ˜ Assume that the functions ϕ0 , g , g , ξ˜ ∈ R, are 2R + 1 times


0 0 |h −0 =ξ̃
differentiable. Assume that ϕ0 is monotonic (hence one-to-one) with a derivative m 
which is never 0. If sup |g(2R+1)| − =ξ̃
(ξ)| < +∞, then
0 h 0
h∈H, ξ̃, ξ∈R

R
hr 1
gh (ξ) = g0 (ξ) + g0 (ξ) Pr (ξ) + O(h R+ 2 ) uniformly in ξ ∈ R, (9.77)
r =1
r!

where the functions Pr are 2(R − r ) + 1 times differentiable functions, r = 1, . . . , R.


(c) Indicator function and smooth joint density. Let G h and G 0 denote the c.d.f.
functions of h and 0 , respectively. Under the assumptions of (b), if, furthermore,
sup |g(2R) | − =ξ̃
(ξ)| < +∞ and lim g(2R) | − =ξ̃
(ξ) = 0 for every ξ˜ ∈ R,
0 h 0 ξ→−∞ 0 h 0
h∈H, ξ̃, ξ∈R
then, one also has for every ξ ∈ R,

hr  
R
1
G h (ξ) = G 0 (ξ) + E Pr (0 )1{0 ≤ξ} + O(h R+ 2 ). (9.78)
r =1
r!

If f = f a = 1(−∞,a] , a ∈ R, then E Yh = G h (a), h ∈ H ∪ {0}, so that


430 9 Biased Monte Carlo Simulation, Multilevel Paradigm

hr  
R
1
E Yh = E Y0 + E Pr (0 )1{0 ≤a} + O(h R+ 2 ).
r =1
r!

We will admit these results whose proofs turn out to be too specific and technical
for this textbook. A first proof of Claim (a) can be found in [198]. The whole
theorem is established in [113, 114] with closed forms for the coefficients cr ( f )
and the functions Pr involving the derivatives of f (for the coefficients cr ( f )), the
conditional mean ϕ0 , the densities gh , g0 −h , their derivatives and the so-called
partial Bell polynomials (see [69], p. 307). Claim (c) derives from (b) by integrating
with respect to ξ and it can be extended to any bounded Borel function  : R → R
instead of 1(−∞,a] .
– Complexity of nested Monte Carlo. When computing the ML2R estimator in
nested Monte Carlo simulation, one has, at each fine level i ≥ 2,
 ni K
  n i−1 K

1 1
Y nh − Y n h = f F(X, Z k ) − f F(X, Z k )
i i−1 ni K k=1
n i−1 K k=1

with the same terms Z k inside each instance of the function f . Hence, if we neglect
the computational cost of f itself, the complexity of the fine level i reads
ni
κ̄ n i K = κ̄ .
h
√ √
As a consequence, it is easily checked that n j−1 + n j should be replaced by n j
in the above Table 9.1.

9.5.2 Regular Multilevel Estimator (Under First Order Weak


Error Expansion)

What is called here a “regular” multilevel estimator is the original multilevel esti-
mator introduced by M. Giles in [107] (see also [148] in a numerical integration
framework and [167] for the Statistical Romberg extrapolation). The bias reduction
is obtained by an appropriate stratification based on a simple “domino” or cascade
property described hereafter. This simpler weightless structure makes its analysis
possible under a first-order weak error expansion (and the β-control of the strong
convergence similar to that introduced for ML2R estimators). We will see later in
Theorem 9.3 the impact on the performances of this family of estimators compared
to the ML2R family detailed in Theorem 9.1 of the previous section.

The regular Multilevel estimator


 α
We assume that there exists a first-order weak error expansion, namely W E 1
9.5 The Multilevel Paradigm 431
 α  
WE 1
≡ E Yh = E Y0 + c1 h α + o h α . (9.79)

Like for the ML2R estimator, we assume the refiners are powers of a root N :
n i = N i−1 , i = 1, . . . , L, where L will denote throughout this section the depth of
the regular multilevel estimator, following [107].
Note that, except for the final step – the depth optimization of L – most of what
follows is similar to what we just did in the weighted framework, provided one sets
R = L and Wi(L) = 1, i = 1, . . . , L.
Step 1 (Killing the bias). Let L ≥ 2 be an integer which will be the depth of the
regular multilevel estimator. Then
 α   
h h α
E Y nh − E Y0 = c1 +o .
L nL nL

Now introducing artificially a telescopic sum, we get


   
E Y nh = E Yh + E Y h2 − E Yh + · · · + E Y nh − E Y n h
L L L−1

L
= E Yh + E Y nh − Y n h
i i−1
i=2
⎡ ⎤
⎢ L

= E⎢
⎣ Yh + Y nh − Y n h ⎥ ⎦.

i=2   i−1
i

coarse level
refined level i

We will assume again that the Y nh − Y n h variables at each level are sampled
i i−1
independently, i.e.
- .
L  h α  h α 
E Y0 = E Yh(1) + (i)
Yh −Y (i)
h − c1 +o ,
i=2
ni n i−1 nL nL

(i)
where the families Yn,h , i = 1, . . . , L, are independent.
Step 2 (Regular Multilevel estimator). As already noted in the previous section,
Yh(1) has no reason to be small since it is close to a copy of Y0 , whereas, by contrast,
Y (i)
h − Y h
(i)
is approximately 0 as the difference of two variables approximating
ni n i−1
Y0 . The seminal idea of the multilevel paradigm, introduced by Giles in [107], is
to dispatch the L different simulation sizes Mi across the levels i = 1, . . . , L so
that M1 + · · · + M L = M. This leads to the following family of (regular) multilevel
estimators attached to (Yh )h∈H and the refiners vector n:
432 9 Biased Monte Carlo Simulation, Multilevel Paradigm

1
M1 L
1
Mi  
 M L MC
I M,M := Yh(1),k + Y (i),k
h − Y (i),k
h ,
1 ,...,M L M1 Mi ni n i−1
k=1 i=2 k=1

 (i),k 
where M1 , . . . , M L ∈ N∗ , M1 + · · · + M L = M, Yn,h are i.i.d. copies
  i=1,...,L ,k≥1
of Yn,h := Y nh i=1,...,L .
i
As for weighted ML2R estimators, it is more convenient to re-write the above
multilevel estimator as a stratified estimator by setting qi = MMi , i = 1, . . . , L. This
leads us to the following formal definition.

Definition 9.5 The family of (regular) multilevel estimators attached to (Yh )h∈H and
the refiners vector n is defined, as follows: for every h ∈ H, every q = (qi )1≤i≤R ∈ S R
and every integer M ≥ 1,
-M .
1  (i),k
Yh(1),k 
1 L Mi
 1
I MM L MC = 
I MM L MC (q, h, L , n) = +
Yh −Y h (i),k
M k=1
q
i=2 k=1 i
q1 ni n i−1

  (i),k  (9.80)
(convention the 01 0k=1 = 0), where Mi = qi M, i = 1, . . . , R, Yn,h , i = 1,
 
. . . , R, k ≥ 1, are i.i.d. copies of Yn,h := Y nh i=1,...,R .
i

This definition is similar to that of ML2R estimators where all the weights Wi
would have been set to 1. Note that, as soon as M ≥ M(q) := (mini qi )−1 , all Mi =
qi M ≥ 1, i = 1, . . . , R. This condition on M will be implicitly assumed in what
follows so that no level is empty.
 Bias. The telescopic/domino structure of the estimator implies, regardless of the
sizes Mi ≥ 1, that
 α  α 
 M L MC   RR h h
Bias 
IM = Bias 
I1 = E Y nh − E Y0 = c1 +o . (9.81)
L nL nL

 Variance. Taking advantage of the independence of the levels, one straightfor-


wardly computes the variance of 
I MM L MC
 
Var Y (i) (i)
h − Y h
Var(Yh(1) )
L
Var(
ni n i−1
I MM L MC ) = +
M1 i=2
Mi
⎛  ⎞
L Var Y (i) (i)
h − Y h
1 ⎜ Var(Yh1 ) ni n i−1 ⎟
= ⎝ + ⎠
M q1 i=2
qi

L
1 σ 2 (i, h)
= , (9.82)
M i=1
qi
9.5 The Multilevel Paradigm 433

   
where σ 2 (1, h) = Var Yh(1) and σ 2 (i, h) = Var Y (i) (i)
h − Y h , i = 2, . . . , L.
ni n i−1
 Complexity. The complexity, or simulation cost, of the MLMC estimator is clearly
the same as that of its weighted counterpart investigated in the previous section, i.e.
is a priori given — or at least dominated – by

L
κ̄M
Cost(
I MM L MC ) = κi qi , (9.83)
h i=1

 
where κ1 = 1, κi = 1 + N −1 n i , i = 2, . . . , L.
As usual, at this stage, the aim is to minimize the cost or complexity of the whole
simulation for a prescribed R M S E ε > 0, i.e. solving

inf Cost(
I MM L MC ). (9.84)
R M S E(
I MM L MC )≤ε

Following the lines of what was done in the weighted multilevel ML2R estimator,
we check that this problem is equivalent to solving

Effort(
I MM L MC )
inf  2 ,
I MM L MC )|<ε ε2 − Bias 
|Bias( I M L MC
M

where, as soon as M ≥ M(q),

I MM L MC ) = κ(
Effort( I MM L MC )Var( I MM L MC )
 L  L 
κ̄ σ 2 (i, h)
= κi qi . (9.85)
h i=1 qi i=1

Note that neither the effort, nor the bias depend on M provided M ≥ M(q).
We still proceed by first minimizing the effort for a fixed h (and R) as a function
of the vector q ∈ SL and then minimizing the denominator of the above ratio cost in
the bias parameter h ∈ H.
 Minimizing the effort. Lemma 9.1 yields the solution to the minimization of the
effort, namely
 
σ(i, h)
argminq∈SL Effort(
I MM L MC ) = q ∗ = ∗,†

q (h) κi i=1,...,L

σ(i, h)
with q ∗,† (h) = √ and a resulting minimal effort reading
1≤i≤L
κi
434 9 Biased Monte Carlo Simulation, Multilevel Paradigm

 L
2
κ̄ √
min Effort(
I MM L MC ) = κi σ(i, h) .
q∈S L h i=1

 Minimizing the resulting cost. Compiling the above results, the cost minimization

inf Cost(
I MM L MC )
R M S E(
I MM L MC )≤ε

reads, if we neglect the second-order term in the bias expansion (9.81),


 √ 2
κ̄ 1≤i≤L κi σ(i, h)
7 inf 7     .
7Bias(I M L MC )7≤ε h ε2 − c1 2 h 2α
M nL

As for the weighted framework, we adopt a (slightly) suboptimal strategy that is


solves
-  2α . -   2α .
 h h
7 sup 7 h ε − c1
2 2
= sup h ε − c1
2 2

7Bias(I M L MC )7≤ε n L εn
h∈H: h≤ |c |L n L
1
-   2α .
M

h
≤ sup h ε − c1
2 2
.
εn
h:0<h≤ L nL
|c1 |

This equivalence can be made rigorous (see [198] for more details) as well as the
fact that it suffices (at least asymptotically as ε → 0) to maximize the denominator.
First note that
  h 2α   1
2α ε α
sup h ε2 − c12 = ε2 nL
(1 + 2α)1+ 2α |c1 |
1
εn
0<h< L nL
|c1 |

attains a maximum at   α1
 ε nL
h(ε) = (9.86)
|c1 | (1 + 2α) 2α
1

so that we are led to set

h ∗ (ε) = lower nearest neighbor of 


h(ε) in H = hh/
h(ε)−1 . (9.87)

When ε → 0,  h(ε) → 0 and h ∗ (ε) ∼ 


h(ε), but of course it remains a priori subop-
timal at finite range so that
9.5 The Multilevel Paradigm 435
 √ 2
κi σ(i, h ∗ (ε))
1 1

M L MC (1 + 2α)1+ 2α |c1 | α 1≤i≤L


inf Cost( I M ) ≤ κ̄ 1 .
R M S E(
I MM L MC )≤ε 2α nL ε2+ α
(9.88)
Step 3 (Final optimization L = L(ε) → +∞). This step is devoted to the mini-
mization of the upper-bound (9.88) with respect to the depth L under the additional
assumption (V E)β (see (9.55)), which provides a non-asymptotic control of the
variance of the Yh .
 Non-asymptotic control of the effort. Following the lines of the weighted frame-
work, after replacing all the weights Wi by 1, we get for every h ∈ H, keeping in
mind that κi = n i (1 + N1 ) for i ∈ {2, . . . , L},
 L
2  L
2
√ β
−1 1 β
(i−1) 1−β
κi σ(i, h) ≤ Var(Yh ) 1 + θh (1 + N
2 ) (N − 1)
2 2 N 2

i=1 i=2
 1−β
2
β 1− N 2 (L−1)
= Var(Yh ) 1 + θh a N 2
1−β
(9.89)
1− N 2

with

β 1−β 1 − 1 L−1
a N = (1 + N −1 ) 2 (N − 1) 2 N
1
2 and the convention = L −1
1−1
(9.90)
corresponding to the case β = 1 (3 )
 Optimization of the depth L = L(ε). One checks that the upper-bound (9.88) of the
complexity formally goes to 0 as L ↑ +∞ for a fixed ε > 0 since n L = N L−1 ↑ +∞.
As in the weighted setting, the depth L is limited by the constraint on the bias
parameter h ∗ (ε) ≤ h. This constraint implies, owing to (9.87), that
  α1 
|c1 | √
log ε
1 + 2α h
L ≤ L max (ε) = 1 + .
log(N )

However, unlike the ML2R estimator, the exponent 2 + α1 of the (prescribed) RMSE
ε appearing in the upper-bound of the complexity does not depend on the depth L,
so it seems natural to try to go deeper in the optimization. This can be achieved in
different manners (see e.g. [141]) depending on what is assumed to be a variable or
a fixed parameter. We shall carry on with our approach which lets the bias parameter
h vary in H.
Based on the formula (9.88) of the complexity, the expression of the effort (9.89)
and using the formula (9.86) of  h(ε) (before projection on H), the minimization
problem

β 1−β
3 Note that in the nested Monte Carlo framework, κi = n i , so that a N = (N − 1) 2 N 2 .
436 9 Biased Monte Carlo Simulation, Multilevel Paradigm

 √ 2
1≤i≤L κi σ(i, 
h(ε))
min (9.91)
1≤L≤L max (ε) nL

is dominated by
⎛ ⎞2
  β 1−β
ε 2α 1− N 2 (L−1)
⎝N − L−1 − 1−β
2 (L−1) ⎠ ,
sup Var(Yh ) min 2 +θ √ aN N
h∈H 1≤L≤L max |c1 | 1 + 2α 1− N
1−β
2

(9.92)
still with our convention when β = 1, where a N is given by (9.90).
In (9.92) the function to be minimized in L is not decreasing (where L is viewed as
a real variable) so that saturating the constraint on L, e.g. by the condition 
h(ε) ≤ h,
may not be optimal.
Lemma 9.4 The function of L, viewed as a [1, +∞)-valued real variable, to be
minimized in (9.92), attains its minimum at
  √   1−β 
1 1 |c1 | 1 + 2α 2 N 2 −1 1
∗ (ε) = 1 + log + log β
log N α ε β 1−β 1 1−β
2θ(1 + N −1 ) 2 (N − 1) 2 N 2
2 +
(9.93)
1−β
N 2 −1
with the convention 1−β
= log N if β = 1.
2

 Exercise. Prove the lemma. [Hint: consider directly the function between brackets
and inspect the three cases β ∈ (0, 1), β = 1 and β > 1.]
At this stage we define the optimized depth of the MLMC simulation for a pre-
scribed RMSE ε by 3 4
L ∗ (ε) = ∗ (ε) ∧ L max (ε) . (9.94)

 Optimal size of the simulation. The global size M ∗ (ε) of the simulation is obtained
by saturating the RMSE constraint, i.e. I MM L MC − I0 2 ≤ ε with h = 
h(ε), using that
 
 M L MC  

I MM L MC − I0 22 = M1∗ (ε) + Bias 
Var I 2
I MM L MC . We get
 M L MC 
Var 
I1
M ∗ (ε) =  2 .
ε − Bias 
2 I MM L MC

Plugging the value of the variance (9.82) and the bias (9.81) into this equation finally
yields
5   M L MC  6 5  ∗,†  √ ∗
6
1 Var 
I1 1 q 1≤i≤L(ε) κi σ(i, h (ε))
M ∗ (ε) = 1+ = 1+
2α ε2 2α ε2
(9.95)
which can in turn be re-written in an implementable form (see Table 9.2 in Practi-
tioner’s corner) with the available formulas or bounds for κi and σ(i, h) and h ∗ (ε).
9.5 The Multilevel Paradigm 437

At this stage, it remains to inspect again the same three cases, depending on the
parameter β (β > 1, β = 1 and β ∈ (0, 1)) to obtain the following theorem.

i = N , i = 1, . . . , L,
i−1
Theorem 9.3 (MLMC estimator, see [107, 198]) Let n
 α
N ∈ N \ {0, 1}. Assume (S E)β and W E 1 . Let θ = θh = Var(Y) .V1

(a) The MLMC estimator ( I M L MC ) M≥1 satisfies, as ε → 0,


M


⎪1 if β > 1,
  Var(Yh ) ⎨ 2
inf Cost 
I MM L MC  K M L MC × log(1/ε) if β = 1,
h∈H, q∈S L , L , M≥1 ε2 ⎪ 1−β
⎩ −

I M
M
L MC −I0 2 ≤ε
2 2 ε α if β < 1,

where K M L MC = K M L MC (α, β, θ, h, c1 , V1 ).
(b) The rates in the right-hand side in (a) are achieved by setting in the defini-
tion (9.80) of the estimator 
I MM L MC , h = h ∗ (ε, L ∗ (ε)), q = q ∗ (ε), R = R ∗ (ε) and

M = M (ε).
Closed forms for the depth L ∗ (ε), the bias parameter h ∗ = h ∗ (ε, L ∗ (ε)) ∈ H,
the optimal allocation vector q ∗ (ε) and the global size M ∗ (ε) of the simulation are
reported in Table 9.2 below (see Practitioner’s corner hereafter for these formulas
and their variants).

Remarks. • The MLMC estimator achieves the rate ε−2 of an unbiased simulation
as soon as β > 1 (fast strong rate) under lighter conditions than ML2R since it only
requires a first-order weak error expansion.
−2+η
• When β = 1, the cost of the MLMC estimator  1 for a prescribed R M S E is o(ε )
for every η > 0 but it is slower by a factor log ε than the weighted ML2R Multilevel
 
estimator. Keep in mind than for ε = 1bp = 10−4 , log 1ε  6.9.
• When β ∈ (0, 1), the simulation cost is no longer close to unbiased simulation since
1−β
ε− α → +∞ as ε → 0 at a polynomial rate while the weighted ML2R estimator
remains close to the unbiased framework, see Theorem 9.1 and the comments that
follow. In that setting, which is great interest in derivative pricing since it corresponds,
for example, to barrier options or γ-hedging, the weighted ML2R clearly outperforms
the regular MLMC estimator.
Proof of Theorem 9.3. The announced rates can be obtained by saturating the value
of h at h i.e. by replacing ∗ (ε) in (9.94) by +∞. Then the proof follows the lines
of that of Theorem 9.1. At some place one has to use that
 2
Var(Yh ∗ (ε) ) ≤ σ(Yh ∗ (ε) − Yh ) + σ(Yh )
 2  2
≤ Var(Yh ) 1 + V1 (h − h ∗ (ε))1/2 ≤ Var(Yh ) 1 + V1 h1/2 .

Details are left to the reader. ♦


438 9 Biased Monte Carlo Simulation, Multilevel Paradigm

Table 9.2 Parameters of the MLMC estimator (standard framework)


n n i = N i−1 , i = 1, . . . , L (convention n 0 = n −1 0 = 0)
⎛ ⎡  ⎤ ⎞
  1
log (1 + 2α) 2α |cε1 | α h
1

⎜ ⎢ ⎥ 3 4⎟
L = L ∗ (ε) min ⎜ ⎝1 + ⎢
⎢ ⎥ , (ε) ⎟
⎥ ⎠
⎢ log(N ) ⎥
⎢ ⎥
⎛   ⎞
1  |c |  α
1
log (1+2α) 2α ε1  1−β 
⎜ ⎟
(ε) ⎜ + 2
log N 2 −1 1 ⎟
⎝ log(N ) β log N 1−β 1−β 1 β ⎠
2 2θ N 2 (1+N −1 ) 2 (N −1) 2
+
with the convention N 0−1 = log N when β = 1.
0

5  1 6
F 1 |c1 | α −(L−1)
h = h ∗ (ε) h h(1 + 2α) 2α N
ε
β 
−1  2
β
1 θh 2 n −1 j−1 − n j
q1 (ε) = † , q j (ε) = √ , j = 2, . . . , L ,
qε qε† n j−1 + n j
q = q ∗ (ε) 8
L
V1

with qε s.t. q j (ε) = 1 and θ = .
Var(Yh )
j=1
⎡ ⎛ ⎞⎤
L
β  −1 β
⎢ Var(Yh ) qε ⎝1 + θh 2

n j−1 − n j−1 2 n j−1 + n j ⎠ ⎥
⎢  ⎥
⎢ 1 j=2 ⎥
M = M ∗ (ε) ⎢ 1+ ⎥
⎢ ε 2 ⎥
⎢ 2α ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥

ℵ Practitioner’s corner
Most recommendations are similar to that made for the ML2R estimator. We focus in
what follows on specific features of the regular MLMC estimators. The parameters
have been designed based on a complexity of fine level i given by κ(n i + n i−1 ).
When this complexity is of the form κn i (like in nested Monte Carlo), the terms
√ √
n j + n j−1 should be replaced by n j .
 Payoff dependent parameters for MLMC estimator
We set n i = N i−1 , i = 1, . . . , L. The values of the (optimized) parameters are listed
in Table 9.2 (∗ superscripts are often removed for simplicity).
Let us make few additional remarks:
• Note that, by contrast with the ML2R framework, there is no a priori reason to
choose N = 2 as a root when β ∈ (0, 1).
• The value of =the depth
 L ∗ (ε) proposed  in Table>9.2 is sharper than the value
1  |c |  1
∗
L (ε) = 1 + log (1 + 2α) 2α 1 α
h / log(N ) given in [198]. However, the
ε
impact of this sharper optimized depth on the numerical simulations remains
limited and 
L ∗ (ε) is a satisfactory choice in practice (it corresponds to setting
9.5 The Multilevel Paradigm 439

∗ (ε) = +∞): in fact, in most situations,


 one observes
 that L ∗ (ε) = 
L ∗ (ε). In
any case the optimized depth grows as O log(1/ε) when ε → 0.

 Numerical estimation of the parameters Var(Yh ) (h = h ∗ (ε)), V1 and θ


– The same remark about possible variants of the complexity of the simulation of
Y nh − Y n h can be made for the regular MLMC estimator with the same impact on
i i−1
the computation of the allocation vector q(ε).
– Like with the ML2R estimator, one needs to estimate Var(Yh ), this time with
h = h ∗ (ε),
 and the constant V1 in (V E)β in order to have a proxy of Var(Yh (ε) )

and θ = Var(YV1h∗ (ε) ) , both used to specify the parameters reported in Table 9.2. The
estimation procedures are the same as that developed in Sect. 9.5.1 in Practitioner’s
corner I (see (9.70) and (9.71)).
 Calibration of the parameter c1
This is an opportunity of improvement offered by the MLMC estimator: if the coeffi-
cient α is known, e.g. by theoretical means,
 itis possible to estimate the coefficient c1
α
in the first-order weak error expansion W E 1 . First we approximate by considering
the following proxies, defined for h 0 , h 0 ∈ H, 0 < h 0 < h 0 , by

E Yh 0 − E Yh 0
c1 (h 0 , h 0 ) =
  c1
h α0 − (h 0 )α

if h 0 and h 0 are small (but not too small). Nevertheless, like for the estimation of
V1 (see Practitioner’s corner I in Sect. 9.5.1), we adopt a conservative strategy which
naturally leads us to consider not too small h 0 and h 0 , namely, like for the ML2R
estimator,
h h0 h
h0 = and h 0 = = .
5 2 10
 
Then, it remains to replace E Yh 0 − Y h0 by its usual estimator to obtain the (obvi-
2
ously biased) estimator of the form
m
(h 0 )−α 1
c1 (h 0 , h 0 , m) =
 Yhk0 − Yhk ,
(1 − 2−α ) m k=1
0

   
where Yhk∗ (ε) , Y hk∗ (ε) k≥1 are i.i.d. copies of Yh ∗ (ε) , Y h∗ (ε) and m " M (m is a priori
2 2
small with respect to the size M of the master simulation).
One must be aware that this opportunity to estimate c1 (which has no counterpart
for MLL2R) is also a factor of instability for the MLMC estimators, if c1 is not
estimated or not estimated with enough accuracy. From this point of view the MLMC
estimators are less robust than the ML2R estimators.
440 9 Biased Monte Carlo Simulation, Multilevel Paradigm

 Design of confidence intervals


The conditions of application and the way to design confidence intervals for the
MLMC estimators when the RMSE ε → 0 is the same as that described in Practi-
tioner’s corner II for weighted multilevel estimators in Sect. 9.5.1 (see Eq. (9.73)).

9.5.3 Additional Comments and Provisional Remarks

 Heuristic recommendations for Brownian diffusions. We can give few additional


recommendations on how to implement the above Multilevel methods for Brownian
diffusions. It is clear in view of the results of Chaps. 7 and 8 that:
• for vanilla Lipschitz payoffs, one may apply the method with α = 1 and β = 1;
• for path-dependent payoffs of lookback type (without barrier), the same holds true;
• for barrier-like payoffs, β = 21 is appropriate in practice, if the diffusion has poly-
nomial moments at any order and α = 21 .
But these recommendations are partly heuristic (or correspond to critical cases),
especially for path-dependent functionals.
 Extension to Lévy-driven diffusions. The weak expansion error at order 1 for Lévy-
driven diffusion processes discretized by an Euler scheme was established in [74]
in view of the implementation of a multilevel simulation. For other results about
the discretization schemes of Lévy-driven diffusion processes, we refer to classical
papers like [157, 250].
 More about the strong convergence assumption (V E)β . A careful reading of the
chapter emphasizes that the assumptions

(V E)β ≡ Var(Yh − Yh  ) ≤ V1 |h − h  |β or (S E)β ≡ Yh − Yh  22 ≤ V1 |h − h  |β

are the key to controlling the effort of the multilevel estimator after the optimization
of the allocation vector (qi )i , see e.g. (9.57) still in the weighted Multilevel section.
Using the seemingly more natural (S E)0β : Yh − Y0 22 ≤ V̄1 h β (see [198]) yields a
less efficient design of the multilevel estimators, not to speak of the fact that V1 is
easier to estimate in practice than V̄1 . But the virtue of this choice goes beyond this
calibrating aspect. In fact, as emphasized in [32], assumptions (V E)β and (S E)β
may be satisfied in situations where Yh does not converge in L 2 toward Y0 because
the underlying scheme itself does not converge toward the continuous time process.
This is the case for the binomial tree (see e.g. [59]) toward the Black–Scholes model.
More interestingly, the author takes advantage of this situation to show that the
Euler discretization scheme of a Lévy driven diffusion can be wienerized and still
satisfies (S E)β for a larger β: the wienerization of the small jumps of a Lévy process
consists in replacing the increments of the compensated small enough jump process
over a time interval [t, t + δt] by increments c(Bt+ t − Bt ), where B is a standard
Brownian motion and c is chosen to equalize the variances (B is independent of the
9.5 The Multilevel Paradigm 441

existing Brownian and the large jump components of the Lévy process, if any). The
wienerization trick was introduced in [14, 66] and makes this modified Euler scheme
simulable, which is usually not the case for the standard Euler scheme if the process
jumps infinitely often on each nontrivial time interval with non-summable jumps.
 Nested
 Monte
α Carlo. As for the strong convergence (S E)β and weak error expan-
sion W E R , we refer to Practitioner’s corner II, Sect. 9.5.1.
 Limiting behavior. The asymptotics of multilevel estimators, especially the CLT ,
has been investigated in various frameworks (Brownian diffusions, Lévy-driven dif-
fusions, etc) in several papers (see [34, 35, 75]). See also [112] for a general approach
where both weighted and regular Multilevel estimators are analyzed: a SLLN and a
CLT are established for multilevel estimators as the RMSE ε → 0.

9.6 Antithetic Schemes (a Quest for β > 1)

9.6.1 The Antithetic Scheme for Brownian Diffusions:


Definition and Results

We consider our standard Brownian diffusion SDE (7.1) with drift b and diffusion
coefficient σ driven by a q-dimensional standard Brownian motion W defined on
a probability space (, A, P) with q ≥ 2. In such a framework, we saw that the
Milstein scheme cannot be simulated efficiently due to the presence of Lévy areas
induced by the rectangular terms coming out in the second-order term of the scheme.
This seems to make impossible it to reach the unbiased setting “β > 1” since the
Euler scheme corresponds, as we saw above, to the critical case β = 1 (scheme of
order 21 ).
First we introduce a truncated Milstein scheme with step h = Tn . We start from
the definition of the discrete time Milstein scheme as defined by (7.41). The scheme
is not considered as simulable when q ≥ 2 because of the Lévy areas
C
 
n
tk+1
Wsi − Wtikn dWsj
tkn

for which no efficient method of simulation is known when i = j. On the other hand
a simple integration by parts shows that, for every i = j and every k = 0, . . . , n − 1,
C C
   j
n n
tk+1 tk+1
j j
Wsi − Wtikn dWsj + Wsj − Wt n dWsi = (Wtik+1
n − Wtikn )(Wt n − Wt n ),
k k+1 k
tkn tkn

C
  1 
n
tk+1
whereas Wsi − Wtikn dWsi = ( Wtik+1
n − Wtikn )2 − T /n for i = 1, . . . , q.
tkn 2
442 9 Biased Monte Carlo Simulation, Multilevel Paradigm

The idea is to symmetrize the “rectangular” Lévy areas, i.e. to replace them by half
j j
of their sum with their symmetric counterpart, namely 21 (Wtin − Wtin )(Wt n − Wt n ).
k+1 k k+1 k
This yields the truncated Milstein scheme defined as follows:

X̆ 0n = X 0 ,
  √  
X̆ tnk+1
n = X̆ tnkn + h b tkn , X̆ tnkn + h σ tkn , X̆ tnkn
 j 
+ σ̆i j ( X̆ tnkn ) Wtik+1
n Wt n − 1{i= j} h , (9.96)
k+1
1≤i≤ j≤q

where tkn = kT
n
, Wtk+1
n = Wtk+1
n − Wtkn , k = 0, . . . , n − 1 and

1 
σ̆i j (x) = ∂σ.i σ. j (x) + ∂σ. j σ.i (x) , 1 ≤ i < j ≤ q,
2
1
σ̆ii (x) = ∂σ.i σ.i (x), 1 ≤ i ≤ q
2
(where ∂σ.i σ. j is defined by (7.42)).
This scheme can clearly be simulated, but the analysis of its strong convergence
rate, e.g. in L 2 when X 0 ∈ L 2 , shows a behavior quite similar to the Euler scheme,
i.e.
 
 7 n 7 T 
7 7
 max X̆ tkn − X tkn  ≤ Cb,σ,T 1 + X 0 2
k=0,...,n 2 n

if b and σ are CLip


1
. In terms of weak error it also behaves like the Euler scheme:
under similar smoothness assumptions on b, σ and f , or ellipticity on σ, it satisfies
 1
a first-order expansion in h = Tn : E f (X T ) − E f ( X̆ Tn ) = c1 h + O(h 2 ), i.e. W E 1
(see [110]). Under additional assumptions, it is expected that higher-order weak error
 1
expansion W E R holds true.
Rather than trying to search for a higher-order simulable scheme, the idea intro-
duced in [110] is to combine several such truncated Milstein schemes at different
scales in order to make the fine level i behave as if the scheme satisfies (S E)β
for a β > 1. To achieve this, the fine scheme of the level is duplicated into two
fine schemes with the same step but based on swapped Brownian increments. Let
us be more specific: the above scheme with step h = Tn can be described as a homoge-
neous Markov chain associated to the mapping M G= M Gb,σ : Rd × H × Rq → Rd
as follows
 
X̆ tnk+1
n G X̆ tnn , h, Wt n , k = 0, . . . , n − 1, X̆ 0n = X 0 .
=M k k+1

 
We consider a first scheme X̆ t2n,[1]
2n k=0,...,2n
with step h
2
= T
2n
which reads on two
k
time steps
9.6 Antithetic Schemes (a Quest for β > 1) 443

 h 
X̆ t2n,[1]
2n
G X̆ 2n,[1]
=M 2n , , W t2k+1 ,
2n
2k+1 t2k 2
 h 
X̆ t2n,[1]
2n
G X̆ 2n,[1]
=M 2n , , Wt2(k+1)
2n , k = 0, . . . , n − 1.
2(k+1) t2k+1 2
 
We consider a second scheme with step 2n T
, denoted by X̆ t2n,[2]
2n k=0,...,2n
, identical
k
to the first one, except that the Brownian increments are pairwise swapped, i.e. the
increments Wt2k+1 2n and Wt2(k+1)
2n are swapped. It reads

 h 
X̆ t2n,[2]
2n
G X̆ 2n,[2]
=M 2n , , W t2(k+1) ,
2n
2k+1 t2k 2
 h 
X̆ t2n,[2]
2n
G X̆ 2n,[2]
=M 2n , , Wt2k+1
2n , k = 0, . . . , n − 1.
2(k+1) t2k+1 2
The following theorem, established in [110] in the autonomous case (b(t, x) = b(x),
σ(t, x) = σ(x)), makes precise the fact that it produces a meta-scheme satisfying
(S E)β with β = 2 (and a root N = 2) in a multilevel scheme.

Theorem 9.4 Assume b ∈ C 2 (Rd , Rd ), σ ∈ C 2 (Rd , M(d, q, R)) with bounded exist-
ing partial derivatives of b, σ, σ̆ (up to order 2).
(a) Smooth payoff. Let f ∈ C 2 (Rd , R) also with bounded existing partial derivatives
(up to order 2). Then, there exists a real constant Cb,σ,T > 0
   1  T 
 
 f X̆ Tn − f ( X̆ T2n,[1] ) + f ( X̆ T2n,[2] )  ≤ Cb,σ,T 1 + X 0 2 , (9.97)
2 2 n

i.e. (S E)β is satisfied with β = 2.


(b) Almost smooth payoff. Assume that f is Lipschitz continuous on Rd such that its
first two order partial derivatives exist outside a Lebesgue negligible set N0 of Rd
and are uniformly bounded. Assume moreover that the diffusion (X t )t∈[0,T ] satisfies
 
lim ε−1 P inf |X T − z| ≤ ε < +∞. (9.98)
ε→0 z∈N0

Then
   1   43 −η
 
 T  
 f X̆ Tn − f ( X̆ T2n,[1] ) + f ( X̆ T2n,[2] )  ≤ Cb,σ,T,η 1 + X 0 2 (9.99)
2 2 n

for every η ∈ (0, 43 ), i.e. (S E)β is satisfied for every β ∈ (0, 23 ).

For detailed proofs, we refer to Theorems 4.10 and 5.2 in [110]. At least in a one-
dimensional framework, the above condition (9.98) is satisfied if the distribution of
X T has a bounded density.
ℵ Practitioner’s corner: Application to multilevel estimators with root N = 2
444 9 Biased Monte Carlo Simulation, Multilevel Paradigm

The principle is rather simple at this stage. Let us deal with the simplest case of a
vanilla payoff (or 1-marginal) Y0 = f (X T ) and assume that f is such that (S E)β is
satisfied for some β ∈ (1, 2].
 
 Coarse level. On the coarse (first) level i = 1, we simply set Yh(1) = f X̆ Tn,(1) ,
where the truncated Milstein scheme is driven by a q-dimensional Brownian motion
W (1) . Keep in mind that this scheme is of order 21 so that coarse and fine levels will
not be ruled by the same β.
 Fine levels. At each fine level i ≥ 2, the regular difference Y (i) (i)
h − Y h (with
ni n i−1

n i = 2i−1 and h ∈ H), is replaced by

1  [1],(i) 
Yh + Y [2],(i)
h − Y (i)h ,
2 i n n i n i−1

   
where Y (i)h = f X̆ Tn N ,(i) , Y [rh ],(i) = f X̆ Tn N ,[r ],(i) , r = 1, 2. At each level, the
i−2 i−1

n i−1 ni
three truncated antithetic Milstein schemes are driven by the same q-dimensional
Brownian motions W (i) , i = 1, . . . , R (or L, depending on the implemented type of
multilevel meta-scheme). Owing to Theorem 9.4, it is clear that, under appropriate
assumptions,
1  2  h β  3 −
 [1],(i) 
 Yh + Y [2],(i)
h − Y (i)h  ≤ V1 with β= or β = 2.
2 ni ni n i−1 2 ni 2

As a second step, the optimization of the simulation parameter can be carried


out following the lines of the case α = 1 and β = 23 or β = 2. However, the first
level is ruled at a strong rate β = 1 which slightly modifies the definition of the
sample allocations qi in Table 9.1 for the weighted Multilevel estimator. More-
over, the complexity of fine level i is now of the form κ̄ (2n i + n i−1 )/ h instead of

κ̄(n i + n i−1 )/ h so that n i + n i−1 should be replaced mutatis mutandis by

2n i + n i−1 in the same Table. Table 9.2 for regular Multilevel should of course be
modified accordingly.
 Parameter modifications for the antithetic estimator:
1. New value for V0 : now V0 = E Yh = E f ( X̆ Tn ) (were h = h(ε) has been optimized
of course).
2. Updated parameters of the estimator: see Table 9.3.

9.6.2 Antithetic Scheme for Nested Monte Carlo (Smooth


Case)

For the sake of simplicity, we assume in this section that the root N = 2, but what
follows can be extended to any integer N ≥ 2 (see e.g. [113]). In a regular multilevel
Nested Monte Carlo simulation, the fine level i ≥ 2 – with or without weight – relies
9.6 Antithetic Schemes (a Quest for β > 1) 445

Table 9.3 Parameters of the ML2R estimator (standard antithetic diffusion framework)
N 2 (in our described setting)
⎡ 9 ⎤
: 2  √1+4α 
1
: 1 1
⎢ 1 log(
c α h)
; log(
c α h) log ε ⎥
R = R ∗ (ε) ⎢ + ∞
+ + ∞
+2 ⎥
⎢2 log(N ) 2 log(N ) α log(N ) ⎥
⎢ ⎥

h = h ∗ (ε) h
n n i = N i−1 , i = 1, . . . , R (convention: n 0 = n −1 0 := 0)
β 7 7 −1  2
β
(R)
1 θh 2 7W j (N )7 n −1 j−1 − n j
q1 (ε) = † , q j (ε) = , j = 2, . . . , R,
qε qε† n j−1 + 2n j
q = q ∗ (ε) 8
R
V1

with qε s.t. q j (ε) = 1, θ = and V̆h = Var(Y̆h )
j=1
V̆h
⎡ ⎛ ⎞⎤
β
R
7 7  β
⎢ V̆ q † ⎝1 + θh 2 7W(R) (N )7 n −1 − n −1 2 n j−1 + 2 n j ⎠ ⎥
⎢  h ε j j−1 j ⎥
⎢ 1 j=2 ⎥
M = M ∗ (ε) ⎢ 1+ ⎥
⎢ ε 2 ⎥
⎢ 2αR ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥

on quantities of the form


 2K
  K

1 1
Y h2 − Yh = f F(X, Z k ) − f F(X, Z k )
2K k=1
K k=1

where h = 1
= h i = 2i−21 K 0 ∈ H. If f is Lipschitz continuous, we know (see Propo-
K  
 2 Var F(X,Z 1 )
 
sition 9.2(a)) that Yh − Y h2 2 ≤ [ f ]Lip h so that (S E)β and, subse-
2
quently, (V E)β is satisfied with β = 1.
The antithetic version of the fine level Yh − Y h2 that we present below seems to
have been introduced independently in [55, 61, 140]; see also [109, 114]. It consists
in replacing the estimator Yh in a smart enough way by a random variable Y h with
the same expectation as Yh , designed from the same X and the same innovations Z k
but closer to Y h2 in L 2 (P). This leads us to set
-  K
  2K
.
h = 1
Y f
1
F(X, Z k ) + f
1
F(X, Z k ) . (9.100)
2 K k=1
K k=K +1

h = E Yh , h ∈ H. Now we can substitute in the fine levels of the


It is obvious that E Y
Multilevel estimators (with or without weights) the quantities Y h2 − Y h in place of
the quantities of the form Y h2 − Yh without modifying their bias.
446 9 Biased Monte Carlo Simulation, Multilevel Paradigm

Proposition 9.3 Assume that f is differentiable with a ρ-Hölder derivative f  ,


ρ ∈ (0, 1], and that F(X, Z 1 ) ∈ L 2(1+ρ) (P). Then, there exists a real constant V1 =
V1 ( f, ρ, F, X, Z ) such that
 2
∀ h ∈ H, Yh − Y h2 2 ≤ V1 h 1+ρ . (9.101)

As a consequence, weighted and unweighted multilevel estimators (where Yh is


h ), can be implemented with β = 1 + ρ > 1.
replaced by Y

Proof. One derives from the elementary inequality


7 a + b 1 77 |b − a|1+ρ
7
7f − f (a) + f (b) 7 ≤ [ f  ]ρ,H ol
2 2 2ρ
that
7 72(1+2ρ)
 2 [ f  ]2ρ,H ol 77 1 K
1
2K 7
7
Yh − Y h2 2 ≤ E7 F(X, Z k ) − F(X, Z k )7
4ρ 7K K 7
k=1 k=K +1
7 K 72(1+ρ)
[ f  ]2ρ,H ol 7 7
1 7 7
= E 7 F(X, Z k ) − F(X, Z K +k ) 7
4ρ K 2(1+ρ) 7 7
k=1
- K .1+ρ
[ f  ]2ρ,H ol C2(1+ρ)  2
≤ E F(X, Z k ) − F(X, Z K +k )
4ρ K 2(1+ρ) k=1
 K 1+ρ
[ f  ]2ρ,H ol C2(1+ρ)   2 

=  F(X, Z k ) − F(X, Z K +k ) 
4ρ K 2(1+ρ)  k=1

1+ρ

owing to the discrete time B.D.G. Inequality (see (6.49)) applied to the sequence
F(X, Z k ) − F(X, Z K +k ) k≥1 of σ(X, Z 1 , . . . , Z n )-adapted martingale increments.
Then,
Minkowski’s Inequality yields

 2 [ f  ]2ρ,H ol C2(1+ρ) 1+ρ 



2(1+ρ)

Yh − Y h2 2 ≤ K  F(X, Z 1 ) − F(X, Z K +1 ) 
4ρ K 2(1+ρ) 2(1+ρ)
V1
= 1+ρ .
K
This completes the proof. ♦

 Exercises. 1. Extend the above antithetic multilevel nested Monte Carlo method
to locally ρ-Hölder continuous functions f : Rd → R satisfying
7 7  
∀ x, y ∈ Rd , 7 f (x) − f (y)7 ≤ [ f ]ρ,loc |x − y|ρ 1 + |x|r + |y|r .
9.6 Antithetic Schemes (a Quest for β > 1) 447

2. Update Table 9.3 to take into account the specific complexity of the antithetic
nested Monte Carlo method.
3. Extend the antithetic approach to any root N ≥ 2 by replacing Yh in the generic
expression Y Nh − Yh of a fine level with root N by

N
 K

h = 1
Y f
1
F(X, Z (n−1)K +k ) .
N n=1
K k=1

Remark (on convexity). If furthermore f is convex, it follows from the preceding


that
h ≥ E Y h .
E Yh = E Y N

  
We know that Yh → Y0 = f E F(X, Z 1 ) | X in L 2 (P) as h → 0 in H. It follows
that, for every fixed h ∈ H,

E Yh ≥ E Y h → E Y0 as k → +∞.
Nk

Consequently, for the regular MLMC estimator (with h = h ∗ (ε), optimized bias
parameter), one has (see (9.81)):

E
I MM L MC = E Y h ≥ E Y0 = I0
N L−1

i.e. the regular MLMC estimator is upper-biased. This allows us to produce asym-
metrical narrower confidence intervals (see Practitioner’s corner in Sect. 9.5.2) since
the bias E I MM L MC − I0 is always non-negative.

9.7 Examples of Simulation

9.7.1 The Clark–Cameron System

We consider the 2-dimensional SDE (in its integral form)


C t
X t1 = μt + Wt1 , X t2 =σ X s1 dWs2 , t ∈ [0, T ], (9.102)
0

where W = (W 1 , W 2 ) is a 2-dimensional standard Brownian motion. This SDE,


known as the Clark–Cameron oscillator, is an example of diffusion for which
 the
strong convergence rate of the Euler scheme with step Tn is exactly O T
n
. It
means that the choice β = 1 is optimal. For this reason, it is part of the numerical
448 9 Biased Monte Carlo Simulation, Multilevel Paradigm

Fig. 9.1 Clark–Cameron oscillator: CPU-time ratio w.r.t. CPU-time of ML2R (in seconds)
as a function of the prescribed RMSE ε (with D. Giorgi)

experiments carried out in [110] on weighted and unweighted multilevel estimators


(see also [6] where other experiments are carried out with this model).
We want to compute
P(μ, σ, T ) = 10 E cos(X T2 )

for the following values of the parameters: μ = 1, σ = 1, T = 1. Actually, a closed


form exists for E cos(X T2 ) for any admissible values of μ, σ > and T > 0, namely

2
 
− μ 2T 1− tanh(σT
σT
)
e
E cos(X T2 ) = √ . (9.103)
cosh(σT )

This yields the reference value

P(1, 1, 1)  7.14556.

We defer to Sect. 12.11 in the Miscellany chapter the proof of this identity (also used
in [113] as a reference value).
We implemented the following schemes with their correspondences in terms of
characteristic exponents (α, β):

– Crude Euler scheme: α = β = 1 (ICEU


RU D E ),
– MLMC with Euler scheme: α = β = 1 (I MEUL MC ),
– ML2R with Euler scheme: α = β = 1 (I MEUL2R ),
– ML2R with the Giles–Szpruch antithetic Milstein scheme: α = β = 2 (I MG SL2R ).
9.7 Examples of Simulation 449

We reproduce in Fig. 9.1 below the graph ε −→ CPU-time× ε2 where ε denotes


the theoretical (prescribed) RMSE (log-scale).
This graphic highlights that Multilevel estimators make it possible to achieve
accuracies which are simply out of reach with regular Monte Carlo estimators.

9.7.2 Option Pricing

We propose in this paragraph first a graphical comparison between crude Monte


Carlo and multilevel methods, then an internal comparison between weighted and
unweighted multilevel estimators, based on two types of options: a vanilla option
(Call) and a barrier option both in a Black–Scholes model. We recall that the Black–
Scholes dynamics for a traded risky asset reads:
 
d X t = X t r dt + σ dWt , X 0 = x0 . (9.104)

This model can be simulated in an exact way at fixed time t ∈ [0, T ] since the above
σ2
SDE has a closed solution X t = x0 e(r − 2 )t+σWt .
All the simulations of this section have been performed by processing a C++ script
on a Mac Book Pro (2.7 GHz intel Core i5).
Vanilla Call (α = β = 1)
 
The (discounted) Vanilla Call payoff is given by ϕ(X T ) = e−r T X T − K + . As for
the model parameters we set:

x0 = 50, r = 0.5, σ = 0.4, T = 1, K = 40.

Note that in order to produce a significant bias in the implemented discretization


schemes, the interest purposefully been set at an unrealistic value of 50 %. The
premium of this option (see Sect. 12.2 in the Miscellany chapter) has a closed form
given by the Black–Scholes formula, namely the premium is given by

I0 = Call0  25.9308.

– The selected time discretization schemes are the Euler scheme (see (7.6)) for
which β = 1), and the Milstein scheme (see (7.39)) for which β = 2. Both are sim-
ulable since we are in one dimension and share the weak error exponent α = 1.
– The constant c∞ is set at 1 in the ML2R estimator and c1 is roughly estimated
in a preliminary phase to design the MLMC estimator (see Practitioner’s corner of
Sect. 9.5.2).
Though of no practical interest, one must have in mind that the Black–Scholes
SDE is quite a demanding benchmark to test the efficiency of discretization schemes
because both its drift and diffusion coefficients go to infinity as fast as possible under
450 9 Biased Monte Carlo Simulation, Multilevel Paradigm

a linear growth control assumption. This is all the more true given the high interest
rate.
All estimators are designed following the specifications detailed in the Practi-
tioner’s corners sections of this chapter (including the crude unbiased Monte Carlo
simulation based on the exact simulation of X T ). As for (weighted and unweighted)
multilevel estimators, we set h = 1 and estimate the “payoff-dependent” parameters
following the recommendations made in Practitioner’s corner I (see (9.70) and (9.71)
with h 0 = h/5 and h 0 = h 0 /2):

Euler scheme: Var(Yh )  137.289, V1  40.2669 and θ  0.5416.


Milstein scheme: Var(Yh )  157.536, V1  1130.899 and θ  2.6793.

The parameters of the multilevel estimators have been settled following the values in
Tables 9.1 and 9.2. The crude Monte Carlo simulation has been implemented using
the parameter values given in Sect. 9.3.
We present below tables of numerical results for crude Monte Carlo, MLMC
and ML2R estimators for both Euler and Milstein schemes. Crude Monte Carlo
has been run for target RMSE running from ε = 2−k , k = 1, . . . , 6 and multilevel
estimators for RMSE running from ε = 2−k , k = 1, . . . , 8. We observe globally that
our estimators fulfill the prescribed RMSE so that it is relevant to compare the C PU
computation times.
Crude versus Multilevel estimators: See Tables 9.4, 9.5, 9.6, 9.7, 9.8 and 9.9. We
observe at a 2−6 level for the Euler scheme that MLMC is 45 times faster than crude
MC and ML2R is 130 times faster than crude MC. As for the Milstein scheme these
ratios attain 166 and 255, respectively. This illustrates in a striking way that multilevel
methods simply make possible simulations that would be unreasonable to undertake
otherwise (in fact, that is why we limit ourselves to this lower range of RMSE).
ML2R versus MLMC estimators: See Tables 9.6, 9.7, 9.8 and 9.9. This time we
make the comparison at an RMSE level of ε = 2−8 . As for the Euler scheme we
observe that the ML2R is 4.62 times faster than the MLMC estimator whereas, with
the Milstein scheme, this ratio is still 1.58. Note that, being in a setting β = 2 > 1,
both estimators are assumed to behave like unbiased estimators which means that,
in the case of ML2R, the constant C in the asymptotic rate Cε−2 of the complexity
seemingly remains lower with ML2R than with MLMC.
In the following figures, the performances (vertices) of the estimators are depicted
against the empirical RMSE  ε (abscissas). The label above each point of the graph
indicates the target RMSE ε. The empirical RMSE, based on the bias-variance decom-
position, has been computed a posteriori by performing 250 independent trials for
each estimator. The performances of the estimators are measured and compared in
various ways, mixing CPU time and  ε. Details are provided for each figure.
9.7 Examples of Simulation 451

Table 9.4 MC Euler


ls Epsilon CPU Time (s) Emp. RMSE
1 0.50000 0.00339 0.50518
2 0.25000 0.02495 0.28324
3 0.12500 0.19411 0.14460
4 0.06250 1.50731 0.06995
5 0.03125 11.81181 0.03611
6 0.01562 93.40342 0.01733

Table 9.5 MC Milstein


ls Epsilon CPU Time (s) Emp. RMSE
1 0.50000 0.00454 0.46376
2 0.25000 0.03428 0.25770
3 0.12500 0.26948 0.12421
4 0.06250 2.12549 0.06176
5 0.03125 16.92783 0.03180
6 0.01562 135.14195 0.01561

Table 9.6 MLMC Euler


ls Epsilon CPU Time (s) Emp. RMSE
1 0.50000 0.00079 0.55250
2 0.25000 0.00361 0.27192
3 0.12500 0.01729 0.15152
4 0.06250 0.08959 0.06325
5 0.03125 0.41110 0.03448
6 0.01562 2.05944 0.01644
7 0.00781 13.08539 0.00745
8 0.00391 60.58989 0.00351

Table 9.7 MLMC Milstein


ls Epsilon CPU Time (s) Emp. RMSE
1 0.50000 0.00079 0.42039
2 0.25000 0.00296 0.19586
3 0.12500 0.01193 0.10496
4 0.06250 0.04888 0.05278
5 0.03125 0.19987 0.02871
6 0.01562 0.81320 0.01296
7 0.00781 3.29153 0.00721
8 0.00391 13.25764 0.00331
452 9 Biased Monte Carlo Simulation, Multilevel Paradigm

Table 9.8 ML2R Euler


ls Epsilon CPU Time (s) Emp. RMSE
1 0.50000 0.00048 0.61429
2 0.25000 0.00183 0.31070
3 0.12500 0.00798 0.12511
4 0.06250 0.03559 0.06534
5 0.03125 0.16060 0.03504
6 0.01562 0.71820 0.01680
7 0.00781 3.17699 0.00855
8 0.00391 13.09290 0.00363

Table 9.9 ML2R Milstein


ls Epsilon CPU Time (s) Emp. RMSE
1 0.50000 0.00070 0.36217
2 0.25000 0.00231 0.20908
3 0.12500 0.00819 0.10206
4 0.06250 0.03171 0.05154
5 0.03125 0.13266 0.02421
6 0.01562 0.52949 0.01269
7 0.00781 2.11587 0.00696
8 0.00391 8.45961 0.00332

The graphics are produced with the following conventions:


– Each method has its distinctive label: × for MLMC, + for ML2R, ◦ for crude MC
on discretization schemes and %-dashed line for exact simulation (when possible).
– Euler scheme-based estimators are depicted with solid lines whereas Milstein
scheme-based estimators are depicted with dashed lines and the exact simulation
with a dotted line.
It is important to keep in mind that an estimator satisfies the prescribed error
RSMSE ε = 2− if its labels are aligned vertically above the corresponding abscissa
2− .
In Fig. 9.2 the performance measure is simply the average CPU time of one esti-
mator (in seconds) for each of the five tested estimators, depicted in log-log scale.
These graphics confirm that both multilevel estimators satisfy the assigned RMSE
constraint ε in the sense that the empirical RMSE  ε ≤ ε.
It also stresses the tremendous gain provided by multilevel estimators (note that
when the absolute accuracy is 2−8 , it corresponds to a relative accuracy of 0.015 %.)
Figure 9.3 is based on the same simulations as Fig. 9.2. The performances are
now measured through the empirical MSE × CPU time =  ε2 × CPU time. Figure 9.3
 β = 1 that the computational cost of MLMC estimators
highlights in the critical case
increases faster by a log 1/ε) factor than ML2R multilevel estimators.
9.7 Examples of Simulation 453

Fig. 9.2 Call option pricing: CPU-time as a function of the empirical RMSE 
ε. Top: log-
regular scale; Bottom: log-log scale (with D. Giorgi and V. Lemaire)
454 9 Biased Monte Carlo Simulation, Multilevel Paradigm

2 −6−6


2


2 −5
2 −5

1e−02 2 −4
● 2 −4

Estimator
2 −3 −3
● ●2 ● MC
CPU−Time x MSE

MLMC
2 −2−2
● 2
● ML2R

1e−03 ●
2 −1−1
2
2 −8 2 −7 ● Scheme
−6
2
2 −5 −3
Euler
2 −4 2
Milstein
2 −2 2 −1
2 −7 Exact
−8 2 −6 2 −5
−2 2 −1
2
−8 2 −7 −6 2 −5 −42
−4 2 −1
2 2 2 2 −3 −3 2
2 22−2−2
2 −8 2 −7 2 −1
1e−04 2 −6 2 −5 2 −4 2 −3
2 −7 2 −3 2 −2 2 −1
−8 ● 2 −6 2 −5 2 −4
● ● ●

2 ● ●

2 −8 2 −7 2 −6 2 −5 2 −4 2 −3 2 −2 2 −1
Empirical RMSE

Fig. 9.3 Call option pricing by Euler and Milsteinschemes: MSE × CPU-time as a
function of the empirical RMSE 
ε, log-log- scale. (With D. Giorgi and V. Lemaire.)

Figure 9.3 illustrates for the Milstein scheme the “virtually” unbiased setting (β =
2) obtained here with the Multilevel Milstein scheme. We verify that multilevel curves
remain approximately flat but lie significantly above the true unbiased simulation
which, as expected, remains unrivaled. The weighted multilevel seems to be still
faster in practice.
For more simulations involving path-dependent options, we refer to [107, 109]
for regular multilevel and to [112, 198] for a comparison between weighted and
regular Multilevel estimators on other options like Lookback options (note that the
specifications are slightly more “conservative” in these last two papers than those
given here, whereas in Giles’ papers, the estimators designed are in an adaptive
form). Now we will briefly explore a path-dependent option, namely a barrier option
for which the characteristic exponent β is lower than 1, namely β = 0.5 (in (S E)0β )
and α is most likely also equal to 0.5. We still consider a Black–Scholes dynamics
to have access to a closed form for the option premium.
Barrier option (α = β = 0.5)
We consider an Up-and-Out Call option to illustrate the case β = 0.5 < 1 and α =
0.5. This path-dependent option with strike K and barrier B > K is defined by its
functional payoff

ϕ(x) = e−r T (x(T ) − K )+ 1{maxt∈[0,T ] x(t)≤B} , x ∈ C([0, T ], R).


9.7 Examples of Simulation 455

The parameters of the Black–Scholes model are set as follows:

x0 = 100, r = 0, σ = 0.15, and T = 1.

With K = 100 and B = 120, the price obtained by the standard closed-form solution
is
I0 = Up-and-Out Call0 = 1.855225.

We consider here a simple (and highly biased) approximation of max X t by


t∈[0,T ]
max X̄ kh (h = T
n
) without any “help” from a Brownian bridge approximation, as
k∈{1,...,n}
investigated in Sects. 8.2.1 and 8.2.3.
By adapting the computations carried out in Lemma 9.3 with the convergence
rate for the sup norm of the discrete time Euler scheme, we obtain that β = 0.5− for
(S E)0β (i.e. (S E)0β holds for every β ∈ (0, 21 ) but a priori not for β = 0.5, due to the
log h term coming out in the quadratic strong convergence rate). The design of the
multilevel estimators is adapted to (S E)0β for both estimators.
As for the weak error rate, the first-order with α = 0.5 has been established in [17].
But assuming that R = +∞ with α = 0.5, i.e. that the weak error expansion holds
at any order in that scale, stands as a pure conjecture at higher-orders.
The rough estimation of payoff dependent parameters yields with h = 1,
Var(Yh )  30.229 and V1  11.789 (with h 0 = h/5, h 0 = h 0 /2) so that θ  0.6245.
Figure 9.4 highlights that since β = 0.5, we observe that the function (CPU-
time)×ε2 increases much faster for MLMC than ML2R as ε goes to 0. This agrees with
the respective theoretical asymptotic rates for both estimators since the computational
cost of ML2R estimators remains a o(ε−(2+η) ) for every η > 0, even for β < 1,
 1−β 
whereas the performances of the MLMC estimator behaves like o ε−(2+ α ) .
In fact, in this highly biased example with slow strong convergence, the ratio
MLMC/ML2R in terms of CPU time, as a function of the prescribed  ε = 2−k , attains
approximately 45 when k = 6.

Remark. The performances of the ML2R estimator on this option tends to support
the conjecture that an expansion of the weak error at higher-orders exists.

9.7.3 Nested Monte Carlo

As an example we will consider two settings. First, a compound option (4 ) (here a


Call-on-Put option) and, as a second example, a quantile of a Call option (both in
a Black–Scholes model). In the first example we will implement both regular and
antithetic multilevel nested estimators.
Compound option (Put-on-Call) (α = β = 1)

4 More generally, a compound option is an option on (the premium of) an option.


456 9 Biased Monte Carlo Simulation, Multilevel Paradigm


2−5

2−6
0.010

2−4
Estimator
2−5
MC
CPU−Time x MSE

MLMC
2−4

2−3 ML2R

0.001 Scheme
2−3 Euler

2−2
2−6 2−5 2−2

2−4
2−3
2−2

2−6 2−5 2−4 2−3 2−2


Empirical RMSE

2−6
−5
●2

10.00
2−5 −4 Estimator
●2
2−6
MC
CPU−Time


1.00
MLMC
2−5 2−4
ML2R
−3
●2
0.10 Scheme
2−4
Euler
2−3

2−3 −2
0.01 ●2

2−2
2−2

2−6 2−5 2−4 2−3 2−2


Empirical RMSE
Fig. 9.4 Barrier option in a Black–Scholes model. Top:  ε2 × CPU-time (y– axis, log-
ε (x– axis, log2 scale) for MLMC and ML2R estimators. Bottom: CPU-time
scale) as a function of
(y–axis, log-scale) as a function of 
ε (x–axis, log2 scale). (With V. Lemaire.)
9.7 Examples of Simulation 457

A compound option being simply an option on an option, its payoff involves the
value of another option. We consider here the example of a European style Put-on-
Call, still in a Black–Scholes model, with parameters (r, σ) (see (9.104) above). We
consider a Call of maturity T2 and strike K 2 . At date T1 , T1 < T2 , the holder of the
option has the right to sell it at (strike) price K 1 . The payoff of such a Put-on-Call
option reads   
e−r T1 K 1 − e−r (T2 −T1 ) E (X T2 − K 2 )+ | X T1 .
+

To comply with the nested multilevel framework (5 ), we set here H = {1/k, k ∈


k0 N∗ } with k0 = 2 and
 
   1
k
  1

Y0 = f E (X T2 − K 2 )+ | X T1 , Y 1 = f F(X T1 , Z ) − K 2 +
, ∈ H,
k k k
=1
 
where f (x) = e−r T1 K 1 − e−r (T2 −T1 ) x + , (Z k )k≥1 is an i.i.d. sequence of N (0; 1)-
distributed random variables and F is such that
σ2

X T2 = F(X T1 , Z ) = X T1 e(r − 2 )(T2 −T1 )+σ T2 −T1 Z d
, Z = N (0; 1).

Note that, in these experiments, the underlying process (X t )t∈[0,T2 ] is not dis-
cretized in time since it can be simulated in an exact way. The bias error is exclusively
due to the inner Monte Carlo estimator of the conditional expectation.
The parameters of the Black–Scholes dynamics (X t )t∈[0,T2 ] are

x0 = 100, r = 0.03, σ = 0.2,

whereas those of the Put-on-Call payoff are T1 = 1/12, T2 = 1/2 and K 1 = 6.5,
K 2 = 100.
To compute a reference price in spite of the absence of closed form we proceed
as follows: we first note that, by the (homogeneous) Markov property, the Black–
Scholes formula at time T1 reads (see Sect. 12.2)
 
e−r (T2 −T1 ) E (X T2 − K 2 )+ | X T1 = Call B S (X T1 , K 2 , r, σ, T2 − T1 )
 
(see Sect. 12.2) so that Y0 = g X T1 where
  
g(x) = f K 1 − Call B S (x, K 2 , r, σ, T2 − T1 ) +
.

This can be computed by a standard unbiased Monte Carlo simulation. To reduce


the variance one may introduce the control variate

5 Due to a notational conflict with the strike prices


K 1 , K 2 , etc, we temporarily modify our standard
notations for denoting inner simulations and the bias parameter set H.
458 9 Biased Monte Carlo Simulation, Multilevel Paradigm
    
 = K 1 − e−r (T2 −T 1) (X T2 − K 1 )+ + − E K 1 − e−r (T2 −T 1) (X T2 − K 1 )+ + .

This leads, with the numerical values of the current example, to the reference value
7 7
E Y0  1.39456 and 7E Y0 − 1.394567 ≤ 2−15 at a 95%-confidence level.

 Exercise. Check the above reference value using a large enough Monte Carlo sim-
ulation with the above control variate. How would you still speed up this simulation?

We tested the following four estimators:


• Regular MLMC and ML2R,
• Antithetic MLMC and ML2R.
Theorem 9.2 (a) suggests, though f is simply Lipschitz continuous, that the weak
error expansion holds with α = 1 and Proposition 9.2(a) shows that the strong error
rate is satisfied with β = 1 for both regular multilevel estimators.
As for the implementation of the antithetic nested estimator on this example, it
seems a priori totally inappropriate since the derivative f  of f is not even defined
everywhere and has subsequently no chance to be (Hölder-)continuous. However,
empirical tests suggest that we may assume that (9.101) (the variant of (S E)β involv-
ing the antithetic Y term defined in (9.100)) holds with β = 1.5, which led us to
implement this specification for both multilevel estimators.
For the implementation we set K 0 = 2 (at least two inner simulations), i.e. h = 21 .
A rough computation of the payoff-dependent parameters yields:
– Regular nested: Var(Yh )  7.938, V1  23.943 (still with h 0 = h/5 and h 0 =
h 0 /2) so that θ  1.74.
– Antithetic: Var(Yh )  7.867, V1  14.689 so that θ  1.3664.
Note in Fig. 9.5 that, for a prescribed RMSE of 2−8 , ML2R is faster than MLMC
as a function of the empirical RMSE  ε by a factor of approximately 4 for regular
nested simulations (8.24 s versus 33.98 s) and 2 for antithetic simulations (2.26 s
versus 4.37 s). The gain with respect to a crude Monte Carlo simulation at the same
accuracy (68.39 s) is beyond a factor of 30.
Digital option on a call option (β = 0.5− , α = 1)
Now we are interested in a quantile of the premium at time T1 of a call option with
maturity T2 in the same model, namely to compute
   
P e−r (T2 −T1 ) E (X T2 − K 2 )+ | X T1 ) ≥ K 1 = E 1e−r (T2 −T1 ) E (X −K 2 )+ | X T )≥K 1
 .
T2 1

Such a value can still be seen as a compound option, namely the value of a digital
option on the premium at time T1 of a call of maturity T2 . We kept the above param-
eters for the Black–Scholes model and the underlying call option but also for the
digital component, namely K 1 = 6.5, T1 = 1/2; K 2 = 100 and T2 = 1/2.
9.7 Examples of Simulation 459

1e−03 2−8

2−8 ●
2−7
CPU−Time x MSE

2−7 −6−6 Estimator



22
2−5 ● MC
2 −5 2−4
● MLMC
2−3
2−6
1e−04 2 −8 2−7 ML2R
2−5 2−4
2−7 ● −4
2 2−3
2−8 2−6 Scheme
2−5
2−4 −3 Nested

2 2−3
2−8 2−6 −5 Antithetic
2 −7 2
2−4
2−3

1e−05
2−8 2−7 2−6 2−5 2−4 2−3
Empirical RMSE

1e+02 ●2
−8

2−8

−7
1e+01 2−8 ●2
2−7
2−8 Estimator
2−8 2−7 ● MC
2−7 2−6−6
2
1e+00
CPU−Time


−7 MLMC
2 2−6
−6 ML2R
2 −5
2−5
2−6 ●2
−5
2 Scheme
1e−01 2−5
2−5 2−4 Nested
−4
−4
22

Antithetic
2−4
1e−02 2 −4

2
2−3
−3
−3
●2
−3 2−3
2

1e−03

2−8 2−7 2−6 2−5 2−4 2−3


Empirical RMSE
Fig. 9.5 Compound option in a Black- - Scholes model: Top:  ε2 × CPU-time (y–axis,
regular scale) as a function of the empirical RMSE 
ε (x-axis, log scale). Bottom: CPU-time in
log-log scale. (With D. Giorgi and V. Lemaire)
460 9 Biased Monte Carlo Simulation, Multilevel Paradigm

Fig. 9.6 Digital option on a Call: CPU-time (y–axis, log scale) as a function of the empirical
RMSEε̃ (x-axis, log2 scale) (with V. Lemaire)

We only tested the first two regular MLMC and ML2R estimators since antithetic
pseudo-schemes are not efficient with non-smooth functions (here an indicator func-
tion).
Theorem 9.2(c) strongly suggests that weak error expansion holds with α = 1
and Proposition 9.2(c) suggests that the strong error rate is satisfied with β = 0.5− ,
so we adopt these values.
The multilevel estimators have been implemented for this simulation with the
parameters from [198], which are slightly more conservative than those prescribed
in this chapter, which explains why the crosses are slightly ahead of their optimal
positions (the resulting empirical RMSE is lower than the prescribed one).
Note in Fig. 9.6 that the ML2R estimator is faster than the MLMC estimator as a
function of the empirical RMSE  ε by a factor approximately equal to 5 within the
range of our simulations. This is expected since we are in framework where β < 1.
Note that this toy example is close in spirit to the computation of Solvency Capital
Requirement which consists by its very principle in computing a kind of quantile – or
a value-at-risk – of a conditional expectation.

9.7.4 Multilevel Monte Carlo Research Worldwide

 The webpage below, created and maintained by Mike Giles, is an attempt to list
the research groups working on multilevel Monte Carlo methods and their main
contributions:
people.maths.ox.ac.uk/gilesm/mlmc_community.html
9.7 Examples of Simulation 461

 For an interactive ride through the world of multilevel, have a look at the website
simulations@lpsm at the url:
simulations.lpsm.paris/

9.8 Randomized Multilevel Monte Carlo (Unbiased


Simulation)

We conclude this chapter with a section devoted to randomized multilevel estima-


tors, which is a new name for a rather old idea going back to the very beginning
of Monte Carlo simulation, brought back to light and developed more recently by
McLeish [206], then Rhee and Glynn in [252].
We briefly expose in the next section a quite general abstract version. As a sec-
ond step, we make the connection with our multilevel framework developed in this
chapter.

9.8.1 General Paradigm of Unbiased Simulation

Let (Z n )n≥1 be a sequence of square integrable random variables defined on a proba-


bility space (, A, P) and let τ : (, A, P) → N∗ , independent of (Z n )n≥1 , be such
that
πn = P(τ ≥ n) > 0 for every n ∈ N∗ .


We also set pn = P(τ
 = n) = πn − πn+1 for every n ∈ N . 6
We assume that n≥1 Z n 2 < +∞ so that, classically ( ),

L 2 & a.s
Z 1 + · · · + Z n −→ Y0 = Zn.
n≥1

Our aim is to compute I0 = E Y0 taking advantage of the assumption that the random
variables Z n are simulable (see further on) and to be more precise, to devise an
unbiased estimator of I0 . To this end, we will use the random time τ (with the
disadvantage of adding an exogenous variance to our estimator).
Let
τ
Zi Zi
Iτ = = 1{i≤τ }
π
i=1 i i≥1
πi

  
n≥1 Z n 1
< +∞ so that E n≥1 |Z n | < +∞,
6 As the L 1 -norm is dominated by the L 2 -norm,

which in turn implies that n≥1 |Z n | is P-a.s. absolutely convergent.
462 9 Biased Monte Carlo Simulation, Multilevel Paradigm

and, for every n ≥ 1,


n τ ∧n
Zi Zi
Iτ ∧n = 1{i≤τ } = .
i=1
πi i=1
πi

The following proposition provides a first answer to our problem.

Proposition 9.4 (a) If the sequence (Z n )n≥1 satisfies

Z n 2
√ < +∞, (9.105)
n≥1
πn

then
L 2 & a.s
Iτ ∧n −→ Iτ

L 2 & a.s
(and Z 1 + · · · + Z n −→ Y0 as n → +∞). Then, the random variable Iτ is unbi-
ased with respect to Y0 in the sense that
- τ
.
Zi
E Iτ = E = E Y0
i=1
πi

and  2
   2  2 Z n 2
Var Iτ =  Iτ 2 − I02 with  Iτ 2 ≤ √ . (9.106)
n≥1
πn

(b) The L 2 -norm of Iτ also satisfies


 2 ⎛ 2  n−1 2 ⎞
 2  n
Zi   n
Zi   Zi 
 Iτ  =       ⎠
pn   = πn ⎝  −  .
2  πi   πi   π 
n≥1 i=1 2
n≥1 i=1 2
i=1 i 2

(c) Assume the random variables (Z n )n≥1 are independent. If

Var(Z n ) |E Z n |
< +∞ and √ < +∞, (9.107)
n≥1
πn n≥1
πn

L 2 & a.s
then Iτ ∧n −→ Iτ as n → +∞ and
  21
  Var(Z n ) |E Z n |
 Iτ  ≤ + √ . (9.108)
2
n≥1
πn n≥1
πn
9.8 Randomized Multilevel Monte Carlo (Unbiased Simulation) 463
 
Remark. When the random variables Z n are independent and E Z n = o Z n 2 , the
assumption (9.107) in (c) is strictly less stringent than that in (a) (under theindepen-
dence assumption). Furthermore, if one may neglect the additional term n≥1 |E√πZnn |
involving expectations in the upper-bound (9.108) of Iτ 2 , it is clear that the variance
in the independent case is most likely lower than the general upper-bound established
in (9.106)) since an2 = o(an ) when an → 0.
Proof of Proposition 9.4. (a) The space L 2 (, A, P) endowed with the L 2 -norm
Z 
 · 2 is complete so it suffices to show that the series with general term √nπn2 is
 · 2 -normally convergent. Now, using that τ and the sequence (Z n )n≥1 are mutually
independent, we have for every n ≥ 1,
 Z  Z n    Z n 2 √ Z n 
 n 
 √ 1{n≤τ }  = √ 2 1{n≤τ } 2 = πn = √ 2 .
πn 2 πn πn πn

Then, the sequence Iτ ∧n converges in L 2 (P), hence toward Iτ as it is clearly its a.s.
limit since τ < +∞.
The L 2 -convergence of Yn toward Y∞ is straightforward under our assumptions

since 1 ≤ 1/ πi by similar 2
? L -normal @ convergence arguments.
Zn E Zn
On the other hand E 1{n≤τ } = πn = E Z n , n ≥ 1. Now as a conse-
πn πn
quence of the L 2 (hence L 1 )-convergence of Iτ ∧n toward Iτ , their expectations con-
n
verge, i.e. E Iτ = lim E Iτ ∧n = lim E Z i = E Y0 . Finally,
n n
i=1

  Z n 2
 Iτ  ≤ √ .
2
n≥1
πn

(b) It is obvious after a preconditioning by τ that


 τ
2  2
 2 Zi
i
Z
 Iτ  = E = pi E .
2
i=1
πi i≥1 =1
π

Now, an Abel transform yields


 2  2 ⎡  2  i−1 2 ⎤
n i n n i
Z Zi Z Z
pi E = −πn+1 E + πi ⎣E −E ⎦.
π πi π π
i=1 =1 i=1 i=1 =1 =1

Now
 
√  n
Zi  √
n
Z i 2
 
πn+1   ≤ πn √ √ → 0 as n → +∞
 πi  πi πi
i=1 2 i=1
464 9 Biased Monte Carlo Simulation, Multilevel Paradigm

Z i 2
owing to Kronecker’s Lemma (see Lemma 12.1 applied with an = √
πi
and bn =

1/ πn ).
(c) First we decompose Iτ ∧n into
n

Zi
n
E Zi
Iτ ∧n = 1{i≤τ } + 1{i≤τ } ,
i=1
πi i=1
πi

where the random variables 


Z i = Z i − E Z i are independent and centered.
Then, using that τ and (Z n )n≥1 are independent, we obtain for every n, m ≥ 1,
7 72
7 n+m
 7 n+m
EZ i2 EZi 
7 Zi 7 Zj
E7 1{i≤τ } 7 = πi + 2 πi∨ j
7 πi 7 πi
2 πi π j
i=n+1 i=n+1 n+1≤i< j≤n+m
n+m
Var(Z i )
=
i=n+1
πi

since E  Zj = E
Zi  Zi E 
Z j = 0 if i = j. Hence, for all integers n, m ≥ 1,
   n+m  21
 n+m
  Var(Z i )
 Zi 
 1{i≤τ }  ≤
 πi  πi
i=n+1 2
i=n+1

τ ∧n
 Var(Z n )

Zi
which implies, under the assumption πn
< +∞, that the series =
n≥1
i=1
πi
n

Zi
1{i≤τ } is a Cauchy sequence hence convergent in the complete space
π
i=1 i
τ
 2  
Zi
L (P),  . 2 . Its limit is necessarily Iτ = since that is its a.s. limit.
π
i=1 i
On the other hand, as
 
 |E Z i |  |E Z i | √ |E Z i |
 1 
{i≤τ }  ≤ πi = √ < +∞,
 π πi πi
i≥1 i 2 i≥1 i≥1

n τ
E Zi |E Z i |
one deduces that 1{i≤τ } is convergent in L 2 (P). Its limit is clearly .
i=1
πi i=1
πi
Finally, Iτ ∧n → Iτ in L 2 (P) (and P-a.s.) and the estimate of Var(Iτ ) is a straightfor-
ward consequence of Minkowski’s Inequality. ♦
Unbiased Multilevel estimator
The resulting unbiased multilevel estimator reads, for every integer M ≥ 1,
9.8 Randomized Multilevel Monte Carlo (Unbiased Simulation) 465

τ (m)
Z i(m)
M M
 1 1
I MR M L = Iτ(m)
(m) = , (9.109)
M m=1
M m=1 i=1
πi

where (τ (i) )i≥1 and (Z i(i) )i≥1 , i ≥ 1, are independent sequences, distributed as τ and
(Z i )i≥1 , respectively.

Definition 9.6 (Randomized Multilevel estimator) The unbiased multilevel esti-


mator (9.109) associated to the random variables (Z i )i∈N∗ and the random time τ
is called a Randomized Multilevel estimator and will be denoted in short by RML
from now on.

However, for a practical implementation, we need to specify the random time


τ so that the estimator has both a finite mean complexity and a finite variance or,
equivalently, a finite L 2 -norm and such a choice is closely connected to the balance
between the rate of decay of the variances of the random variables Z i and the growth
of their computational complexity. This is the purpose of the following section,
where we will briefly investigate a geometric setting close to that of the multilevel
framework with geometric refiners.
ℵ Practitioner’s corner
In view of practical implementation, we make the following “geometric” assumptions
on the complexity and variance (which are consistent with a multilevel approach,
see later) and specify a distribution for τ accordingly. See the exercise in the next
section for an analysis of an alternative non-geometric framework.
d
• Distribution of τ : we assume a priori that τ = G ∗ ( p), p ∈ (0, 1), so that pn =
p(1 − p)n−1 and πn = (1 − p)n−1 , n ≥ 1.
This choice is motivated by the assumptions below.
• Complexity. The complexity of simulating Z n is of the form

κ(Z n ) = κN n , n ≥ 1.

Then, the mean complexity of the estimator Iτ is clearly

κ̄(Iτ ) = κ pn N n .
n≥1

As a consequence this mean complexity is finite if and only if

(1 − p)N < 1

κ pN
and κ̄(Iτ ) = .
1 − (1 − p)N
• Variance. We make the assumption that there exists some β > 0 such that
466 9 Biased Monte Carlo Simulation, Multilevel Paradigm

E Z n2 ≤ V1 N −nβ , n ≥ 1,

for the same N as for the complexity. Then, the finiteness of the variance is satisfied
if Assumption (9.105) holds true. With the above assumptions and choice, we have
β
Z n 2 N−2 n  −n/2
√ ≤ V1 = V1 (1 − p) N β (1 − p) .
πn (1 − p)
n−1
2

Consequently, the series in (9.105) converges if and only if N β (1 − p) > 1.


Now we are in a position to specify the parameters of the randomized multilevel
estimator:
• Choice of the geometric parameter p. The two antagonist conditions on p induced
by finiteness of the variance and the mean complexity read as

1 1
<1− p < ,
Nβ N

which may hold if and only if β > 1. A reasonable choice is then to set 1 − p to
be the geometric mean of these bounds, namely
1+β
p =1− N 2 . (9.110)

• Size of the simulation. To specify our estimator for a given RMSE, we take advan-
tage of its unbiased feature and note that
 
 RML 2  R M L  Var Iτ
  
I M − I0 2 = Var I M = .
M
 RML 
The prescription 
I M − I0 2 = ε, reads like for any unbiased estimator,
 
M ∗ (ε) = ε−2 Var Iτ .

In view of practical
  implementation a short pre-processing is necessary to roughly
calibrate Var Iτ .

9.8.2 Connection with Former Multilevel Frameworks

The transcription to our multilevel framework (see


 Sect. 9.2)
 is straightforward: with
our usual notations, we set for every h ∈ H = hn , n ≥ 1 , h ∈ (0, +∞).

Z 1 = Yh(1) , and Z i = Y (i) (i)


h − Y h , i ≥ 2. (9.111)
ni n i−1
9.8 Randomized Multilevel Monte Carlo (Unbiased Simulation) 467

However, we insist upon the fact that this unbiased simulation method can only
be implemented if the family (Yh )h∈H satisfies (S E)β with β > 1 (fast quadratic
approximation). By contrast, at a first glance, no assumptions on the weak error
expansion are needed. This observation should be tempered by the result stated in
Claim (b) of the former proposition, where the Z n are assumed independent. This
setting is actually the closest to multilevel multilevel approach and the second series
in (9.107) involves E Z n , which implies a control of the weak error to ensure its
convergence.
 Exercise
 (Linear refiners). Set n i = i, i ≥ 1, in the multilevel framework. Assume
H = k/n, n ≥ 1 and the sequence (Z n )n≥1 is made of independent random vari-
ables (see Claim (c) of the above Proposition 9.4).
• Inverse linear complexity: there exists a κ > 0 such that κ(Yh ) = κ/ h, h ∈ H.
 β
• (S E)β ≡ Var(Yh − Y hi ) ≤ V1 h β 1 − 1/i , h ∈ H, i ≥ 2.
• (W E)1α ≡ E Yh + c1 h + o(h) as h → 0 in H.
(a) Devise a new version of randomized Monte Carlo based on these assumptions
(including a new distribution for τ ).
(b) Show that, once again, it is unfortunately only efficient (by efficient we mean that
it is possible to choose the distribution of τ so that the mean cost and the variance of
the unbiased randomized multilevel estimate both remain finite) if β > 1.
Application to weak exact simulation of SDEs
In particular, in an SDE framework, a multilevel implementation of the Milstein
scheme provides an efficient unbiased method of simulation of the expectation of a
marginal function of a diffusion E f (X T ). This is always possible in one dimension
(scalar Brownian motion, i.e. q = 1 with our usual notations) under appropriate
smoothness assumptions on the drift and the diffusion coefficient of the SDE.
In higher dimensions (q ≥ 2), an unbiased simulation method can be devised by
considering the antithetic (meta-)scheme, see Sect. 9.6.1,

1  [1],(i) 
Z 1 = Yh(1) , and Zi = Yh + Y [2],(i)
h − Y (i)h , i ≥ 2,
2 ni ni n i−1

(with the notations of this section). Such an approach provides a rather simple weak
exact simulation method of a diffusion among those mentioned in the introduction
of this chapter.

9.8.3 Numerical Illustration

We tested the performances of the Randomized Multilevel estimator in the above with
consistent and independent levels on a regular Black–Scholes Call in one dimension,
discretized by a Milstein scheme in order to guarantee that β > 1. The levels have
468 9 Biased Monte Carlo Simulation, Multilevel Paradigm

been designed following the multilevel framework described above. The parameters
of the Call are
x0 = 100, K = 80, σ = 0.4.

As for the simulation, we set N = 6 so that

p ∗ = 0.931959

owing to Formula (9.110) (further numerical experiments show that this choice is
close to optimal).
First we compared the Randomized Multilevel estimator (see the former subsec-
tion) with consistent levels (the same Brownian underlying increments are used for
the simulation at all levels) and the Randomized Multilevel estimator with inde-
pendent levels (in the spirit of the multilevel paradigm described at the beginning
of Sect. 9.5). Figure 9.7 depicts the respective performances of the two estimators.
Together with the computations carried out in Proposition 9.4, it confirms the intu-
ition that the Randomized Multilevel estimator with independent levels outperforms
that with constant levels. To be more precise the variance of Iτ is approximately equal
to 1617.05 in the consistent levels case and 1436.61 in the independent levels case.
As a second step, with the same toy benchmark, we tested the Randomized Multi-
level estimator versus the weighted and regular Multilevel estimators described in the
former sections. Figure 9.8 highlights that even for not so small prescribed RMSE
(ε = 2−k , k = 1, . . . , 5 in this simulation), the Randomized Multilevel estimator
(with independent levels) is outperformed by both weighted and regular multilevel

Fig. 9.7 RML estimators: Constant vs independent levels (on a Black–Scholes Call). CPU-time
(y–axis, log scale) as a function of the empirical RMSE ε̃ (x-axis, log scale) (with D. Giorgi)
9.8 Randomized Multilevel Monte Carlo (Unbiased Simulation) 469

Fig. 9.8 RML estimator vs ML2R and MLMC estimators: CPU-time as a function of the
empirical RMSE ε̃ (x-axis, log scale) (with D. Giorgi)

estimators. This phenomenon can be attributed to exogenous variance induced by


the random times τi .
As a conclusion of this action, numerical tests carried out with the Milstein scheme
for vanilla options in a Black–Scholes model (see, for example, e.g. [113] for more
details and further simulations) share the following features:
• The expected theoretical behavior as ε → 0 is observed, corresponding to its coun-
terpart β > 1 in the multilevel framework, as well as the absence of bias.
• The procedure shows better performances when implemented with independent
levels than fully consistent levels.
• This RML estimator is always slower than the weighted and regular multilevel
estimators introduced in former sections. This is undoubtedly due to the exogenous
noise induced by the random times τi used to determine the number of levels on
each path.
 Exercise. Compute by any means at your disposal the value of the constant

C = E W  L 2 ([0,1],dt) with an accuracy of ±3.10−7 at a confidence level 95%.

(This constant C was introduced in Sect. 5.3.1 – Exercise 4. of Practitioner’s corner


to be precise – in order to produce optimal quantizers of the Lévy area.)
Chapter 10
Back to Sensitivity Computation

Let (1 ) Z : (, A, P) → (E, E) be an E-valued random variable where (E, E) is an


abstract measurable space, let I be a nonempty open interval of R and let F : I ×
E → R be a Bor (I ) ⊗ E-measurable function such that, for every x ∈ I , F(x, Z ) ∈
L 2 (P) (2 ). Then set
f (x) = E F(x, Z ).

Assume that the function f is regular, at least at some points. Our aim is to devise a
method to compute by simulation f  (x) or higher derivatives f (k) (x), k ≥ 1, at such
points x. If the functional F(x, z) is differentiable in its first variable at x for P Z -
almost every z, if a domination or uniform integrability property holds (like (10.1)
or (10.5) below), if the partial derivative ∂∂xF (x, z) can be computed and Z can be
simulated both, at a reasonable computational cost, it is natural to compute f  (x)
using a Monte Carlo simulation based on the representation formula
 
∂F
f  (x) = E (x, Z ) .
∂x

1 Although this chapter comes right after the presentation of multilevel methods, we decided to
expose the main available approaches for simulation-based sensitivity computation on their own,
in order not to be polluted by technicalities induced by a mix (or a merge). However, the multilevel
paradigm may be applied efficiently to improve the performances of the Monte Carlo-based methods
that we will present: in most cases we have at hand a strong error rate and a weak error expansion.
This can be easily checked on the finite difference method, for example. For an application of the
multilevel approach to Greek computation, we refer to [56].
2 In fact, E can be replaced by the probability space  itself: Z becomes the canonical vari-

able/process on this probability space (endowed with the distribution P = P Z of the process).
In particular, Z can be the Brownian motion or any process at time T starting at x or its entire path,
etc. The notation is essentially formal and could be replaced by the more general F(x, ω).
© Springer International Publishing AG, part of Springer Nature 2018 471
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_10
472 10 Back to Sensitivity Computation

This approach has already been introduced in Chap. 2 and will be more deeply devel-
oped further on, in Sect. 10.2, mainly devoted to the tangent process method for
diffusions.
Otherwise, when ∂∂xF (x, z) does not exist or cannot be computed easily (whereas
F can), a natural idea is to introduce a stochastic finite difference approach. Other
methods based on the introduction of an appropriate weight will be introduced in the
last two sections of this chapter.

10.1 Finite Difference Method(s)

The finite difference method is in some way the most elementary and natural method
for computing sensitivity parameters, known as Greeks when dealing with financial
derivatives, although it is an approximate method in its standard form. This is also
known in financial Engineering as the “Bump Method” or “Shock Method”. It can be
described in a very general setting which corresponds to its wide field of application.
Finite difference methods were been originally investigated in [117, 119, 194].

10.1.1 The Constant Step Approach

We consider the framework described in the introduction. We will distinguish two


cases: in the first one – called the “regular setting” – the function x → F(x, Z (ω))
is “not far” from being pathwise differentiable whereas in the second one – called
the “singular setting” – f remains smooth but F becomes “singular”.
The regular setting

Proposition 10.1 Let x ∈ R. Assume that F satisfies the following local mean
quadratic Lipschitz continuous assumption at x
 
∃ ε0 > 0, ∀ x  ∈ (x − ε0 , x + ε0 ),  F(x, Z ) − F(x  , Z )2 ≤ C F,Z |x − x  |. (10.1)

Assume the function f is twice differentiable with a Lipschitz continuous second


derivative on (x − ε0 , x + ε0 ). Let (Z k )k≥1 be a sequence of i.i.d. random vectors
with the same distribution as Z . Then for every ε ∈ (0, ε0 ), the mean quadratic error
or Root Mean Square Error (RMSE) satisfies
10.1 Finite Difference Method(s) 473
 
 1  F(x + ε, Z k ) − F(x − ε, Z k ) 
M
  
 f (x) − 
 M k=1 2ε 
 2

 ε2 2 C F,Z
2 − ( f  (x) − ε2
[ f  ]Lip )2+
2
≤ [ f  ]Lip + (10.2)
2 M

 ε2 2 C F,Z2
≤ [ f  ]Lip + (10.3)
2 M
ε 2 C
≤ [ f  ]Lip + √F,Z .
2 M

Furthermore, if f is three times differentiable on (x − ε0 , x + ε0 ) with a bounded


third derivative, then one can replace [ f  ]Lip by 13 sup|ξ−x|≤ε0 | f (3) (ξ)|.
C F,Z
Remark. In the above sum [ f  ]Lip ε2 represents the bias and
2
√ is the statistical
M
error.

Proof. Let ε ∈ (0, ε0 ). It follows from the Taylor formula applied to f between x
and x ± ε, respectively, that

f (x + ε) − f (x − ε) ε2
f  (x) − ≤ [ f  ]Lip . (10.4)
2ε 2
On the other hand
 
F(x + ε, Z ) − F(x − ε, Z ) f (x + ε) − f (x − ε)
E =
2ε 2ε

and
 
F(x + ε, Z ) − F(x − ε, Z )
Var

   f (x + ε) − f (x − ε) 2
F(x + ε, Z ) − F(x − ε, Z ) 2
=E −
2ε 2ε
E (F(x + ε, Z ) − F(x − ε, Z ))2  f (x + ε) − f (x − ε) 2
= −
4ε2 2ε
 f (x + ε) − f (x − ε) 2
≤ C F,Z
2
− ≤ C F,Z
2
. (10.5)

Using the bias-variance decomposition (sometimes called Huyguens’ Theorem in
Statistics (3 )), we derive the following upper-bound for the mean squared error

3 which formally simply reads E |X − a|2 = (a − E X )2 + Var(X ).


474 10 Back to Sensitivity Computation

 2
 1  F(x + ε, Z k ) − F(x − ε, Z k ) 
M
  
 f (x) − 
 M k=1 2ε 
2
 
f (x + ε) − f (x − ε) 2
= f  (x) −

1  F(x + ε, Z k ) − F(x − ε, Z k )
M
+ Var
M k=1 2ε
 
 f (x + ε) − f (x − ε) 2
= f (x) −

 
1 F(x + ε, Z ) − F(x − ε, Z )
+ Var
M 2ε
 2 2   f (x + ε) − f (x − ε) 2 
 ε 1
≤ [ f ]Lip + C F,Z −
2
2 M 2ε

where we combined (10.4) and (10.5) to derive the last inequality, i.e. (10.3). To get
the improved bound (10.2), we first derive from (10.4) that

f (x + ε) − f (x − ε) ε2
≥ f  (x) − [ f  ]Lip
2ε 2
so that
 
F(x + ε, Z ) − F(x − ε, Z ) f (x + ε) − f (x − ε)
E =
2ε 2ε
 
f (x + ε) − f (x − ε)

2ε +
 
ε 2
≥ f  (x) − [ f  ]Lip .
2 +

Plugging this lower bound into the definition of the variance yields the announced
inequality. ♦

ℵ Practitioner’s corner. The above result suggests to choose M = M(ε) (and ε) to


equalize the variance and the statistical error term by setting
 2  2
2C F,Z 2C F,Z
M(ε) = = ε−4 .
ε2 [ f  ]Lip [ f  ]Lip

However, this is not realistic since C F,Z is only an upper-bound of the variance term
and [ f  ]Lip is usually unknown. An alternative is to choose

M(ε) = o ε−4
10.1 Finite Difference Method(s) 475

to be sure that the statistical error becomes smaller. However, it is of course useless
to carry on the simulation too far since the bias error is not impacted. Note that
such specification of the size M of the simulation breaks the recursive feature of the
estimator. Another way to use such an error bound is to keep in mind that, in order
to reduce the error by a factor of 2, we need to reduce ε and increase M as follows:

ε  ε/ 2 and M  4 M.

Warning (what should never be done)! Imagine that we are using two independent
samples (Z k )k≥1 and ( 
Z k )k≥1 to simulate copies F(x − ε, Z ) and F(x + ε, Z ). Then,

1  F(x + ε, Z k ) − F(x − ε, 
M
Zk )
Var
M k=1 2ε
1  
= Var F(x + ε, Z ) + Var F(x − ε, Z )
4Mε2
Var F(x, Z )
 .
2Mε2

Note that the asymptotic variance of the estimator of f (x+ε)−



f (x−ε)
explodes as ε → 0
and the resulting quadratic error reads approximately

ε2 σ F(x, Z )
[ f  ]Lip + √ ,
2 ε 2M

where σ(F(x, Z )) = Var(F(x, Z )) is the standard deviation of F(x, Z ). This leads
to consider the unrealistic constraint M(ε) ∝ ε−6 to keep the balance
√ between the
bias term and the variance term; or equivalently to switch ε  ε/ 2 and M  8 M
to reduce the error by a factor of 2.
 Examples (Greeks computation). 1. Sensitivity in a Black–Scholes model. Vanilla
payoffs viewed as functions of a normal distribution correspond to functions F of
the form
 σ2
√ 
F(x, z) = e−r t h x e(r − 2 )T +σ T z , z ∈ R, x ∈ (0, +∞),

where h : R+ → R is a Borel function with linear growth. If h is Lipschitz continu-


ous, then √
σ2
|F(x, Z ) − F(x  , Z )| ≤ [h]Lip |x − x  |e− 2 T +σ T Z

d
so that elementary computations show, using that Z = N (0; 1),

σ2
F(x, Z ) − F(x  , Z )2 ≤ [h]Lip |x − x  |e 2 T
.
476 10 Back to Sensitivity Computation

The regularity of f follows from the following easy change of variable


  √ 
 +∞
2 dz (log(x/y)+μT )2 dy
f (x) = e −r t
h xe μT +σ T z
e− z2
√ = e−r t h(y)e− 2σ 2 T √
R 2π 0 yσ 2πT

where μ = r − σ2 . This change of variable makes the integral appear as a “log”-


2

convolution on (0, +∞) (4 ) with similar regularizing effects as the standard con-
volution on the whole real line. Under the appropriate growth assumption on the
function h (say polynomial growth), one shows from the above identity that the
function f is in fact infinitely differentiable over (0, +∞). In particular, it is twice
differentiable with Lipschitz continuous second derivative over any compact interval
included in (0, +∞).
2. Diffusion model with Lipschitz continuous coefficients. Let X x = (X tx )t∈[0,T ]
denote the Brownian diffusion solution of the SDE

d X t = b(X t )dt + ϑ(X t )dWt , X 0 = x,

where b and ϑ are locally Lipschitz continuous functions (on the real line) with at
most linear growth (which implies the existence and uniqueness of a strong solution
(X tx )t∈[0,T ] starting from X 0x = x). In such a case, one should instead write

F(x, ω) = h X Tx (ω) .

The Lipschitz continuity of the flow of the above SDE (see Theorem 7.10) shows
that
F(x, . ) − F(x  , . )2 ≤ Cb,ϑ [h]Lip |x − x  |eCb,ϑ T

where Cb,ϑ is a positive constant only depending on the Lipschitz continuous coef-
ficients of b and ϑ. In fact, this also holds for multi-dimensional diffusion processes
and for path-dependent functionals.
The regularity of the function f is a less straightforward question. But the answer
is positive in two situations: either h, b and σ are regular enough to apply results on
the flow of the SDE which allows pathwise differentiation of x → F(x, ω) (see The-
orem 10.1 further on in Sect. 10.2.2) or ϑ satisfies a uniform ellipticity assumption
ϑ ≥ ε0 > 0.
3. Euler scheme of a Brownian diffusion model with Lipschitz continuous coeffi-
cients. The same holds for the Euler scheme. Furthermore, Assumption (10.1) holds
uniformly with respect to n if Tn is the step size of the Euler scheme.
4. F can also be a functional of the whole path of a diffusion, provided F is Lipschitz
continuous with respect to the sup-norm over [0, T ].

4 Theconvolution
 on (0, +∞) is defined between two non-negative functions f and g on (0, +∞)
+∞
by f  g(x) = f (x/y)g(y)dy.
0
10.1 Finite Difference Method(s) 477

As emphasized in the section devoted to the tangent process below, the generic
parameter x can be the maturity T (in practice the residual maturity T − t, also
known as seniority), or any finite-dimensional parameter on which the diffusion
coefficients depend since they can always be seen as an additional component or a
starting value of the diffusion.
 Exercises. 1. Adapt the results of this section to the case where f  (x) is estimated
by its “forward” approximation

f (x + ε) − f (x)
f  (x)  .
ε

2. Richardson–Romberg extrapolation. Assume f is C 3 with Lipschitz third deriva-


tive in the neighborhood of x. Starting from the expansion

f (x + ε) − f (x − ε) ε2
= f  (x) + f (3) (x) + O(ε3 ),
2ε 6
propose a Richardson–Romberg extrapolation method (see Sect. 9.4) based on the

estimator 
f  (x)ε,M = M1 1≤k≤M
F(x+ε,Z k )−F(x−ε,Z k )

with ε and ε/2
3. Multilevel estimators (see Sect. 9.5). (a) Assume f is C 4 and refine the above
expansion given in the previous exercise. Give a condition on the random function
F(x, ω) which makes it possible to design and implement a regular multilevel esti-
mator for f (x).
(b) Under which assumptions is it possible to design and implement a weighted
Multilevel estimator (ML2R)?
(c) Which conclusions should be drawn from (a) and (b).
4. Apply the above method(s) to approximate the γ-parameter by considering that

f (x + ε) + f (x − ε) − 2 f (x)
f  (x) 
ε2
under suitable assumptions on f and its derivatives.
The singular setting
In the above setting described in Proposition 10.1, we are close to a frame-
work in which one can interchange derivation and expectation: the (local) Lips-
 
chitz
  continuous  assumption on the random function x → F(x , Z ) implies that
F(x ,Z )−F(x,Z )
x  −x 
is a uniformly integrable family. Hence, as soon as
x ∈(x−ε0 ,x+ε0 )\{x}
x  → F(x  , Z ) is P-a.s. pathwise differentiable at x (or even simply in L 2 (P)), one
has f  (x) = E Fx (x, Z ).
Consequently, it is important to investigate the singular setting in which f is
differentiable at x and F( . , Z ) is not Lipschitz continuous in L 2 . This is the purpose
of the next proposition (whose proof is quite similar to the Lipschitz continuous
setting and is subsequently left to the reader as an exercise).
478 10 Back to Sensitivity Computation

Proposition 10.2 Let x ∈ R. Assume that F satisfies in a x open neighborhood


(x − ε0 , x + ε0 ), ε0 > 0, of x the following local mean quadratic θ-Hölder assump-
tion, θ ∈ (0, 1], at x: there exists a positive real constant C H ol,F,Z
 
∀ x  , x  ∈ (x − ε0 , x + ε0 ),  F(x  , Z ) − F(x  , Z )2 ≤ C H ol,F,Z |x  − x  |θ .

Assume the function f is twice differentiable with a Lipschitz continuous second


derivative on (x − ε0 , x + ε0 ). Let (Z k )k≥1 be a sequence of i.i.d. random vectors
with the same distribution as Z . Then, for every ε ∈ (0, ε0 ), the RMSE satisfies
  
  
1  F(x + ε, Z k ) − F(x − ε, Z k ) 
M
   ε2 2 C H2 ol,F,Z
 f (x) −  ≤ [ f  ]Lip +
 M 2ε  2 (2ε)2(1−θ) M
k=1 2

ε2 C H ol,F,Z
≤ [ f  ]Lip + √ . (10.6)
2 (2ε)1−θ M

This variance of the finite difference estimator explodes as ε → 0 as soon as


θ < 1. As a consequence, in such a framework, to divide the quadratic error by a
factor of 2, we need to switch

ε  ε/ 2 and M  21−θ × 4 M.

A dual point of view in this singular case is to (roughly) optimize the parameter
ε = ε(M), given a simulation size of M in order to minimize the quadratic error, or
at least its natural upper-bounds. Such an optimization performed on (10.6) yields
1

2θ C H ol,F,Z 3−θ

εopt = √
[ f  ]Lip M

which of course depends on M so that it breaks the recursiveness of the estimator.


Moreover, its sensitivity to [ f  ]Lip (and to C H ol,F,Z ) makes its use rather unrealistic in
practice.  
The resulting rate of decay of the quadratic error is O M − 3−θ . This rate shows
2−θ

that when θ ∈ (0, 1), the lack of L 2 -regularity of x → F(x, Z ) slows down the con-
vergence of the finite difference method by contrast with the Lipschitz continuous
case where the standard rate of convergence of the Monte Carlo method is preserved.
 Example of the digital option. A typical example of such a situation is the pricing
of digital options (or equivalently the computation of the δ-hedge of a Call or Put
options).
Let us consider, still in the standard risk neutral Black–Scholes model, a digital
Call option with strike price K > 0 defined by its payoff

h(ξ) = 1{ξ≥K }
10.1 Finite Difference Method(s) 479
 σ2
√ 
and set F(x, z) = e−r t h x e(r − 2 )T +σ T z , z ∈ R, x ∈ (0, +∞) (r denotes the con-
stant interest rate as usual). We know that the premium of this option is given for
every initial price x > 0 of the underlying risky asset by

d
f (x) = E F(x, Z ) with Z = N (0; 1).

σ2 d
Set μ = r − 2
. It is clear since Z = −Z that
 √ 
f (x) = e−r t P x eμ T +σ T Z ≥ K
 log(x/K ) + μT 
= e−r t P Z ≥ − √
σ T
 log(x/K ) + μT 
= e−r t 0 √ ,
σ T

where 0 denotes the c.d.f. of the N (0; 1) distribution. Hence the function f is
infinitely differentiable on (0, +∞).
d
On the other hand, still using Z = −Z , for every x, x  ∈ R,
 
 F(x, Z ) − F(x  , Z )2
 2
2

 
 
= e−2r T  1  − 1 
 Z ≥− log(x/K√ )+μT 
Z ≥− log(xσ/K )+μT 
 σ T

T 
2
2

= e−2r T E 1 )+μT
 − 1


Z ≤ log(x/K

σ T
Z ≤ log(xσ/K

T
)+μT

   log(min(x, x  )/K ) + μT 
−2r T log(max(x, x  )/K ) + μT 
=e 0 √ − 0 √ .
σ T σ T

Using that 0 is bounded by κ0 = √1 ,



we derive that

  −2r T
 F(x, Z ) − F(x  , Z )2 ≤ κ0 e√ log x − log x  .
2
σ T

Consequently for every interval I ⊂ (0, +∞) bounded away from 0, there exists a
real constant Cr,σ,T,I > 0 such that

∀ x, x  ∈ I, F(x, Z ) − F(x  , Z )2 ≤ Cr,σ,T,I |x − x  |,

i.e. the functional F is 21 -Hölder in L 2 (P) and the above proposition applies.
480 10 Back to Sensitivity Computation

 Exercises. 1. Prove the above Proposition 10.2.


2. Digital option. (a) Consider in the risk neutral Black–Scholes model a digital
option defined by its payoff
h(ξ) = 1{ξ≥K }
 σ2
√ 
and set F(x, z) = e−r t h x e(r − 2 )T +σ T z , z ∈ R, x ∈ (0, +∞) (r is a constant
interest rate as usual). We still consider the computation of f (x) = E F(x, Z ) where
d
Z = N (0; 1).
Verify on a numerical simulation that the variance of the finite difference estima-
tor introduced in Proposition 10.1 explodes as ε → 0 at the rate expected from the
preceding computations.
(b) Derive from the preceding a way to “synchronize” the step ε and the size M of
the simulation.

10.1.2 A Recursive Approach: Finite Difference


with Decreasing Step

In the former finite difference method with constant step, the bias never fades. Conse-
quently, increasing the accuracy of the sensitivity computation, it has to be resumed
from the beginning with a new ε. In fact, it is easy to propose a recursive version of
the above finite difference procedure by considering some variable steps ε which go
to 0. This can be seen as an application of the Kiefer–Wolfowitz principle originally
developed for Stochastic Approximation purposes.
We will focus on the “regular setting” (F Lipschitz continuous in L 2 ) in this
section, the singular setting is proposed as an exercise. Let (εk )k≥1 be a sequence of
positive real numbers decreasing to 0. With the notations and the assumptions of the
former section, consider the estimator

1  F(x + εk , Z k ) − F(x − εk , Z k )
M

f  (x) M := . (10.7)
M k=1 2εk

It can be computed in a recursive way since


 
 1 F(x + ε M+1 , Z M+1 ) − F(x − ε M+1 , Z M+1 )
f  (x) M+1 = 
f  (x) M + − 
f  (x) M .
M +1 2ε M+1

Elementary computations show that the mean squared error satisfies


10.1 Finite Difference Method(s) 481

 2 1  f (x + εk ) − f (x − εk )
M 2
  
 f (x) − 
f  (x) M  = 
f (x) −
2 M k=1 2εk

1  Var(F(x + εk , Z k ) − F(x − εk , Z k ))
M
+ 2
M k=1 4ε2k
2
[ f  ]2Lip 
M 2
C F,Z
≤ ε2k + (10.8)
4M 2 k=1 M
⎛ 2

 2 
1 ⎝ [ f ] M
2 ⎠
= ε2k + C F,Z
Lip

M 4M k=1

where we again used (10.4) to get Inequality (10.8). As a consequence, the RMSE
satisfies

  2
  [ f ] M 2
    1 
 f (x) − f (x) M  ≤ √ εk + C F,Z
2 .
Lip 2
(10.9)
2 M 4M k=1

L 2 -rate of convergence (erasing the asymptotic bias)


An efficient way to prove an √1M L 2 -rate like in a standard Monte Carlo simulation
is to erase the bias by an appropriate choice of the sequence (εk )k≥1 . The bias term
will fade as M → +∞ if
2
M
εk = o(M).
2

k=1

This leads us to choose εk of the form

εk = o k − 4
1
as k → +∞,


M 
M
1
since, then, ε2k = o 1 as M → +∞ and
k=1 k=1 k2

M  
 √ 1  k −2 √  1 dx √
M 1
1
1 = M ∼ M √ = 2 M as M → +∞.
k=1 k2 M k=1 M 0 x

However, the choice of too small steps εk may introduce numerical instability in
the computations, so we recommend to choose the εk of the form

εk = o k −( 4 +δ)
1
with δ > 0 small enough.
482 10 Back to Sensitivity Computation

One can refine the bound obtained in (10.9): note that


 2
 Var(F(x + εk , Z k ) − F(x − εk , Z k ))  F(x + εk , Z k ) − F(x − εk , Z k )2
M M
=
k=1
4ε2k k=1
2εk
 M  
f (x + εk ) − f (x − εk ) 2
− .
k=1
2εk

f (x + εk ) − f (x − εk )
Now, since → f  (x) as k → +∞,
2εk

M  
1  f (x + εk ) − f (x − εk ) 2
→ f  (x)2 as M → +∞.
M k=1 4ε2k

Plugging this into the above computations yields the refined asymptotic upper-bound
√   
 
M  f  (x) − 
2
lim f  (x) M  ≤ C F,Z
2
− f  (x) .
M→+∞ 2

This approach has the same quadratic rate of convergence as a regular Monte
Carlo simulation (e.g. a simulation carried out with ∂∂xF (x, Z ) if it exists).
Now, we show that the estimator  f  (x) M is consistent, i.e. convergent toward

f (x).

Proposition 10.3 Under the assumptions of Proposition 10.1 and if εk goes to zero
as k goes to infinity, the estimator 
f  (x) M P-a.s. converges to its target f  (x).

Proof. This amounts to showing that

1  F(x + εk , Z k ) − F(x − εk , Z k )
M
f (x + εk ) − f (x − εk ) a.s.
− −→ 0 (10.10)
M 2εk 2εk
k=1

as M → +∞. This is (again) a straightforward consequence of the a.s. convergence


of L 2 -bounded martingales combined with the Kronecker Lemma (see Lemma 12.1):
first define the martingale

M
1 F(x + εk , Z k ) − F(x − εk , Z k ) − ( f (x + εk ) − f (x − εk ))
LM = , M ≥ 1.
k=1
k 2εk

One checks that


10.1 Finite Difference Method(s) 483


M 
M
1
E L 2M = E (L k )2 = Var(F(x + εk , Z k ) − F(x − εk , Z k ))
k=1 k=1
4k 2 ε2k

M
1
≤ E (F(x + εk , Z k ) − F(x − εk , Z k ))2
k=1
4k 2 ε2k

M
1  1 M
≤ C 2 4 ε2k = C F,Z
2 F,Z
2

k=1
4k εk
2
k=1
k2

so that
sup E L 2M < +∞.
M

Consequently, L M a.s. converges to a square integrable (hence a.s. finite) random


variable L ∞ as M → +∞. The announced a.s. convergence in (10.10) follows from
the Kronecker Lemma (see Lemma 12.1). ♦

 Exercises. 1. Central Limit Theorem. Assume that x → F(x, Z ) is Lipschitz con-


tinuous from R to L 2+η (P) for an η > 0. Show that the convergence of the finite
difference estimator with decreasing step 
f  (x) M defined in (10.7) satisfies the fol-
lowing property: from every subsequence (M  ) of (M) one may extract a subsequence
(M  ) such that
√   L
M  f  (x) M  − f  (x) −→ N (0; v), v ∈ [0, v̄] as M → +∞

2
where v̄ = C F,Z
2
− f  (x) .
 
[Hint: Note that the sequence Var F(x+εk ,Z k )−F(x−ε
2εk
k ,Z k )
, k ≥ 1, is bounded and
use the following Central Limit Theorem: if (Yn )n≥1 is a sequence of i.i.d. random
variables such that there exists an η > 0 satisfying

1 
Nn
n→+∞
sup E |Yn |2+η < +∞ and ∃ Nn → +∞ with Var(Yk ) −→ σ 2 > 0
n Nn k=1

then
1 
Nn
L
√ Yk −→ N (0; σ 2 ) as n → +∞.]
Nn k=1

2. Hölder framework. Assume that x → F(x, Z ) is only θ-Hölder from R to L 2 (P)


with θ ∈ (0, 1), like in Proposition 10.2.
(a) Show that a natural upper-bound for the quadratic error induced by the symmetric
finite difference estimator with decreasing step 
f  (x) M defined in (10.7) is given by
484 10 Back to Sensitivity Computation

  2 2
1 [ f ]Lip  
M 2 M
C F,Z 1
ε2k + 2(1−θ)
.
M 4 k=1
22(1−θ) k=1 εk

(b) Show that the resulting estimator f  (x) M a.s. converges to its target f  (x) as
soon as
 1
2(1−θ)
< +∞.
k≥1 k εk
2

(c) Assume that εk = kca , k ≥ 1, where c is a positive real constant and a ∈ (0, 1).
Show that the exponent a corresponds to an admissible step if and only if a ∈
(0, 2(1−θ)
1
). Justify the choice of a ∗ = 2(3−θ)
1
for the exponent a and deduce that
√ − 2 
the resulting rate of decay of the (RMSE) is O M 3−θ .
The above exercise shows that the (lack of) regularity of x → F(x, Z ) in L 2 (P)
impacts the rate of convergence of the finite difference method.

10.2 Pathwise Differentiation Method

10.2.1 (Temporary) Abstract Point of View

We retain the notation of the former section. We assume that there exists a p ∈
[1, +∞) such that

∀ ξ ∈ (x − ε0 , x + ε0 ), F(ξ, Z ) ∈ L p (P).

Definition 10.1 The function ξ → F(ξ, Z ) from (x − ε0 , x + ε0 ) to L p (P) is L p -


differentiable at x if there exists a random variable denoted by ∂x F(x, . ) ∈ L p (P)
such that  
 F(x, Z ) − F(ξ, Z ) 
lim  − ∂ x F(x, ω)  = 0.

ξ→x x −ξ p
ξ=x

Proposition 10.4 If ξ → F(ξ, Z ) is L p -differentiable at x, then the function f


defined by f (ξ) = E F(ξ, Z ) is differentiable at x and

f  (x) = E ∂x F(x, ω) .

Proof. This is a straightforward consequence of the inequality E Y ≤ Y  p (which


holds for any Y ∈ L p (P)) applied to Y (ω) = F(x,Z (ω))−F(ξ,Z
x−ξ
(ω))
− ∂x F(x, ω) by let-
ting ξ converge to x. ♦
10.2 Pathwise Differentiation Method 485

ℵ Practitioner’s corner. • As soon as ∂x F(x, ω) is simulable at a reasonable com-


putational cost, it is usually preferable to compute f  (x) = E ∂x F(x, ω) by a direct
Monte Carlo simulation rather than by a finite difference method. The performances
are similar in terms of size of samples but the complexity of each path is lower.
In a Brownian diffusion framework it depends on the smoothness of the functional
of the diffusion and on the existence of the tangent process introduced in the next
Sect. 10.2.2.
• When the functional F(x, ω) is not pathwise differentiable (or in L p hereafter),
whereas the function f (x) = E F(x, ω) is differentiable, the finite difference method
is outperformed by weighted estimators whose variance appear significantly lower.
This phenomenon is illustrated in Sect. 10.4.4,
• In this section, we deal with functions F(x, z) of a real variable x, but the extension
to functionals depending on x ∈ Rd by replacing the notion of derivative by that of
differential (or partial derivatives).
The usual criterion to establish L p -differentiability, especially when the under-
lying source of randomness comes from a diffusion (see below), is to establish the
pathwise differentiability of ξ → F(ξ, Z (ω)) combined with an L p - uniform inte-
)−F(ξ,Z )
grability property of the ratio F(x,Zx−ξ (see Theorem 12.4 and the Corollary that
follows for a short background on uniform integrability).
Usually, this is applied with p = 2 since one needs ∂x F(x, ω) to be in L 2 (P) to
ensure that the Central Limit Theorem applies and rules the rate of convergence in
the Monte Carlo simulation.
This can be summed up in the following proposition, whose proof is obvious.

Proposition 10.5 Let p ∈ [1, +∞). If


(i) there exists a random variable ∂x F(x, . ) such that P(dω)-a.s. ξ → F(ξ, Z (ω))
is differentiable at x with derivative ∂x F(x, ω),  
)−F(ξ,Z )
(ii) there exists an ε0 > 0 such that the family F(x,Zx−ξ is
ξ∈(x−ε0 ,x+ε0 )\{x}
L p -uniformly integrable,
then ∂x F(x, . ) ∈ L p (P) and ξ → F(ξ, Z ) is L p -differentiable at x with derivative
∂x F(x, .).

10.2.2 The Tangent Process of a Diffusion and Application


to Sensitivity Computation

In a diffusion framework, it is important to keep in mind that, like for ODE s in a


deterministic setting, the solution of an SDE is usually smooth when viewed as a
(random) function of its starting value. This smoothness even holds in a pathwise
sense with an (almost) explicit differential which appears as the solution to a linear
SDE. The main result in this direction is due to Kunita (see [176], Theorem 3.3,
p. 223).
486 10 Back to Sensitivity Computation

This result must be understood as follows: when a sensitivity (the δ-hedge and the
γ parameter but also other “greek parameters”, as will be seen further on) related to
the premium E h(X Tx ) of an option cannot be computed by “simply” interchanging
differentiation and expectation, this lack of differentiability comes from the payoff
function h. This also holds true for path-dependent options.
Let us come to a precise statement of Kunita’s Theorem on the regularity of the
flow of an SDE.

Theorem 10.1 (Smoothness of the flow (Kunita’s Theorem))


(a) Let b : R+ × Rd → Rd , ϑ : R+ × Rd → M(d, q, R), with regularity Cb1 i.e.
with bounded α-Hölder partial derivatives for an α > 0. Let X x = (X tx )t≥0 denote
the unique (FtW )-adapted strong solution of the SDE (on the whole non-negative
real line)
d X t = b(t, X t )dt + ϑ(t, X t )dWt , X 0 = x ∈ Rd , (10.11)

where W = (W 1 , . . . , W q ) is a q-dimensional Brownian motion defined on a prob-


 x →
ability space (, A, P). Then, at every t ∈ R+ , the mapping x
 X t is a.s. contin-
∂(X tx )i
uously differentiable and its Jacobian Yt (x) := ∇x X tx = ∂x j
satisfies the
1≤i, j≤d
linear stochastic differential system

d 
 t
∂bi
(s, X sx )Ysj (x)ds
ij
∀ t ∈ R+ , Yt (x) = δi j +
=1 0 ∂ y
q  t

d 
∂ϑik
+ (s, X sx )Ysj (x)dWsk , 1 ≤ i, j ≤ d,
=1 k=1 0 ∂ y

where δi j denotes the Kronecker symbol.


(b) Furthermore, the tangent process Y (x) takes values in the set G L(d, R) of invert-
ible square matrices. As a consequence, for every t ≥ 0, the mapping x → X tx is
a.s. a homeomorphism of Rd .

We will admit this theorem, which is beyond the scope of this monograph. We
refer to Theorem 3.3, p. 223, from [176] for Claim (a) and Theorem 4.4 (and its
proof), p. 227, for Claim (b).
n+β
Remarks. • If b and σ are Cbn+α for some n ∈ N, then, a.s. the flow x → X tx is Cb ,
0 < η < α (see again Theorem 3.3 in [176]).
One easily derives from the above theorem the slightly more general result about
the tangent process to the solution (X st,x )s∈[t,T ] starting from x at time t. This pro-
cess, denoted by (Y (t, x)s )s≥t , can be deduced from Y (x) by the following inverted
transition formula
∀ s ≥ t, Ys (t, x) = Ys (x)[Yt (x)]−1 .

This is a consequence of the uniqueness of the solution of a linear SDE.


10.2 Pathwise Differentiation Method 487

• Higher-order differentiability properties hold true if b are ϑ are smoother. For a


more precise statement, see Sect. 10.2.2 below.

 Example. If d = q = 1, the above SDE reads


 
dYt (x) = Yt (x) bx (t, X t )dt + ϑx (t, X tx )dWt , Y0 (x) = 1

and elementary computations show that


 t  t 
1
Yt (x) = exp bx (s, X sx ) − (ϑx (s, X sx ))2 ds + ϑx (s, X sx )dWs >0
0 2 0
(10.12)
so that, in the Black–Scholes model (b(t, x) = r x, ϑ(t, x) = ϑx), one retrieves that

d x Xx
X t = Yt (x) = t .
dx x
 Exercise. Let d = q = 1. Show that under the assumptions of Theorem 10.1, the
tangent process at x, Yt (x) = ddx X tx , satisfies

Yt (x)
sup ∈ L p (P), p ∈ (0, +∞).
s,t∈[0,T ] s (x)
Y

Applications to δ-hedging
The tangent process and the δ hedge are closely related. Assume for convenience
that the interest rate is 0 and that a basket is made up of d risky assets whose price
dynamics (X tx )t∈[0,T ] , X 0x = x ∈ (0, +∞)d , is a solution to (10.11).
Then the premium of the payoff h(X Tx ) on the basket is given by

f (x) := E h(X Tx ).

The δ-hedge vector of this option (at time 0 and) at x = (x 1 , . . . , x d ) ∈ (0, +∞)d
is given by ∇ f (x).
We have the following proposition that establishes the existence and the repre-
sentation of f  (x) as an expectation (with in view its computation by a Monte Carlo
simulation). It is a straightforward application of Theorem 2.2(b).

Proposition 10.6 (δ-hedge of vanilla European options) If a Borel function h :


Rd → R satisfies the following assumptions:
(i) A.s. differentiability: ∇h(y) exists P X Tx (dy)-a.s.,
488 10 Back to Sensitivity Computation

(ii) Uniform integrability condition (5 ): there exists a neighborhood (x − ε0 , x + ε0 )


(ε0 > 0) of x such that,

|h(X Tx ) − h(X Tx )|
is uniformly integrable.
|x − x  |
|x  −x|<ε0 ,x  =x

 
∂f
Then f is differentiable at x and its gradient ∇ f (x) = ∂x i
(x) admits the
1≤i≤d
following representation as an expectation:
 
∂f ∂ X Tx 
(x) = E ∇h(X x
) , i = 1, . . . , d. (10.13)
∂x i T
∂x i

Remark. One can also consider the payoff of a forward start option
h(X Tx1 , . . . , X Tx ), 0 < T1 < · · · < TN , where h : (Rd ) N → R+ . Then, under simi-
N
lar a.s. differentiability assumptions on h (with respect to P(X T1 ,...,X TN ) and uniform
integrability, its premium f (x) is differentiable and
⎡ " # ⎤
∂f  ∂ X Tx 

(x) = E ∇h(X T1 , . . . , X T )
x x j
⎦.
∂x i N ∂x i
1≤ j≤d

Toward computation by Monte Carlo simulation


These formulas can be used to compute various sensitivities by Monte Carlo sim-
ulation since sensitivity parameters are functions of the pair made by the diffusion
process X x , the solution to the SDE starting at x, and its tangent process ∇x X x at
x: it suffices to consider the Euler scheme of this pair (X tx , ∇x X tx ) over [0, T ] with
step Tn .
Thus, if d = 1 (for notational convenience), and setting Yt = ∇x X tx , we see that
the pair (X x , Y ) satisfies the stochastic differential system

d X tx = b(X tx )dt + ϑ(X tx )dWt , X 0x = x ∈ R


dYt = Yt b (X tx )dt + ϑ (X tx )dWt , Y0 = 1.

In one dimension, one can take advantage of the semi-closed formula (10.12)
obtained in the above exercise for the tangent process.
Extension to an exogenous parameter θ
All theoretical results obtained for the δ, i.e. for the differentiation of the flow of an
SDE with respect to its initial value, can be, at least first formally, extended to any
parameter provided no ellipticity is required. This follows from the remark that if

5 Ifh is Lipschitz continuous, or even locally Lipschitz with polynomial growth (|h(x) − h(y)| ≤
C|x − y|(1 + |x|r + |y|r ) r > 0) and X x is a solution to an SDE with Lipschitz continuous coef-
ficients b and ϑ in the sense of (7.2), this uniform integrability is always satisfied: it follows from
Theorem 7.10 applied with p > 1.
10.2 Pathwise Differentiation Method 489

the coefficient(s) b and/or ϑ of a diffusion depend(s) on a parameter θ, then the pair


(X t , θt )t∈[0,T ] is still a Brownian diffusion process, namely

d X t = b(θ, X t )dt + ϑ(θ, X t )dWt , X 0 = x,


dθt = 0, θ0 = θ.

Set x = (x, θ) and  X


t := (X t , θt ). Thus, following Theorems 3.1 and 3.3 of Sect. 3
x x

in [176], if (θ, x) → b(θ, x) and (θ, x) → ϑ(θ, x) are Cbk+α (0 < α < 1) with respect
to x and θ (6 ) then the solution of the SDE at a given time t will be C k+β (for every
β ∈ (0, α)) as a function of (x and θ). A more specific approach would show that
some regularity in the sole variable θ would be enough, but then this result does not
follow for free from the general theorem of the differentiability of the flows.
Assume b = b(θ, . ) and ϑ = ϑ(θ, . ) and the initial value x = x(θ), θ ∈ ,  ⊂
Rq , open set. One can also differentiate an SDE with respect to this parameter θ.
We can assume that q = 1 (by considering a partial derivative if necessary). Then,
one gets
 t  ∂ X (θ) 
∂ X t (θ) ∂x(θ) ∂b ∂b s
= + θ, X s (θ) + θ, X s (θ) ds
∂θ ∂θ 0 ∂θ ∂x ∂θ
 t  ∂ X (θ) 
∂ϑ ∂ϑ s
+ θ, X s (θ) + θ, X s (θ) dWs .
0 ∂θ ∂x ∂θ

ℵ Practitioner’s corner
Once coupled with the original diffusion process X x , this yields some expres-
sions for the sensitivity with respect to the parameter θ, possibly closed, but usu-
ally computable
  by a Monte Carlo simulation of the Euler scheme of the couple
X t (θ), ∂ X∂θt (θ) .

As a conclusion, let us mention that this tangent process approach is close to the
finite difference method applied to F(x, ω) = h(X Tx (ω)) in the “regular setting”: it
appears as a limit case of the finite difference method. Their behaviors are similar,
except that for a given size of Monte Carlo simulation, the tangent process method has
a lower computational complexity. The only constraint is that it is more demanding
in terms of “preliminary” human – hence expansive– calculations. This is no longer
true since the recent emergence (in quantitative finance) of Automatic Differentiation
(AD) as described by Griewank in [135–137], Naumann in [215] and Hascoet in [144],
which, once included in the types of the variable spare much time prior to and during
the simulation itself (at least for the adjoint version of AD).

6A function g has a Cbk+α regularity if g is C k with k-th order partial derivatives globally α-Holder
and if all its partial derivatives up to k-th order are bounded.
490 10 Back to Sensitivity Computation

10.3 Sensitivity Computation for Non-smooth Payoffs: The


Log-Likelihood Approach (II)

The approach based on the tangent process clearly requires smooth payoff functions,
typically almost everywhere differentiable, as emphasized above (and in Sect. 2.2).
Unfortunately, this assumption is not fulfilled by many of the usual payoffs functions
like the digital payoff

h T = h(X T ) with h(x) := 1{x≥K }

whose δ-hedge parameter cannot be computed by the tangent process method. In



fact, ∂x E h(X Tx ) is the probability density of X Tx evaluated at the strike K . For the
same reason, this is also the case for the γ-sensitivity parameter of a vanilla Call
option.
We also saw in Sect. 2.2 that in the Black–Scholes model, this problem can be
overcome since integration by parts or differentiating the log-likelihood leads to
some sensitivity formulas for non-smooth payoffs. Is it possible to extend this idea
to more general models?

10.3.1 A General Abstract Result

In Sect. 2.2, we saw that a family of random vectors X (θ) indexed by a parameter
θ ∈ ,  an open interval of R, all having an a.e. positive probability density p(θ, y)
with respect to a reference non-negative measure μ on Rd . Assume this density is
positive on a domain D of Rd for every θ ∈  where D does not depend on θ. Then
one can derive the sensitivity with respect to θ of functions of the form

f (θ) := E ϕ(X (θ)) = ϕ(y) p(θ, y)μ(dy),
D

where ϕ is a Borel function with appropriate integrability assumptions, provided the


density function p is smooth enough as a function of θ, regardless of the regularity
of ϕ. This idea has been briefly developed in a one-dimensional Black–Scholes
framework in Sect. 2.2.3. The extension to a more general framework is actually
straightforward and yields the following result.

Proposition 10.7 If the parametrized probability density p(θ, y) :  × D → R+


satisfies

(i) θ −→ p(θ, y) is differentiable on  μ(dy)-a.e.


&
d Borel
- g ϕ ∈ L 1 (μ)
 
(ii) ∃ g : R −→ R, such that ,
- ∀ θ ∈ , |∂θ p(θ, y)| ≤ g(y) μ(dy)-a.e.
10.3 Sensitivity Computation for Non-smooth Payoffs: The Log-Likelihood Approach (II) 491

then  
∂ log p
∀ θ ∈ , f  (θ) = E ϕ X (θ) θ, X (θ) .
∂θ

10.3.2 The log-Likelihood Method for the Discrete Time


Euler Scheme

As presented in an abstract framework, this approach looks attractive but unfor-


tunately, in most situations, we have no explicitly computable form for the den-
sity p(θ, y) even when its existence is proved. However, if one considers diffu-
sion approximation by some discretization schemes like the Euler scheme in a non-
degenerate setting, the application of the log-likelihood method becomes much less
unrealistic, for computing by simulation some proxies of the greek parameters by a
Monte Carlo simulation.
As a matter of fact, under light ellipticity assumption, the (constant step) Euler
scheme of a diffusion does have a probability density at each time t > 0 which can be
made explicit in some way (see below). This is a straightforward consequence of the
fact that it is a discrete time Markov process with conditionally Gaussian increments.
The principle is the following: we consider an autonomous diffusion (X tx (θ))t∈[0,T ]
depending on a parameter θ ∈ , say

d X tx (θ) = b θ, X tx (θ) dt + ϑ θ, X tx (θ) dWt , X 0 (θ) = x. (10.14)

We considered an autonomous SDE mainly for simplicity and to get nice formulas,
but the extension to general SDE s is just a matter of notation.
Let pT (θ, x, y) and p̄T (θ, x, y) denote the density of X Tx (θ) and its discrete time
Euler scheme X̄ txn (θ) 0≤k≤n with step size Tn (the superscript n is dropped until the end
k
of the section). Then one may naturally propose the following naive approximation
   
∂ log pT ∂ log p̄T
f  (θ) = E ϕ(X Tx (θ)) (θ, x, X Tx (θ))  E ϕ( X̄ Tx (θ)) (θ, x, X̄ Tx (θ)) .
∂θ ∂θ

It consists in making the risky hypothesis that the derivative of an approximation is


an approximation of the derivative (supported by various theoretical results about the
convergence of the densities of the discrete time Euler scheme and their derivatives).
In fact, even at this stage, the story is not as straightforward as expected because
only the density of the whole n-tuple ( X̄ txn , . . . , X̄ txn , . . . , X̄ txnn ) (with tnn = T ) can be
1 k
made explicit and tractable.

Proposition 10.8 (Density of the Euler scheme) Let q ≥ d. Assume ϑϑ∗ (θ, x) ∈
G L(d, R) for every x ∈ Rd and θ ∈ .
(a) Then the distribution P X̄ x (dy) of X̄ xT is absolutely continuous with respect to
T n
n
the Lebesgue measure with a probability density given by
492 10 Back to Sensitivity Computation

1 (ϑϑ∗ (θ,x))−1 y−x− Tn b(θ,x)
e− 2T
n
y−x− Tn b(θ,x)
p̄ T (θ, x, y) = T d2
√ .
n
(2π n ) det ϑϑ∗ (θ, x)

(b) The distribution P( X̄ xn ,..., X̄ xn ,..., X̄ xn ) (dy1 , . . . , dyn ) of the n-tuple ( X̄ txn , . . . , X̄ txn , . . . ,
t1 tk tn 1 k

X̄ txnn ) has a probability density given by

'
n
p̄t n ,...,tnn (θ, x, y1 , . . . , yn ) = p̄ T (θ, yk−1 , yk ) (convention: y0 = x).
1 n
k=1

Proof. (a) is a straightforward consequence of the first step of the definition (7.5) of
the Euler scheme at time Tn applied to the diffusion (10.14) and the standard formula
for the density of a Gaussian vector since ϑ(θ, x)ϑ∗ (θ, x) is invertible. Claim (b)
follows from an easy induction based on the Markov property satisfied by the Euler
scheme. First, it is clear, again from the recursion satisfied by the discrete time Euler
scheme of (10.14) (adapted from (7.5)) that
   
n (θ) |F n
L X̄ txk+1 W
tk = L X̄ txk+1
n (θ) | X̄ t n ,
k
k = 0, . . . , n − 1

and that
   
n (θ) | X̄ n (θ) = yk
L X̄ txk+1 x
tk = L X̄ t1
n (θ) | X̄ (θ) = y
0 k = p̄ T (θ, x, yk )λd (d x).
n

One concludes by an easy induction that, for every bounded Borel function g :
(Rd )k → R+ ,
 '
k
Eg X̄ tx1n , . . . , X̄ txkn = g(y1 , . . . , yk ) p̄ T (θ, y−1 , y )
(Rd )k =1
n

with the convention y0 = 0, using the Markov property satisfied by the discrete time
Euler scheme. ♦

The above proposition proves that every marginal X̄ txn has a density which, unfor-
k
tunately, cannot be made explicit. Moreover, to take advantage of the above closed
form for the density of the n-tuple, we can write

∂ log p̄t n ,...,tnn


f  (θ)  E ϕ( X̄ Tx (θ)) 1
(θ, x, X̄ tx1n (θ), . . . , X̄ tx1n (θ))
∂θ

n ∂ log p̄ T
= E ϕ X̄ Tx (θ) n
θ, X̄ txk−1
n (θ), X̄ n (θ)
x
.
k=1
∂θ tk

At this stage it is clear that the method also works for path-dependent options,
i.e. when considering F(θ) = E  (X t (θ))t∈[0,T ] instead of f (θ) = E ϕ X T (θ)
10.3 Sensitivity Computation for Non-smooth Payoffs: The Log-Likelihood Approach (II) 493

(at least for specific functionals  involving time averaging, a finite number of
instants, supremum, infimum, etc). This raises new difficulties in connection with
the Brownian bridge method for diffusions, that need to be encompassed.
Finally, let us mention that evaluating the rate of convergence of these approx-
imations from a theoretical point of view is quite a challenging problem since it
involves not only the rate of convergence of the Euler scheme itself, but also that of
the probability density functions of the scheme toward that of the diffusion (see [25]
where this problem is addressed).
 Exercise. Apply the preceding to the case θ = x (starting value) when d = 1.
Comments. In fact, there is a way to get rid of non-smooth payoffs by regularizing
them over one time step before maturity and then applying to tangent process method.
Such an approach has been developed by M. Giles under the name of Vibrato Monte
Carlo which appears as a kind of degenerate nested Monte Carlo combined with a
pathwise differentiation based on the tangent process (see [108]).

10.4 Flavors of Stochastic Variational Calculus

A subtitle could be “from Bismut’s formula to Malliavin calculus”.

10.4.1 Bismut’s Formula

In this section, for the sake of simplicity, we assume that d = 1 and q = 1 (scalar
Brownian motion).
Theorem 10.2 (Bismut’s formula) Let W = (Wt )t∈[0,T ] be a standard Brownian
motion on a probability space (, A, P) and let F W := (FtW )t∈[0,T ] be its augmented
natural filtration. Let X x = (X tx )t∈[0,T ] be a diffusion process, unique (Ft )t∈[0,T ] -
adapted solution to the autonomous SDE

d X t = b(X t )dt + ϑ(X t )dWt , X 0 = x,

where b and ϑ are Cb1 (hence Lipschitz continuous). Let f : R → R be a continuously


differentiable function such that
 
E f (X Tx )2 + f  (X Tx )
2
< +∞.

Let (Ht )t∈[0,T ] be an F W -progressively measurable (7 ) process lying in


(T
L 2 ([0, T ] × , dt ⊗ d P), i.e. satisfying E 0 Hs2 ds < +∞. Then

7 This means that for every t ∈ [0, T ], (Hs (ω))(s,ω)∈[0,t]× is Bor ([0, t]) ⊗ Ft -measurable.
494 10 Back to Sensitivity Computation
     
T

T
ϑ(X sx )Hs
E f (X T )
x
Hs dWs =E f (X T )YT x
ds
0 0 Ys

d X tx
where Yt = dx
is the tangent process of X x at x.

Proof (Sketch). To simplify the arguments, we will assume in the proof that
|Ht | ≤ C < +∞, where C is a real constant and that f and f  are bounded functions.
Let ε ≥ 0. Set on the probability space (, FT , P),

P(ε) = L (ε)
T .P

where    
t
ε2 t
L (ε)
t = exp −ε Hs dWs − Hs2 ds , t ∈ [0, T ],
0 2 0

is a P-martingale since H is bounded. It follows from Girsanov’s Theorem (see e.g.


Sect. 3.5, Theorem 5.1, p. 191, in [162]) that
  t   
 ε
W := Wt + ε Hs ds is a P(ε) , (Ft )t∈[0,T ] -Brownian motion.
0 t∈[0,T ]

Now it follows from the definition of L (ε)


T
and the differentiation Theorem 2.2 that
  
T
∂  
E f (X Tx ) Hs dWs = − E f (X Tx )L (ε) .
0 ∂ε T
|ε=0

On the other hand


∂    ∂ ) *
E f (X T )L (ε) = EP(ε) f (X T ) |ε=0 .
∂ε T
|ε=0 ∂ε

Now we can rewrite the SDE satisfied by X as follows

d X tx = b(X tx )dt + ϑ(X tx )dWt


t(ε) .
= b(X tx ) − εHt ϑ(X tx ) dt + ϑ(X tx )d W

Consequently (see [251], Theorem 1.11, p. 372), the process X x has the same
distribution under P(ε) as X (ε) , the solution to

d X t(ε) = b(X t(ε) ) − εHt ϑ(X t(ε) ))dt + ϑ(X t(ε) dWt , X 0(ε) = x.

Now we can write


10.4 Flavors of Stochastic Variational Calculus 495
⎡ ⎤
  
T
∂ ) * ∂ X T(ε)
E f (X T ) Hs dWs =− E ( f (X T(ε) )) |ε=0 = −E ⎣ f  (X T ) ⎦,
0 ∂ε ∂ε
|ε=0

where we used once again Theorem 2.2 and the obvious fact that X (0) = X .
Using the tangent process method with ε as an auxiliary variable, one derives that
∂ X t(ε)
the process Ut := satisfies
∂ε
|ε=0

dUt = Ut b (X tx )dt + ϑ (X tx )dWt − Ht ϑ(X tx )dt.

Plugging the regular tangent process Y into this equation yields

Ut
dUt = dYt − Ht ϑ(X tx )dt. (10.15)
Yt

We know that Yt is never 0, hence (up to some localization if necessary) we can apply
Itô’s formula to the ratio UYtt : elementary computations of the partial derivatives of
the function (u, y) → uy on R × (0, +∞) combined with Eq. (10.15) show that
   
Ut dUt Ut dYt 1 dU, Y t 2Ut dY t
d = − + −2 +
Yt Yt Yt2 2 Yt2 Yt3
 
Ht ϑ(X tx ) 1 dU, Y t 2Ut dY t
=− dt + −2 + .
Yt 2 Yt2 Yt3

Then we derive from (10.15) that

Ut
dU, Y t = dY t ,
Yt

which yields  
Ut ϑ(X tx )Ht
d =− dt.
Yt Yt

d X 0(ε)
Noting that U0 = dε
= dx

= 0 finally leads to
 t
ϑ(X s )Hs
Ut = −Yt ds, t ∈ [0, T ],
0 Ys

which completes this step of the proof.


The extension to more general processes H can be done by introducing for every
n≥1
Ht(n) (ω) := Ht (ω)1{|Ht (ω)|≤n} .
496 10 Back to Sensitivity Computation

It is clear by the Lebesgue dominated convergence theorem that H (n) converges to H


in L 2 ([0, T ] × , dt ⊗ d P). Then one checks that both sides of Bismut’s identity
are continuous with respect to this topology (using Hölder’s Inequality).
The extension to unbounded functions and derivatives f follows by approximation
of f by bounded Cb1 functions, e.g. by “smooth” truncations. ♦

Application to the computation of the δ-parameter


Assume b and ϑ are Cb1 . If f is continuous with polynomial growth and satisfies
 T 2
Yt
E f (X T )+
2 x
dt < +∞,
0 ϑ(X tx )


then ∂x
E f (X Tx ) appears as a weighted expectation of f (X T ), namely
⎛ ⎞
⎜  T ⎟
∂ ⎜ x 1 Ys ⎟

E f (X T ) = E ⎜ f (X T ) dWs ⎟
⎟.
x
(10.16)
∂x ⎝ T ϑ(X )x
, 0 -. s /⎠
weight

Proof. We proceed like we did with the Black–Scholes model in Sect. 2.2: we first
assume that f is regular, namely bounded and differentiable with bounded derivative.
Then, using the tangent process method approach


E f (X Tx ) = E f  (X Tx )YT ,
∂x
d X tx
still with Yt = dx
. Then, we set

Yt
Ht = .
ϑ(X tx )

Under the above assumption, we can apply Bismut’s formula to get


  T 
 Yt
T E f (X T )YT = Ex
f (X T )
x
dWt ,
0 ϑ(X tx )

which yields the announced result. The extension to continuous functions with poly-
nomial growth relies on an approximation argument (e.g. by convolution). ♦

Remarks. • One retrieves in the case of a Black–Scholes model the formula (2.8)
Xx
for the δ, obtained in Sect. 2.2 by an elementary integration by parts, since Yt = xt
and ϑ(x) = σx.
• Note that the assumption
10.4 Flavors of Stochastic Variational Calculus 497

 T 2
Yt
dt < +∞
0 ϑ(X tx )

is essentially an ellipticity assumption. Thus, if ϑ2 (x) ≥ ε0 > 0, one checks that the
assumption is always satisfied.

 Exercises. 1. Apply the preceding to get a formula for the γ-parameter in a general
diffusion model. [Hint: Apply the above “derivative free” formula to the δ-hedge
formula obtained using the tangent process method.]

2. Show that if b − b ϑϑ − 21 ϑ ϑ = c ∈ R, then

∂ ecT
E f (X Tx ) = E f (X Tx )WT .
∂x ϑ(x)

10.4.2 The Haussman–Clark–Occone Formula: Toward


Malliavin Calculus

In this section we state an elementary version of the so-called Haussman–Clark–


Occone formula, following the seminal paper by Haussman [147]. We still consider
the standard SDE

d X t = b(X t )dt + ϑ(X t )dWt , X 0 = x, t ∈ [0, T ],

with Lipschitz continuous coefficients b and σ and X x = (X tx )t∈[0,T ] its unique (FtW )-
adapted solution starting at x, where (Ft )t∈[0,T ] is the (augmented) filtration of the
Brownian motion W . We admit the following theorem, stated in a one-dimensional
setting for (at least notational) convenience.

Theorem 10.3 (Haussman(–Clark–Occone) formula)


Let F : C([0, T ], R),  · sup → R be a differentiable functional with differential
D F. Then
 T
F(X x ) = E F(X x ) + E D F(X )·(1[t,T ] Y.(t) ) | Ft ϑ(X tx )dWt ,
0

where Y (t) is the tangent process of X x at time t, the solution to the SDE

dYs(t) = Ys(t) b (X sx )ds + ϑ (X sx )dWs , Yt(t) = 1, s ∈ [t, T ].

Note that Y (t) also reads


Ys
Ys(t) = , s ∈ [t, T ] where Yt = Yt(0) is the tangent process of X x at the origin.
Yt
498 10 Back to Sensitivity Computation

Remarks. • The starting point to understand this formula is to regard it as a more


explicit version of the classical representation formula of Brownian martingales (see
Proposition 3.2, p. 191, in [251]). Thus, Mt = E(F(X ) | Ft ), which admits a formal
representation as a Brownian stochastic integral
 T
Mt = M0 + Ht dWt ,
0

where (Ht )t∈[0,T ] is an (FtW )-progressively measurable process satisfying


(T 2
0 Hs ds < +∞ P-a.s. So, the Clark-Occone-Haussman formula provides a kind
of closed form for the process H .
• The differential D F(ξ) of the functional F at an element ξ ∈ C([0, T ], R) is a con-
tinuous linear form on C([0, T ], R). Hence, following the Riesz representation The-
orem (see e.g. [52]), it can be represented by a finite signed measure, say μ D F(ξ) (ds),
so that the term D F(ξ)·(1[t,T ] Y.(t) ) reads
 T  T
D F(ξ)·(1[t,T ] Y.(t) ) = 1[t,T ] (s)Ys(t) μ D F(ξ) (ds) = Ys(t) μ D F(ξ) (ds).
0 t

Toward the Malliavin derivative


Assume F(x) = f x(t0 ) , x ∈ C([0, T ], R), f : R → R differentiable with deriva-
tive f  . Then μx (ds) = f  (x(t0 ))δt0 (ds), where δt0 (ds) denotes the Dirac mass at
time t0 . Consequently

D F(X )·(1[t,T ] Y.(t) ) = f  (X tx0 )Yt(t)


0
1[0,t0 ] (t),

whence one derives that


 t0
f (X tx0 ) = E f (X tx0 ) + E f  (X t0 )Yt(t)
0
| Ft ϑ(X tx ) dWt . (10.17)
0

This leads us to introduce the Malliavin derivative Dt F(X x ) of F(X x ) at time t


by 1
D F(X x )·(1[t,T ] Y.(t) )ϑ(X tx ) if t ≤ t0
Dt F(X ) :=
x
.
0 if t > t0

The simplest interpretation (and original motivation!) is that the Malliavin derivative
is a derivative with respect to the Brownian path of W (viewed at time t). It can be
easily understood owing to the following formal chain rule of differentiation

∂ F(X x ) ∂ X tx
Dt F(X x ) = × .
∂ X tx ∂W
10.4 Flavors of Stochastic Variational Calculus 499

X x ,t
If one notes that, for any s ≥ t, X sx = X s t , the first term “ ∂ F(X ) x

∂ X tx
” in the above
(t)
product is clearly equal to D F(X )·(1[t,T ] Y. ) whereas the second term is the result
of a formal differentiation of the SDE at time t with respect to W , namely ϑ(X tx ).
An interesting feature of this derivative in practice is that it satisfies the usual
chaining rules like
Dt F 2 (X x ) = 2F(X x )Dt F(X x )

and more generally


Dt (F(X x )) = D(F(X x ))Dt F(X x ),

etc.
What is called Malliavin calculus is a way to extend this notion of differentiation
to more general functionals using some functional analysis arguments (closure of
operators, etc) using, for example, the domain of the operator Dt F (see e.g. [16,
208]).
Using the Haussman–Clark–Occone formula to get Bismut’s formula
As a first conclusion we will show that the Haussman–Clark–Occone formula con-
tains the Bismut formula. Let X x , H , f and T be as in Sect. 10.4.1. We consider the
two true martingales
 t  t
Mt = Hs dWs and Nt = E f (X T ) + x
E f  (X T )YT(s) | Fs dWs , t ∈ [0, T ]
0 0

and perform a (stochastic) integration by parts. Owing to (10.17), we get, under


appropriate integrality conditions,
  T   T  T
E f (X Tx ) Hs dWs = 0+E [. . . ] d Mt + E [. . . ] d Nt
0 0 0
 T 
+E E ( f  (X T )YT(s) | Fs ) ϑ(X sx )Hs ds
0
 T  
= E E ( f  (X T )YT(s) | Fs ) ϑ(X sx )Hs ds
0

owing to Fubini’s Theorem. Finally, using the characterization of conditional expec-


tation to get rid of the conditioning, we obtain
  T   T  
E f (X T ) x
Hs dWs = E f  (X T )YT(s) ϑ(X sx )Hs ds.
0 0

Finally, a reverse application of Fubini’s Theorem and the identity YT(s) = YT


Ys
leads
to   T    T 
 ϑ(X sx )
E f (X T )
x
Hs dWs = E f (X T )YT Hs ds ,
0 0 Ys
500 10 Back to Sensitivity Computation

which is simply Bismut’s formula (10.2).  


T
 Exercise. (a) Consider the functional F : ξ → F(ξ) = f ξ(s)ds , where
0
f : R → R is a differentiable function. Show that
    
T T T
ϑ(X tx )
F(X ) = E F(X ) +
x x
E f X sx ds Ys ds | Ft dWt .
0 0 t Yt

(b) Derive, using the homogeneity of the Eq. (10.11), that


  T  T      T −t  T −t 
Ys
E f X sx ds ds | Ft = E f  x̄ + X sx ds Ys ds (t
0 t Yt 0 0 |x = X tx ,x̄ = X sx ds
0
  t 
=:  T − s, X tx , X sx ds .
0

10.4.3 Toward Practical Implementation: The Paradigm


of Localization

For practical implementation, one should be aware that the weighted estimators of
sensitivities, obtained by log-likelihood methods (exact, see Sect. 2.2.3, or approx-
imate, see Sect. 10.3.1), by Bismut’s formula or more general Malliavin inspired
methods, often suffer from high variance, especially for short maturities.
This phenomenon can be observed in particular when such a formula coexists with
a pathwise differentiated formula involving the tangent process for smooth enough
payoff functions.
This can be easily understood on the formula for the δ-hedge when the maturity
is small (see the toy-example in Sect. 3.1).
Consequently, weighted formulas for sensitivities need to be speeded up by effi-
cient variance reduction methods. The usual approach, known as localization, is to
isolate the singular part (where differentiation does not apply) from the smooth part.
Let us illustrate the principle of localization functions on a very simple toy-
example (d = 1). Assume that, for every ε > 0,

|F(x, z) − F(x  , z)| ≤ C F,ε |x − x  |, x, x  , z ∈ R, |x − z|, |x  − z| ≥ ε > 0

with lim C F,ε = +∞. Assume furthermore that,


ε→0

∀ x, z ∈ Rd , x = z, Fx (x, z) exists

(hence bounded by C F,ε if |x − z| ≥ ε).


On the other hand, assume that e.g. F(x, Z ) = h(G(x, Z )) where G(x, Z ) has a
probability density which is regular in x whereas h is “highly” singular when in the
10.4 Flavors of Stochastic Variational Calculus 501
2 3
neighborhood of G(z, z), z ∈ R (think of an indicator function). Then, the function
f (x) := E F(x, Z ) is differentiable.
Then, one considers a function ϕ ∈ C ∞ (R, [0, 1]) such that ϕ ≡ 1 on [−ε, ε] and
supp(ϕ) ⊂ [−2ε, 2ε]. Then, one may decompose

F(x, Z ) = 1 − ϕ(x − Z ) F(x, Z ) + ϕ(x − Z )F(x, Z ) := F1 (x, Z ) + F2 (x, Z ).

Functions ϕ can be obtained as mollifiers in convolution theory but other choices are
possible, like simply Lipschitz continuous functions (see the numerical illustration
in Sect. 10.4.4). Set f i (x) = E Fi (x, Z )) so that f (x) = f 1 (x) + f 2 (x). Then one
may use a direct differentiation to compute
 
∂ F1 (x, Z )
f 1 (x) = E
∂x

(or a finite difference method with constant or decreasing increments). As concerns


f 2 (x), since F2 (x, Z ) is singular, it is natural to look for a weighted estimator of the
form
f 2 (x) = E F2 (x, Z ) ,

where  is a random “weight” (possibly non-positive) obtained, for example, by the


above described method if we are in a diffusion framework.
When working with a vanilla payoff in a Brownian diffusion framework at a fixed
time T , the above Bismut formula (10.16) does the job, in its multi-dimensional
version if necessary. When working with path-dependent payoff functionals (like
barriers, etc) in local or stochastic volatility models, Malliavin calculus methods
are often a powerful and elegant tools even if, in many settings, more elementary
approaches can often be used to derive explicit weights. For an example of weight
computation by means of Malliavin calculus in the case of Lookback or barriers
options, we refer to [125]. These weights are not unique – even prior to any variance
reduction –, see e.g. [121].

10.4.4 Numerical Illustration: What is Localization


Useful for?

(written with V. Lemaire). Let us consider in a standard Black–Scholes model


(St )t∈[0,T ] with interest rate r > 0 and volatility σ > 0, two binary options:
– a digital Call with strike K > 0 and
– an Asset-or-Nothing Call with strike K > 0,
defined respectively by their payoff functions

h 1 (ξ) = 1{ξ≥K } and h S (ξ) = ξ 1{ξ≥K } ,


502 10 Back to Sensitivity Computation

2
payoff
smooth payoff

1.5

0.5

-0.5
35 40 45 50 55 60 65

Fig. 10.1 Payoff h 1 of the digital Call with strike K = 50

100
payoff
smooth payoff
80

60

40

20

-20
35 40 45 50 55 60 65

Fig. 10.2 Payoff h S of the Asset-or-Nothing Call with strike K = 50)

reproduced in Figs. 10.1 and 10.2 (pay attention to the scales of the y-axis in each
figure).
σ2

Let F(x, z) = e−r t h 1 x e(r − 2 )T +σ T z in the digital Call case and let F(x, z) =
σ2

e−r t h S x e(r − 2 )T +σ T z in the Asset-or-Nothing Call. We define f (x) = E F(x, Z ),
where Z is a standard Gaussian variable. With both payoff functions, we are in the
singular setting in which F( . , Z ) is not Lipschitz continuous but only 21 -Hölder
in L 2 (see the singular setting in (10.1.1)) whereas f is C ∞ on (0, +∞). We are
interested in computing the delta (δ-hedge) of the two options, i.e. f  (x).
In such a singular case, the variance of the finite difference estimator explodes as
ε → 0 (see Proposition 10.2) and ξ → F(ξ, Z ) is not L p -differentiable for p ≥ 1
so that the tangent process approach cannot be used (see Sect. 10.2.2).
10.4 Flavors of Stochastic Variational Calculus 503

10
Crude estimator
RR extrapolation

Variance 6

0
0 0.2 0.4 0.6 0.8 1
epsilon

Fig. 10.3 Variance of the two estimators as a function of ε (digital Call)

5000
Crude estimator
4500 RR extrapolation

4000
3500
Variance

3000
2500
2000
1500
1000
500

0 0.2 0.4 0.6 0.8 1


epsilon

Fig. 10.4 Variance of the two estimators as a function of ε (Asset-or-Nothing Call)

We first illustrate the variance explosion in Figs. 10.3 and 10.4 where the param-
eters have been set to r = 0.04, σ = 0.1, T = 1/12 (one month), x0 = K = 50 and
M = 106 .
To avoid the explosion of the variance one considers a smooth (namely Lipschitz
continuous) approximation of both payoffs. Given a small parameter η > 0, one
defines 1
h 1 (ξ), if |ξ − K | > η,
h 1,η (ξ) = 1

ξ + 1
2
(1 − K
η
), if |ξ − K | ≤ η

and &
h S (ξ), if |ξ − K | > η,
h η,S (ξ) = K +η K +η

ξ + 2
(1 − K
η
), if |ξ − K | ≤ η.
504 10 Back to Sensitivity Computation

We define Fη (x, Z ) and f η (x) similarly as in the singular case.


In this numerical section, we introduce a Richardson–Romberg (RR) extrapolation
of the finite difference estimators

1  F(x + ε, Z k ) − F(x − ε, Z k )
M

f  (x)ε,M =
M k=1 2ε

and

1  Fη (x + ε, Z k ) − Fη (x − ε, Z k )
M

f η (x)ε,M = .
M k=1 2ε

The extrapolation is done using a linear combination of the estimators with step
(3) (5)
ε and 2ε , respectively. As E  f  (x)ε,M = f  (x) + f 6(x) ε2 + f 5!(x) ε4 + O(ε6 ), we
easily check that the combination that kills the first term of this bias (in ε2 ) is
4 
3
f  (x) 2ε ,M − 13 
f  (x)ε,M (see Exercise 4. in Sect. 10.1.1). The same holds for f η .
Then, as in the proof of Propositions 10.1 (with θ = 21 ) and 10.2, we then prove
that    
  
 f (x) − 4  f  (x) ε −
1
f  (x)  = O(ε4 ) + O √ 1 , (10.18)
 3 2 ,M
3 ε,M  εM
2

and
    
  
 f (x) − 4 f  (x) −
1
f  (x)  = O(ε4 ) + O √1 . (10.19)
 η 3 η
ε
2 ,M 3 η ε,M 2 M

 Exercise. Prove (10.18) and (10.19).

The control of the variance in the smooth case is illustrated in Figs. 10.5 and 10.6
when η = 2, and in Figs. 10.7 and 10.8 when η = 0.5. The variance increases when
η decreases to 0 but does not explode as ε goes to 0.
For a given ε, note that the variance is usually higher using the standard RR
extrapolation. However, in the Lipschitz continuous case the variance of the RR-
estimator and that of the crude finite difference converge toward the same value
when ε goes to 0. Moreover, from (10.18) we deduce the choice ε = O(M −1/9 ) to
keep the balance between the bias term of the RR estimator and the variance term.
As a consequence, for a given level of the L 2 -error, we can choose a bigger ε with
the RR estimator, which reduces the bias without increasing the variance.
The parameters of the model are the following: x0 = 50, K = 50, r = 0.04, σ =
0.3 and T = 52 1
(one week). The size of the Monte Carlo simulation is fixed to
M = 10 . We now compare the following estimators in the two cases with two
6

different maturities T = 1/12 (one month) and T = 1/52 (one week):


10.4 Flavors of Stochastic Variational Calculus 505

0.06
Crude estimator
RR extrapolation
0.055

0.05

Variance 0.045

0.04

0.035

0.03

0.025
0 0.2 0.4 0.6 0.8 1
epsilon

Fig. 10.5 Digital Call: Variance of the two estimators as a function of ε. T = 1/52. Smoothed
payoff with η = 1

150
Crude estimator
RR extrapolation
140

130

120
Variance

110

100

90

80

70
0 0.2 0.4 0.6 0.8 1
epsilon

Fig. 10.6 Asset- or- nothing Call: Variance of the two estimators as a function of ε, T = 1/52.
Smoothed payoff with η = 1

• Finite difference estimator on the non-smooth payoffs h 1 and h S with ε = M − 5 


1

0.0631 since the RMSE has the form f (3) (x)ε2 + O √εM 1
(see Proposition 10.1
with θ = 21 ).
• Finite difference estimator with RR-extrapolation on the non-smooth payoffs with
ε = M − 9  0.2154 (based on (10.18)).
1

• Crude weighted estimator (with the standard Black–Scholes δ-weight 2.7) on the
non-smooth payoffs h 1 and h S .
• Localization (smoothing): Finite difference estimator on the smooth payoffs h 1,η
and h S,η with η = 1 and ε = M − 4 combined with the weighted estimator on the
1

(non-smooth) differences h 1 − h 1,η and h S − h S,η .


506 10 Back to Sensitivity Computation

0.16
Crude estimator
RR extrapolation
0.14

0.12
Variance
0.1

0.08

0.06

0.04
0 0.2 0.4 0.6 0.8 1
epsilon

Fig. 10.7 Digital Call: Variance of the two estimators as a function of ε, T = 1/52. Smoothed
payoff with η = 0.5

400
Crude estimator
RR extrapolation
350

300
Variance

250

200

150

100
0 0.2 0.4 0.6 0.8 1
epsilon

Fig. 10.8 Asset- or- nothing Call: Variance of the two estimators as a function of ε, T = 1/52.
Smoothed payoff with η = 0.5

• Localization (smoothing): Finite difference estimator with RR-extrapolation on


the smooth payoffs h 1,η and h S,η with η = 1 and ε = M − 8 combined with the
1

weighted estimator on the (non-smooth) differences h 1 − h 1,η and h S − h S,η (based


on (10.19)).
10.4 Flavors of Stochastic Variational Calculus 507

Table 10.1 Digital Call: results with T = 1/52 (one week)


Estimator Mean Variance 95% CI
Finite difference 0.19038 1.4735 [0.188; 0.19276]
FD + RR 0.19083 1.1967 [0.18869; 0.19297]
Weighted estimator 0.19196 0.07904 [0.19141; 0.19251]
Loc. FD + WE 0.1919 0.057459 [0.19141; 0.19239]
Loc. FD + WE + RR 0.19182 0.056925 [0.19133; 0.19231]

Table 10.2 Asset- or- nothing Call: results with T = 1/52 (one week)
Estimator Mean Variance 95% CI
Finite difference 10.156 3736.2 [10.036; 10.276]
FD + RR 10.094 3001.1 [9.9865; 10.201]
Weighted estimator 10.101 227.98 [10.071; 10.131]
Loc. FD + WE 10.085 143.74 [10.061; 10.11]
Loc. FD + WE + RR 10.078 142.4 [10.054; 10.103]

The results are summarized in the following Tables 10.1 for the delta of the digital
Call option and 10.2 for the delta of the Asset-or-Nothing Call option.
In practice, the variance of these estimators can be improved by appropriate control
variates, not implemented in this numerical experiment.
Chapter 11
Optimal Stopping, Multi-asset
American/Bermudan Options

11.1 Introduction

11.1.1 Optimal Stopping in a Brownian Diffusion Framework

This chapter is devoted to probabilistic numerical methods for solving optimal stop-
ping problems, in particular the pricing of multi-asset American and Bermudan
options. However, we do not propose an in-depth presentation of optimal stopping
theory in discrete and continuous time, which would be far beyond the scope of this
monograph. When necessary, we will refer to reference books or papers on this topic,
and many important theoretical results will be admitted throughout this chapter. For
an introduction to optimal stopping theory in discrete time we refer to [217] or more
recently [183] and, for a comprehensive overview of the whole theory, including
continuous time models and its applications to the pricing of American derivatives,
we recommend Shiryaev’s book [182, 262], which we will often refer to. On our
side, we will focus on time and space discretization aspects in Markovian models,
usually in connection with Brownian diffusion processes.
We will first explain how one can approximate a continuous time optimal stop-
ping problem, namely for Brownian diffusions, by a discrete time optimal stopping
problem for Rd -valued Markov chains. The specificity of direct Markovian optimal
stopping problems is that the quantity of interest (the Snell envelope) satisfies a
backward Dynamical Programming Principle (BDPP) which is a major step toward
an operating numerical scheme. Once this time discretization phase is achieved, we
will present two different methods of space discretization which allow us to imple-
ment approximations of the BDPP: Least squares regression methods, also known
as American Monte Carlo and optimal quantization methods (quantization trees).
As it does not add any specific difficulty, we will consider a general diffusion
framework (7.1), namely

d X t = b(t, X t )dt + σ(t, X t )dWt , t ∈ [0, T ], X 0 ∈ L rRd (, A, P) (11.1)

© Springer International Publishing AG, part of Springer Nature 2018 509


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_11
510 11 Optimal Stopping, Multi-asset American/Bermudan Options

(r ≥ 1), where b : [0, T ] × Rd → Rd , σ : [0, T ] × Rd → M(d, q, R) are continu-


ous functions, Lipschitz in x uniformly in t ∈ [0, T ] and W is a q-dimensional stan-
dard Brownian motion defined on (, A, P), independent of X 0 . If we consider the
augmented natural filtration of the Brownian motion Ft = σ(X 0 , NP , Ws , 0 ≤ s ≤
t), then this equation admits a unique (Ft )-adapted solution defined on (, A, P).
These assumptions are in force throughout the chapter.
Moreover, we will denote by X x = (X tx )t∈[0,T ] the unique solution starting at
X 0 = x ∈ Rd . More generally, under the above assumptions, for every t ∈ [0, T ] and
every x ∈ Rd , there exists a unique solution (X st,x )s∈[t,T ] to (11.1) living between t
t,x
 t,x from x at time t, i.e. such that X t = x (note that X s = X s ).
x 0,x
and T and starting
The process (X s )s∈[t,T ] t∈[0,T ],x∈Rd is called the flow of (11.1).
An optimal stopping problem in that framework can be presented as follows. The
process (X tx )t∈[0,T ] is representative of a continuous time flow of information. Typ-
ically on a financial market, it can be the quotation of d traded risky assets (stocks,
interest rates, currency exchange rates, quotation of various commodities, gas, oil,
etc). It can also contain a lot of non-tradable information of an economical nature
(inflation, industrial activity, unemployment rate,…) or a meteorological nature (tem-
perature), etc. If X entirely consists of traded risky assets, all its components X ti are
assumed to be positive and, in a risk neutral world, among other assumptions, the
drift b is of the form b(t, x) = r (t, x)x, where r : [0, T ] × Rd+ → R+ is the instan-
taneous interest rate (1 ). Then, the above dynamics for the risky assets is known
as a local volatility model, provided it satisfies some conditions on b and σ which
preserve the non-negativity of the components X i (see further on).
A stochastic game can be associated to the process X as follows: the player pays
at time t ∈ [0, T ] to enter the game. She may leave the game whenever she wants
at a [t, T ]-valued random time τ . Then she receives f (τ , X τ ) where the function
f : [0, T ] × Rd → R+ is Borel. Not any τ is authorized, only “honest” stopping
strategies. By “honest”, we mean that the player made her decision in real time from
her observations of the information flow (X t )t∈[0,T ] , that is, in a non-anticipative
way. That means that she “observes” the (augmented) filtration (FtX )t∈[0,T ] of the
process (X t )t∈[0,T ] (2 ). If σ degenerates at some places, the filtrations (Ft )t∈[0,T ] and
(FtX )t∈[0,T ] may differ, but of course FtX ⊂ Ft . However, as (X t )t∈[0,T ] is an (Ft )-
Markov process (see [256]), it turns out not to be a restriction to consider that the
player observes the filtration (Ft )t∈[0,T ] to make her decisions. In fact, a posteriori, the
decision only depends on the observed filtration (FtX )t∈[0,T ] . In what follows, we will
still denote by Ft any of these two filtrations when there is no necessity to distinguish

1 This statement should be considered with caution since the traded asset may not correspond to
what is quoted: thus in the case of Foreign Exchange, the quoted process is the exchange rate itself
whereas the risky asset in arbitrage theory is the foreign bond quoted in domestic currency, i.e. the
value of the foreign bond (of an appropriate maturity) multiplied by the exchange rate. Thus, if r F
denotes the foreign interest rate, r should be replaced by r − r F in the dynamics of the exchange
rate. A similar situation occurs with a stock continuously distributing dividends at an exponential
rate λ: r should be replaced by r − λ, or with a future contract where r should be replaced by 0 in
the dynamics of the quoted price.
2 F = σ( N , X , s ∈ [0, t]).
t P s
11.1 Introduction 511

them. In mathematical terms, this means that admissible stopping strategies τ are
those which satisfy that the decision to stop between 0 and t is Ft -measurable, i.e.

∀ t ∈ [0, T ], {τ ≤ t} ∈ Ft .

In other words, τ is a [0, T ]-valued (Ft )-stopping time. Note that simply asking
{τ = t} to lie in Ft (the player decides to stop exactly at time t) may seem more
natural as an assumption, but, if it is the right notion in discrete time (equivalent to
the above condition), in continuous time, this simpler condition turns out to be too
loose to devise a consistent theory.
 Exercise. Show that for any F-stopping time τ one has, for every t ∈ [0, T ],
{τ = t} ∈ Ft (the converse is not true in general in a continuous time setting).

For the sake of simplicity, we will assume that the instantaneous interest rate
is constant and invariant, i.e. the interest rate curve is constant equal to r so that
the discounting factor at time t is given by e−r t (this also has consequences on the
dynamics of risky assets under a risk neutral probability, see further on). At this stage,
the supremum of the mean gain over all admissible stopping rules at each time is the
best the player can hope for. Concerning measurability, it is not possible to consider
such a supremum in a naive way. It should be replaced by a P-essential supremum
(see Sect. 12.9 in the Miscellany Chapter). This quantity, defined below, is known as
the Snell envelope.
It is clear from Theorem 7.2 that, under
 the above assumptions
 made on b and
σ, if X 0 ∈ L r (P) for some r ≥ 1, then  supt∈[0,T ] |X t |r < +∞, so that as soon
as the Borel function f : [0, T ] × Rd → R+ has r -polynomial growth in x uni-
formly in t ∈ [0, T ], that is, 0 ≤ f (t, x) ≤ C(1 + |x|r ), then supt∈[0,T ] f (t, X t ) ∈
L r (P). In particular, for any random variable τ : (, A) → [0, T ], 0 ≤ f (τ , X τ ) ≤
supt∈[0,T ] f (t, X t ) ∈ L r (P).
Definition 11.1 (a) The (Ft )-adapted process f (t, X t ), t ∈ [0, T ] is called the pay-
off process of the game (or the American payoff). Its discounted version, denoted by
Z t , reads (3 )
Z t = e−r t f (t, X t ), t ∈ [0, T ].

(b) Assume that X 0 ∈ L r (P) and f has r -polynomial growth. The (F, P)-Snell enve-
lope of the discounted payoff (Z t )t∈[0,T ] , is defined for every t ∈ [0, T ] by
   F

Ut = P-esssup E Z τ | Ft , τ ∈ Tt,T
 
= P-esssup E e−r τ f (τ , X τ ) | Ft , τ ∈ Tt,T
F
< +∞ P-a.s. (11.2)

F
where τ ∈ Tt,T denotes the set of (Fs )s∈[0,T ] -stopping times having values in [t, T ].

3 Inthis chapter, to preserve the internal consistency of the notations, we will not use to denote
discounted payoffs or prices.
512 11 Optimal Stopping, Multi-asset American/Bermudan Options

In fact, since the Brownian diffusion process X is an F-Markov process, one


shows (see [263]) that the filtration F can be replaced by F X in the above definition
and characterizations of F(t, x). This is clearly more in accordance with the usual
models in Finance (and elsewhere) since X represents in most models the observable
structure process.

Proposition 11.1 (Value function, [84, 262]) Assume that f : [0, T ] × Rd → R+


is continuous with r -polynomial growth, r ≥ 1.
(a) There exists a continuous function F : [0, T ] × Rd → R+ defined by
 
F(t, x) = sup E e−r (τ −t) f (τ , X τt,x ) ,
F
τ ∈Tt,T

F
where τ ∈ Tt,T denotes the set of [t, T ]-valued (Fs )s∈[0,T ] -stopping times. Then, if
X 0 = x, the Snell envelope of (Z t )t∈[0,T ] satisfies

∀ t ∈ [0, T ] Ut = e−r t F(t, X tx ) P a.s. (11.3)

This function F is called the value function. If furthermore, f is Lipschitz continuous


in x ∈ Rd uniformly in t ∈ [0, T ] then, so is the function F.
(b) Autonomous case. If b(t, x) = b(x) and σ(t, x) = σ(x), then
 
F(t, x) = sup E e−r τ f (t + τ , X τx ) .
F
τ ∈T0,T −t

As a by-product of this intuitive, powerful (but quite demanding) result that we


will admit, we derive that the Snell envelope (Ut )t∈[0,T ] has a continuous modification
given by (11.3) (at least when X 0 is deterministic, since the diffusion is a.s. pathwise
continuous).
The major results on the Snell envelope that we will admit in this continuous time
framework (see [84, 182, 262]) are the following:

• (Ut )t∈[0,T ] is a continuous (P, F)-super-martingale dominating the (discounted)


payoff process (Z t )t∈[0,T ] (i.e. P-a.s. (Ut ≥ Z t , ∀ t ∈ [0, T ])) and
• If (Vt )t∈[0,T ] is a continuous (P, F)-super-martingale dominating (Z t )t∈[0,T ] , then
P-a.s. (Vt ≥ Ut , ∀ t ∈ [0, T ]).

Remark. • If the components X i of X are positive such that e−r t X ti are (P, Ft )-
martingales, and if the representation theorem of (FtW )-martingales as stochastic
integrals with respect to W can be reformulated as a representation theorem with
respect to the above (discounted) processes X ti , then the diffusion (X t )t∈[0,T ] is an
abstract model of complete financial market where X i are the quoted prices of the
risky assets. The last representation assumption is essentially satisfied under uniform
ellipticity of the diffusion coefficient of the SDE satisfied by the log-risky assets. In
such a model, one shows by hedging arguments that F(0, x) is the premium of an
11.1 Introduction 513

American option with payoff f (t, X tx ), t ∈ [0, T ]. We refer to [182] for a detailed
presentation with a focus on a model of stock market distributing dividends (see
also Sect. 11.1.2 for few more technical aspects).

 Exercise. Prove the “Lipschitz claim” of the theorem: if f is Lipschitz continuous


in x ∈ Rd uniformly in t ∈ [0, T ], then so is the function F. [Hint: use that the flow
of the SDE is Lipschitz continuous in L 1 (P), see Theorem 7.10.]

The Lipschitz continuity assumption on the payoff function f will be in force


throughout the chapter, namely

(Lip) ≡ f is Lipschitz continuous in x ∈ Rd , uniformly in t ∈ [0, T ].

This assumption will sometimes be refined into a semi-convex condition


(SC) ≡ f satisfies (Lip) and is semi-convex in the following sense: there exists
δ f : [0, T ] × Rd → Rd , Borel and bounded, and ρ ∈ (0, +∞) such that
 
∀ t ∈ [0, T ], ∀ x, y ∈ Rd , f (t, y) − f (t, x) ≥ δ f (t, x) | y − x − ρ|y − x|2 . (11.4)

 Examples. 1. If f is Lipschitz continuous and convex, then f is semi-convex with


 
∂f
δ f (t, x) = (t, x)
∂xi r 1≤i≤d


∂f
where ∂xi
(t, x) denotes the right derivative with respect to xi at (t, x).
r
2. If f (t, . ) ∈ C 1 (Rd , Rd ) and, for every t ∈ [0, T ], ∇x f (t, . ) is Lipschitz continuous
in x, uniformly in t ∈ [0, T ], then

f (t, y) − f (t, x) ≥ (∇x f (t, x)|y − x) − [∇x f ]Lip |ξ − x||y − x|, ξ ∈ [x, y]
≥ ∇x f (t, x)|y − x − [∇x f ]Lip |y − x|2 ,

where [∇x f ]Lip = sup [∇x f (t, . )]Lip .


t∈[0,T ]

11.1.2 Interpretation in Terms of American Options (Sketch)

As far as derivative pricing is concerned, the (non-negative) function f defined


on [0, T ] × Rd+ is an American payoff function. This means that the holder of the
contract exercises his/her option at time t and he/she will receive a monetary flow
equal to f (t, x) if the vector of market prices is equal to x ∈ Rd+ at time t. When
514 11 Optimal Stopping, Multi-asset American/Bermudan Options

modeling a financial market, one usually assumes that the quoted prices of the d risky
assets X ti , i = 1, . . . , d, are non-negative, which has important consequences on the
possible choices for b and σ. As we assume that all components are observable,
our framework corresponds to a multi-asset local volatility model. To ensure the
positivity of the risky assets, a natural condition is to assume that

b(t, x) = Diag(x 1 , . . . , x d )β(t, x) and σ(t, x) = Diag(x 1 , . . . , x d )ϑ(t, x),

where β : [0, T ] × Rd → Rd and ϑ : [0, T ] × Rd → M(d, q, R) are bounded con-


tinuous functions, Lipschitz in x uniformly in t ∈ [0, T ] (other conditions are needed
to ensure that b and σ themselves satisfy a Lipschitz continuity assumption in x uni-
formly in t). In practice, if one assumes that the interest rate curve is flat and stays
invariant in time, the drift of the dynamics under a risk-neutral probability is simply
of the form b(t, x) = r x so (4 ) that, for every i t ∈ {1, . . . , d},

 t  
|ϑi. (s, X s )|2 ds t  
∀ t ∈ [0, T ], X ti = x0i exp r t − 0
+ ϑi. (s, X s )|dWs q ,
2 0
x0i > 0.

Note that, a necessary condition for such a market to be complete is that q ≥ d


and a sufficient condition is that ϑϑ∗ ≥ ε0 Id (uniform ellipticity). If this is the case,
the risk neutral probability (equivalent to the historical probability) is unique and one
shows (see [181] in a more general setting) that, for hedging purposes, one should
price all the American option contracts under this risk-neutral probability measure:
this premium is the minimal amount of cash that makes it possible to manage a
self-financed portfolio to face any early – optimal or not – exercise of the American
option contract during its entire life between 0 and T (see [182] or, in a discrete time
models, [59]). Once again, one should be cautious in interpreting the preceding, the
first task being to determine what the risky assets really are (since it may differ from
the “quoted asset”).
However, as far as numerical aspects are concerned, this kind of restriction is of
little interest in the sense that it has no impact on the methods or on their performances.
This is the reason why, in what follows, we will keep a general form for the drift b of
our Brownian diffusion dynamics (and always denote by P the probability measure
on (, A)).

4 A risk neutral probability is a probability P∗


on (, A, P), equivalent to P, such that the discounted
value of the d risky assets are P∗ -martingales. If b(t, x) = r x and ϑ is bounded, P itself is a risk-
neutral probability.
11.2 Optimal Stopping for Discrete Time Rd -Valued Markov Chains 515

11.2 Optimal Stopping for Discrete Time Rd -Valued


Markov Chains

11.2.1 General Theory, the Backward Dynamic


Programming Principle

Optimal stopping in discrete time can be developed in a more general framework,


starting from a sequence (Z k )k=0,...,n of non-negative integrable payoffs defined on
a filtered probability space (, A, (Fk )k=0,...,n , P). For such a presentation we refer
to [217] or [183] (on a finite probability space  of scenarii). In view of the appli-
cations we have in mind – the approximation of a continuous time Snell envelope
where the underlying process is a Brownian diffusion – we chose to develop our pre-
sentation in a Markovian framework. However it remains very close to the abstract
framework.
Let (X k )0≤k≤n be an Rd -valued (Fk)0≤k≤n -Markov chain defined on a filtered
probability space , A, (Fk )k=0,...,n , P with Markov transitions (5 )
 
Pk (x, dy) = P X k+1 ∈ dy | X k = x , k = 0, . . . , n − 1,

and let Z = (Z k )0≤k≤n be an (Fk )0≤k≤n -adapted payoff sequence of non-negative


integrable random variables of the form

Z k = f k (X k ) ∈ L 1R+ (, Fk , P), k = 0, . . . , n, (11.5)

where f k : Rd → R+ are non-negative Borel functions. We want to compute the


(P, (Fk )0≤k≤n )-Snell envelope U = (Uk )0≤k≤n defined (6 ) by
   F
Uk = P-esssup E f τ (X τ ) | Fk , τ ∈ Tk:n , (11.6)

F
 
where Tk:n = τ : (, A) → {k, . . . , n}, (F )0≤k≤n -stopping time and its sequence
of values (E Uk )0≤k≤n .
Note that Uk ≥ Z k a.s. for every k ∈ {0, . . . , n}.
Proposition 11.2 (Backward Dynamic Programming Principle (BDPP)) (a)
Pathwise Backward Dynamic Programming Principle. The Snell envelope (Uk )0≤k≤n
satisfies the following Backward Dynamic Programming Principle

us recall that this means that for every k ∈ {0, . . . , n − 1}, for every bounded or non-negative
5 Let

Borel function f : Rd → R,

E ( f (X k+1 ) | Fk ) = E ( f (X k+1 ) | X k ) = Pk ( f )(X k ) P-a.s.

6A random variable τ : (, A) → {0, . . . , n} is an (Fk )0≤k≤n -stopping time if {τ ≤ k} ∈ Fk for


every k = 0, . . . , n, or, equivalently, if {τ = k} ∈ Fk for every k = 0, . . . , n.
516 11 Optimal Stopping, Multi-asset American/Bermudan Options
  
Un = Z n and Uk = max Z k , E Uk+1 | X k , k = 0, . . . , n − 1. (11.7)

(b) Optimal stopping times. The sequence of stopping times (τk )0≤k≤n defined by

τn = n and τk = k1 Z  + τk+1 1  , k = 0, . . . , n − 1


k ≥E (Z τk+1 |X k ) Z k <E (Z τk+1 |X k )
(11.8)
satisfies
   
τk = min  ∈ {k, . . . , n}, U = Z  and Uk = E Z τk |X k a.s., k = 0, . . . , n.
(11.9)
In particular, this means that, for every k ∈ {0, . . . , n}, the {k, . . . , n}-valued τk is
an optimal stopping time for the optimal stopping problem starting at k.
(c) Functional Backward Dynamic Programming Principle. Furthermore, for every
k ∈ {0, . . . , n} there exists a Borel function u k : Rd → R such that

Uk = u k (X k ), k = 0, . . . , n,

where  
u n = fn and u k = max f k , Pk u k+1 , k = 0, . . . , n − 1.

Proof. Let us introduce (temporarily) the sequence

θn = n and θk = k1 Z  + θk+1 1  , k = 0, . . . , n − 1.


k ≥E (Z θk+1 |Fk ) Z k <E (Z θk+1 |Fk )

The fact that θk is a stopping time is left as an exercise since {Z k ≥ E (Z θk+1 |Fk )},
{Z k < E (Z θk+1 |Fk )} = {Z k ≥ E (Z θk+1 |Fk )}c ∈ Fk and θk+1 is an (F )-stopping time.
As a first step we will prove (a) and (b) where X k is replaced by Fk in the condi-
tioning.
F
We proceed by a backward induction on the discrete instant k. If k = n, Tk:n is
reduced to {n} so that Un = E(Z n |Fn ) = Z n . Now, let k ∈ {0, . . . , n − 1} and let θ ∈
F
Tk:n . Set θ = θ ∨ (k + 1) so that θ = k1{θ=k} + θ1{θ≥k+1} . The random variable θ is a
{k + 1, . . . , n}-valued (F )-stopping time since {θ ≥ k + 1} = {θ = k}c ∈ Fk . Then,
using the chain rule for conditional expectation in the third line and the definition of
Uk+1 in the fourth line, we obtain
   
E f θ (X θ ) |Fk = f k (X k )1{θ=k} + E f θ (X θ )1{θ≥k+1} |Fk
 
= f k (X k )1{θ=k} + E f θ (X θ ) |Fk 1{θ≥k+1}
   
= f k (X k )1{θ=k} + E E f θ (X θ ) |Fk+1 |Fk 1{θ≥k+1}
 
≤ f k (X k )1{θ=k} + E Uk+1 |Fk )1{θ≥k+1} a.s.
  
≤ max f k (X k ), E Uk+1 |Fk a.s.
11.2 Optimal Stopping for Discrete Time Rd -Valued Markov Chains 517

By the very definition of the Snell envelope, Uk ≥ E (Z θk+1 |Fk ) a.s. so that
  
Uk ≥ E (Z θk |Fk ) = Z k 1{θk =k} + E Z θk+1 |Fk 1θ ≥k+1
a.s.
k
    
= Z k 1θ =k  + E E Z θk+1 |Fk+1 |Fk 1θ ≥k+1
k k

    
= Z k 1 θ =k + E Uk+1 |Fk 1{θk ≥k+1} a.s.
k

owing to the induction assumption. Going back to the definition of θk we check that
{θk = k} = {Z k ≥ E (Z θk+1 |Fk )} and {θk ≥ k + 1} = {Z k < E (Z θk+1 |Fk )}. Hence
  
Uk ≥ E (Z θk |Fk ) = max Z k , E Uk+1 |Fk .

This completes the proof


 of (11.7) except for the conditioning and, as a consequence,
 
E (Z θk |Fk ) = max Z k , E Uk+1 |Fk = Uk a.s.
(c) The result is straightforward by a backward verification procedure: one notices
that, for every k ∈ {0, . . . , n − 1}

E (Uk+1 | Fk ) = E (u k+1 (X k+1 ) | Fk ) = E (u k+1 (X k+1 ) | X k ) = Pk u k+1 (X k )


 
owing to the Markov property, so that Uk = max f k (X k ), Pk u k+1 (X k ) = u k (X k ).
Consequently, E (Uk+1 | Fk ) = E (Uk+1 | X k ) which completes the proof of (11.7)
and also shows that, for every k ∈ {0, . . . , n},
    
Uk = u k (X k ) = E (Uk |X k ) = E E Z θk |Fk |X k = E Z θk |X k .

Plugging this equality into the definition of the stopping times θk shows, again by
a backward induction, that θk = τk a.s. since
 
E (Z τk+1 |Fk ) = E E Z τk+1 |Fk+1 )|Fk
   
= E Uk+1 |Fk = E Uk+1 | X k

= E E (Z τk+1 |Fk+1 ) | X k = E (Z τk+1 |X k )

a.s. for every k ∈ {0, . . . , n − 1}. This completes the proof of (11.8).
Finally, we derive that

τk = k1 Z  + τk+1 1 
k ≥E (Uk+1 |Fk ) Z k <E (Uk+1 |Fk )

= k1U  + τk+1 1  , k = 0, . . . , n,
k =Z k Uk >Z k
518 11 Optimal Stopping, Multi-asset American/Bermudan Options

which implies, again by a backward induction, that


 
τk = min  ∈ {k, . . . , n}, U = Z 

since Un = Z n . ♦

Remarks. • The optimal stopping time (starting at time k) is in general not unique
(see [182] or [59], Problem 21, p. 253 for an example dealing with the American Put
in a binomial model) but
 
τk = min  ∈ {k, . . . , n}, U = Z 

is the lowest such optimal stopping time.


• It immediately follows from (11.7) that (Uk )k=0,...,n is a P-super-martingale dom-
inating the payoff process (Z k )k=0,...,n (i.e. Uk ≥ Z k a.s.).
 Exercise. Show that any (F, P)-super-martingale (Vk )0≤k≤n which dominates
(Z k )0≤k≤n , dominates (Uk )0≤k≤n as well.

This proposition shows that two backward dynamic programming principles coex-
ist, a first one based on the Snell envelope, which can be considered as the “regular”
BDPP formula, and a second one, (11.8), based on a recursion on optimal stopping
times depending on the effective date k of entry into the game, which can be seen as
a dual form of the BDPP. It is on this second BDDP that the original paper [202] on
least squares regression methods is based (see the next section).

11.2.2 Time Discretization for Snell Envelopes Based


on a Diffusion Dynamics

We aim at discretizing the optimal stopping problem defined by (11.2). This dis-
cretization is two-fold: first we discretize the instants at which the payoffs can be
“exercised” and, as a second step, we will discretize the diffusion itself, if necessary,
in order to have at hand a simulable underlying structure process. In this first phase,
the Markov chain of interest is (X tkn )0≤k≤n and the payoff functions are f k = f (tkn , .),
k = 0, . . . , n.
First we note that if tkn = kTn
, k = 0, . . . , n, then (X tkn )0≤k≤n is an (Ftkn )0≤k≤n -
Markov chain with transitions

Pk (x, dy) = P(X tk+1


n ∈ dy | X tkn = x), k = 0, . . . , n − 1.

Proposition 11.3 Let n ∈ N∗ . Set



Tkn = τ : (, A, P) → {tn ,  = k, . . . , n}, (Ftn )-stopping time
11.2 Optimal Stopping for Discrete Time Rd -Valued Markov Chains 519

and, for every x ∈ Rd , the value function (for discrete time observations or exercise):
  t n ,x 
Fn (tkn , x) = sup E e−r τ f τ , X τk , (11.10)
τ ∈Tkn

t n ,x
where (X sk )s∈[tkn ,T ] is the unique solution to the (S D E) starting from x at time tkn .
   
(a) For every x ∈ Rd , e−r tk Fn (tkn , X txn ) = P, (Ftkn ) -Snell e−r tk f (tkn , X txn ) .
n n

 k k=0,...,n k

(b) For every x ∈ R , Fn (tk , X t n )


d n x
satisfies the “pathwise” BDPP
k k=0,...,n

Fn (T, X Tx ) = f (T, X Tx )

and
  
Fn (tkn , X txkn ) = max f (tkn , X txkn ), e−r n E Fn (tkn , X txk+1
T
n ) | Ft n , k = 0, . . . , n − 1.
k

(11.11)
(c) The functions Fn (tkn , . ) satisfy the “functional” BDPP
 −r T n  
Fn (T, x) = f (T, x) and Fn (tkn , x) = max f (tkn , x), ek P Fn (tkn , . ) (x) ,
k = 0, . . . , n − 1.
(d) If the function f satisfies the Lipschitz assumption (Lip), then the functions
Fn (tkn , . ) are all Lipschitz continuous, uniformly in k ∈ {0, . . . , n}, n ≥ 1.

Proof. We proceed by a backward induction, starting from claim (c). We consider


the functions Fn (tkn , . ), k = 0, . . . , n.
(c) ⇒ (b). The result follows from the Markov property, which implies for every
k = 0, . . . , n − 1,
  
E Fn (tk+1
n
, X tk+1
n )|F n
tk = Pk Fn (tk , . ) (x).
n


(b) ⇒ (a). This is a trivial consequence of the (BDPP) since e−r tk Fn (tkn , X tkn )
n

k=0,...,n
is the (P, (Ftkn )0≤k≤n )-Snell envelope associated to the obstacle sequence

e−r tk f (tkn , X tkn )
n
.
k=0,...,n
Applying the preceding to the case X 0 = x, we derive from the general theory on
optimal stopping that
  
Fn (0, x) = sup E e−r τ f (τ , X τ , τ ∈ T0n .

The extensionto times tkn , k = 1, . . . , n, follows likewise from the same reasoning
t n ,x
carried out with X tkn .
 =k,...,n
520 11 Optimal Stopping, Multi-asset American/Bermudan Options

(d) A slight adaptation of Theorem 7.10 shows that there exists a real constant
C[b]Lip ,[σ]Lip ,T such that
 
 
 sup |X st,x − X st,y | ≤ C[b]Lip ,[σ]Lip ,T |x − y|.
0≤t≤s≤T p

Using the Lipchitz continuity of f , the conclusion follows from the definition (11.10)
of the functions Fn (tkn , . ). ♦

The following rates of convergence of both the Snell envelope related to the
optimal stopping problem with payoff f (tkn , X tkn ) k=0,...,n and its value function
toward the continuous time Snell envelope and its value function at the discretization
times were established in [21]. Note that this rate can be significantly improved when
f is semi-convex.
Theorem 11.1 (Diffusion at discrete times) (a) Discretization of the stopping rules
for the structure process X : If f satisfies (Li p), i.e. is Lipschitz continuous in x uni-
formly in t ∈ [0, T ], then so are the value functions Fn (tkn , . ) and F(tkn , . ), uniformly
with respect to tkn , k = 0, . . . , n, n ≥ 1.
Furthermore, F(tkn , .) ≥ Fn (tkn , . ) and
  
  Cb,σ, f,T
 max F(tkn , X tkn ) − Fn (tkn , X tkn )  ≤ √
0≤k≤n p n

and, for every compact set K ⊂ Rd ,


 
Cb,σ, f,T,K
0 ≤ sup max F(tkn , x) − Fn (tkn , x) ≤ √ .
x∈K 0≤k≤n n

(b) Semi-convex payoffs. If f is semi-convex, then there exist real constants Cb,σ, f,T
and Cb,σ, f,T,K > 0 such that
  
  Cb,σ, f,T
 max Fn (tkn , X tkn ) − F(tkn , X tkn )  ≤
0≤k≤n p n

and, for every compact set K ⊂ Rd ,


 
Cb,σ, f,T,K
0 ≤ sup max F(tkn , x) − Fn (tkn , x) ≤ .
x∈K 0≤k≤n n

If the diffusion process is simulable at the instants tkn , as may happen if X tkn =
ϕ(X 0 , tkn , Wtkn ), where ϕ has an explicit (computable) form, the time discretization
can be stopped here. Otherwise, we need to perform a second time discretization of
the underlying diffusion process in order to be able to simulate the dynamics.
Now we pass to the second phase: we approximate the (usually not simulable)
Markov chain (X tkn )0≤k≤n by the discrete time Euler scheme ( X̄ tnn )0≤k≤n as defined by
k
11.2 Optimal Stopping for Discrete Time Rd -Valued Markov Chains 521

Eq. (7.5) in Chap. 7. We recall its definition for convenience (to emphasize a change
of notation concerning the Gaussian noise):

T T n
X̄ tnk+1
n = X̄ tnkn + b(tk , X̄ tkn ) + σ(tk , X̄ tkn )
n n n n
ξ , X̄ 0n = X 0 , k = 0, . . . , n − 1,
n n k+1
(11.12)
where (ξkn )1≤k≤n denotes a sequence of i.i.d. N (0; Iq )-distributed random vectors
given by 
n  
ξk :=
n
Wtkn − Wtk−1
n , k = 1, . . . , n.
T

In the absence of ambiguity, we will drop the superscript n in X̄ n and ξkn to write
X̄ and ξk . Also note that we temporarily gave up the notation Z kn for the Gaussian
innovations in favor of ξkn since Z is often representative of the reward process in
this chapter.
Proposition 11.4 (Euler scheme) The above proposition remains true when replac-
ing the sequence (X tkn )0≤k≤n by its Euler scheme ( X̄ tnn )0≤k≤n with step Tn , still with
k
the filtration Ftkn = σ(X 0 , FtWn ), k = 0, . . . , n. In both cases one just has to replace
k
the transitions Pk (x, dy) of the original process by that of its Euler scheme with step
T
n
, namely P̄kn (x, dy) defined for every k ∈ {0, . . . , n − 1} by
  
 T T d
P̄kn f (x) = E f x + b(tkn , x) + σ(tkn , x) ξ , ξ = N (0; Iq ),
n n

and Fn by F̄n .
Remark. In fact, the result also holds true with the (smaller) innovation filtration
FkX 0 ,Z = σ(X 0 , Z 1 , . . . , Z k ), k = 0, . . . , n, since the Euler scheme is still a Markov
chain with the same transitions with respect to this filtration.

The rate of convergence between the Snell envelopes and their value functions
when switching mutatis mutandis from the diffusion “sampled” at discrete times
(X tkn )k=0,...,n to the Euler scheme ( X̄ tkn )k=0,...,n is also established in [21].
Theorem 11.2 (Euler scheme approximation) Under the assumptions of the for-
mer Theorem 11.1(a), there exists real constants Cb,σ, f,T and Cb,σ, f,T,K > 0 such
that  
  Cb,σ, f,T
 max |Fn (tkn , X tkn ) − F̄n (tkn , X̄ tnkn )| ≤ √
0≤k≤n p n

and, for every compact set K ⊂ Rd ,


 
Cb,σ, f,T,K
0 ≤ sup max F(tkn , x) − F̄n (tkn , x) ≤ √ .
x∈K 0≤k≤n n
522 11 Optimal Stopping, Multi-asset American/Bermudan Options

ℵ Practitioner’s corner
 If the diffusion process X is simulable at times tkn (in an exact way), there is no
need to introduce the Euler scheme and it will be possible to take advantage of the
semi-convexity of the payoff/obstacle function f to get a time discretization error of
the value function F by Fn at a O(1/n)-rate.
A typical example of this situation is provided by the multi-dimensional Black–
Scholes model (and its avatars for Foreign Exchange (Garman–Kohlhagen) or future
contracts (Black)) and, more generally, by models where the process (X tx )t∈[0,T ] can
be written at each time t ∈ [0, T ] as an explicit function of Wt , namely

∀ t ∈ [0, T ], X t = ϕ(t, Wt ),

where ϕ(t, x) can be computed at a very low cost. When d = 1 (although of smaller
interest for application in view of the available analytical methods based on vari-
ational inequalities), one can also rely on the exact simulation method of one-
dimensional diffusions (see [43]).
 In the general case, we will rely on these two successive steps of discretization:
one for the optimal stopping rules and one for the underlying process to make it
simulable. However, in both cases, as far as numerics are concerned, we will rely on
the BDPP, which itself requires the computation of conditional expectations.
In both cases we now have access to a simulable Markov chain (either the Euler
scheme or the process itself at times tkn ). This task requires a second discretization
phase, making possible the computation of the discrete time Snell envelope (and its
value function).

11.3 Numerical Methods

11.3.1 The Regression Methods

The Longstaff–Schwartz approach

Regression methods are often generically known as Longstaff–Schwartz method(s)


in reference to the paper [202] (see also [60]). However, they present in this paper a
particular and original approach relying on the dual BDPP based on running optimal
stopping times.
Assume that all the random variables Z k = f k (X k ), k = 0, . . . , n, as defined
in (11.5), are square integrable, then so are the payoffs Z τk .
The idea is to replace the conditional expectation operator E (·|X k ) by a linear
regression on the first N elements of a Hilbert basis of (L 2 (, σ(X k ), P), ·, · L 2 (P) ).
This is a very natural idea to approximate conditional expectation (see e.g. [48],
Chap. 2.D. for an introduction in a general framework).
11.3 Numerical Methods 523


 ei : R → R, i ∈ N , of
d
Mostly for convenience, we will consider a sequence 
Borel functions such that, for every k ∈ {0, . . . , n}, ei (X k ) i∈N∗ is a complete system
of L 2 (, σ(X k ), P), i.e.
 ⊥
e (X k ),  ∈ N∗ = {0}, k = 0, . . . , n.

In practice,
 one
 may choose different functions at every time k, i.e. families ei,k
so that ei,k (X k ) i≥1 makes up a Hilbert basis of L 2R (, σ(X k ), P).
 Example. If X k = Wtkn , or equivalently after normalization, if X 0 = 0 and X k =
n
Wtkn , k = 1, . . . , n, the family of Hermite polynomials provides an orthonormal
kT  
basis of L 2R , σ(X k ), P at each time step tkn , by setting e = H−1 ,  ≥ 1 (see e.g.
[162], Chap. 3, p. 167).

Meta- script of Longstaff–Schwartz’ regression procedure.


• Approximation 1: Dimension Truncation

 At every time k ∈ {0, . . . , n}, truncate at level Nk


 
e[Nk ] (X k ) := e1 (X k ), e2 (X k ), . . . , e Nk (X k )

and set

 τn[Nn ] := n,

 τk[Nk ] := k1 Z >α[Nk ] |e[Nk ] (X ) + τk+1k+1 1 Z ≤α[Nk ] |e[Nk ] (X ) ,


[N ]
k k k k k k

where
  
 [Nk ]  2
αk[Nk ] := argmin E Z τ [Nk ] − α|e (X k ) , α ∈ R Nk
, k = 0, . . . , n − 1.
k+1

(Keep in mind that, at each step k, ( · | · ) denotes the canonical inner product on
R Nk .)
In fact, this finite-dimensional optimization problem has a well-known solution
given by
 −1   
αk[Nk ] = Gram e[Nk ] (X k ) Z τ [Nk+1 ] | e[Nk ] (X k ) L 2 (P) ,
k+1 1≤≤Nk

where Gram(e[Nk ] (X k )) is the Gram matrix of e[Nk ] (X k ) defined by


    
Gram e[Nk ] (X k ) = e[Nk ] (X k ) | e[N k ] (X k ) L 2 (P) .
1≤, ≤Nk
524 11 Optimal Stopping, Multi-asset American/Bermudan Options

• Approximation 2: Monte Carlo approximation


This second approximation phase can itself be decomposed into two successive
phases:
– a forward MC simulation of the underlying “structure” Markov process
(X k )0≤k≤n , followed by
– a backward approximation of τk[N ] , k = 0, . . . , n. In a more formal way, the idea
is to replace the true distribution of the Markov chain (X k )0≤k≤n by the empirical
measure of a simulated sample of size M of the chain.
 Forward Monte Carlo simulation phase: Simulate (and store) M independent
copies X (1) , . . . , X (m) , . . . , X (M) of X = (X k )0≤k≤n in order to have access to the
empirical measure
1 
M
δ X (m) .
M m=1

 Backward phase:
– At time n: For every m ∈ {1, . . . , M},

τn[Nn ],m,M := n.

– for k = n − 1 down to 0:
Compute
 2
1  (m)  [Nk ] (m) 
M
αk[Nk ],M := argminα∈R Nk Z [N ],m,M − α|e (X k )
M m=1 τk+1k+1

using the closed form formula


−1 

M 
M
αk[Nk ],M = e[Nk ] (X k )e[N k ] (X k )
(m) (m) (m) [Nk ] (m)
1
M
1
M Z [Nk+1 ],m,M e (X k ) .
m=1 1≤, ≤Nk m=1 τk+1 1≤≤Nk

– For every m ∈ {1, . . . , M}, set

τk[Nk ],m,M := k1 Z (m) >α[Nk ],M  + τ [Nk+1 ],m,M 1


e[Nk ] (X k ) k+1
 [Nk ],M
Z k(m) ≤ αk
 .
e[Nk ] (X k )
k k

Finally, the resulting approximation of the mean value at the origin of the Snell
envelope reads

1  (m)
M
E U0 = E (Z τ0 )  E (Z τ [N0 ] )  Z [N ],m,M as M → +∞.
0 M m=1 τ0 0

Note that when F0 = {∅, }, (so that X 0 = x0 ∈ Rd ), U0 = E U0 = u 0 (x0 ).


11.3 Numerical Methods 525

Remarks. • One may formally rewrite the second approximation phase by simply
replacing the distribution of the chain (X k )0≤k≤n by the empirical measure

1 
M
δ X (m)
M m=1

where X (m) are i.i.d. copies of X .


  
• The Gram matrices E ei[Nk ] (X k )e[N j
k]
(X k ) 1≤i, j≤Nk can be computed off line in
the
 sense that they are not payoff dependent by contrast with the inner product term
Z τ [Nk ] | e[Nk ] (X k ) L 2 (P) .
k+1 1≤≤Nk
In various situations, it is even possible for this Gram matrix to have a closed form,
e.g. when ei (X k ), i ≥ 1, happens to be an orthonormal basis of L 2 (, σ(X k ), P). In
that case the Gram matrix n is reduced to the identity matrix. This is is the case, for
example, when X k = kT W kTn , k = 0, . . . , n and ei = Hi+1 , where (Hi )i≥0 is the
orthonormal basis of Hermite polynomials given by

x2 d i − x2
Hi (x) = (−1)i e 2 e 2 , i ≥ 0.
dxi
• The algorithmic analysis of the above described procedure shows that its imple-
mentation requires
– a forward simulation of M paths of the Markov chain,
– a backward non-linear optimization phase in which all the (stored) paths have
to interact through the computation at every time k of αk[N ],M , which depends on all
the simulated values X k(m) , m = 1, . . . , M.
However, still in very specific situations, the forward phase can be skipped if a
backward simulation method for the Markov chain (X k )0≤k≤n is available. Such is
the case for the Brownian motion at times nk T , using a backward recursive simulation
method, or, equivalently, the Brownian bridge method (see Chap. 8).
The rate of convergence of the Monte Carlo phase of the procedure is ruled by a
Central Limit Theorem due to Clément–Lamberton–Protter in [64], stated below.
Theorem 11.3 (CLT, see [64] (2003)) The Monte Carlo approximation satisfies a
Central Limit Theorem, namely
 
√ 1  (m)
M
L
M Z [Nk ],m,M − E Z [Nk ] , αk[Nk ],M , αk[Nk ] −→ N (0; ) as M → +∞,
M τk τk
m=1 0≤k≤n−1

where  is a non-zero covariance matrix.


526 11 Optimal Stopping, Multi-asset American/Bermudan Options

Regression on the continuation function

Another approach to introducing


 regression
 methods is to directly
 apply them to
the continuation function E Uk+1 |Fk = E u k+1 (X k+1 ) |X k . The – maybe more
natural or straightforward – idea  is againto project
 this function onto the finite-
dimensional vector space span e[Nk ] (X k ) := e1 (X k ), e2 (X k ), . . . , e Nk (X k ) as a
first step and then to plug this projection into the regular BDPP (11.7). This sec-
ond approach was proposed and developed by Tsitsiklis and van Roy in [269]. We
briefly describe it with the notation used for Longstaff–Schawartz’ approach.
Meta- script of Tsitsiklis–van Roy’s regression procedure.
– At time n:  
Un(m) = Z n(m) = f n X n(m) , m ∈ {1, . . . , M}.

– For every k = n − 1, . . . , 0, one defines in a backward way

M 
   2
αk[Nk ],M = arg min (m)
Uk+1 − α|e[Nk ] (X k(m) )
α∈R Nk m=1
! "−1  
1  [Nk ] (m) [Nk ] (m) 1  (m) [Nk ] (m)
M M
= e (X k )e (X k ) Uk+1 e (X k )
M M
m=1 1≤, ≤Nk m=1 1≤≤Nk

and   
Uk(m) = max f k (X k(m) ), αk[Nk ],M |e[Nk ] (X (m) ) , m = 1, . . . , M.

– In particular, when k = 0, one obtains

1  (m)
M
1 
M   
E U0  U (0) = max f 0 (X 0(m) ), α0[N0 ],M |e[N0 ] (X k(m) ) .
M m=1 M m=1
(11.13)
 Exercise. Justify the above algorithm starting from the BDPP.

Pros and Cons of the regression method(s) (ℵ Practitioner’s corner)

Pros
• The method is “natural”: it involves the approximation of a conditional expectation
by an affine regression operator performed on a truncated generating system of
L 2 (σ(X k ), P).
• The method appears to be “flexible”: there is an opportunity to change or adapt the
(truncated) basis of L 2 (σ(X k ), P) at each step of the procedure, e.g. by including the
payoff function in the (truncated) basis, at each step at least during the first backward
iterations of the induction.
11.3 Numerical Methods 527

Cons
• From a purely theoretical point of view the regression approach does not pro-
vide error bounds or a rate of approximation for the convergence of E Z τ [N0 ] toward
0
E Z τ0 = E U0 , which is mostly ruled by the rate at which the family e[Nk ] (X k ) “fills”
L 2 (, σ(X k ), P) when Nk goes to infinity. This point is discussed, for example,
in [118] in connection with the size of the Monte Carlo sample. However, in prac-
tice this information may be difficult to exploit, especially in higher dimensions, the
possible choices for the size Nk are limited by the storing capacity of the computing
device.
• Most computations are performed  on-line since
 they are payoff dependent. How-
ever, note that the Gram matrix of e[Nk ] (X k ) 1≤≤Nk can be computed off-line since
it only depends on the structure process.
• Due to the sub-optimality of the stopping times involved in the regression procedure,
the method tends to produce lower bounds for the u 0 (x0 ) = E U0 since it has a
methodological negative bias.  
• The choice of the functions e (x) ∈N∗ , is crucial and needs much care (and
intuition). In practical implementations, it may vary at each time step (our choice of
a unique family is mostly motivated by notational simplicity). Furthermore, it may
have a biased effect for options deep in- or out-of-the-money since the coordinates
of the functions ei (X k ) are computed “locally” from simulated data which of course
lie most of the time where things happen (around the mean). This has an impact on
the prices of options with long maturity and/or deep-in- or out-in the money since
the calibration, mainly performed “at-the-money” of the coordinates of e[Nk ] (X k ),
induce their behavior at the aisles. On the other hand, this choice of the functions, if
they are smooth, has a smoothing effect which can be interesting to users (provided
it does not induce any hidden arbitrage…). To overcome the first problem one may
choose local functions like indicator functions of a Voronoi diagram (see the next
Sect. 11.3.2 devoted to quantization tree methods or Chap. 5) with the counterpart
that a smoothing effect can no longer be expected.
When there is a family of distributions “related” to the underlying Markov struc-
ture process, a natural idea can be to consider an orthonormal basis of L 2 (μ0 ), where
μ0 is a normalized distribution of the family. A typical example is the sequence of
Hermite polynomials for the normal distribution N (0; 1).
When no simple solution is available, considering the simple basis (t  )≥0 remains
a quite natural and efficient choice in one dimension.
In higher dimensions (in fact the only case of interest in practice since one-
dimensional setting is usually solved through the associated variational inequality
by specific numerical schemes, see [2, 158]), this choice becomes more and more
influenced by the payoff itself.
• A huge RAM capacity is needed to store all the paths of the simulated Markov
chain (forward phase) except when a backward simulation procedure is available.
This induces a stringent limitation on the size M of the simulation, even with recent
devices, to prevent a swapping effect which would dramatically slow down the pro-
cedure. By a swapping effect we mean that, when the quantity of data to be stored
528 11 Optimal Stopping, Multi-asset American/Bermudan Options

becomes too large, the computer uses its hard disk to store it but the access to this
ROM memory is incredibly slow compared to the access to RAM memory.
• Regression methods are strongly payoff-dependent in the sense that a significant
part of the procedure (the product of the inverted Gram matrix by the projection of
the payoff at every time k) has to be done for each payoff.
 Exercise. Write a regression algorithm based on the “primal” BDPP.

11.3.2 Quantization Methods II: Non-linear Problems


(Quantization Tree)

Approximation by a quantization tree

In this section, we continue to deal with the simple discrete time Markovian optimal
stopping problem introduced in the former section. The underlying idea of the quan-
tization tree method is to approximate the whole Markovian dynamics of the chain
(X k )0≤k≤n using a skeleton of the distribution supported by a tree. In some sense, the
underlying idea is to design some optimized tree methods in higher dimensions with
a procedure optimally fitted to the underlying marginal distributions of the chain in
order to prevent an explosion of their size (number of nodes per level).
For every k ∈ {0, . . . , n}, we replace the marginal X k by a function #
X k of X k taking
values in a grid k , namely # X k = πk (X k ), where πk : Rd → k is a Borel function.
The grid k = πk (Rd ) (also known as a codebook in Signal processing or Information
Theory) will always be assumed finite in practice, with size |k | = Nk ∈ N∗ .
However, note that all the error bounds established below still hold if the grids
are infinite provided πk is sub-linear, i.e.
 
|πk (x)| ≤ C 1 + |x| ,

so that #X k has at least as many finite moments as #X k . In such a situation, all in all
integrability assumptions X k and # X k should be coupled, i.e. X k , #
X k ∈ L r (P) instead
of X k ∈ L (P).
r

We saw in Chap. 5 an optimal way to specify the function πk (including k ) when


trying to minimize the induced L p -mean quadratic error X k − # X k  p . This is the
purpose of optimal quantization theory. We will return to these aspects later.
The starting point, being aware that the sequence ( # X k )0≤k≤n has no reason to
share a Markov property, is to force this Markov property in the BDPP.  This means
defining by induction a quantized pseudo-Snell envelope of f k (X k ) 0≤k≤n (assumed
to lie at least in L 1 ), namely
  
#n = f n ( #
U X n ), #k = max f k ( #
U X k ), E U#k+1 | #
Xk . (11.14)
11.3 Numerical Methods 529

The forced Markov  property results


 from the conditioning by # X k rather than by the
σ-field F#k := σ # X , 0 ≤  ≤ k .
It is straightforward by induction that, for every k ∈ {0, . . . , n},

U uk ( #
#k = # Xk) #
u k : Rd → R+ , Borel function.

See Subsection “Implementation of a quantization tree descent” for the detailed


implementation.

Error bounds

The following theorem establishes the control on the approximation of the true Snell
envelope (Uk )0≤k≤n by the quantized pseudo-Snell envelope (U #k )0≤k≤n using the
L p -mean approximation errors X k − #Xkp.

Theorem 11.4 (see [20] (2001), [235] (2011)) Assume that all functions f k :
Rd → R+ are Lipschitz continuous and that all the transitions Pk (x, dy) =
P(X k+1 ∈ dy | X k = x) are Lipschitz continuous in the following sense

[Pk ]Lip = sup [Pk g]Lip < +∞, k = 0, . . . , (n − 1).


[g]Lip ≤1

Set [P]Lip = max [Pk ]Lip and [ f ]Lip = max0≤k≤n [ f k ]Lip .


0≤k≤n−1

n
Let p ∈ [1, +∞). We assume that X k  p +  #
X k  p < +∞.
k=0
(a) For every k ∈ {0, . . . , n},


n
 n−
#k  p ≤ 2[ f ]Lip
Uk − U [P]Lip ∨ 1 X  − #
X  p .
=k

(b) If p = 2, for every k ∈ {0, . . . , n},


 n  21
√  2(n−)
#k 2 ≤ 2 [ f ]Lip
Uk − U [P]Lip ∨ 1 X  − #
X  2 .
2

=k

Proof. Step 1. First, we control the Lipschitz continuous constants of the functions
u k . It follows from the classical inequality

∀ ai , bi ∈ R, sup ai − sup bi ≤ sup |ai − bi |


i∈I i∈I i∈I

that
530 11 Optimal Stopping, Multi-asset American/Bermudan Options
 
[u k ]Lip ≤ max [ f k ]Lip , [Pk u k+1 ]Lip
 
≤ max [ f ]Lip , [Pk ]Lip [u k+1 ]Lip

with the convention [u n+1 ]Lip = 0. An easy backward induction yields [u k ]Lip ≤
 n−k
[ f ]Lip [P]Lip ∨ 1 .
Step 2. We focus on claim (b) when p = 2. It follows from both Backward Dynamic
Programming formulas (original and quantized) and the above elementary inequality
that
    
#k | ≤ max | f k (X k ) − f k ( #
|Uk − U X k )|, E Uk+1 |X k − E U#k+1 | #
Xk

so that
   2
#k |2 ≤ | f k (X k ) − f k ( #
|Uk − U X k )|2 + E Uk+1 |X k − E U#k+1 | #
Xk . (11.15)

Keeping in mind that Uk+1 = u k+1 (X k+1 ) = and U #k+1 = #


u k+1 ( #
X k+1 ), we derive
from the quantized approximation bounds for conditional expectation (5.16) applied
with F = u k+1 , G = #
u k+1 , X = X k+1 and Y = X k that

  
   2   2
E Uk+1 |X k − E U #k+1 | #X k  =  Pk u k+1 (X k ) − E # u k+1 ( # X k 2
X k+1 ) | #
2
 2  2
X k 2 + u k (X k+1 ) − #
= [Pk u k+1 ]2Lip  X k − # X k+1 )2 .
u k+1 ( #
(11.16)
Plugging this inequality into (11.15) and taking the expectation yields for every
k ∈ {0, . . . , n},
    2  
#k 2 ≤ [ f ]2Lip + [P]2Lip [u k+1 ]2Lip  X k − #
Uk − U X k
 + Uk+1 − U
#k+1 2
2 2 2

#n+1 = Un+1 = 0. Now,


still with the conventions [u n+1 ]Lip = 0 and U
 2(n−(k+1))
[ f ]2Lip + [P]2Lip [u k+1 ]2Lip ≤ [ f ]2Lip + [P]2Lip 1 ∨ [Pk ]Lip
 2(n−k)
≤ 2[ f ]2Lip 1 ∨ [Pk ]Lip .

Consequently,

  
n−1
 2(n−k)  2  2
Uk − U
#k 2 ≤ 2[ f ]2Lip 1 ∨ [Pk ]Lip X − #
X  2 + [ f ]2Lip  X n − #
X n 2
2
=k

n
 2(n−k)  2
≤ 2[ f ]2Lip 1 ∨ [Pk ]Lip X − #
X  2 ,
=k

which completes the proof. ♦


11.3 Numerical Methods 531

Remark. The above control emphasizes the interest of minimizing the L p -mean
quantization error X k − #
X k  p at each time step of the Markov chain to reduce the
final resulting error.
 Exercise. Prove claim (a) starting from the L p -error bound (5.18).
 Example of application: the Euler scheme. Let ( X̄ tnn )0≤k≤n be the Euler scheme
k
with step Tn of the d-dimensional diffusion solution to the SDE (11.1). It defines a
homogenous Markov chain with transition
  
T T d
P̄kn g(x) = E g x + b(tk , X̄ tkn ) + σ(tk , X̄ tkn )
n n n n
Z , Z = N (0; Iq ).
n n

If f is Lipschitz continuous, then


 2
T n T n
 
P̄kn g(x) − P̄kn g(x  ) ≤ [g]2Lip E x − x  + b(tk , x) − b(tkn , x  ) + σ(tk , x) − σ(tkn , x  ) Z
2
n n
 
2  T n n  2 T n n  2
≤ [g]Lip x − x + b(tk , x) − b(tk , x ) + σ(tk , x) − σ(tk , x )
n N
 
2  2 T 2 2T T2 2
≤ [g]Lip |x − x | 1 + [σ]Lip + [b]Lip + 2 [b]Lip .
n n n

[σ]2Lip
As a consequence, setting Cb,σ = [b]Lip + yields
2

Cb,σ,T
[ P̄kn g]Lip ≤ 1 + [g]Lip , k = 0, . . . , n − 1,
n

i.e. 
Cb,σ,T
[ P̄ n ]Lip ≤ 1 + .
n

Applying the control established in claim (b) of the above theorem yields with
obvious notations
 n 1
  √  C 2(n−)  2 2
Uk − U 
#k ≤ 2[ f ]Lip 1+
b,σ,T  X − # X 2 
2 n
=k
 n  21
√ C  n 
e2Cb,σ,T t  X  − #X  2 .
2
≤ 2e b,σ,T [ f ]Lip
=k

 Exercise. Derive a result in the case p = 2 based on Claim (a) of theorem.


532 11 Optimal Stopping, Multi-asset American/Bermudan Options

Background on optimal quantization

For some background on optimal quantization, we refer to Chap. 5.

Implementation of a quantization tree descent

 Quantization tree. The pathwise Quantized Backward Dynamic Programming


Principle (11.14) can be rewritten in distribution as follows. Let, for every k ∈
{0, . . . , n}, 
k = x1k , . . . , x Nk k .

#k = #
Keeping in mind that U uk ( #
X k ), k = 0, . . . , n, we first get

⎨#
⎪ un = f n on
 n, 
#
u k (xik ) = max f k (xik ), E # u k+1 ( #
X k+1 )| #
X k = xik , (11.17)


i = 1, . . . , Nk , k = 0, . . . , n − 1,

which finally leads to



⎨#
⎪ u n (xin ) = f n (xi ), i = 1, . . . , Nn ,
n
 Nk+1 k
#
u k (xik ) = max f k (xik ), j=1 pi j #
u k+1 (x k+1
j ) , (11.18)


i = 1, . . . , Nk , k = 0, . . . , n − 1,

where the transition weight “super-matrix” [ pikj ] is defined by


 
pikj = P # #
j | X x = x i , 1 ≤ i ≤ Nk , 1 ≤ j ≤ Nk+1 , k = 0, . . . , n − 1.
X k+1 = x k+1 k

(11.19)
Although the above super-matrix defines a family of Markov transitions, the
sequence ( # X k )0≤k≤n is definitely not a Markov Chain since there is no reason why
   
#
P X k+1 = x k+1 # #
j | X k = x i and P X k+1 = x j
k k+1 #
| X k = xik , #
X  = xi ,  = 0, . . . , i − 1
should be equal.
In fact, one should rather view the quantized transitions


Nk+1
#k (xik , dy) =
P pikj δx k+1 , xik ∈ k , k = 0 = . . . , n − 1,
j
j=1

as spatial discretizations of the original transitions Pk (x, du) of the original Markov
chain.

Definition 11.2 The family of grids (k ), 0 ≤ k ≤ n, and the transition super-matrix
[ pikj ] defined by (11.19) defines a quantization tree of size N = N0 + · · · + Nn .
11.3 Numerical Methods 533

Remark. A quantization tree in the sense of the above definition does not characterize
the distribution of the sequence ( #
X k )0≤k≤n .
 Implementation of the quantization tree descent. The implementation of the
whole quantization tree method relies on the computation of this transition super-
matrix. Once the grids (optimal or not) have been specified and the weights of the
super-matrix have been computed or, to be more precise, have been estimated, the
computation of the approximate value function E U #0 at time 0 amounts to an almost
instantaneous “backward descent” of the quantization tree based on (11.18).
If we can simulate M independent copies of the Markov chain (X k )0≤k≤n denoted
by X (1) , . . . , X (M) , then the weights pikj can be estimated by a standard Monte Carlo
estimator
 
(M),k m ∈ {1, . . . , M}, # (m)
X k+1 = x k+1
j &# X k(m) = xik M→+∞ k
pi j =   −→ pi j
m ∈ {1, . . . , M}, #
X k(m) = xik

as M → +∞.

Remark. By contrast with the regression methods for which theoretical results are
mostly focused on the rate of convergence of the Monte Carlo phase, here we will
analyze the part of the procedure for which we refer to [17], where the transition
super-matrix is computed and its impact on the quantization tree is deeply analyzed.
Application. We can apply the preceding, still within the framework of an Euler
scheme, to our original optimal stopping problem. We assume that all random vectors

X k lie in L p (P), for a real exponent p  > 2, and that they have been optimally quan-
tized (in L ) by grids of size Nk , k = 0, . . . , n, then, relying on the non-asymptotic
2

Zador Theorem (claim (b) from Theorem 5.1.2), we get with obvious notations,
 n  21

Ū0n (n  ≤ √2eCb,σ,T [ f ] C  
− Ū e 2Cb,σ,T −2
σ p ( X̄ tnkn )2 Nk d .
0 2 Lip p ,d
k=0

 Optimizing the quantization tree. At this stage, one can process an optimization
of the quantization tree. To be precise, one can optimize the size of the grids k
subject to a “budget” (or total allocation) constraint, typically
) *

n
−2
min e 2Cb,σ,T
σ p ( X̄ tnkn )2 Nk d , Nk ≥ 1, N0 + · · · + Nn = N .
k=0

 Exercise. (a) Solve (more) rigorously the constrained optimization problem


) *

n
−2
min e2Cb,σ,T σ p ( X̄ tnkn )2 xk d , xk ∈ R+ , x0 + · · · + xn = N .
k=0
534 11 Optimal Stopping, Multi-asset American/Bermudan Options

(b) Derive an asymptotically optimal choice for the grid size allocation (as N →
+∞).

This optimization turns out to have a significant numerical impact, even if, in terms
of rate, the uniform choice Nk = N̄ = N n−1 (doing so we implicitly assume that X 0 =
x0 ∈ Rd , so that #X 0 = X 0 = x0 , N0 = 1 and U #0 = #u 0 (x0 )) leads to a quantization
error of the form
 (n 
|ū n0 (x0 ) − u#̄n0 (x0 )| ≤ Ū0n − Ū 
0 2

n
≤ κb,σ,T [ f ]Lip max σ p ( X̄ tnkn ) × 1
0≤k≤n N̄ d

n
≤ [ f ]Lip κb,σ,T, p 1
N̄ d
 
since we know (see Proposition 7.2 in Chap. 7) that sup E max | X̄ tnkn | p < +∞. If
n 0≤k≤n
we plug that into the global estimate obtained in Theorem 11.2, we obtain the typical
error bound  1

#̄ 1 n2
|u 0 (x0 ) − u 0 (x0 )| ≤ Cb,σ,T, f,d
n
1 + 1 . (11.20)
n2 N̄ d

Remark. If we can directly simulate the sampled (X tkn )0≤k≤n of the diffusion instead
of its Euler scheme and if the obstacle/payoff function is semi-convex in the sense
of Condition (SC), then we get as a typical error bound
 1

1 n2
u 0 (x0 ) − u#̄n0 (x0 ) ≤ Cb,σ,T, f,d + 1 .
n N̄ d

Remark. • The rate of decay N̄ − d becomes obviously bad as the spatial dimension
1

d of the underlying Markov process increases, but this phenomenon cannot be over-
come by such tree methods. This is a consequence of Zador’s Theorem. This rate
degradation is known as the curse of dimensionality.
• These rates can be significantly improved by introducing a Romberg-like extrapo-
lation method and/or some martingale corrections to the quantization tree (see [235]).

ℵ Practitioner’s corner

 Tree optimization vs B D D P complexity


If, n being fixed, we set
+ 2d
,
(σ2+δ (X k )) d+2
Nk =  2d N ∨ 1, k = 0, . . . , n
0≤≤n (σ2+δ (X k ))
d+2
11.3 Numerical Methods 535

(see exercise, in the previous section) we (asymptotically) minimize the resulting


quantization error induced by the B D D P descent.

#0 = 0 and
 Examples: 1. Brownian motion X k = Wtk . Then W

Wtk 2+δ = Cδ tk , k = 0, . . . , n.

Hence N0 = 1,
 2(d+1)
d
2(d + 1) k
Nk  N , k = 1, . . . , n,
d +2 n

and 1− d1
2(d + 1) n 1+ d
1
 n
|V0 − #
v0 (0)| ≤ C W,δ =O
d +2 Nd
1
N̄ d
1

with N̄ = Nn .
Theoretically this choice may not look crucial since it has no impact on the
convergence rate, but in practice, it does influence the numerical performances.
2. Stationary processes. The process (X k )0≤k≤n is stationary and X 0 ∈ L 2+δ for some
δ > 0. A typical example in the Gaussian world, is, as expected, the stationary
Ornstein–Uhlenbeck process

d X t = −B X t dt + dWt , B,  ∈ M(d, R),

where all eigenvalues of B have positive real parts and W is a d-dimensional


Brownian motion defined on (, A, P) and X 0 is independent of W . Then, it is
a classical result that, if X 0 is distributed according to the distribution
 +∞ 

ν0 = N 0; e−t B  ∗ e−t B dt ,
0

then the process (X t )t≥0 is stationary, i.e. (X t+s )s≥0 and (X s )s≥0 have the same
distribution as processes, with common marginal distribution ν0 . If we “sample” (or
observe) this process at times tkn = kT
n
, k = 0, . . . , n, between 0 and T , it takes the
form of a Gaussian Autoregressive process of order 1

 
T T n
X tk+1
n = X tkn Id − B + Z , (Z kn )1≤k≤n i.i.d., N (0; Id ) distributed.
2n n k+1

The key feature of such a setting is that the quantization tree only relies on one
optimal N̄ -grid  =  0,( N̄ ) = {x10 , . . . , x N̄0 } (say L 2 optimal for the distribution of
 
X 0 at level N̄ ) and one quantized transition matrix P( # X 1 = x 0j | #
X 0 = xi0 ) 1≤i, j≤n .
536 11 Optimal Stopping, Multi-asset American/Bermudan Options
- .
For every k ∈ {0, . . . , n}, X k 2+δ = X 0 2+δ , hence Nk = N
n+1
, k = 0, . . . , n,

n 2+d
1 1
n N
v0 ( #
V0 − # X 0 )2 ≤ C X,δ ≤ C X,δ 1 with N̄ = .
Nd
1
N̄ d n+1

Numerical optimization of the grids: Gaussian and non-Gaussian vectors


We refer to Sect. 5.3 in Chap. 5 devoted to optimal quantization and Sect. 6.3.5 in
Chap. 6 devoted to Stochastic approximation and optimization.
 Richardson–Romberg extrapolation(s). Richardson–Romberg extrapolation in
this framework is based on a heuristic guess: there exists a “sharp rate” of convergence
of the quantization tree method as the total budget N goes to infinity and this rate of
convergence is given by (11.20) when replacing N̄ by N /n.
On can perform a Romberg extrapolation in N for fixed n or even a full Romberg
extrapolation involving both the time discretization step n and the size N of the
quantization tree.
 Exercise (temporary). Make the assumption that the error in a quantization
scheme admits a first-order expansion of the form

c2 n 2 + d
1 1
c1
Err (n, N ) = + .
nα Nd
1

Devise a Richardson–Romberg extrapolation based on two quantization trees with


sizes N (1) and N (2) , respectively.
 Martingale correction. When (X k )0≤k≤n is a martingale, one can force this martin-
gale property on the quantized chain by freezing the transition weight super-matrix
and by moving in a backward way the grids k so that the resulting grids k satisfy
 k k−1  k−1
E #
Xk | #
X k−1 =#
X k−1 .

In fact, this identity defines by a backward induction new grids k . As a final step,
one translates these new grids so that 0 and 0 have the same mean. Of course, such
a procedure is entirely heuristic.

Pros and Cons of quantization method(s) (ℵ Practitioner’s corner)

Pros
• The quantization tree, once optimized, appears as a skeleton of the distribution
of the underlying Markov chain. This optimization phase can be performed off-line
whereas the payoff dependent part, the Quantized BDPP, is almost instantaneous.
• In many situations, including the multi-dimensional Black–Scholes model, the
quantization tree can be designed on the Brownian motion itself, for which pre-
11.3 Numerical Methods 537

computed grids are available (website quantize.maths-fi.com). It only


remains to commute the transitions.
• Several natural and easy-to-implement methods are available to improve its crude
performances when the dimension d of the state space of the underlying Markov chain
increases: use the premium of a European option with payoff the terminal value of an
American payoff as a control variate, Richardson–Romberg extrapolation, martingale
correction as described above, etc.
• A new approach has been recently developed to dramatically speed up the con-
traction of performing quantization trees, so far in one dimension (see [234]) but a
higher-dimensional extension is in progress (see [91]). This new quantization method
preserves the Markov property, but it is slightly less accurate in its approximation of
the distribution.
Cons
• The method clearly suffers from the curse of dimensionality as emphasized by its a
priori error bounds. Indeed, allmethods suffer
 from this curse: the regression method
through the “filling” rate of L 2 σ(X k ), P by the basis e[N ] (X k ) as N goes to infinity,
the Malliavin-Monte Carlo method (not developed here), e.g. through the variance
of their estimator of conditional expectations (usually exploding like (timestep)−d
in d dimensions).
• Quantization-based numerical schemes are local by construction: they rely on the
indicator functions of Voronoi cells and, owing to this feature, they do not propagate
regularity properties.
• The method may appear to be less flexible than its competitors, a counterpart of
the off-line calibration phase. In particular, it may seem inappropriate to perform
calibration of financial models where at each iteration of the procedure the model
parameters, and subsequently the resulting quantization tree, are modified. This point
may be discussed with respect to competing methods and current performance of
modern computers. However, to take this objection into account, an alternative fast
quantization method has been proposed and analyzed in [234] to design perform-
ing (though sub-optimal) quantization trees. They have been successfully tested on
calibration problems in [57, 58] in local and stochastic volatility problems.
• The method is limited to medium dimensions, say up to d = 10 or 12, but this
is also the case for other methods. For higher dimensions the complete paradigm
should be modified.

11.4 Dual Form of the Snell Envelope (Discrete Time)

We rely in this brief section on the notations introduced in Sect. 11.2.1. So far we have
seen that the (P, (Fk )k=0,...,n )-Snell envelope (Uk )k=0,...,n of a sequence (Z k )k=0,...,n of
an (Fk )0≤k≤n -adapted sequence of integrable non-negative  random variables defined
on a filtered probability space , A, P, (Fk )k=0,...,n ) is defined as a P-essential
supremum (see Sect. 12.9) and satisfies a Backward Dynamic Programming Principle
538 11 Optimal Stopping, Multi-asset American/Bermudan Options

established in Proposition 11.7. The numerical methods (regression and optimal


quantization) described in the former section relied on this BDDP (quantization) or
its backward running optimal stopping times-based variant (regression). This second
method is supposed to provide lower bounds for the Snell envelope (up to the Monte
Carlo error). In the early 2000s a dual representation of the Snell envelope has
been established by Rogers in [255] (see also [146]) where the Snell envelope is
represented as an essential infimum (see Proposition 12.4 and Eq. (12.7) in Sect. 12.9)
with the hope that numerical methods based on such a representation will provide
some upper-bounds.
This dual representation reads as follows (we will not use the Markov property).

Theorem 11.5 (Dual form of the Snell envelope (Rogers)) Let (Z k )k=0,...,n be as
above and let (Uk )k=0,...,n be its (P, (Fk )0≤k≤n )-Snell envelope. Let
 
Mk = M = (M )=0,...,n (P, F )-martingale, Mk = 0 .

Then, for every k ∈ {0, . . . , n},


)   *
Uk = P-essinf E sup (Z  − M ) Fk , M ∈ Mk .
∈{k,...,T }

The proof relies on Doob’s decomposition of a super-martingale:

Proposition 11.5 (Doob’s decomposition) Let (Sk )k=0,...,n be a (P, (Fk )0≤k≤n )-
super-martingale, i.e. a sequence of integrable (Fk )0≤k≤n -adapted random variables
satisfying
∀ k ∈ {0, . . . , n − 1}, E (Sk+1 | Fk ) ≤ Sk .

Then there exists a pair (M, A), unique up to P-a.s. equality, where M = (Mk )k=0,...,n
is a (P, (Fk )0≤k≤n )-martingale null at 0 and A = (Ak )k=0,...,n is a non-decreasing
sequence of random variables such that A0 = 0 and Ak is Fk−1 -measurable for every
k ∈ {1, . . . , n} such that

∀ k ∈ {0, . . . , n − 1}, Sk = S0 + Mk − Ak .

 Exercise.
 Prove
 this proposition.
 [Hint: First establish uniqueness by showing that
Ak = − k=1 E S | F−1 , k = 1, . . . , n.]

Proof of Theorem 11.5 Step 1 (k = 0). Let M ∈ M0 and let τ be an (Fk )0≤k≤n -
stopping time. Then, by the optional stopping theorem (see [217]), E Mτ = 0 so
that
 
 
E Z τ | F0 ) = E Z τ − M τ | F0 ) ≤ E sup (Z k − Mk ) | F0 P-a.s.
k∈{0,...,n}
11.4 Dual Form of the Snell Envelope (Discrete Time) 539

Hence, by the very definition of a P-essinf,


)   *

E Z τ | F0 ) ≤ P-essinf E sup (Z k − Mk ) | F0 , M ∈ M0 P-a.s.
k∈{0,...,n}

Then by the definition (11.6) of a Snell envelope and P-esssup


)   *
U0 ≤ P-essinf E sup (Z k − Mk ) | F0 , M ∈ M0 P-a.s.
k∈{0,...,n}

Conversely, it is clear from the BDPP (11.7) satisfied by the Snell envelope that
(Uk )k=0,...,n is a (P, (Fk )0≤k≤n )-super-martingale which dominates (Z k )k=0,...,n (i.e.
Uk ≥ Z k P-a.s. for every k = 0, . . . , n). Hence, it admits a Doob decomposition
(M ∗ , A∗ ) such that Uk = U0 + Mk∗ − A∗k , k = 0, . . . , n, with M ∗ and A∗ as in the
above Proposition 11.5. Consequently, for every k = 0, . . . , n,

Z k − Mk∗ = Z k − Uk −A∗k + U0 ≤ −A∗k + Uk ≤ U0 ,


/ 01 2
≤0


which in turn implies E sup Z k − Mk∗ | F0 ≤ E (U0 | F0 ) = U0 .
0≤k≤n
Step 2 (Generic k). For a generic k ∈ {0, . . . , n}, one adapts the proof of the above
step by considering for τ a {k, . . . , n}-valued stopping time and, in the second part
of the proof, by considering the martingale M∗,k = M∨k ∗
− Mk∗ ,  = 0, . . . , n. Con-
ditioning by Fk completes the proof. ♦

Various papers take advantage of this representation to devise alternative Ameri-


can style option pricing algorithms, starting with [146, 255]. Let us also cite, among
other references, [8, 36, 38] with extensions to multiple stopping time problems and
to Backward Stochastic Differential Equations.
Chapter 12
Miscellany

12.1 More on the Normal Distribution

12.1.1 Characteristic Function


d
Proposition 12.1. If Z = N (0; 1), then its characteristic function χ Z , defined for
every u ∈ R by χ Z (u) = E eı̃u Z is given by
 +∞
x2 dx u2
∀ u ∈ R, χ Z (u) := eı̃ux e− 2 √ = e− 2 .
−∞ 2π

Proof. Differentiating under the integral sign yields


 +∞
x2 dx
χZ (u) = i eı̃ux e− 2 x √ .
−∞ 2π

⎨ eiux −→ 
ı̃ueı̃ux
Now the integration by parts  yields
⎩ xe− x22 −→ −e− x22

 +∞
 x2 dx
χ Z (u) = ı̃ u
2
eı̃ux e− 2 x √ = −u Z (u)
−∞ 2π

so that
u2 u2
χ Z (u) =  Z (0)e− 2 = e− 2 . ♦

d
Corollary 12.1 If Z = (Z 1 , . . . , Z d ) = N (0; I d) is a multivariate normal distri-
bution, then its characteristic function χ Z (u) = E eı̃(u | Z ) , u ∈ Rd , is given by

|u|2
χ Z (u) = e− 2 .
© Springer International Publishing AG, part of Springer Nature 2018 541
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0_12
542 12 Miscellany

d
Proof. For every u ∈ Rd , (u | Z ) = i=1 u i Z i has a Gaussian distribution with
d
variance i=1 (u i )2 = |u|2 since the components Z i are independent and N (0; 1)-
distributed. Consequently,
d d
(u | Z ) = |u| ζ, ζ = N (0; 1),

so that the characteristic function χ Z of Z defined by χ Z (u) = E eı̃(u|Z ) is given by

|u|2
∀ u ∈ Rd , χ Z (u) = e− 2 . ♦
i i
Remark. An alternative argument is that the C-valued random variables eı̃u Z
,i =
1, . . . , d, are independent, so that


d 
d 
d
(u i )2 |u|2
e− = e−
i
Zi i
Zi
χ Z (u) = E eı̃u = E eı̃u = 2 2 .
i=1 i=1 i=1

12.1.2 Numerical Approximation of the Cumulative


Distribution Function 0

To compute the c.d.f. (cumulative distribution function) of the normal distribution,


one usually relies on the fast approximation formula obtained by continuous fraction
expansion techniques (see [1]):

x2
− x2
∀ x ∈ R+ , 0 (x) = 1 − e√ (a1 t + a2 t 2 + a3 t 3 + a4 t 4 + a5 t 5 + O e−
2
2 t6 ,

1
where t := , p := 0.231 6419 and
1 + px

a1 := 0.319 381 530, a2 := −0.356 563 782, a3 := 1.781 477 937,


a4 := −1.821 255 978, a5 := 1.330 274 429.

x2
inducing a maximal error of the form O e− 2 t 6 ≤ 7.5 10−8 .

12.1.3 Table of the Distribution Function of the Normal


Distribution

The distribution function of the N (0; 1) distribution is given for every real number
t by
12.1 More on the Normal Distribution 543
 t
1 x2
0 (t) := √ e− 2 d x.
2π −∞

Since the probability density is even, one easily checks that

0 (t) − 0 (−t) = 2 0 (t) − 1.

The following tables give the values of 0 (t) for t = x0 , x1 x2 where x0 ∈ {0, 1, 2},
x1 ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and x2 ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.
For example, if t = 1.23 (i.e. row 1.2 and column 0.03) one has 0 (t) 0.8907.

t 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6661 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7290 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1,0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1,1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1,2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1,3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1,4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1,5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1,6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1,7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1,8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1,9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2,0 0.9772 0.9779 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2,1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2,2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2,3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2,4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2,5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2,6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2,7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2,8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2,9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

One notes that 0 (t) = 0.9986 for t = 2.99. This comes from the fact that the
mass of the normal distribution is mainly concentrated on the interval [−3, 3] as
emphasized by the table of the “large” values hereafter (for instance, we observe that
P({|X | ≤ 4.5}) ≥ 0.99999 !).
544 12 Miscellany

t 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,8 4,0 4,5
0 (t) .99865 .99904 .99931 .99952 .99966 .99976 .999841 .999928 .999968 .999997

12.2 Black–Scholes Formula(s) (To Compute Reference


Prices)

In a risk-neutral Black–Scholes model, the quoted price of a risky asset is a solution


to the SDE d X t = X t (r dt + σdWt ), X 0 = x0 > 0, where r is the interest rate and
σ > 0 is the volatility and W is a standard Brownian motion. Itô’s formula (see
Sect. 12.8) yields that

σ2
X tx0 = x0 e(r − 2 )t+σWt , Wt = N (0; 1).
d

A vanilla (European) payoff of maturity T > 0 is of the form h T = ϕ(X T ). A


European option contract written on the payoff h T is the right to receive h T at
the maturity T . Its price – or premium – at time t = 0 is given by e−r t E ϕ(X Tx0 )

and, more generally at time t ∈ [0, T ], it is given by e−r (T −t) E ϕ(X Tx0 ) | X tx0 =
e−r (T −t) E ϕ(X Tx0−t ). In the case where ϕ(x) = (x − K )+ (call with strike price K )
this premium at time t has a closed form given by

Callt (x0 , K , R, σ, T ) = Call0 (x0 , K , R, σ, T − t),

where
Call0 (x0 , K , r, σ, τ ) = x0 0 (d1 ) − e−r τ K 0 (d2 ), τ > 0, (12.1)

with  σ2
log x0
+ (r + )τ √
d1 = K
√ 2
, d2 = d1 − σ τ . (12.2)
σ τ

As for the put option written on the payoff h T = (K − X Tx0 )+ , the premium is

Put t (x0 , K , r, σ, T ) = Put 0 (x0 , K , r, σ, T − t),

where
Put 0 (x0 , K , r, σ, τ ) = e−r τ K 0 (−d2 ) − x0 0 (−d1 ). (12.3)

The avatars of the regular Black–Scholes formulas can be obtained as follows:


• Stock without dividend (Black–Scholes): the risky asset is X .
• Stock with continuous yield λ > 0 of dividends: the risky asset is eλt X t and one
has to replace x0 by e−λτ x0 in the right-hand sides of (12.1), (12.2) and (12.3).
12.2 Black–Scholes Formula(s) (To Compute Reference Prices) 545

• Foreign exchange (Garman–Kohlhagen): though X tx0 is quoted, the risky asset is


er F t X t where r F denotes the foreign interest rate and one has to replace x0 by
e−r F τ x0 in the right-hand sides of (12.1), (12.2) and (12.3).
• Future contract (Black): the risky asset is the underlying asset of the future contract
(with maturity L > T ), i.e. it is er (L−t) X t and one has to replace x0 by er τ x0 in the
right-hand sides of (12.1), (12.2) and (12.3).

12.3 Measure Theory

Theorem 12.1 (Baire σ-field Theorem) Let (S, d S ) be a metric space. Then

Bor (S, d S ) = σ C(S, R) ,

where C(S, R) denotes the set of continuous functions from (S, d S ) to R. When S is
σ-compact (i.e. is a countable union of compact sets), one may replace the space
C(S, R) by the space C K (S, R) of continuous functions with compact support.

Theorem 12.2 (Functional monotone class Theorem) Let (S, S) be a measurable


space. Let V be a vector space of real-valued bounded measurable functions defined
on (S, S). Let C be a subset of V , stable under the product of two functions. Assume
furthermore that V satisfies
(i) 1 ∈ V ,
(ii) V is closed under uniform convergence,
(iii) V is closed under “bounded non-decreasing convergence”: if ϕn ∈ V , n ≥ 1,
ϕn ≤ ϕn+1 , |ϕn | ≤ K (real constant) and ϕn (x) → ϕ(x) for every x ∈ S, then
ϕ∈ V .
Then H contains the vector subspace of all σ(C)-measurable bounded functions.

We refer to [216] for a proof of this result.

12.4 Uniform Integrability as a Domination Property

In this section, we present a brief background on uniform integrability for random


variables taking values in Rd . Mostly for notational convenience, all these random
variables are defined on the same probability space (, A, P), although this is abso-
lutely not mandatory. We leave it as an exercise to check in what follows that each
random variable X i can be defined on its own probability space (i , Ai , Pi ) with a
straightforward adaptation of the statements.
546 12 Miscellany

Theorem 12.3 (Equivalent definitions of uniform integrability I) A family (X i )i∈I


of Rd -valued random vectors, defined on a probability space (, A, P), is said to be
uniformly integrable if it satisfies one of the following equivalent properties,
(i) lim sup E |X i |1{|X i |≥R} = 0.
R→+∞ i∈I


⎪ (α) supi∈I E |X i | < +∞,


(ii) (β) ∀ ε > 0, ∃ η = η(ε) > 0 such that,

⎪ 

⎩ ∀ A ∈ A, P(A) ≤ η =⇒ sup A |X i |d P ≤ ε.
i∈I

Remarks • All norms being strongly equivalent on Rd , claims (i) and (ii) do not
depend on the selected norm on Rd .
• L 1 -Uniform integrability of a family of probability distribution (μi )i∈I defined on
(Rd , Bor (Rd )) can be defined accordingly by

lim sup |x| μi (d x) = 0.
R→+∞ i∈I {|x|≥R}

All of the following can be straightforwardly “translated” in terms of probability


distributions. Thus, more generally, let f : Rd → R+ be a non-zero Borel function
such that lim f (x) = 0. A family of probability distribution (μi )i∈I defined on
|x|→+∞
(Rd , Bor (Rd )) is f -uniformly integrable if

lim sup f (x) μi (d x) = 0.
R→+∞ i∈I { f (x)≥R}

Proof of Theorem 12.3. Assume first that (i) holds. It is clear that
sup E |X i | ≤ R + sup E |X i |1{|X i |≥R} < +∞
i∈I i∈I

at least for large enough R ∈ (0, +∞). Now, for every i ∈ I and every A ∈ A,
 
|X i | d P ≤ R P(A) + |X i | d P.
A {|X i |≥R}


Owing to (i) there exists a real number R = R(ε) > 0 such that sup |X i | d P ≤
i∈I {|X i |≥R}
ε ε
. Then setting η = η(ε) = 2R yields (ii).
2
Conversely, for every real number R > 0, the Markov Inequality implies

supi∈I E |X i |
sup P(|X i | ≥ R) ≤ .
i∈I R

supi∈I E |X i | 
Let η = η(ε) be given by (ii)(β). As soon as R > η
, sup P {|X i | ≥ R} ≤ η
i∈I
and (ii)(β) implies that
12.4 Uniform Integrability as a Domination Property 547

sup E |X i |1|X i |≥R ≤ ε


i∈I

which completes the proof. ♦

As a consequence, one easily derives that


P0. (X i )i∈I is uniformly integrable if and only if (|X i |)i∈I is.
P1. If X ∈ L 1 (P) then the family (X ) is uniformly integrable.
P2. If (X i )i∈I and (Yi )i∈I are two families of uniformly integrable Rd -valued random
vectors, then (X i + Yi )i∈I is uniformly integrable.
P3. If (X i )i∈I is a family of Rd -valued random vectors dominated by a uniformly
integrable family (Yi )i∈I of random variables in the sense that

∀ i ∈ I, |X i | ≤ Yi P-a.s.

then (X i )i∈I is uniformly integrable.

The four properties follow from characterization (i). To be precise, P1 is a con-


sequence of the Lebesgue Dominated Convergence Theorem, whereas P2 follows
from the obvious

E |X i + Yi |1{|X i +Yi |≥R} ≤ E |X i |1{|X i |≥R} + E |Yi |1{|Yi |≥R} .

Now let us pass to a simple criterion of uniform integrability.


Corollary 12.2 (de La Vallée Poussin criterion) Let (X i )i∈I be a family of Rd -
valued random vectors defined on a probability space (, A, P) and let  : Rd →
(x)
R+ satisfying lim = +∞. If
|x|→+∞ |x|

sup E (X i ) < +∞,


i∈I

then the family (X i )i∈I is uniformly integrable.


Theorem 12.4 (Uniform integrability II) Let (X n )n≥1 be a sequence of Rd -valued
random vectors defined on a probability space (, A, P) and let X be an Rd -valued
random vector defined on the same probability space. If
(i) (X n )n≥1 is uniformly integrable,
P
(ii) X n −→ X ,
then
E |X n − X | −→ 0 as n → +∞.

In particular, E X n → E X . Moreover the converse is true: L 1 (P)-convergence


implies (i) and (ii).
548 12 Miscellany

Proof. One derives from (ii) the existence of a subsequence (X n  )n≥1 such that
X n  → X a.s. Hence, by Fatou’s Lemma,

E |X | ≤ lim E |X n  | ≤ sup E |X n | < +∞.


n n

Consequently, X ∈ L 1 (P) and, owing to P1 and P2, (X n − X )n≥1 is a uniformly


integrable sequence. Now, for every integer n ≥ 1 and every R > 0,
 
E |X n − X | ≤ E |X n − X | ∧ M + E |X n − X |1{|X n −X |≥M} .

The Lebesgue Dominated Convergence Theorem implies limn E(|X n − X | ∧


M) = 0 so that

lim E |X n − X | ≤ lim sup E |X n − X |1{|X n −X |≥M} = 0.


n R→+∞ n ♦

Corollary 12.3 (L -uniform integrability) Let p ∈ [1, +∞). Let (X n )n≥1 be a


p

sequence of Rd -valued random vectors defined on a probability space (, A, P)


and let X be an Rd -valued random vector defined on the same probability space. If
(i) (X n )n≥1 is L p -uniformly integrable (i.e. (|X n | p )n≥1 is uniformly integrable),
P
(ii) X n −→ X ,
then
X n − X  p −→ 0 as n → +∞.

(In particular, E X n → E X .) Moreover the converse is true: L p (P)-convergence


implies (i) and (ii).
Proof. By the same argument as above X ∈ L p (P) so that (|X n − X | p )n≥1 is uni-
formly integrable by P2 and P3. The result follows by the above theorem since
|X n − X | p → 0 in probability as n → +∞. ♦

12.5 Interchanging…

Theorem 12.5 (Interchanging continuity and expectation) (see e.g. [52], Chap. 8)
(a) Let (, A, P) be a probability space, let I be a nontrivial interval of R and let
 : I ×  → R be a Bor (I ) ⊗ A-measurable function. Let x0 ∈ I . If the function
 satisfies:
(i) for every x ∈ I , the random variable (x, . ) ∈ L 1R (P),
(ii) P(dω)-a.s., x → (x, ω) is continuous at x0 ,
(iii) there exists Y ∈ L 1R+ (P) such that for every x ∈ I ,

P(dω)-a.s. |(x, ω)| ≤ Y (ω),


12.6 Weak Convergence of Probability Measures on a Polish Space 549

then the function ψ(x) := E (x, . ) is defined at every x ∈ I and is continuous at x0 .


(b) The domination property (iii) in the above theorem can be replaced
 mutatis
mutandis by a uniform integrability assumption on the family (x, . ) x∈I .

The same extension as Claim (b), based on uniform integrability, holds true for
the differentiation Theorem 2.2 (see the exercise following the theorem).

12.6 Weak Convergence of Probability Measures


on a Polish Space

The main reference for this topic is [45]. See also [239].
The basic result of weak convergence theory is the so-called Portmanteau Theorem
stated below (the definition and notation of weak convergence of probability measures
are recalled in Sect. 4.1).
Theorem 12.6 Let (μn )n≥1 be a sequence of probability measures on a Polish (met-
ric) space (S, δ) equipped with its Borel σ-field S and let μ be a probability measure
on the same space. The following properties are equivalent:
S
(i) μn =⇒ μ as n → +∞,
(ii) For every open set O of (S, δ).

μ(O) ≤ lim μn (O).


n

(ii) For every bounded Lipschitz continuous function,


 
lim f dμn = f dμ.
n S S

(iii) For every closed set F of (S, δ),

μ(F) ≥ lim μn (F).


n


(iv) For every Borel set A ∈ S such that μ(∂ A) = 0 (where ∂ A = A\ A is the
boundary of A),
lim μn (A) = μ(A).
n

(v) Weak Fatou Lemma: For every non-negative lower semi-continuous function
f : S → R+ ,  
0≤ f dμ ≤ lim f dμn .
S n S
550 12 Miscellany

(vi) For every bounded Borel function f : S → R such that μ Disc( f ) = 0,
 
lim f dμn = f dμ.
n S S
 
where Disc( f ) = x ∈ S : f is not continuous at x .
For a proof, we refer to [45], Chap. 1.
When dealing with unbounded functions, there is a kind of weak Lebesgue dom-
inated convergence theorem.
Proposition 12.2 Let (μn )n≥1 be a sequence of probability measures on a Polish
(metric) space (S, δ) weakly converging to μ.
(a) Let g : S → R+ be a (non-negative) μ-integrable continuous function and let
f : S → R be a μ-a.s. continuous Borel function. If
 
0 ≤ |f| ≤ g and g dμn −→ g dμ as n → +∞
S S
 
then f ∈ L (μ) and
1
f dμn −→ f dμ as n → +∞.
S S
(b) The conclusion still holds if f is (μn )n≥1 -uniformly integrable, i.e.

lim sup | f |dμn = 0.
R→+∞ n≥1 | f |≥R

Proof. Let R > 0 such that μ( f = R) = μ(g = R) = 0. Set f R = f 1| f |≤R and g R =


g1g≤R . The functions f R and g R are μ-a.s. continuous and bounded. Starting from
         
   
 f dμn − f dμ ≤  f dμn − f dμ + g dμn − g dμn + g dμ − g dμ,
   R R  R R

we derive from the above proposition and the assumption made on g that, for every
such R,
    
 

lim  f dμn − 
f dμ ≤ 2 (g − g R )dμ = 2 g1{g>R} dμ.
n

As there are at most countably many R which are μ-atoms for g and f , we may
let R go to infinity, so that g1{g≥R} dμ → 0 as R → +∞. This completes the
proof. ♦

If S = R , weak convergence is also characterized by the Fourier transform 


d
μ
defined on Rd by 

μ(u) = eı̃(u|x) μ(d x), u ∈ Rd .
Rd
12.6 Weak Convergence of Probability Measures on a Polish Space 551

Proposition 12.3 Let (μn )n≥1 be a sequence of probability measures on (Rd ,


Bor (Rd ) and let μ be a probability measure on the same space. Then

S
μn =⇒ μ ⇐⇒ ∀ u ∈ Rd , 
μn (u) −→ 
μ(u) .

For a proof we refer, for example, to [156], or to any textbook presenting a first
course in Probability Theory.

Remark. The convergence in distribution of a sequence (X n )n≥1 defined on prob-


ability spaces (n , An , Pn ) of random variables taking values in a Polish space
is defined as the weak convergence of their distributions μn = PnX n = Pn ◦ X n−1
on (S, S).

12.7 Martingale Theory

Theorem 12.7 Let (Mn )n≥0 be a square integrable discrete time (Fn )-martingale
defined on a filtered probability space (, A, P, (Fn )n≥0 ).
(a) Then there exists a unique non-decreasing Fn -predictable process null at 0
denoted by (Mn )n≥0 such that

(Mn − M0 )2 − Mn is an Fn -martingale.

This process reads



n

Mn = E (Mk − Mk−1 )2 | Fk−1 .
k=1

(b) Set M∞ := lim Mn . Then


n

n→+∞  
Mn −→ M∞ on the event M∞ < +∞ ,

where M∞ is a finite random variable. If, furthermore, E M∞ < +∞, then M∞ ∈
L 2 (P).

We refer to [217] for a proof of this result.

Lemma 12.1 (Kronecker Lemma) Let (an )n≥1 be a sequence of real numbers and
let (bn )n≥1 be a non-decreasing sequence of positive real numbers with limn bn =
+∞. Then
   
 an 1 
n
converges in R as a series =⇒ ak −→ 0 as n → +∞ .
n≥1 n
b bn k=1
552 12 Miscellany


n
ak
Proof. Set Cn = , n ≥ 1, C0 = 0. The assumption says that Cn → C∞ =
bk
 ak k=1
∈ R as n → +∞. Now, for every n ≥ 1, bn > 0 and
k≥1
bk

1  1 
n n
ak = bk Ck
bn k=1 bn k=1
 
1  n
= bn Cn − Ck−1 bk
bn k=1

1 
n
= Cn − bk Ck−1 ,
bn k=1

where we used an Abel Transform for series. The result follows by the extended
Césaro Theorem since bn ≥ 0 and lim bn = +∞. ♦
n

To establish Central Limit Theorems outside the “i.i.d.s̈etting, we will rely on


Lindeberg’s Central limit theorem for arrays of martingale increments (see Theorem
3.2 and its Corollary 3.1, p. 58 in [142]), stated below in a simple form involving
only one square integrable martingale.

Theorem 12.8 (Lindeberg’s Central Limit Theorem for martingale increments,


see [142]) Let (Mn )n≥1 be a square integrable martingale with respect to a filtration
(Fn )n≥1 and let (an )n≥1 be a non-decreasingsequence of positive real numbers going
to infinity as n goes to infinity.
If the following two conditions hold:
1 
n
(i) E (Mk )2 | Fk−1 −→ σ 2 ∈ [0, +∞) in probability,
an k=1
1 
n
(ii) ∀ ε > 0, E (Mk )2 1{|Mk |≥ε√an } | Fk−1 −→ 0 in probability,
an k=1
then
Mn L 
√ −→ N 0; σ 2 .
an

 Exercise. Derive from this result a d-dimensional theorem. [Hint: Consider a


linear combination of the d-dimensional martingale under consideration.]
12.8 Itô Formula for Itô Processes 553

12.8 Itô Formula for Itô Processes

12.8.1 Itô Processes

An Rd -valued stochastic process (X t )t∈[0,T ] defined on a filtered probability space


(, A, P, (Ft )t≥0 ) of the form, i.e.
 t  s
Xt = X0 + K s ds + Hs dWs , t ∈ [0, T ], (12.4)
0 0

where
• X 0 is F0 -measurable,  
ij
• (Ht )t≥0 = [Ht ]i=1:d, j=1:q t≥0 and (K t )t≥0 = [K ti ]i=1:d t≥0 are (Ft )-
progressively measurable processes (1 ) having values in Rd and M(d, q, R),
respectively,
 T
• |K s |ds < +∞ P-a.s.,
0 T
• Hs 2 ds < +∞ P-a.s.,
0
• W is a q-dimensional (Ft )-standard Brownian motion, (2 )
is called an Itô process (3 ).
Note that the processes K and H in (12.4) are P-a.s. essentially unique (since a
continuous (local) martingale null at zero with finite variation is indistinguishable
from the null process).
In particular, an Itô process is a local martingale (4 ) if and only if, P-a.s, K t = 0
 t ∈ [0, T ]. 
for every
t t  t
If E Hs 2 ds = EHs 2 ds < +∞ then the process Hs dWs is
0 0 0 t≥0
an Rd -valued square integrable martingale with a bracket process
     
t i t j t  2
Hs dWs , Hs dWs = (H H ∗ )is j ds
0 0 0 1≤i, j≤d
1≤i, j≤d

(H ∗ denotes here the transpose of H ) and

1A stochastic process (Yt )≥0 defined on (, A, P is (Ft )-progressively measurable if for every
t ∈ R+ , the mapping (s, ω) → Ys (ω) defined on [0, t] ×  is Bor ([0, t]) ⊗ Ft -measurable.
2 This means that W is (F )-adapted and, for every s, t ≥ 0, s ≤ t, W − W is independent of F .
t
  t s s
3 The stochastic integral is defined by s H d W = d ij j
0 s s j=1 H s d W s .
1≤i≤d
4 An Ft -adapted continuous process is a local martingale if there exists an increasing sequence
(τn )n≥1 of (Ft )-stopping times, increasing to +∞, such that (Mt∧τn − M0 )t≥0 is an (Ft )-martingale
for every n ≥ 1.
554 12 Miscellany

 i  j
  i  j

t t t t
Hs dWs , Hs dWs − Hs dWs , Hs dWs
0 0 0 0

are (P, Ft )-martingales for every i, j ∈ {1, . . . , d}. Otherwise, these are only local
martingales.

12.8.2 The Itô Formula

The Itô formula, also known as Itô’s Lemma, applies


 to this family of processes: let
f ∈ C 1,2 (R+ , Rd ). Then the process f (t, X t ) t≥0 is still an Itô process reading
 t

f (t, X t ) = f (0, X 0 ) + ∂t f (s, X s ) + L f (s, X s ) ds
0

1 t 
+ ∇x f (s, X s ) | σ(s, X s )dWs , (12.5)
2 0

where
 1 
L f (t, X t ) = b(t, x) | ∇x f (t, x) + Tr σ D 2 f σ ∗ (t, x)
2
denotes the infinitesimal generator of the diffusion process.
For a comprehensive exposition of the theory of continuous martingales theory
and stochastic calculus, we refer to [162, 163, 249, 251, 256] among many other
references. For a more synthetic introduction with a view to applications to Finance,
we refer to [183].

12.9 Essential Supremum (and Infimum)

The essential supremum is an object playing for a.s. comparisons a similar role
for random variables defined on a probability space (, A, P) to that played by the
regular supremum for real numbers on the real line.
Let us understand on a simple example why the regular supremum is not the right
notion in a framework where equalities and inequalities hold in an a.s. sense. First,
it is usually not measurable when taken over a non-countable infinite set of indices:
indeed, let (, A, P) = ([0, 1], Bor ([0, 1]), λ), where λ denotes the Lebesgue mea-
sure on [0, 1], let I ⊂ [0, 1] and X i = 1{i} , i ∈ I . Then, supi∈I X i = 1 I is measurable
if and only if I is a Borel subset of the interval [0, 1].
Furthermore, even when I = [0, 1], supi∈I X i = 1 is certainly measurable but is
therefore constantly 1, whereas X i = 0 P-a.s. for every i ∈ I . We will see, how-
12.9 Essential Supremum (and Infimum) 555

ever, that the notion of essential supremum defined below does satisfy esssup X i =
i∈I
0 P-a.s.

Proposition 12.4 (Essential supremum) (a) Let (, A, P) be a probability space


and (X i )i∈I a family of random variables having values in R = [−∞, +∞] defined
on (, A). There exists a unique random variable on , up to a P-a.s. equality,
denoted by P-esssup X i (or simply esssup X i when there is no ambiguity on the
i∈I i∈I
underlying probability P), satisfying:
(i) ∀ i ∈ I, X i ≤ P-esssup X i P-a.s.,
i∈I
(ii) for every random variable Z : (, A) → R, if X i ≤ Z P-a.s. for every i ∈ I ,
then P-esssup X i ≤ Z P-a.s.
i∈I
(b) A family (X i )i∈I of real-valued random variables defined on (, A) is stable
under finite supremum (up to P-a.s. equality), if

∀i, j ∈ I, ∃ k ∈ I such that X k ≥ max(X i , X j ) P-a.s. (12.6)

If (X i )i∈I is stable under finite supremum, then there exists a subsequence (X in )n∈N
P-a.s. non-decreasing such that

P-esssup X i = sup X in = lim X in P-a.s.


i∈I n∈N n→+∞

(c) Let (, P(), P) be a finite probability space such that P({ω}) > 0 for every
ω ∈ . Then supi∈I X i is the unique essential supremum of the X i .

We adopt the simplified notation esssup in the proof below to alleviate notation.
i∈I

Proof. (a) Let us begin by P-a.s. uniqueness. Assume there exists a random variable
Z satisfying (i) and (ii). Following (i) for Z and (ii) for esssup X i , we derive
i∈I
that esssup X i ≤ Z P-a.s. Finally, (i) for esssup X i and (ii) for Z give the reverse
i∈I i∈I
inequality.
As for the existence, first assume that the variables X i are [0, 1]-valued. Set
! "
MX := sup E sup X j , J ⊂ I, J countable ∈ [0, 1].
j∈J

This definition is consistent since, J being countable, sup j∈J X j is measurable and
[0, 1]-valued. As MX ∈ [0, 1] is a finite real number, there exists a sequence (Jn )n≥1
of countable subset of I satisfying:

1
E sup X j ≥ MX − .
j∈Jn n
556 12 Miscellany

Set
esssup X i := sup X j where J = ∪n≥1 Jn .
i∈I j∈J

This defines a random variable since J is itself countable as the countable union of
countable sets. This yields

MX ≥ E esssup X i ≥ lim E sup X j ≥ MX .


i∈I n j∈Jn

whence MX = E esssup X i . Then let i ∈ I be fixed. The subset J ∪ {i} of I is


i∈I
countable, hence
E sup X j ≤ M X = E sup X j .
j∈J ∪{i} j∈J

Consequently,
E sup X j − sup X j ≤ 0,
j∈J ∪{i} j∈J
# $% &
≥0

which implies sup j∈J ∪{i} X j − sup j∈J X j = 0 P-a.s., i.e.

X i ≤ esssup X i = sup X j P-a.s.


i∈I j∈J

Moreover, if X i ≤ Z P-a.s. for every i ∈ I , in particular this holds true for i ∈ J so


that
esssup X i = sup X j ≤ Z P-a.s.
i∈I j∈J

When the X i have values in [−∞, +∞], one introduces an increasing bijection
 : R → [0, 1] and sets

esssup X i := −1 esssup (X i ) .
i∈I i∈I

It is easy to check that, thus defined, esssup X i satisfies (i) and (ii). By uniqueness
i∈I
esssup X i does not depend P-a.s. on the selected function .
i∈I  
(b) Following (a), there exists a sequence ( jn )n∈N such that J = jn , n ≥ 0 satisfies

esssup X i = sup X jn = lim X jn P-a.s.


i∈I n∈N n

Now, owing to the stability property of the supremum (12.6), one may build a
sequence (i n )n∈N such that X i0 = X j0 and, for every n ≥ 1,
12.9 Essential Supremum (and Infimum) 557

X in ≥ max(X jn , X in−1 ), P-a.s.

It is clear by induction that X in ≥ max(X j0 , . . . , X jn ) increases toward esssup X i ,


i∈I
which completes the proof of this item.
(c) When the probability space is endowed with the coarse σ-field P(), the map-
ping supi∈I X i is measurable. Now, supi∈I X i obviously satisfies (i) and (ii) and is
subsequently P-a.s. equal to esssup X i . As the empty set is the only P-negligible set,
i∈I
the former equality is true on the whole . ♦

We will not detail all the properties of the P-essential supremum which are quite
similar to those of regular supremum (up to an a.s. equality). Typically, one has with
obvious notations and P-a.s.,

esssup (X i + Yi ) ≤ esssup X i + esssup Yi


i∈I i∈I i∈I

and,
∀λ ≥ 0, esssup (λ.X i ) = λ.esssup X i ,
i∈I i∈I

etc.
As a conclusion, we may define the essential infimum by setting, with the notation
of the above theorem,

P-essinf i∈I X i = −P-esssup (−X i ). (12.7)


i∈I

12.10 Halton Sequence Discrepancy (Proof of an


Upper-Bound)

This section is devoted to the proof of Theorem 4.2. We focus on dimension 2,


assuming that the one dimensional result for the Vdc( p) sequence holds. This one
dimensional case can be established directly following the lines of the proof below: it
is easier and it does not require the Chinese Remainder Theorem. In fact, the general
bound can be proved by induction on d by adapting the proof of the 2-dimensional
setting hereafter.
Assume p1 and p2 are mutually prime.
Step 1. We deal with the two-dimensional case for simplicity: let p1 , p2 , two mutu-
ally prime numbers and let ξni =  pi (n), i = 1, 2, n ≥ 1, where  pi is the radical
inverse function associated to pi . Note that ξni ∈ (0, 1) for every n ≥ 1 and every
i = 1, 2. We set, for every integer n ≥ 1,
558 12 Miscellany


n
n (x, y) = 1{ξk1 ≤x}∩{ξk2 ≤y} , (x, y) ∈ [0, 1]2
k=1

and
'n (x, y) = n (x, y) − nx y.

( log n )
Let n ≥ 1 be a fixed integer. Let ri = log pi
, i = 1, 2. We consider the increasing
reordered n-tuples of both components, denoted by (ξki,(n) )k=1,...,n , i = 1, 2 with the
two additional terms ξ0i,(n) = 0 and ξn+1
i,(n)
= 1. Then, set
*  +
X1,2
n
= ξk1,(n)
1
, ξk2,(n)
2
, 0 ≤ k1 , k2 ≤ n .

Note that, for any pair (x, y) ∈ X1,2


n
, x reads

x1 xr y1 yr
x= + · · · + r11 , y= + · · · + r22 .
p1 p1 p2 p2

1,(n)  1,(n) 
As n is constant over rectangles of the form ξk1,(n) , ξk+1 × ξ1,(n) , ξ+1 ,
0 ≤ k,  ≤ n, and 'n is decreasing with respect to the componentwise partial order
'n is “right-
on [0, 1]2 (the order induced by inclusions of boxes [[0, x]]). Finally, 
'n (x, y) =
continuous” in the sense that  lim 'n (u, v). Moreover, as

(u,v)→(x,y),u>x,v>y
'n is continuous at (1, 1), 
both sequences are (0, 1)-valued,  'n (1, 1) = 0 and, for
x, y ∈ [0, 1),

'n (x, 1) =
 lim 'n (u, v), 
 'n (1, y) = lim 'n (u, v).

(u,v)→(x,1),u>x,v<1 (u,v)→(x,y),u<1,v>y

This shows that

sup 'n (x, y) =


 sup 'n (x, y)

(x,y)∈[0,1]2 (x,y)∈[0,1)2

'n ξk1,(n) , ξ2,(n) =
= max  'n (x, y) ≥ 0.
max 
0≤k,≤n (x,y)∈X1,2
n

Likewise, setting n (x− , y− ) = lim n (u, v), one shows that


(u,v)→(x,y),u<x,v<y


inf 'n (x, y) =
 min 'n ξk−
 1
, ξ−
2
.
(x,y)∈[0,1]2 1≤k,≤n+1

Note that, as ξk2 ∈ (0, 1), k = 1, . . . , n,

'n (1− , y− ) =
 lim 'n (1, y) ≥ −n Dn∗ (ξ 2 ) and 
 'n (x− , 1− ) ≥ −n Dn∗ (ξ 1 )
v→y,v<y
12.10 Halton Sequence Discrepancy (Proof of an Upper-Bound) 559

so that

inf 'n (x, y) ≥ min


 min 'n (x, y), −n Dn∗ (ξ 1 ), −n Dn∗ (ξ 2 ) . (12.8)

(x,y)∈[0,1]2 (x,y)∈X1,2
n

Consequently,

   
sup 'n (x, y) ≤ max n Dn∗ (ξ 1 ), n Dn∗ (ξ 2 ), max 'n (x, y) . (12.9)
(x,y)∈[0,1]2
n
(x,y)∈X1,2

Step 2. Let k ∈ {0, . . . , n} and let (x, y) ∈ X1,2


n
. Write k = k0 + k1 p1 + · · · + kr1 p1r1 ,
ki ∈ {0, . . . , p1 − 1}, kr1 = 0 in base p1 . Let us focus first on the inequality, ξk1 < x.
It reads
k0 kr 1 x1 xr
ξk1 = + · · · r +1 < + · · · + r +1 ,
p p 1 p p1

which is equivalent to

∃ r ∈ {0, . . . , r1 } s.t. ks = xs+1 , s = 0, . . . , r − 1 and kr < xr +1 .

These conditions can be rewritten as

∃ r ∈ {0, . . . , r1 }, ∃ u r ∈ {0, . . . , xr +1 } s.t.


k ≡ x1 + · · · xr p1r −1 + u r p1r mod p1r +1 .

Consequently, the joint conditions ξk1 < x and ξk2 < y read: there exists r ∈ {0, . . . , r1 },
s ∈ {0, . . . , r2 }, u r ∈ {0, . . . , xr +1 − 1}, vs ∈ {0, . . . , ys+1 − 1} such that

⎨ k ≡ x1 + · · · + xr p1r −1 + u r p1r mod p1r +1
Sr1 ,r2 (u r , vs ) ≡

k ≡ y1 + · · · + ys p2s−1 + vs p2s mod p2s+1 .

Since p1r +1 and p2s+1 are mutually prime, we know by the Chinese Remainder The-
r +1 r +1
orem that
, then system - Sr1 ,r2 (u r , vr ) has a unique solution in {0, . . . , p1 p2 − 1},
hence + ηr,s,ur ,vs solutions lying in {0, . . . , n} with ηr,s,ur ,vr ∈ {0, 1}.
pr +1 q r +1
The solution k = 0 to the system S0,0 (0, 0) should be removed since it is not admis-
sible. Consequently, as (x, y) ∈ X1,2 n
,
560 12 Miscellany


n
n (x, y) = 1 + 1{ξk1 <x} 1{ξk2 <y}
k=1
+1 −1 ys+1 −1 , -

r1 
r2 xr  n
= 1−1+ + ηr,s,ur ,vs
r =0 s=0 u r =0 vs =0
p1r +1 p2s+1

r1 
r2 , n -
= xr +1 ys+1 + ηr,s ,
r =0 s=0
p1r +1 p2s+1

where ηr,s ∈ [0, 1]. Owing to (12.8), (12.9) and the obvious fact that nx y =
  n xr +1 ys+1
r +1 q s+1
, we derive that for every (x, y) ∈ X1,2
n
,
0≤r ≤r 0≤s≤r
p
1 2

 
 
 
  r −1 
 +1 −1 ys+1 −1
r2 xr   , - 
'n (x, y) ≤  n n 
xr +1 ys+1  r +1 s+1 − r +1 s+1 + ηr,s 
 p1 p2 p p2 #$%& 
r =0 s=0 u r =0 vs =0 # $% 1 & ∈[0,1]

∈[−1,0]
≤ (r1 + 1)(r2 + 1)( p1 − 1)( p2 − 1).
( pi n )
We conclude by noting that, on the one hand, ri + 1 = log log pi
, i = 1, 2 and, on
the other hand,
 
 

n D (ξ) ≤ max sup  '  ∗ 1 ∗ 2
n (x, y) , n D (ξ ), n D (ξ ) .
n n n
(x,y)∈[0,1]2

Finally, following the above lines – in a simpler way – one shows that

n Dn∗ (ξ i ) ≤ (ri + 1)( pi − 1), i = 1, 2.

This completes the proof since (ri + 1)( pi − 1) ≥ 1, i = 1, 2, and max(a, b) ≤ ab


for every a, b ∈ N∗ . ♦

12.11 A Pitman–Yor Identity as a Benchmark

We aim at computing E cos(X t2 ), where X 2 denotes the second component of a


Clark–Cameron oscillator defined by (9.104), namely
 t
X t2 = σ (Ws1 + μs)dWs2 , t ∈ [0, T ].
0

Conditional on the process (Ws1 )0≤s≤t , X t2 has a centered Gaussian distribution with
stochastic variance
12.11 A Pitman–Yor Identity as a Benchmark 561
 t
σ2 (μs + Ws1 )2 ds.
0

2
Let us compute E eı̃ X t . As W 1 and W 2 are independent,

t   t 2 
0 (μs+Ws )dWs = E E eı̃σ 0 (μs+ws )dWs W 1 = w
2 1 2
E eı̃ X t = E eı̃σ
σ2
t
= E e− 2 0 (μs+Ws ) ds
1 2
.
μ2
We apply Girsanov’s Theorem and we make a change of probability P∗ = e−μWt − 2 t . P
1

such that Bs = μs + Ws1 , s ∈ [0, t], is a P∗ -standard Brownian motion. Hence,


dP μ2


= eμBt − 2 t , which yields
dP
 
σ2 t
E eı̃ X t = E∗ eμBt − 2 μ t e− 2 0 (Bs ) ds
2 1 2 2
(12.10)
  2  
= E∗ eμBt − 2 μ t E∗ e− 2 0 (Bs ) ds  Bt
1 2 σ t 2

  σ2  t 2 
= e− 2 μ t E∗ eμBt E∗ e− 2 0 (Bs ) ds (Bt )2
1 2
(12.11)

d t t
where we used in the third line that, as B = −B and 0 (Bs )2 ds = 0 (−Bs )2 ds,
 σ2  t 2   σ2  t 2 
E∗ e− 2 0 (Bs ) ds  Bt = E∗ e− 2 0 (Bs ) ds (Bt )2 . (12.12)

We recall now a formula established in [244] (p. 427):



 σ2  t 2  σt I− 21 σz
sinh σt
E x e− 2 0 Bs ds (Bt )2 = y =
y
e 2t (1−σt coth σt) 
sinh σt I− 21 zt

with z = x y and where Iν is the Bessel function which, for ν = −1/2, reads
.
I− 21 (z) = π2 cosh
√ z . Here x = 0, hence
z
 σ2  t 2  /
∗ − 2 0 (Bs ) ds  σt y
E e (Bt ) = y =
2
e 2t (1−σt coth σt) .
sinh σt

Plugging this into Equation (12.10) yields


/  
ı̃ X t2 σt − 21 μ2 t ∗
B2
μBt + 2tt (1−σt coth σt)
Ee = e E e .
sinh σt

d
As Bt = N (0; t),
562 12 Miscellany
   +∞
Bt2 x2 x2 dx
E∗ eμBt + 2t (1−σt coth σt) = eμx+ 2t (1−σt coth σt) e− 2t √
−∞ 2π
 +∞
1 1 2  dx
=√ exp − x σ coth σt − 2μx √ .
t −∞ 2 2π

We set a = σ coth σt and b = μ/a and we get
   b2

B2
μBt + 2tt (1−σt coth(σt)) 1 b2 +∞ (ax − b)2 d x e2
E e =√ e 2 exp − √ = √ .
t −∞ 2 2π a t

Hence,
/
σt 1 1 1
e− 2 (1− σt ) .
b2 μ2 t tanh σt
e− 2 μ t+ 2 = √
1 2
ı̃ X t2
Ee = √ √
sinh σt t σ coth σt cosh σt

Since the right term of the above equality has no imaginary part, we obtained the
announced formula (9.105)

e− 2 (1− σt )
μ t 2 tanh σt

E cos(σ X t2 ) = √ .
cosh σt

Remark. Note that these computations are shortened when there is no drift term,
i.e.  t
X t = Wt , X t = σ
1 1 2
X s1 dWs1 .
0

Indeed, owing to the Cameron–Martin formula (see [251] p. 445),

1 √ − 21
E e−σ 0 (Ws1 )2 ds
= cosh 2σ

and the scaling property of the Brownian motion yields


σ2
t σ2 t 2
1
E eı̃ X t = E e− 2 0 (Ws ) ds = E e− (Ws2 )2 ds
= (cosh σt)− 2 .
2 2 2 1
2 0

 Exercise. Prove in detail the identity (12.12).


Bibliography

1. M. Abramovicz, I.A. Stegun, Handbook of Mathematical Functions (National Bureau of


Standards, Washington, 1964), pp. xiv+1046
2. Y. Achdou, O. Pironneau, Computational Methods for Option Pricing, Collection Frontiers
in Applied Mathematics. Society for Industrial and Applied Mathematics, vol. 30 (SIAM,
Philadelphia, 2005), pp. xviii+297. ISBN: 0-89871-573-3
3. A. Alfonsi, High order discretization schemes for the CIR process: application to affine term
structure and Heston models. Math. Comput. 79(269), 209–237 (2010)
4. A. Alfonsi, Affine Diffusions and Related Processes: Simulation, Theory and Applications.
Bocconi and Springer Series, vol. 6 (Springer, Cham; Bocconi University Press, Milan, 2015),
pp. xiv+252
5. A. Alfonsi, B. Jourdain, A. Kohatsu-Higa, Pathwise optimal transport bounds between a one-
dimensional diffusion and its Euler scheme. Ann. Appl. Probab. 24(3), 1049–1080 (2014)
6. A. Al Gerbi, B. Jourdain, E. Clément, Ninomiya-Victoir scheme: strong convergence, anti-
thetic version and application to multilevel estimators. Monte Carlo Methods Appl. 22(3),
197–228 (2016)
7. L. Andersen, Simple and efficient simulation of the Heston stochastic volatility model. J.
Comput. Financ. 11(3), 1–42 (2015)
8. L. Andersen, M. Broadie, Primal-Dual simulation algorithm for pricing multi-dimensional
American options. Manag. Sci. 50(9), 1222–1234 (2004)
9. P. Andersson, A. Kohatsu-Higa, Exact simulation of stochastic differential equations using
parametrix expansions. Bernoulli 23(3), 2028–2057 (2017)
10. I.A. Antonov, V.M. Saleev, An economic method of computing L Pτ -sequences. Zh. vȳ chisl.
Mat. mat. Fiz. 19, 243–245 (1979); English translation. U.S.S.R. Comput. Maths. Math. Phys.
19, 252–256 (1979)
11. D.G. Aronson, Bounds for the fundamental solution of a parabolic equation. Bull. Am. Math.
Soc. 73, 890–903 (1967)
12. B. Arouna, Adaptive Monte Carlo method, a variance reduction technique. Monte Carlo
Methods Appl. 10(1), 1–24 (2004)
13. M.D. Ašić, D.D. Adamović, Limit points of sequences in metric spaces. Am. Math. Monthly
77(6), 613–616 (1970)
14. S. Asmussen, J. Rosinski, Approximations of small jumps of Lévy processes with a view
towards simulation. J. Appl. Probab. 38, 482–493 (2001)
15. P. Baldi, Exact asymptotics for the probability of exit from a domain and applications to
simulation. Ann. Probab. 23(4), 1644–1670 (1995)

© Springer International Publishing AG, part of Springer Nature 2018 563


G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0
564 Bibliography

16. V. Bally, An elementary introduction to Malliavin calculus, technical report (Inria) (2003),
https://hal.inria.fr/inria-00071868/document
17. V. Bally, The central limit theorem for a non-linear algorithm based on quantization. Proc. R.
Soc. 460, 221–241 (2004)
18. V. Bally, A. Kohatsu-Higa, A probabilistic interpretation of the parametrix method. Ann.
Appl. Probab. 25(6), 3095–3138 (2015)
19. V. Bally, G. Pagès, J. Printems, A stochastic quantization method for non-linear problems.
Monte Carlo Methods Appl. 7(1), 21–34 (2001)
20. V. Bally, G. Pagès, A quantization algorithm for solving discrete time multi-dimensional
optimal stopping problems. Bernoulli 9(6), 1003–1049 (2003)
21. V. Bally, G. Pagès, Error analysis of the quantization algorithm for obstacle problems. Stoch.
Process. Appl. 106(1), 1–40 (2003)
22. V. Bally, G. Pagès, J. Printems, First order schemes in the numerical quantization method.
Math. Financ. 13(1), 1–16 (2003)
23. V. Bally, G. Pagès, J. Printems, A quantization tree method for pricing and hedging multi-
dimensional American options. Math. Financ. 15(1), 119–168 (2005)
24. V. Bally, D. Talay, The distribution of the Euler scheme for stochastic differential equations:
I. Convergence rate of the distribution function. Probab. Theory Relat. Fields 104(1), 43–60
(1996)
25. V. Bally, D. Talay, The law of the Euler scheme for stochastic differential equations. II.
Convergence rate of the density. Monte Carlo Methods Appl. 2(2), 93–128 (1996)
26. C. Barrera-Esteve, F. Bergeret, C. Dossal, E. Gobet, A. Meziou, R. Munos, D. Reboul-Salze,
Numerical methods for the pricing of Swing options: a stochastic control approach. Methodol.
Comput. Appl. Probab. 8(4), 517–540 (2006). https://doi.org/10.1007/s11009-006-0427-8
27. O. Bardou, S. Bouthemy, G. Pagès, Pricing swing options using optimal quantization. Appl.
Math. Financ. 16(2), 183–217 (2009)
28. O. Bardou, S. Bouthemy, G. Pagès, When are Swing options bang-bang? Int. J. Theor. Appl.
Financ. 13(6), 867–899 (2010)
29. O. Bardou, N. Frikha, G. Pagès, Computing VaR and CVaR using stochastic approximation
and adaptive unconstrained importance sampling. Monte Carlo Appl. J. 15(3), 173–210 (2009)
30. J. Barraquand, D. Martineau, Numerical valuation of high dimensional multivariate American
securities. J. Financ. Quant. Anal. 30, 383–405 (1995)
31. D. Bauer, A. Reuss, D. Singer, On the calculation of the solvency capital requirement based
on nested simulations. Astin Bull. 42(2), 453–499 (2012)
32. D. Belomestny, T. Nagapetyan, Multilevel path simulation for weak approximation schemes
with application to Lévy-driven SDEs. Bernoulli 23(2), 927–950 (2017)
33. M. Benaïm, Dynamics of stochastic approximation algorithms, in Séminaire de Probabilités
XXXIII, ed. by J. Azéma, M. Émery, M. Ledoux, M. Yor. LNM, vol. 1709 (1999), pp. 1–68
34. M. Ben Alaya, A. Kebaier, Central limit theorem for the multilevel Monte Carlo Euler method.
Ann. Appl. Probab. 25(1), 211–234 (2015)
35. M. Ben Alaya, A. Kebaier, Multilevel Monte Carlo for Asian options and limit theorems.
Monte Carlo Methods Appl. 20(3), 181–194 (2014)
36. C. Bender, Dual pricing of multi-exercise options under volume constraints. Financ. Stoch.
15(1), 1–26 (2011)
37. C. Bender, C. Gärtner, N. Schweizer, Iterative improvement of lower and upper bounds for
backward SDEs. SIAM J. Sci. Comput. 39(2), B442–B466 (2017)
38. C. Bender, J. Schoenmakers, J. Zhang, Dual representations for general multiple stopping
problems. Math. Financ. 25(2), 339–370 (2015)
39. A. Benveniste, M. Métivier, P. Priouret, Algorithmes Adaptatifs et Approximations Stochas-
tiques (Masson, Paris, 1987), pp. 367 (Adaptive Algorithms and Stochastic Approximations
(Springer, Berlin, English version, 1993))
40. J. Bertoin, Lévy Processes. Cambridge tracts in Mathematics, vol. 121 (Cambridge University
Press, Cambridge, 1996), pp. 262
Bibliography 565

41. A. Berkaoui, M. Bossy, A. Diop, Euler scheme for SDEs with non-Lipschitz diffusion coef-
ficient: strong convergence. ESAIM Probab. Stat. 12, 1–11 (2008)
42. A. Beskos, O. Papaspiliopoulos, G.O. Roberts, Retrospective exact simulation of diffusion
sample paths with applications. Bernoull 12(6), 1077–1098 (2006)
43. A. Beskos, G.O. Roberts, Exact simulation of diffusions. Ann. Appl. Probab. 15(4), 2422–
2444 (2005)
44. P. Billingsley, Probability and Measure (First edition, 1979), 3rd edn. Wiley Series in Prob-
ability and Mathematical Statistics (A Wiley–Interscience Publication, Wiley, New York,
1995), pp. xiv+593
45. P. Billingsley, Convergence of Probability Measures, 1st edn. (Second edition 1999, 277pp.)
(Wiley, New York, 1968), pp. 253
46. J.-P. Borel, G. Pagès, Y.J. XIao, Suites à discrépance faible and intégration numérique, in
Probabilités Numériques, ed. by N. Bouleau, D. Talay (Coll. didactique, INRIA, 1992). ISBN-
10: 2726107087
47. B. Bouchard, I. Ekeland, N. Touzi, On the Malliavin approach to Monte Carlo approximation
of conditional expectations. Financ. Stoch. 8(1), 45–71 (2004)
48. N. Bouleau, D. Lépingle, Numerical Methods for Stochastic Processes. Wiley Series in Prob-
ability and Mathematical Statistics: Applied Probability and Statistics (A Wiley-Interscience
Publication; Wiley, New York, 1994), pp. 359
49. C. Bouton, Approximation gaussienne d’algorithmes stochastiques à dynamique markovi-
enne. Annales de l’I.H.P. Probabilités et Statistiques 24(1), 131–155 (1998)
50. P.P. Boyle, Options: a Monte Carlo approach. J. Financ. Econ. 4(3), 323–338 (1977)
51. O. Brandière, M. Duflo, Les algorithmes stochastiques contournent-ils les pièges? (French)
[Do stochastic algorithms go around traps?]. Ann. Inst. H. Poincaré Probab. Stat. 32(3), 395–
427 (1996)
52. M. Briane, G. Pagès, Théorie de l’intégration, convolution and transformée de Fourier, in
Cours and Exercices, 6th edn. (Vuibert, Paris, 2015), pp. 400
53. M. Broadie, Y. Du, C.C. Moallemei, Efficient risk estimation via nested sequential simulation.
Manag. Sci. 57(6), 1172–1194 (2011)
54. R. Buche, H.J. Kushner, Rate of convergence for constrained stochastic approximation algo-
rithm. SIAM J. Control Optim. 40(4), 1001–1041 (2002)
55. K. Bujok, B. Hambly, C. Reisinger, Multilevel simulation of functionals of Bernoulli random
variables with application to basket credit derivatives. Methodol. Comput. Appl. Probab.
17(3), 579–604 (2015)
56. S. Burgos, M. Giles, Computing Greeks using multilevel path simulation, in Monte Carlo and
Quasi-Monte Carlo Methods 2010 (Springer, Berlin, 2012), pp. 281–296
57. G. Callegaro, L. Fiorin, M. Grasselli, Quantized calibration in local volatility, in Risk (Cut-
ting Hedge: Derivatives Pricing), (2015), pp. 56–67, https://www.risk.net/risk-magazine/
technical-paper/2402156/quantized-calibration-in-local-volatility
58. G. Callegaro, L. Fiorin, M. Grasselli, Pricing via quantization in stochastic volatility models.
Quant. Financ. 17(6), 855–872 (2017)
59. L. Carassus, G. Pagès, Finance de marché, Modèles mathématiques à temps discret (Vuibert,
Paris, 2016), pp. xii+ 385. ISBN 978-2-311-40136-3
60. J.F. Carrière, Valuation of the early-exercise price for options using simulations and nonpara-
metric regression. Insur. Math. Econ. 19, 19–30 (1996)
61. N. Chen, Y. Liu, Estimating expectations of functionals of conditional expected via multilevel
nested simulation, Presentation at MCQMC’12 Conference, Sydney (2010)
62. D.-Y. Cheng, A. Gersho, B. Ramamilrthi, Y. Shoham, Fast search algorithms for vector quan-
tization and pattern matching. Proc. IEEE Int. Conf. Acoust. Speech Signal Process 1, 9.11.1–
9.11.4 (1984)
63. K.L. Chung, An estimate concerning the Kolmogorov limit distribution. Trans. Am. Math.
Soc. 67, 36–50 (1949)
64. É. Clément, A. Kohatsu-Higa, D. Lamberton, A duality approach for the weak approximation
of stochastic differential equations. Ann. Appl. Probab. 16(3), 1124–1154 (2006)
566 Bibliography

65. É. Clément, D. Lamberton, P. Protter, An analysis of a least squares regression method for
American option pricing. Financ. Stoch. 6(2), 449–471 (2002)
66. S. Cohen, M.M. Meerschaert, J. Rosinski, Modeling and simulation with operator scaling.
Stoch. Process. Appl. 120(12), 2390–2411 (2010)
67. P. Cohort, Sur quelques problèmes de quantification, thèse de l’Université Paris 6, Paris (2000),
pp. 187
68. P. Cohort, Limit theorems for random normalized distortion. Ann. Appl. Probab. 14(1), 118–
143 (2004)
69. L. Comtet, Advanced Combinatorics. The Art of Finite and Infinite Expansions. Revised and
enlarged edition. D. Reidel Publishing Co., Dordrecht, 1974, pp. xi+343. ISBN: 90-277-0441-
4 05-02
70. S. Corlay, G. Pagès, Functional quantization-based stratified sampling methods. Monte Carlo
Methods Appl. J. 21(1), 1–32 (2015). https://doi.org/10.1515/mcma-2014-0010
71. R. Cranley, T.N.L. Patterson, Randomization of number theoretic methods for multiple inte-
gration. SIAM J. Numer. Anal. 13(6), 904–914 (1976)
72. D. Dacunha-Castelle, M. Duflo, Probabilités et Statistique II : Problèmes à temps mobile
(Masson, Paris, 1986). English version: Probability and Statistics II (Translated by D. M,
McHale) (Springer, New York, 1986), pp. 290
73. R.B. Davis, D.S. Harte, Tests for Hurst effect. Biometrika 74, 95–101 (1987)
74. S. Dereich, F. Heidenreich, A multilevel Monte Carlo algorithm for Lévy-driven stochastic
differential equations. Stoch. Process. Appl. 121, 1565–1587 (2011)
75. S. Dereich, S. Li, Multilevel Monte Carlo for Lévy-driven SDEs: central limit theorems for
adaptive Euler schemes. Ann. Appl. Probab. 26(1), 136–185 (2016)
76. L. Devineau, S. Loisel, Construction d’un algorithme d’accélération de la méthode des simu-
lations dans les simulation pour le calcul du capital économique Solvabilité II. Bull. Français
d’Actuariat, Institut des Actuaires 10(17), 188–221 (2009)
77. L. Devroye, Non Uniform Random Variate Generation (Springer, New York, 1986), 843pp.
First edition available at http://www.eirene.de/Devroye.pdf or http://www.nrbook.com/
devroye/
78. J. Dick, Higher order scrambled digital nets achieve the optimal rate of the root mean square
error for smooth integrands. Ann. Stat. 39(3), 1372–1398 (2011)
79. Q. Du, V. Faber, M. Gunzburger, Centroidal Voronoi tessellations: applications and algorithms.
SIAM Rev. 41, 637–676 (1999)
80. Q. Du, M. Emelianenko, L. Ju, Convergence of the Lloyd algorithm for computing centroidal
Voronoi tessellations. SIAM J. Numer. Anal. 44, 102–119 (2006)
81. M. Duflo, Algorithmes stochastiques, coll, in Mathématiques and Applications, vol. 23
(Springer, Berlin, 1996), pp. 319
82. M. Duflo, Random Iterative Models, translated from the 1990 French original by Stephen S.
Wilson and revised by the author. Applications of Mathematics (New York), vol. 34 (Springer,
Berlin, 1996), pp. 385
83. D. Egloff, M. Leippold, Quantile estimation with adaptive importance sampling. Ann. Stat.
38(2), 1244–1278 (2010)
84. N. El Karoui, J.P. Lepeltier, A. Millet, A probabilistic approach of the réduite in optimal
stopping. Probab. Math. Stat. 13(1), 97–121 (1988)
85. M. Emelianenko, L. Ju, A. Rand, Nondegeneracy and weak global convergence of the Lloyd
algorithm in Rd . SIAM J. Numer. Anal. 46(3), 1423–1441 (2008)
86. P. Étoré, B. Jourdain, Adaptive optimal allocation in stratified sampling methods. Methodol.
Comput. Appl. Probab. 12(3), 335–360 (2010)
87. M. Fathi, N. Frikha, Transport-entropy inequalities and deviation estimates for stochastic
approximations schemes. Electron. J. Probab. 18(67), 36 (2013)
88. H. Faure, Suite à discrépance faible dans T s , Technical report, University Limoges (France,
1986)
89. H. Faure, Discrépance associée à un système de numération (en dimension s). Acta Arith. 41,
337–361 (1982)
Bibliography 567

90. O. Faure, Simulation du mouvement brownien des diffusions, Thèse de l’ENPC (France,
Marne-la-Vallée, 1992), pp. 133
91. L. Fiorin, G. Pagès, A. Sagna, Markovian and product quantization of an Rd -valued Euler
scheme of a diffusion process with applications to finance, to appear in Methodology and
Computing in Applied Probability (2017), arXiv:1511.01758v3
92. H. Föllmer, A. Schied, in Stochastic Finance, An Introduction in Discrete Time (2nd edn. in
2004). De Gruyter Studies in Mathematics, vol. 27 (De Gruyter, Berlin, 2002), pp. 422
93. G. Fort, É. Moulines, A. Schreck, M. Vihola, Convergence of Markovian stochastic approxi-
mation with discontinuous dynamics. SIAM J. Control Optim. 54(2), 866–893 (2016)
94. J.C. Fort, G. Pagès, Convergence of stochastic algorithms: from the Kushner & Clark Theorem
to the Lyapunov functional. Adv. Appl. Probab. 28, 1072–1094 (1996)
95. J.C. Fort, G. Pagès, Decreasing step stochastic algorithms: a.s. behavior of weighted empirical
measures. Monte Carlo Methods Appl. 8(3), 237–270 (2002)
96. J.C. Fort, G. Pagès, Asymptotics of optimal quantizers for some scalar distributions. J. Comput.
Appl. Math. 146(2), 253–275 (2002)
97. N. Fournier, A. Guillin, On the rate of convergence in Wasserstein distance of the empirical
measure. Probab. Theory Relat. Fields 162(3–4), 707–738 (2015)
98. A. Friedman, Stochastic Differential Equations and Applications Volume 1–2. Probability and
Mathematical Statistics, vol. 28 (Academic Press [Harcourt Brace Jovanovich, Publishers],
New York, 1975), pp. 528
99. J.H. Friedman, J.L. Bentley, R.A. Finkel, An algorithm for finding best matches in logarithmic
expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977)
100. N. Frikha, S. Menozzi, Concentration bounds for stochastic approximations. Electron. Com-
mun. Probab., 183 17(47), 1–15 (2012)
101. S. Gadat, F. Panloup, Optimal non-asymptotic bound of the Ruppert-Polyak averaging without
strong convexity, pre-print, (2017), arXiv:1709.03342v1
102. J.G. Gaines, T. Lyons, Random generation of stochastic area integrals. SIAM J. Appl. Math.
54(4), 1132–1146 (1994)
103. S.B. Gelfand, S.K. Mitter, Recursive stochastic algorithms for global optimization in Rd .
SIAM J. Control Optim. 29, 999–1018 (1991)
104. S.B. Gelfand, S.K. Mitter, Metropolis-type annealing algorithms for global optimization in
Rd . SIAM J. Control Optim. 31, 111–131 (1993)
105. A. Gersho, R.M. Gray, Special issue on quantization, I-II (A. Gersho and R.M. Gray eds.)
IEEE Trans. Inf. Theory 28, 139–148 (1988)
106. A. Gersho, R.M. Gray, Vector Quantization and Signal Compression (Kluwer, Boston, 1992),
pp. 732
107. M.B. Giles, Multilevel Monte Carlo path simulation. Oper. Res. 56(3), 607–617 (2008)
108. M.B. Giles, Vibrato Monte Carlo Sensitivities, Monte Carlo and quasi-Monte Carlo Methods
2008 (Springer, Berlin, 2009), pp. 369–382
109. M.B. Giles, Multilevel Monte Carlo methods. Acta Numer. 24, 259–328 (2015)
110. M.B. Giles, L. Szpruch, Antithetic multilevel Monte Carlo estimation for multi-dimensional
SDEs, in Monte Carlo and quasi-Monte Carlo methods 2012. Springer Proceedings in Math-
ematics and Statistics, vol. 65, (Springer, Heidelberg, 2013), pp. 367–384
111. M.B. Giles, L. Szpruch, Multilevel Monte Carlo methods for applications in finance, in Recent
Developments in Computational Finance, Interdisciplinary Mathematics Sciences, vol. 14
(World Scientific Publishing, Hackensack, 2013), pp. 3–47
112. D. Giorgi, V. Lemaire, G. Pagès, Limit theorems for weighted and regular Multilevel estima-
tors. Monte Carlo Methods Appl. 23(1), 43–70 (2017)
113. D. Giorgi, Théorèmes limites pour estimateurs Multilevel avec et sans poids. Comparaisons
et applications, PhD thesis (2017), pp. 123 (cf 68. cohort). (UPMC)
114. D. Giorgi, V. Lemaire, G. Pagès, Weak error for nested Multilevel Monte Carlo, (2018),
arXiv:1806.07627.
115. P. Glasserman, Monte Carlo Methods in Financial Engineering (Springer, New York, 2003),
pp. 596
568 Bibliography

116. P. Glasserman, P. Heidelberger, P. Shahabuddin, Asymptotically optimal importance sampling


and stratification for pricing path-dependent options. Math. Financ. 9(2), 117–152 (1999)
117. P. Glasserman, D.D. Yao, Some guidelines and guarantees for common random numbers.
Manag. Sci. 36(6), 884–908 (1992)
118. P. Glasserman, B. Yu, Number of paths versus number of basis functions in American option
pricing. Ann. Appl. Probab. 14(4), 2090–2119 (2004)
119. P. Glynn, Optimization of stochastic systems via simulation, in Proceedings of the 1989 Winter
Simulation Conference, Society for Computer Simulation, San Diego (1989), pp. 90–105
120. E. Gobet, Weak approximation of killed diffusion using Euler schemes. Stoch. Proc. Appl.
87, 167–197 (2000)
121. E. Gobet, Revisiting the Greeks for European and American options, in Stochastic Processes
and Applications to Math. Finance (Proceedings of the Ritsumeikan International Symposium,
Kusatsu, Shiga, Japan 5–9 March 2003), ed. by S. Watanabe, J. Akahori, S. Ogawa (2003),
pp. 53–71. ISBN-13: 978-9812387783
122. E. Gobet, Advanced Monte Carlo methods for barrier and related exotic options, in Mathe-
matical Modeling and Numerical Methods in Finance, Handbook of Numerical Analysis, vol.
XV (North-Holland, Special Volume, Elsevier, Netherlands, 2009), pp. 497–528
123. E. Gobet, S. Menozzi, Exact approximation rate of killed hypo-elliptic diffusions using the
discrete Euler scheme. Stoch. Process. Appl. 112(2), 201–223 (2004)
124. E. Gobet, S. Menozzi, Stopped diffusion processes: boundary corrections and overshoot.
Stoch. Process. Appl. 120(2), 30–162 (2010)
125. E. Gobet, A. Kohatsu-Higa, Computation of Greeks for barrier and look-back options using
Malliavin calculus. Electron. Commun. Probab. 8, 51–62 (2003)
126. E. Gobet, R. Munos, Sensitivity analysis using Itô-Malliavin calculus and martingales. Appli-
cation to optimal control. SIAM J. Control Optim. 43(5), 1676–1713 (2005)
127. E. Gobet, G. Pagès, H. Pham, J. Printems, Discretization and simulation for a class of SPDE’s
with applications to Zakai and McKean-Vlasov equations. SIAM J. Numer. Anal. 44(6),
2505–2538 (2007)
128. M.B. Gordy, S. Juneja, Nested simulation in portfolio risk measurement. Manag. Sci. 56(10),
1833–1848 (2010)
129. S. Graf, H. Luschgy, Foundations of Quantization for Probability Distributions. LNM, vol.
1730 (Springer, Berlin, 2000), pp. 230
130. S. Graf, H. Luschgy, G. Pagès, Optimal quantizers for Radon random vectors in a Banach
space. J. Approx. 144, 27–53 (2007)
131. S. Graf, H. Luschgy, G. Pagès, Distortion mismatch in the quantization of probability mea-
sures. ESAIM P&S 12, 127–154 (2008)
132. S. Graf, H. Luschgy, G. Pagès, The local quantization behavior of absolutely continuous
probabilities. Ann. Probab. 40(4), 1795–1828 (2012)
133. C. Graham, D. Talay (2013). Stochastic Simulation and Monte Carlo Methods. Mathematical
Foundations of Stochastic Simulation. Stochastic Modelling and Applied Probability, vol. 68
(Springer, Heidelberg), pp. xvi+260
134. C. Graham, D. Talay, Analysis and Simulation of Hypoelliptioc, Stopped, Reflected and
Ergodic Diffusions, Mathematical Foundations of Stochastic Simulation. Stochastic Mod-
elling and Applied Probability (Springer, Berlin, 2017) (to appear)
135. A. Griewank, On automatic differentiation, in Mathematical Programming: Recent Develop-
ments and Applications, ed. by M. Iri, K. Tanabe (Kluwer Academic Publishers, Dordrecht,
1989), pp. 83–108
136. A. Griewank, A. Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic
Differentiation. Frontiers in Applied Mathematics (SIAM, Philadelphia, 2008), pp. xxi+426.
ISBN 978-0-89871-659-7
137. A. Griewank, A. Walther, ADOL-C: A Package for the Automatic Differentiation of Algorithm
Written in C/C++ (University of Paderborn, Germany, 2010)
138. J. Guyon, Euler scheme and tempered distributions. Stoch. Process. Appl. 116(6), 877–904
(2006)
Bibliography 569

139. M. Hairer, On Malliavin’s proof of Hörmander’s theorem, Notes of a mini-course, Warwick


University (2010), pp. 19, arXiv:1103.1998v1
140. A.L. Haji-Ali, Pedestrian Flow in the Mean-Field Limit, MSc. Thesis, 2012, KAUST (2012)
141. A.L. Haji-Ali, F. Nobile, E. von Schwerin, R. Tempone, Optimization of mesh hierarchies
in multilevel Monte Carlo samplers. Stoch. Partial Differ. Equ. Anal. Comput. 4(1), 76–112
(2016)
142. P. Hall, C.C. Heyde, Martingale Limit Theory and Its Applications (Academic Press, New
York, 1980), pp. 308
143. G.H. Hardy, J.E. Littlewood, G. Pùlya, Inequalities. Reprint of the 1952 edition. Cambridge
Mathematical Library (Cambridge University Press, Cambridge, 1952, Reprinted on 1988),
pp. xii+324
144. L. Hascoë, V. Pascual, The Tapenade automatic differentiation tool: principles, model, and
specification. ACM Trans. Math. Softw. 39(3), 20:1–20:43 (2013)
145. R.Z. Has’minskiı̆, Stochastic Stability of Differential Equations (1980). Translated from the
Russian by D. Louvish. Monographs and Textbooks on Mechanics of Solids and Fluids:
Mechanics and Analysis, 7. Sijthoff & Noordhoff, Alphen aan den Rijn-Germantown, Md.,
1980, pp. xvi+344
146. M.B. Haugh, L. Kogan, Pricing American options: a duality approach. Oper. Res. 52(2),
258–270 (2004)
147. U.G. Haussmann, On the integral representation of functionals of Itô Processes. Stochastics
3(1–4), 17–27 (1980)
148. S. Heinrich, Multilevel Monte Carlo methods, in In Large-Scale Scientific Computing
(Springer, Berlin, 2001), pp. 58–67
149. P. Henry-Labordère, X. Tan, N. Touzi, Exact simulation of multi-dimensional stochastic dif-
ferential equations (2015), arXiv:1504.06107
150. S.L. Heston, A closed-form solution for options with stochastic volatility with applications
to bond and currency options. Rev. Financ. Stud. 6(2), 327–343 (1993)
151. W. Hoeffding, Probability inequalities for sums of bounded random variables. J. Am. Stat.
Assoc. 58(301), 13–30 (1963)
152. M. Hutzenthaler, A. Jentzen, Numerical approximations of stochastic differential equations
with non-globally Lipschitz continuous coefficients. Mem. Am. Math. Soc. 236, 1112 (2015),
pp. v+99
153. M. Hutzenthaler, A. Jentzen, P.E. Kloeden, Divergence of the multilevel Monte Carlo Euler
method for non-linear stochastic differential equations. Ann. Appl. Probab. 23(5), 1913–1966
(2013)
154. J. Jacod, A.N. Shiryaev, Limit Theorems for Stochastic Processes, 2nd edn. (2003) (first
edition 1987). Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of
Mathematical Sciences], vol. 288 (Springer, Berlin, 2003), pp. xx+661
155. J. Jacod, P. Protter, Asymptotic error distributions for the Euler method for stochastic differ-
ential equations. Ann. Probab. 26(1), 267–307 (1998)
156. J. Jacod, P. Protter, Probability Essentials, 2nd edn. (Universitext) (Springer, Berlin, 2003),
pp. x+254
157. J. Jacod, T. Kurtz, S. Méléard, P. Protter, The approximate Euler method for Lévy driven
stochastic differential equations. Ann. Inst. H. Poincaré Probab. Stat. 41(3), 523–558 (2005)
158. P. Jaillet, D. Lamberton, B. Lapeyre, Variational inequalities and the pricing of American
options. Acta Appl. Math. 21, 263–289 (1990)
159. H. Johnson, Call on the maximum or minimum of several assets. J. Financ. Quant. Anal.
22(3), 277–283 (1987)
160. B. Jourdain, A. Kohatsu-Higa, A review of recent results on approximation of solutions of
stochastic differential equations. Prog. Probab. 65, 141–164 (2011) (Springer Basel AG)
161. B. Jourdain, J. Lelong, Robust adaptive importance sampling for normal random vectors.
Ann. Appl. Probab. 19(5), 1687–1718 (2009)
162. I. Karatzas, S.E. Shreve, Brownian Motion and Stochastic Calculus. Graduate Texts in Math-
ematics (Springer, New York, 1998), pp. xxix+470
570 Bibliography

163. I. Karatzas, S.E. Shreve, Methods of Math. Finance. Applications of Mathematics, vol. 39
(Springer, New York, 1998), xvi+407pp
164. O. Kavian, Introduction à la théorie des points critiques and applications aux problèmes
elliptiques (1993) (French) [Introduction to critical point theory and applications to elliptic
problems]. coll. Mathématiques and Applications [Mathematics and Applications], vol. 13
(Springer, Paris [Berlin]), pp. 325. ISBN: 2-287-00410-6
165. H.L. Keng, W. Yuan, Application of Number Theory to Numerical Analysis (Springer and
Science Press, Beijing, 1981), pp. 241
166. A.G. Kemna, A.C. Vorst, A pricing method for options based on average asset value. J. Bank.
Financ. 14, 113–129 (1990)
167. A. Kebaier, Statistical Romberg extrapolation: a new variance reduction method and applica-
tions to option pricing. Ann. Appl. Probab. 15(4), 2681–2705 (2005)
168. J.C. Kieffer, Exponential rate of convergence for Lloyd’s method I. IEEE Trans. Inf. Theory
(Special issue on quantization) 28(2), 205–210 (1982)
169. J. Kiefer, J. Wolfowitz, Stochastic estimation of the maximum of a regression function. Ann.
Math. Stat. 23, 462–466 (1952)
170. P.E. Kloeden, E. Platten, Numerical Solution of Stochastic Differential Equations. Applica-
tions of Mathematics, vol. 23 (Springer, Berlin (New York), 1992), pp. 632
171. A. Kohatsu-Higa, R. Pettersson, Variance reduction methods for simulation of densities on
Wiener space. SIAM J. Numer. Anal. 40(2), 431–450 (2002)
172. V. Konakov, E. Mammen, Edgeworth type expansions for Euler schemes for stochastic dif-
ferential equations. Monte Carlo Methods Appl. 8(3), 271–285 (2002)
173. V. Konakov, S. Menozzi, S. Molchanov, Explicit parametrix and local limit theorems for some
degenerate diffusion processes. Ann. Inst. H. Poincaré Probab. Stat. (série B) 46(4), 908–923
(2010)
174. U. Krengel, Ergodic Theorems. De Gruyter Studies in Mathematics, vol. 6 (Springer, Berlin,
1987), pp. 357
175. L. Kuipers, H. Niederreiter, Uniform Distribution of Sequences (Wiley, New York, 1974),
pp. 390
176. H. Kunita, Stochastic differential equations and stochastic flows of diffeomorphisms, in Cours
d’école d’été de Saint-Flour XII–1982, LN-1097 (Springer, Berlin, 1984), pp. 143–303
177. H. Kunita, Stochastic flows and stochastic differential equations. Reprint of the 1990 orig-
inal. Cambridge Studies in Advanced Mathematics, vol. 24 (Cambridge University Press,
Cambridge, 1997), pp. xiv+346
178. T.G. Kurtz, P. Protter, Weak limit theorems for stochastic integrals and stochastic differential
equations. Ann. Probab. 19, 1035–1070 (1991)
179. H.J. Kushner, H. Huang, Rates of convergence for stochastic approximation type algorithms.
SIAM J. Control 17(5), 607–617 (1979)
180. H.J. Kushner, G.G. Yin, Stochastic Approximation and Recursive Algorithms and Applica-
tions, 2nd edn. Applications of Mathematics, Stochastic Modeling and Applied Probability,
vol. 35 (Springer, New York, 2003)
181. J.P. Lambert, Quasi-Monte Carlo, low-discrepancy, and ergodic transformations. J. Comput.
Appl. Math. 12–13, 419–423 (1985)
182. D. Lamberton, Optimal stopping and American options, a course at the Ljubljana
Summer School on Financial Mathematics (2009), https://www.fmf.uni-lj.si/finmath09/
ShortCourseAmericanOptions.pdf
183. D. Lamberton, B. Lapeyre, Introduction to Stochastic Calculus Applied to Finance (Chapman
& Hall, London, 1996), pp. 185
184. D. Lamberton, G. Pagès, P. Tarrès, When can the two-armed bandit algorithm be trusted?
Ann. Appl. Probab. 14(3), 1424–1454 (2004)
185. B. Lapeyre, G. Pagès, Familles de suites à discrépance faible obtenues par itération d’une
transformation de [0, 1]. Comptes rendus de l’Académie des Sciences de Paris, Série I(308),
507–509 (1989)
Bibliography 571

186. B. Lapeyre, G. Pagès, K. Sab, Statistics. Sequences with low discrepancy. Generalization and
application to Robbins-Monro algorithm 21(2), 251–272 (1990)
187. B. Lapeyre, E. Temam, Competitive Monte Carlo methods for the pricing of Asian options.
J. Comput. Financ. 5(1), 39–59 (2001)
188. S. Laruelle, C.-A. Lehalle, G. Pagès, Optimal split of orders across liquidity pools: a stochastic
algorithm approach. SIAM J. Financ. Math. 2, 1042–1076 (2011)
189. S. Laruelle, C.-A. Lehalle, G. Pagès, Optimal posting price of limit orders: learning by trading.
Math. Financ. Econ. 7(3), 359–403 (2013)
190. S. Laruelle, G. Pagès, Stochastic Approximation with averaging innovation. Monte Carlo
Methods Appl. J. 18, 1–51 (2012)
191. V.A. Lazarev, Convergence of stochastic approximation procedures in the case of a regression
equation with several roots (transl. from). Problemy Pederachi Informatsii 28(1), 66–78 (1992)
192. A. Lejay, V. Reutenauer, A variance reduction technique using a quantized Brownian motion
as a control variate. J. Comput. Financ. 16(2), 61–84 (2012)
193. M. Ledoux, M. Talagrand, Probability in Banach Spaces. Isoperimetry and Processes. Reprint
of the 1991 edition. Classics in Mathematics (Springer, Berlin, 2011), pp. xii+480
194. P. L’Ecuyer, G. Perron, On the convergence rates of IPA and FDC derivatives estimators.
Oper. Res. 42, 643–656 (1994)
195. M. Ledoux, Concentration of measure and logarithmic Sobolev inequalities, technical report
(1997), http://www.math.univ-toulouse.fr/~ledoux/Berlin.pdf
196. J. Lelong, Almost sure convergence of randomly truncated stochastic algorithms under veri-
fiable conditions. Stat. Probab. Lett. 78(16), 2632–2636 (2008)
197. J. Lelong, Asymptotic normality of randomly truncated stochastic algorithms. ESAIM.
Probab. Stat. 17, 105–119 (2013)
198. V. Lemaire, G. Pagès, Multilevel Richardson-Romberg extrapolation. Bernoulli 23(4A),
2643–2692 (2017)
199. V. Lemaire, G. Pagès, Unconstrained recursive importance sampling. Ann. Appl. Probab.
20(3), 1029–1067 (2010)
200. L. Ljung, Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4),
551–575 (1977)
201. S.P. Lloyd, Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)
202. F.A. Longstaff, E.S. Schwarz, Valuing American options by simulation: a simple least-squares
approach. Rev. Financ. Stud. 14, 113–148 (2001)
203. H. Luschgy, Martingale in diskreter Zeit : Theory und Anwendungen (Springer, Berlin, 2013),
pp. 452
204. H. Luschgy, G. Pagès, Functional quantization of Gaussian processes. J. Funct. Anal. 196(2),
486–531 (2001)
205. H. Luschgy, G. Pagès, Functional quantization rate and mean regularity of processes with an
application to Lévy processes. Ann. Appl. Probab. 18(2), 427–469 (2008)
206. D. McLeish, A general method for debiasing a Monte Carlo estimator. Monte Carlo Methods
Appl. 17(4), 301–315 (2011)
207. J. Mac Names, A fast nearest-neighbor algorithm based on principal axis search tree. IEEE
T. Pattern. Anal. 23(9), 964–976 (2001)
208. P. Malliavin, A. Thalmaier Stochastic Calculus of Variations in Mathematical Finance.
Springer Finance. (Springer, Berlin, 2006), pp. xii+142
209. S. Manaster, G. Koehler, The calculation of implied variance from the Black-Scholes model:
a note. J. Financ. 37(1), 227–230 (1982)
210. G. Marsaglia, T.A. Bray, A convenient method for generating normal variables. SIAM Rev.
6, 260–264 (1964)
211. M. Matsumata, T. Nishimura, Mersenne twister: a 623-dimensionally equidistributed uniform
pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8(1), 3–30 (1998)
212. G. Marsaglia, W.W. Tsang, The ziggurat method for generating random variables. J. Stat.
Softw. 5(8), 363–372 (2000)
572 Bibliography

213. M. Métivier, P. Priouret, Théorèmes de convergence presque sûre pour une classe
d’algorithmes stochastiques à pas décroissant. (French. English summary) [Almost sure con-
vergence theorems for a class of decreasing-step stochastic algorithms]. Probab. Theory Relat.
Fields 74(3), 403–428 (1987)
214. G.N. Milstein, A method of second order accuracy for stochastic differential equations. Theor.
Probab. Appl. (USSR) 23, 396–401 (1976)
215. U. Naumann, The Art of Differentiating Computer Programs: An Introduction to Algorith-
mic Differentiation. Software, Environments and Tools (SIAM, RWTH Aachen University,
Aachen, Germany, 2012), pp. xviii+333
216. J. Neveu, Bases mathématiques du calcul des Probabilités, Masson, Paris, 213 pp; English
translation: Mathematical Foundations of the Calculus of Probability (1965) (Holden Day,
San Francisco, 1964)
217. J. Neveu, Martingales à temps discret (Masson, 1972, 218pp. English translation: Discrete-
Parameter Martingales (North-Holland, New York, 1975), 236pp
218. D.J. Newman, The Hexagon Theorem. IEEE Trans. Inf. Theory (A. Gersho and R.M. Gray
eds.) 28, 137–138 (1982)
219. H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods CBMS-NSF
regional conference series in Applied mathematics (SIAM, Philadelphia, 1992)
220. Numerical Recipes: Textbook available at http://www.ma.utexas.edu/documentation/nr/
bookcpdf.html. See also [247]
221. A. Owen, Local antithetic sampling with scrambled nets. Ann. Stat. 36(5), 2319–2343 (2008)
222. G. Pagès, Sur quelques problèmes de convergence, thèse University Paris 6, Paris (1987) pp.
144
223. G. Pagès, Van der Corput sequences, Kakutani transform and one-dimensional numerical
integration. J. Comput. Appl. Math. 44, 21–39 (1992)
224. G. Pagès, A space vector quantization method for numerical integration. J. Comput. Appl.
Math. 89, 1–38 (1998) (Extended version of “Voronoi Tessellation, space quantization algo-
rithms and numerical integration, ed. by M. Verleysen, Proceedings of the ESANN’ 93,
Bruxelles, Quorum Editions, (1993), pp. 221–228
225. G. Pagès, Multistep Richardson-Romberg extrapolation: controlling variance and complexity.
Monte Carlo Methods Appl. 13(1), 37–70 (2007)
226. G. Pagès, Quadratic optimal functional quantization methods and numerical applications, in
Proceedings of MCQMC, Ulm’06 (Springer, Berlin, 2007), pp. 101–142
227. G. Pagès, Functional co-monotony of processes with an application to peacocks and barrier
options, in Séminaire de Probabilités XXVI, LNM, vol. 2078 (Springer, Cham, 2013), pp.
365–400
228. G. Pagès, Introduction to optimal quantization for numerics. ESAIM Proc. Surv. 48, 29–79
(2015)
229. G. Pagès, H. Pham, J. Printems, Optimal quantization methods and applications to numeri-
cal problems in finance, in Handbook on Numerical Methods in Finance, ed. by S. Rachev
(Birkhauser, Boston, 2004), pp. 253–298
230. G. Pagès, H. Pham, J. Printems, An optimal Markovian Quantization algorithm for multi-
dimensional stochastic control problems. Stoch. Dyn. 4(4), 501–545 (2004)
231. G. Pagès, J. Printems, Optimal quadratic quantization for numerics: the Gaussian case. Monte
Carlo Methods Appl. 9(2), 135–165 (2003)
232. G. Pagès, J. Printems (2005), http://www.quantize.maths-fi.com, website devoted to optimal
quantization
233. G. Pagès, J. Printems, Optimal quantization for finance: from random vectors to stochastic
processes. Mathematical Modeling and Numerical Methods in Finance (special volume) ed.
by A. Bensoussan, Q. Zhang guest éds.), coll. Handbook of Numerical Analysis, ed. by P.G.
Ciarlet (North Holland, 2009), pp. 595–649
234. G. Pagès, A. Sagna, Recursive marginal quantization of the Euler scheme of a diffusion
process. Appl. Math. Financ. 22(5), 463–98 (2015)
Bibliography 573

235. G. Pagès, B. Wilbertz, Optimal Delaunay and Voronoi quantization methods for pricing Amer-
ican options, in Numerical Methods in Finance, ed. by R. Carmona, P. Hu, P. Del Moral, N.
Oudjane (Springer, New York, 2012), pp. 171–217
236. G. Pagès, B. Wilbertz, Sharp rate for the dual quanization problem, forthcoming, in Séminaire
de Probabilités XLIX (Springer, Berlin, 2018)
237. G. Pagès, Y.-J. Xiao, Sequences with low discrepancy and pseudo-random numbers: theoret-
ical results and numerical tests. J. Stat. Comput. Simul. 56, 163–188 (1988)
238. G. Pagès, J. Yu, Pointwise convergence of the Lloyd I algorithm in higher dimension. SIAM
J. Control Optim. 54(5), 2354–2382 (2016)
239. K.R. Parthasarathy, Probability measures on metric spaces, in Probability and Mathematical
Statistics, vol. 3 (Academic Press, Inc., New York-London, 1967), pp. xi+276
240. L. Paulot, Unbiased Monte Carlo Simulation of Diffusion Processes (2016), arXiv:1605.01998
241. M. Pelletier, Weak convergence rates for stochastic approximation with application to multiple
targets and simulated annealing. Ann. Appl. Probab. 8(1), 10–44 (1998)
242. R. Pemantle, Non-convergence to unstable points in urn models and stochastic approxima-
tions. Ann. Probab. 18(2), 698–712 (1990)
243. H. Pham, Some applications and methods of large deviations in finance and insurance, in
Paris-Princeton Lectures on Mathematical Finance 2004. LNM, vol. 1919 (Springer, New
York, 2007), pp. 191–244
244. J. Pitman, M. Yor, A decomposition of Bessel bridges. Z. Wahrsch. Verw. Gebiete 59(4),
425–457 (1982)
245. B.T. Polyak, A new method of stochastic approximation type, (Russian) Avtomat. i Telemekh.
7, 98–107 (1991); translation in Automation Remote Control, 51, part 2, 937–946
246. B.T. Polyak, A.B. Juditsky, Acceleration of stochastic approximation by averaging. SIAM J.
Control Optim. 30(4), 838 (1992). https://doi.org/10.1137/0330046
247. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C++. The Art
of Scientific Computing, 2nd edn. updated for C++ (Cambridge University Press, Cambridge,
2002), pp. 1002
248. P.D. Proïnov, Discrepancy and integration of continuous functions. J. Approx. Theory 52,
121–131 (1988)
249. P.E. Protter, Stochastic integration and differential equations, 2nd edn. Version 2.1. Corrected
Third Printing. Stochastic Modelling and Applied Probability, vol. 21 (Springer, Berlin, 2004),
pp. xiv+419
250. P. Protter, D. Talay, The Euler scheme for Lévy driven stochastic differential equations. Ann.
Probab. 25(1), 393–423 (1997)
251. D. Revuz, M. Yor, Continuous martingales and Brownian motion, 3rd edn. Grundlehren der
Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol.
293 (Springer, Berlin, 1999), pp. 602
252. C. Rhee, P.W. Glynn, Unbiased estimation with square root convergence for SDE models.
Oper. Res. 63(5), 1026–1043 (2012)
253. H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407
(1951)
254. R.T. Rockafellar, S. Uryasev, Optimization of conditional value-at-risk. J. Risk 2(3), 21–41
(2000)
255. L.C.G. Rogers, Monte Carlo valuation of American options. Math. Financ. 12(3), 271–286
(2002)
256. L.C.G. Rogers, D. Williams, in Diffusions, Markov Processes, and Martingales, vol. 2 (Itô
calculus) (2000). Reprint of the second (1994) edition. Cambridge Mathematical Library
(Cambridge University Press, Cambridge 2000), pp. xiv+480
257. K.F. Roth, On irregularities of distributions. Mathematika 1, 73–79 (1954)
258. W. Rudin, Real and Complex Analysis (McGraw-Hill, New York, 1966), pp. xiii+416
259. D. Ruppert, Efficient Estimations from a Slowly Convergent Robbins-Monro Process, Cornell
University Operations Research and Industrial Engineering, Technical Report 781 Ithaca, New
York, 1988, pp. 29
574 Bibliography

260. A. Sagna, Pricing of barrier options by marginal functional quantization. Monte Carlo Methods
Appl. 17(4), 371–398 (2011)
261. K.-I. Sato, Lévy processes and Infinitely Divisible Distributions (Cambridge University Press,
Cambridge, 1999)
262. A.N. Shyriaev, Optimal stopping rules (2008). Translated from the 1976 Russian second edi-
tion by A.B. Aries. Reprint of the, translation Stochastic Modelling and Applied Probability,
vol. 8 (Springer, Berlin, 1978), pp. 217
263. A.N. Shiryaev, Probability. Graduate Texts in Mathematics, 2nd edn. (Springer, New York,
1995), pp. 621; Original Russian edition, 1980, original Russian second edition, 1989
264. P. Seumen-Tonou, Méthodes numériques probabilistes pour la résolution d’équations du
transport and pour l’évaluation d’options exotiques, Thèse de l’Université de Provence
(France, Marseille, 1997), pp. 116
265. I.M. Sobol’, Distribution of points in a cube and approximate evaluation of integrals, Zh.
Vych. Mat. Mat. Fiz., 7, 784–802 (1967) (in Russian); U.S.S.R Comput. Math. Math. Phys.
7, 86–112 (1967) (in English)
266. I.M. Sobol’, Y.L. Levitan, The production of points uniformly distributed in a multi-
dimensional cube. Technical Report 40, Institute of Applied Mathematics, USSR Academy
of Sciences (1976) (in Russian)
267. C. Soize The Fokker–Planck Equation for Stochastic Dynamical Systems and Its Explicit
Steady State Solutions. Series on Advances in Mathematics for Applied Sciences, vol. 17
(World Scientific Publishing Co., Inc., River Edge, 1994), pp. xvi+321
268. M. Sussman, W. Crutchfield, M. Papakipos, Pseudo-random numbers Generation on GPU,
in Graphic Hardware 2006, Proceeding of the Eurographics Symposium Vienna, Austria,
September 3–4, 2006, ed. by M. Olano, P. Slusallek (A K Peters Ltd., 2006), pp. 86–94
269. J.N. Tsitsiklis, B. Van Roy, Regression methods for pricing complex American-style options.
IEEE Trans. Neural Netw. 12(4), 694–703 (2001)
270. D. Talay, L. Tubaro, Expansion of the global error for numerical schemes solving stochastic
differential equations. Stoch. Anal. Appl. 8, 94–120 (1990)
271. B. Tuffin, Randomization of Quasi-Monte Carlo Methods for error estimation: survey and
normal approximation. Monte Carlo Methods Appl. 10(3–4), 617–628 (2004)
272. C. Villani, Topics in Optimal Transportatio. Graduate Studies in Mathematics, vol. 58 (Amer-
ican Mathematical Society, Providence, 2003), pp. xvi+370
273. S. Villeneuve, A. Zanette, Parabolic A.D.I. methods for pricing American option on two
stocks. Math. Oper. Res. 27(1), 121–149 (2002)
274. H.F. Walker, P. Ni, Anderson acceleration for fixed-point iterations. SIAM J. Numer. Anal.
49(4), 1715–1735 (2011)
275. D.S. Watkins, Fundamentals of Matrix Computations. Pure and Applied Mathematics (Hobo-
ken), 3rd edn. (Wiley, Hoboken, 2010), pp. xvi+644
276. M. Winiarski, Quasi-Monte Carlo Derivative valuation and Reduction of simulation bias,
Master thesis (Royal Institute of Technology (KTH), Stockholm (Sweden), 2006)
277. A. Wood, G. Chan, Simulation of stationary Gaussian processes in [0, 1]d . J. Comput. Gr.
Stat. 3(4), 409–432 (1994)
278. Y.-J. Xiao, Contributions aux méthodes arithmétiques pour la simulation accélérée (Thèse
de l’ENPC, Paris, 1990), pp. 110
279. Y.-J. Xiao, Suites équiréparties associées aux automorphismes du tore. C.R. Acad. Sci. Paris
(Série I) 317, 579–582 (1990)
280. P.L. Zador, Development and evaluation of procedures for quantizing multivariate distribu-
tions. Ph.D. dissertation, Stanford University (1963), pp. 111
281. P.L. Zador, Asymptotic quantization error of continuous signals and the quantization dimen-
sion. IEEE Trans. Inf. Theory 28(2), 139–149 (1982)
Index

A Berry-Esseen Theorem, 30
Acceptance-rejection method, 10 Bertrand’s conjecture, 114
α-quantile, 209 Best-of-call, 34, 59, 198
α-strictly convex function, 237 Binomial distribution, 8
American Monte Carlo, 509 Birkhoff’s pointwise ergodic theorem, 119
Anderson’s acceleration method, 157 Bismut’s formula, 493
Antithetic method (variance reduction), 59 Black formula, 545
Antithetic random variables, 62 Black–Scholes formula(s), 544
Antithetic schemes, 441 Black–Scholes formula (call), 34, 544
Antithetic schemes (Brownian diffusions), Black–Scholes formula (put), 544
441 Black–Scholes (Milstein scheme), 304
Antithetic schemes (nested Monte Carlo), Black–Scholes with dividends, 544
444 Blackwell–Rao method, 80
Asian Call, 75 Box, 99
Asian call-put parity equation, 73 Box–Muller method, 20
Asian option, 56, 375 Bridge of a diffusion, 368
a.s. rate (Euler schemes), 283 Brownian bridge, 365
Asset-or-Nothing Call, 501, 502 Brownian bridge method (barrier), 371
Automatic differentiation, 489 Brownian bridge method (lookback), 371
Automorphism of the torus, 101 Brownian bridge method (max/min), 371
Averaging principle, 194 Brownian motion (simulation), 24
Averaging principle (stochastic approxima- Bump method, 472
tion), 253 Burkhölder–Davis–Gundy (B.D.G.)
inequality (continuous time), 324
Burkhölder–Davis–Gundy (B.D.G.)
B inequality (discrete time), 242
Backward Dynamic Programming Principle
(BDPP), 515
Backward Dynamic Programming Principle C
(BDPP) (functional), 516 Càdlàg, 5
Backward Dynamic Programming Principle Call option (European), 34
(BDPP) (pathwise), 515 Call-put parity equation, 72
Baire σ-field, 545 Call-put parity equation (Asian), 73
Bally–Talay theorem, 317 Cameron-Marin formula, 90
Barrier option, 81 Central Limit Theorem (Lindeberg), 552
Basket option, 54 Cheng’s partial distance search, 224
Bernoulli distribution, 8 χ2 -distribution, 31
© Springer International Publishing AG, part of Springer Nature 2018 575
G. Pagès, Numerical Probability, Universitext,
https://doi.org/10.1007/978-3-319-90276-0
576 Index

Cholesky decomposition, 23 E
Cholesky matrix, 23 Effort, 51
Chow Theorem, 216 Ergodic transform, 119
Chung’s CLT, 106 Essential infimum, 557
Chung’s LIL , 108 Essential supremum, 554
CIR process, 307 Euler indicator function, 2
Clark–Cameron oscillator, 447 Euler–Maruyama schemes, 273
Clark–Cameron oscillator (quantization), Euler schemes, 273
160 Euler scheme (continuous), 274
Co-monotony, 60 Euler scheme (discrete time), 273
Coboundary, 119 Euler scheme (genuine), 274
Euler scheme (stepwise constant), 274
Competitive Learning Vector Quantization
Exchange spread option, 80
(CLVQ), 161, 216
Exponential distribution, 7
Complexity, 51
Extreme discrepancy, 99
Componentwise partial order, 99
Compound option, 455
Confidence interval, 32 F
Confidence interval (multilevel), 423 Faure sequences, 113
Confidence interval (theoretical), 30 Feynman–Kac formula, 317, 348
Confidence level, 30 Finite difference (constant step), 472
Continuation function, 526 Finite difference (decreasing step), 480
Control variate (adaptive), 64 Finite difference methods (greeks), 472
Control variate (static), 49 Finite variation function (Hardy and Krause
Convergence (weak), 96 sense), 104
Covariance matrix, 22 Finite variation function (measure sense),
Cubature formula (quantization), 144 102
Curse of dimensionality, 534 First order weak expansion error, 387
Curse of dimensionality (discrepancy), 120 Flow of an SDE, 272, 339
Curse of dimensionality (quantization), 143, Flow (of the Euler scheme), 339
163 Forward start option, 488
Fractional Brownian motion (simulation), 25
Functional monotone class theorem, 545

D
De La Vallée Poussin criterion, 547 G
-method, 237 Gamma distribution, 15
Depth (ML2R estimator), 405 Garman–Kohlhagen formula, 545
Depth (multilevel estimator), 431 Generalized Minkowski inequality, 323
Depth (multistep estimator), 398 Geometric Brownian motion, 33
Deviation inequality (diffusion), 297 Geometric distribution(s), 8
Geometric mean option, 59
Deviation inequality (Euler scheme), 296
Glivenko–Cantelli’s theorem, 96
Diffusion process (Brownian), 272
Greek computation, 472
Digital call, 502
Gronwall lemma, 283
Digital option, 478
Discrepancy at the origin, 99
Discrepancy (extreme), 99 H
Discrepancy (star), 99 Hölder Inequality (extended), 335
Discrete time Euler scheme, 273 Halton sequences, 109
Distortion function (quadratic), 135 Halton sequence (super-), 113
Distribution function, 5 Hammersley procedure, 116
Doob’s decomposition, 538 Haussman–Clark–Occone formula, 497
Doob’s inequality, 284 Hoeffding’s inequality, 233, 298
Index 577

Homogeneous random number generator, 2 M


Huyguens theorem, 473 Malliavin calculus, 499
Malliavin derivative, 498
Malliavin-Monte Carlo method, 42
I Marcinkiewicz–Zygmund inequality, 242
Implicitation (of a parameter), 179 Mark-to-market price, 199
Importance sampling, 86 Marsaglia’s polar method, 22
Importance sampling (recursive), 188 Mean quadratic quantization error, 135
Index option, 54 Mean-reversion, 360
Mean-reverting assumption (pathwise), 263
Infinitesimal generator, 554
Mean Squared Error (MSE), 387
Inner simulations, 385
Median, 139
Integrability (uniform), 39
Milstein scheme, 301
Itô formula, 554
Milstein scheme (Black–Scholes), 304
Itô process, 553
Milstein scheme (continuous), 304
Itô’s Lemma, 554
Milstein scheme (discrete time), 303
Milstein scheme (genuine), 304
Milstein scheme (multi-dimensional), 308
J Milstein scheme (truncated), 441, 442
Jensen’s inequality, 53 MLMC estimator, 432
ML2R estimator, 405
Moneyness, 196
K Monge–Kantorovich characterization, 145
Kakutani’s adding machine, 111 Monte Carlo method, 27
Kakutani sequences, 111 Monte Carlo simulation, 27
K-D tree, 224 Multilevel estimator (regular), 431, 432
Kemna–Vorst control variate, 56 Multilevel paradigm, 402
k-means algorithm, 162 Multilevel Richardson–Romberg (ML2R)
Koksma–Hlawka Inequality, 103 estimator, 391, 405
Kolmogorov–Smirnov, 107 Multilevel Richardson–Romberg estimator,
Kronecker Lemma, 551 404
Multistep estimator, 398

L N
Lévy’s area (quantized), 159 Nearest neighbor projection, 135
Lindeberg Central Limit Theorem, 552 Negative binomial distribution(s), 9
Lloyd’s algorithm, 156 Negatively correlated variables, 59
Lloyd’s algorithm (randomized), 161 Nested Monte Carlo, 385
Lloyd’s map, 156 Newton–Raphson (zero search), 155
Lloyd’s method I, 156 Niederreiter sequences, 115
Lloyd’s method I (randomized), 161 Non-central χ2 (1) distribution (quantiza-
Local inertia, 82 tion), 158
Local martingale (continuous), 553 Normal distribution, 541
Local volatility model, 514 Numerical integration (quantization), 163
Log-likelihood method, 45
Lookback options, 372
Low discrepancy (sequence with), 109 O
Lower semi-continuity, 233 Optimal quantizer, 138
Lower Semi-Continuous (L.S.C.), 233 Option (forward start), 488
L p -Wasserstein distance, 145 Option (lookback), 372
L q -discrepancy, 105 Option on a basket, 54
Lyapunov function, 360 Option on an index, 54
578 Index

Ornstein–Uhlenbeck operator, 292 Risk-neutral probability, 72, 514


Ornstein–Uhlenbeck process, 251, 291 Robbins–Monro algorithm, 184
Outer simulations, 385 Robbins–Siegmund Lemma, 180
Romberg extrapolation (quantization), 163
Root Mean Square Error (RMSE), 28, 387,
P 472
p-adic rotation, 112 Rotations of the torus, 101
Parametrix, 318, 359 Roth’s lower bound (discrepancy), 108
Pareto distribution, 7 Ruppert and Polyak averaging principle, 253
Parity equation (call-put), 72
Parity variance reduction method, 72
Partial distance search, 224 S
Partial lookback option, 300, 373 Semi-convex function, 513
Pathwise mean-reverting assumption, 263 Seniority, 477
Poisson distribution, 18
Sensitivity computation, 472
Poisson process, 18
Sensitivity computation (localization), 500
Polar method, 15, 22
Sensitivity computation (tangent process),
Polish space, 4
485
Portmanteau Theorem, 549
Sequence with low discrepancy, 109
Pre-conditioning, 80
Shock method, 472
Principal Axis Tree, 168
Signed measure, 102
Proïnov Theorem, 120
Simulation complexity, 51
Put spread option, 164
Snell envelope, 509, 511
Snell envelope (dual representation), 538
Q Sobol’ sequences, 115
Quadratic distortion function, 135 Solvency Capital Requirement (SCR), 383,
Quantile, 85, 209 386
Quantile (two-sided α-), 30 Splitting initialization method, 157, 223
Quantization error (quadratic mean), 135 Splitting method, 224
Quantization tree, 509, 528, 532 Square root of a positive semidefinite matrix,
Quantization (Voronoi), 135 23
Quantized backward dynamic Programming Stable convergence in distribution, 239
principle, 532 Stable weak convergence, 239
Quasi-Monte Carlo (QMC) (randomized), Standard-deviation, 322
126 Star discrepancy, 99
Quasi-Monte Carlo (QMC) method, 95 Static control variate, 49
Quasi-Monte Carlo (QMC) method Stationary quantizer, 139
(unbounded dimension), 130 Stepwise constant Euler scheme, 274
Quasi-stochastic approximation, 132, 262 Stochastic algorithm, 177
Stochastic Differential Equation (SDE), 271
Stochastic Differential Equation (SDE)
R (flow), 339
Randomized Multi Level(RML) estimator, Stochastic gradient, 185
465 Stopping time, 511
Random number generator (homogeneous), Strata, 81
2 Stratification, 81
Regular p-adic expansion, 111 Stratified sampling, 81
Residual maturity, 477 Stratified sampling (universal), 168
Richardson–Romberg extrapolation (Euler Strike price, 34
scheme), 318 Strong L p -rate (Euler schemes), 277
Richardson–Romberg extrapolation (quanti- Strong L p -rate (Milstein scheme), 304
zation), 163 Strongly continuous distribution, 135
Riemann integrable function, 98 Structural dimension, 122
Index 579

Student distribution, 32 Van der Corput sequence, 110


Super-Halton sequence, 113 Vibrato Monte Carlo, 493
Supremum of the Brownian bridge (quanti- Von Neumann acceptance-rejection method,
zation), 160 10
Survival function, 6 Voronoi cell, 134
Voronoi partition, 134
Voronoi quantization, 135
T
Talay–Tubaro Theorem, 312, 316
Tangent process method, 47 W
Theoretical confidence interval, 30 Wasserstein distance (L p ), 145
Truncated Milstein scheme, 441, 442 Weak convergence, 96
Two-sided α-quantile, 30 Weak exact simulation (SDE), 467
Weak expansion error (first order), 387
Weak rate (Euler scheme), 310
U Weak uniqueness, 350
Unbiased multilevel estimator, 464 Weighted multilevel estimator, 405
Uniform integrability, 39 Weyl’s criterion, 100
Uniformly distributed sequence, 98 Wienerization, 440
Unimodal distribution, 153
Uniquely ergodic transform, 119
Up-and-out call, 454 Y
Usual conditions, 272 Yield (of a simulation), 6

V Z
Value function, 512 Ziggurat method, 18

You might also like