Lecture Notes_ Quantitative Risk Management
Lecture Notes_ Quantitative Risk Management
Lecture notes
First edition
Preface to the first edition
These lecture notes were written for the course Quantitative Risk Management (abbreviated
QRM ) at the University of Copenhagen in the winter of 2023/2024. They are based on the
lectures by Jeffrey F. Collamore with some of my own additions, mostly in the form of
supplementary examples, exercises and an appendix. I also added a few more results that
I felt were helpful when solving the mandatory exercises. The notes should be sufficient to
cover the entire syllabus, and if the reader wants to dive into certain topics in more detail,
the notes and comments at the end of each week provide further references.
The notes are built up in the following way: Each week in these notes corresponds roughly
to a week of teaching. Some subsections have been moved to make the presentation more
clear, but the reader can expect the notes to follow the teaching almost one to one. The ap-
pendix at the end deserves an explanation. The appendix contains three sections. The first
is very short and concerns generalised inverses. This section is essential for the course, and
it should be clear in the text when the reader should consult this part of the appendix. As
for the other two sections, probability theory and calculus, they are purely supplementary.
I wrote them with the intend of making it easer to look up some basic notions if necessary.
The course relies on a general understanding of probability theory, and sometimes tools
like conditioning arguments and integration by parts (with respect to functions) are used
without explanation. For students with less training in such computations, it may be a good
idea to take a brief look at the relevant subsections.
Feedback in general is very appreciated. The exercises are my own addition (well, most of
them), and any suggestions on how these can be improved to be more interesting etc. are
very welcome. Last but not least, there is likely a bunch of typos remaining and maybe
even a few mathematical errors. These are all due to me. Please don’t hesitate to let me
know, if you find any.
Table of contents
Week 4 - Copulas I 44
8 Copulas: Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9 Examples of copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Week 5 - Copulas II 53
10 Archimedean copulas in higher dimensions . . . . . . . . . . . . . . . . . . . . 53
11 Fitting copulas to data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A Preliminaries 80
A.1 Generalised inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.2 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Bibliography 102
Index 106
Week 1 - Risk measures
1 Introduction
What is quantitative risk management about?
Quantitative risk management aims at describing and understanding risk in a financial
context. To motivate our discussion, let us start with a basic example. Suppose we have a
stock with value Sn at time n (discrete time units). Assume the Bernoulli model
2 1
S0 = 1, P (Sn+1 = 2Sn ) = and P (Sn+1 = 0.5Sn ) = for n > 0.
3 3
In this model there are some basic questions we may ask. For example, what is the risk in
this investment policy? What are the expected returns? We can compute
2 1 1 3
E[Sn | Sn−1 ] = · 2Sn−1 + · Sn−1 = Sn−1
3 3 2 2
and it follows that n
2
Mn = Sn
3
is a martingale, and
n n
2 3
1 = M0 = E[Mn ] = E[Sn ] implying E[Sn ] = → ∞.
3 2
However, there is still a risk that we lose money in this investment policy. Note that we can
write
Sn = R1 · · · Rn
for Bernoulli variables Ri with P (Ri = 2) = 2/3 and P (Ri = 1/2) = 1/3. Taking log yields
n
X
log Sn = log Ri
i=1
which is a risk process that can be studied using ruin theory. If E[e−α log S1 ] = 1 for some
α > 0, we obtain the Cramér-Lundberg estimate (for some threshold u > 0)
for a constant C. In Cramér-Lundberg theory, there is a tradeoff between profit and risk
and the same rule applies to risk management in finance. This tradeoff can be studied from
1
2
several viewpoints. Sometimes this tradeoff is handled using a utility function U and then
maximizing the expected utility E[U (Sn )]. In this course we follow another approach via
so-called risk measures. This approach allows us to compute loss probabilities directly. All
these terms will get a more rigorous definition later on.
Example 1.1. Barings Bank was founded in 1762 and was one of the UK’s oldest and
most respected banks. In 1995 the bank collapsed despite having more than $900 million in
capital. This was due to unauthorized trading by a single employee. He bought straddles
(selling both one call option and put option) which in a typical market will usually expire
and provide a gain. However, market instability was caused by a Japanese earthquake, and
the bank suffered a loss of more than $1 billion, resulting in a bankruptcy. ◦
Example 1.2. In 1994, the hedge fund LTCM (Long-Term Capital Management) was
founded. The employees were experienced traders and academics. $1.3 billion were invested
with returns after two years close to 40 %. Early in 1998 net assets were $4 billion, but
by the end of the year the fund was close to default. The U.S. Federal Reserve managed a
$3.5 billion rescue package to avoid a systematic crisis in the world financial system. The
triggering event for this disaster was the devaluation of the ruble by Russia. ◦
Example 1.3. In 2023, Silicon Valley Bank collapsed. The bank owned low interest bonds
and paid even lower interest to the depositors. When the depositors withdrew their capital,
this required selling the low-interest treasury bonds, whose market prices had decreased
sharply. ◦
Example 1.4. The final example concerns a particular individual, Jesse Livermore. Liver-
more is considered the greatest short-seller in history. He succesfully shorted:
The 1929 market crash (through ca. 100 shorts, netting about $100 million).
He also had numerous successful ”long” investments. He went bankrupt a number of times
however. He went bankrupt in 1901, 1908 and again in 1934. His book on trading remains
a classic to this day. ◦
Definition 2.1. Let t0 , t1 , ... denote discrete timepoints (days, weeks, months or years for
example) and let ∆tn = tn+1 − tn denote the time passed between timepoint n + 1 and n.
We let Vn denote the capital at time tn , and we let Ln+1 define the loss between tn and
tn+1 e.g.
Ln+1 = −(Vn+1 − Vn ).
We let a general loss random variable be denoted by L. Many models consist of assumptions
on L. Let us consider some motivating examples. Note that in these examples, it makes
sense to think of Vn as a portfolio.
Example 2.2 (Stock investment). Let Vn = Sn with Sn the price of a certain stock at
time tn . We let Xn+1 = log Sn+1 − log Sn denote the log returns. Hence
Sn+1
eXn+1 = .
Sn
Historically, Xn+1 has often been given a normal distribution. This is motivated by the
Black-Scholes model, see the end of the examples for a concise explanation of this model. In
this model, the change of the stock price (in continuous time t) is described by the dynamics
dS(t)
= rdt + σdW (t)
S(t)
with W (t) denoting a Brownian motion. Solving the equation explicitly, assuming that we
are currently at time t so that S(t) is known, yields the expression for the stock price at
time T > t 2
r− σ2 (T −t)+σ(W (T )−W (t))
S(T ) = S(t)e .
The discrete time analogue is
2
√
r− σ2 (tn+1 −tn )+σ tn+1 −tn Z
Sn+1 = Sn e
σ2
p
Xn+1 = log Sn+1 − log Sn = r − (tn+1 − tn ) + σ tn+1 − tn Z
2
which is normal distributed. Note that the loss Ln+1 can be written as
If we are at time n, the value Sn is known. A key goal of this course is to model the unknown
part Xn+1 and thereby make inference about the behaviour of the process Vn at a future
time. Note also how this approach differs from the one in classical ruin theory where the
entire positive timeline is considered. Here we model the change in capital one step at a
time. ◦
Example 2.3 (Stock investment with more assets). The preceding example can be
generalised. Assume that we have d stocks. We can then form a portfolio at time n by
d
X
Vn = αi Sn(i)
i=1
4
(i)
with αi the number of units bought of stock i and Sn the value of stock i at time n. Using
the previous example, the loss is given by
d
(i)
(i) (i)
X
Ln+1 = − αi Sn(i) (eXn+1 − 1), Xn+1 = log Sn+1 − log Sn(i) .
i=1
Again we need to come up with a model for the log returns X (1) , ..., X (d) , where we can
write
(1) (d)
Xn+1 = (Xn+1 , ..., Xn+1 ).
Very often the variables in Xn+1 are dependent. Think for example of a portfolio of stocks
in the same type of companies. This dependence structure is crucial to understand and
capture when estimating risk in a portfolio.
◦
Example 2.4 (Bond investment). Consider a zero-coupon bond. Such an asset pays one
unit at a fixed time T . Let us briefly consider such a bond in continuous time. We have an
interest rate rt at time t, and we let Bt denote the price of the bond at time t (this price
will depend on T ). The price of the bond can be described by the dynamics
dBt = rt Bt dt,
and using the boundary condition BT = 1, we can solve the above equation and get
RT
rs ds
1 = BT = Bt e t ,
y(t, T ) is called the yield of the bond. Let us now consider discrete time, and say that the
current time is tn . The loss is
Btn+1
Ln+1 = −(Btn+1 − Btn ) = −Btn −1 .
B tn
Let us fix some notation. Denote the yield at time n by Zn = y(tn , T ), and let Xn+1 =
Zn+1 − Zn . Note that Zn is known at time tn while Xn+1 again needs to be modelled. We
Bt
can rewrite the expression Bn+1
t
as
n
Btn+1
= e−(T −tn+1 )y(tn+1 ,T )+(T −tn )y(tn ,T )
Btn
= e−(T −tn −∆tn )(Zn +Xn+1 )+(T −tn )Zn
= e∆tn Zn −(T −tn+1 )Xn+1 .
This expression makes it clear how the unknown variable Xn+1 enters the loss.
◦
5
In the above examples, we ended up with an expression containing a term of the form
e(··· ) − 1. This makes it tempting to use a Taylor approximation since ex ≈ 1 + x for
small x, and historically, such an approximation was often considered to ease computations.
Taking the example with the stock portfolio, we let
d
(i)
X
L∆
n+1 = − αi Sn(i) Xn+1
i=1
denote the linerarized log returns. This approximation can often be problematic however
since it is of interest to consider large losses (which we will do next week).
rt is called the interest rate and is assumed to be an adapted process. We model a risky
asset (such as a stock) with price process St by a stochastic differential equation (SDE) of
the form
dSt = µ(t, St )dt + σ(t, St )dWt
with deterministic functions µ and σ and a Brownian motion W . µ is called the local mean
rate of return for St while σ is called the volatility of St . For all the necessary results on
SDEs, consult chapter 4 and 5 of [6]. For our purposes, it suffices to know the model on an
intuitive basis. The Black-Scholes model is a special case of the above model.
Definition 2.5. The (Standard) Black-Scholes model consists of two assets with dynamics
In the language of SDEs, a process with the dynamics of St is called a Geometric Brownian
motion (GBM). Such an SDE can be solved explicitly. In our case, we may write
2
µ− σ2 t+σWt
St = S0 e .
The Black-Scholes model can of course be extended to include more risky assets with the
same type of dynamics as St . Such a model is naturally referred to as a multidimensional
Black-Scholes model.
The goal of arbitrage theory is to price financial derivatives i.e. products based on the price
of underlying assets. We think intuitively of an arbitrage as a money machine/free lunch
i.e. as a portfolio of assets that costs nothing and produces a positive amount of money
6
with probability one. It turns out that the absence of arbitrage is the same as the existence
of a so called equivalent martingale measure (EMM) i.e. a measure Q equivalent to the
underlying measure P (Q and P have the same null sets) and such that the discounted price
processes
St
Bt
are martingales under Q. Q is also referred to as a risk neutral measure. Under the measure
Q, the dynamics of St change to
where W Q is a Brownian motion under the Q measure. Note that the volatility is unchanged
while the local mean rate of return becomes the interest rate times St . Say we have a
derivative which expires at time T , the current time is t < T and that the derivative pays
X = Φ(ST ) at time T . Note that the payout is a function of the price of the risky asset at
time T (such a derivative is called simple). If Πt [X] denotes the (arbitrage free) price of X
at time t, arbitrage theory yields the following formula.
Theorem 2.6 (Risk Neutral Valuation). The arbitrage free price of Φ(ST ) at time t < T
is given by
Πt [X] = e−r(T −t) E Q [Φ(ST ) | Ft ].
By an arbitrage free price we mean a price process that doesn’t introduce an arbitrage into
the market. Note that the above formula says that the price is given by the discounted
expected value under the risk neutral measure (given the information we currently have
available).
Example 2.7 (European call option). A European call option gives the holder the right
(but not the obligation) to buy one stock at time T at price K (the strike price). The
payout is (ST − K)+ = max{ST − K, 0} since if ST > K, we get the payout ST − K while
if ST ≤ K, the option is worthless. In the Black-Scholes model, we can solve for ST as
2
r− σ2 (T −t)+σ(WTQ −WtQ )
ST = St e
since we are free to choose the current starting value (so here we choose to consider the
start value at time t). If C(t, T ) denotes the price at time t, we have by the Risk Neutral
Valuation formula that
and this can be computed explicitly1 . The result is known as the Black-Scholes formula. It
says that
C(t, T ) = St Φ(u) − Ke−r(T −t) Φ(v)
where
log(St /K) + (r + σ 2 /2)(T − t) √
u= √ , v = u − σ T − t.
σ T −t
1 The brave reader can carry out this computation. It involves a lot of integration by substitution.
7
Say we have a portfolio consisting of one European call option, so that the value is Vn =
C(tn , T ). We note that the risk in this portfolio can be explained by the three quantities
since the volatility and interest rate are not constant in real life. One typically needs to
model the change in these factors (called risk factors, see the discussion below), namely
Xn+1 = Zn+1 − Zn . We can for example consider the linearized loss
∂C ∂C (1) ∂C (2) ∂C (3)
L∆n+1 = − ∆t + X + X + X .
∂t ∂S n+1 ∂r n+1 ∂σ n+1
In mathematical finance, these first order derivatives have names. ∂C/∂t is called ”theta”,
∂C/∂S ”delta”, ∂C/∂r ”rho” and ∂C/∂σ ”vega”. Together these quantities are referred to
as the greeks, see chapter 10 in [6].
◦
In the previous examples we stressed that we only need to model the change in risk factors
Xn+1 since the value Zn is already known at time tn . Hence we can write the loss entirely
in terms of the change in risk factors. Explicitly,
d
∂f X ∂f (i)
L∆
n+1 = − (tn , Zn )∆t − (tn , Zn )Xn+1
∂t i=1
∂zi
obtained by applying a first order Taylor expansion, we can similarly define the linearized
loss operator
d
∆ ∂f X ∂f
l[n] (x) := − (tn , Zn )∆t − (tn , Zn )x(i) .
∂t i=1
∂zi
3 Risk measures
We need some notion of the ”size” of a risk in order to quantify the risk of a loss.
Definition 3.1. Let L denote a loss. A risk measure ρ associates a real number to L
denoted by ρ(L).
8
A way to make the definition more formal is to let G denote the set of all measurable real-
valued functions on the background probability space. A risk measure is then a mapping
ρ : G → R. We will not worry about these details in this course. We now go through
some essential examples of risk measures. In the following, let L denote some loss random
variable.
Example 3.2. For α ∈ (0, 1), let ρ = inf{x ∈ R : P (L > x) ≤ 1 − α}. This risk measure
is called the Value at Risk at level α. One can intuitively think of VaRα (L) as the smallest
value of x such that P (L ≤ x) ≥ α. We can rewrite
VaRα = inf{x ∈ R : 1 − P (L ≤ x) ≤ 1 − α}
= inf{x ∈ R : FL (x) ≥ α}
= FL← (α) =: qα (FL )
with FL denoting the distribution function of L and FL← the generalised inverse of FL . Since
FL is a distribution function, the generalised inverse coincides with the quantile function
q(·) (L). So a statistician would simply call VaRα the α-quantile of L. See the appendix for
more information on generalised inverses and their properties. ◦
Example 3.3. For α > 0 and FL continuous and strictly increasing, we define
called the Expected Shortfall at level α. This is the expected loss given that the loss has
surpassed the Value at Risk. We immediately see that ESα (L) ≥ VaRα (L) and that ESα
takes into account the severity of the loss in comparison to VaRα . ◦
We want to generalize the Expected Shortfall to also be valid for non-continuous distribution
functions. We will need the following lemma.
Lemma 3.4. If U is uniformly distributed on (0, 1) and L is a random variable with con-
d
tinuous distribution function FL , then L = FL← (U ).
Proof. Let Y = FL← (U ). A property of generalised inverses (see the appendix) is that
u ≤ FL (x) if and only if FL← (u) ≤ x. Hence
The following proposition tells us how to generalize the notion of Expected Shortfall.
Proposition 3.5. Let L be a loss variable with FL continuous and strictly increasing. Then
Z 1
1
ESα (L) = VaRu (L)du.
1−α α
Proof. Because FL is continuous and strictly increasing, the generalised inverse is a proper
inverse. Hence
P (L ≥ VaRα (L)) = 1 − α
9
Most problems include distributions that have continuous and strictly increasing distribu-
tion functions, but it is also useful to have a definition of Expected Shortfall for general
distributions. The following definition summarises the above considerations.
Definition 3.7. For a risk measure ρ and loss variables L, L1 , L2 , we consider the following
axioms/properties:
The rationale for translation invariance is that adding a deterministic quantity to the loss
should increase the capital we need to set aside by exactly that amount. The rationale for
subadditivity is that diversification should reduce risk. Positive homogeneity makes sense
since if we invest more money into the samme asset, the amount of capital we need to set
aside should be multiplied by the same factor. Monotonocity also clearly makes sense. Note
10
Value at Risk and Expected Shortfall are the two most popular choices of risk measures.
One reason many prefer Expected Shortfall over Value at Risk is that Expected Shortfall in
general satisfies all the above axioms, while Value at Risk only satisfies three.
Value at Risk is not a coherent risk measure in general. A counterexample can be found in
[17], example 2.25. The reader can construct a continuous counterexample as an exercise.
1
ESα (L) = E[L1{L≥VaRα (L)} ].
1−α
Define Ii := 1{Li ≥VaRα (Li )} for i = 1, 2 and I12 := 1{L1 +L2 ≥VaRα (L1 +L2 )} . We compute
(1 − α)(ESα (L1 ) + ESα (L2 ) − ESα (L1 + L2 )) = E[L1 I1 ] + E[L2 I2 ] − E[(L1 + L2 )I12 ]
= E[(L1 (I1 − I12 ))] + E[(L2 (I2 − I12 ))].
We now consider two cases for L1 . If L1 ≥ VaRα (L1 ), then I1 − I12 ≥ 0 and hence
L1 (I1 − I12 ) ≥ VaRα (L1 )(I1 − I12 ). If L1 < VaRα (L1 ), then I1 − I12 ≤ 0 so again we have
11
L1 (I1 − I12 ) ≥ VaRα (L1 )(I1 − I12 ). Applying the same reasoning to L2 , we get
(1 − α)(ESα (L1 ) + ESα (L2 ) − ESα (L1 + L2 )) ≥ E[VaRα (L1 )(I1 − I12 ) + VaRα (L2 )(I2 − I12 )]
= VaRα (L1 )E[I1 − I12 ] + VaRα (L2 )E[I2 − I12 ]
= VaRα (L1 )((1 − α) − (1 − α))
+ VaRα (L2 )((1 − α) − (1 − α))
=0
implying that ESα (L1 ) + ESα (L2 ) ≥ ESα (L1 + L2 ) which is the desired statement.
If we consider more stocks, the expression becomes a lot more complicated. It gets even
worse with a more diverse portfolio (with stocks, bonds and call options for example).
◦
Example 3.11. This is from Example 2.11 and 2.14 from [17]. Let α ∈ (0, 1) and assume
that the loss distribution FL is normal distributed with mean µ and variance σ 2 . Since FL
is continuous and strictly increasing, we have
using the properties of Value at Risk from before. We can now compute the Expected
Shortfall, where we again use that FL is continuous and strictly increasing,
L−µ L−µ L−µ
ESα (L) = E[L | L ≥ qα (L)] = µ + σE | ≥ qα
σ σ σ
L−µ L−µ
= µ + σE | ≥ Φ−1 (α)
σ σ
Z ∞
σ
=µ+ xϕ(x)dx
1 − α Φ−1 (α)
σ ϕ(Φ−1 (α))
ESα (L) = µ + [−ϕ(x)]∞−1
Φ (α) = µ + σ .
1−α 1−α
◦
We now consider methods for computing VaR and ES directly from data. Up to time tn we
have the empirical observations L1 , ..., Ln . For this discussion, we make the (not so realistic)
assumption that L1 , ..., Ln is an iid sample. Order the sample and let
denote the corresponding order statistics. We define the empirical distribution function
n
(n) 1X
FL (x) = 1[Li ,∞) (x)
n i=1
that assigns equal weight to each observation. To estimate the risk measures of interest,
(n)
an idea is to replace FL (which is unknown) with the empirical distribution function FL .
This idea is justified by e.g. the Strong Law of Large Numbers which implies that
(n)
FL (x) → P (L1 ≤ x) a.s. as n→∞
for each x or the even stronger result known as the Glivenko-Cantelli Theorem which states
that
(n)
sup |FL (x) − FL (x)| → 0 a.s. as n → ∞.
x∈R
A problem with this approach is that it is often very difficult to infer the tail behaviour of
(n)
FL from FL since we often do not have enough data in the tail of FL . However, one nice
thing about this approach is that we can explicitly solve for the estimator. We have
VaR
d α (L) = L[n(1−α)]+1,n = qbα (FL )
13
i.e. the empirical quantile. Here [n(1 − α)] denotes the largest integer less than or equal to
n(1 − α). Similarly, if FL is continuous and strictly increasing, we may compute an estimate
for ESα (L), namely
P[n(1−α)]+1
c α (L) = i=1 Li,n
ES
[n(1 − α)] + 1
i.e. the average of the observations greater than or equal to VaR
d α (L). Again we stress that
this approximation is often insufficient since we usually do not have many observations in
the tails of FL . We now turn to the problem of computing confidence intervals where we
focus on the Value at Risk. Let β ∈ (0, 1) denote some ”small” value. Our approach is to
find Ab and Bb such that
P (Ab < VaRα (L) < B)b ≥1−β
by determining A
b and B
b in such a way that
β β
P (VaRα (L) ≤ A)
b ≤ and P (VaRα (L) ≥ B)
b ≤ .
2 2
Assume that L has a density so that for the true VaRα (L) we have P (L > VaRα (L)) = 1−α.
For the observed data, each data point can land on either side of VaRα (L). Hence we have
a sequence of Bernoulli trials with probability 1 − α of landing to the right of VaRα (L)
(a success) and probability α of landing to the left of VaRα (L) (a failure). Formally, let
Z = 1{L>VaRα (L)} , then
q := P (Z = 0) = α, p := P (Z = 1) = 1 − α.
Pn
Let Y denote the number of successes in n trials i.e. Y = i=1 Zi for Zi = 1{Li >VaRα (L)}
where Li is the ith loss. Then Y ∼ Bin(n, p) and
n
X n k n−k
P (Y ≥ j) = p q .
k
k=j
Note that Lj,n ≥ VaRα (L) if and only if Y ≥ j so using the above, we can find the smallest
j such that P (Y ≥ j) ≤ β/2. Then P (Lj,n ≥ VaRα (L)) = P (Y ≥ j) ≤ β/2 and so we
b = Lj,n . Conversely, we can find the largest i such that P (Y ≤ i) ≤ β/2. Then
set A
P (Li,n ≤ VaRα (L)) ≤ β/2 so set B
b = Li,n .
Exercises
Exercise 1.1:
Consider a loss variable L which is exponential distributed i.e. L ∼ Exp(λ) for λ > 0.
1)Compute VaRα (L).
2)Compute ESα (L).
Exercise 1.2:
Consider a loss variable L with distribution function F given by
0,
if x < 1
1
F (x) = 1 − 1+x , if 1 ≤ x < 3
1 − x12 ,
if x ≥ 3
Exercise 1.3:
Let L be a loss variable with distribution function
1
F (x) = x−µ
1 + e− s
where µ ∈ R and s > 0 parameters. This distribution is called the logistic distribution with
location µ and scale s. Let α ∈ (0, 1).
1)Compute VaRα (L).
2)Compute ESα (L).
Exercise 1.4:
Consider the risk measure ρ(L) = E[L]. Show/convince yourself that ρ is a coherent risk
measure. Explain why ρ may still be a bad risk measure.
Exercise 1.5:
In this exercise, we will show that the stochastic process given by
2
µ− σ2 t+σWt
St = se
We will apply the Itô formula. The Itô formula says that if we have a continuous time
stochastic process X with differential
and a C 2 function f , then the process Zt = f (t, Xt ) has stochastic differential given by
∂2f
∂f ∂f 1
dZt = (t, Xt ) + µ(t, Xt ) (t, Xt ) + σ(t, Xt )2 2 (t, Xt ) dt
∂t ∂x 2 ∂x
∂f
+ σ(t, Xt ) (t, Xt )dWt .
∂x
2
1)Identify the function f such that St = f (t, Wt ). Compute ∂f ∂f ∂ f
∂t , ∂x and ∂x2 .
2)Apply the Itô formula to show that St satisfies the stochastic differential equation.
Exercise 1.6:
Consider the Standard Black-Scholes model and the derivative that pays X = log ST at
time T . Determine the arbitrage free price of this derivative at time t < T . Assume the
natural filtration generated by the Brownian motion. Hint: It may be helpful to consult the
subsection in the appendix on stochastic processes.
Week 2 - Methods for computing VaR
and extreme value theory
(iv) Bootstrapping.
The motivating example to keep in mind is the one with d investments from last week.
Recall that the loss was given by
d
X (i)
Ln+1 = − αi Sn(i) eXn+1 − 1
i=1
(i)
with Sn the value of the ith asset at time n, Xn+1 = log Sn+1 − log Sn the log return and
αi the number of assets bought of asset i. We now go through the different methods.
16
17
we may rewrite L∆ T
n+1 = −hwn , Xn+1 i = −wn Xn+1 . By the properties of the multivariate
normal distribution (see the appendix), we have
L∆ T T
n+1 ∼ N (−wn m, wn Σwn ).
d
Let µn = −wnT m and σn2 = wnT Σwn , so that we may write L∆
n+1 = µn +σn Z for Z ∼ N (0, 1).
From last week, we can compute VaRα (L∆ n+1 ) as
−1
VaRα (L∆
n+1 ) = µn + σn Φ (α).
Finally, we can use data to obtain estimates of µ and σ 2 i.e. of m and Σ. A virtue of this
method is how simple it is to use. We get an exact analytical expression for VaRα (L∆ n+1 ).
The problem is the approximation ex ≈ 1 + x behind the method. This is not very precise
for large losses, and often we are interested in the tail of the distribution where x is large.
Also, the normality assumption is often problematic with real data.
per the discussion last week. This method is a lot more precise compared to the Variance-
Covariance method since it does not rely on the approximation ex ≈ 1 + x. Another obvious
advantage is flexibility. The method works for any distribution (at least if we can efficiently
simulate large samples from that distribution). One problem is that the rate of convergence
of the estimate will be slow since we are working with the tail of the loss distribution. Hence
we often have to generate a very large number of samples to obtain a robust estimate. This
is especially problematic if we have a large number of assets.
As shown above, this estimator is consistent (it converges in probability to the true under-
(N )
lying parameter, in this case px ). A natural question to ask is how fast pbx converges.
If
N
1 X
SN = 1{l >x} ,
N i=1 i
then the CLT applies to show that
SN − N px d
√ −→ Z ∼ N (0, 1)
σx N
(N )
where σx denotes the variance of 1{l1 >x} . Note that N pbx = SN so for large N , we get (in
distribution) that
(N )
SN − N px pbx − px
Z≈ √ = √ .
σx N σx / N
We can rewrite this relation and obtain the approximation (in distribution)
σx
pb(N
x
)
≈ px + √ Z.
N
We can form an asymptotic confidence interval for px as follows. Let β ∈ (0, 1) be some
”small” value and let zβ/2 denote the β/2-quantile of N (0, 1) so that P (Z > zβ/2 ) = β/2.
Thus, with the ”high” probability 1 − β, we have
(N ) σx (N ) σx
px ∈ pbx − √ zβ/2 , pbx + √ zβ/2 .
N N
√
While the error σx / N goes to zero for N → ∞, the probability px is often also small, so
(N )
it can happen that the error still dominates the estimate pbx , even when N is very large.
To make this precise, define the relative error
σx zβ/2 σx
RE = √ = C(N )
N px px
Importance sampling
To remedy the issue of diverging relative errors, we briefly introduce the main ideas of im-
portance sampling. Consider the setup L = f (X) where X is an Rd -valued random variable
19
with distribution function F (see the appendix for a brief discussion on multidimensional dis-
tribution functions) and f is a deterministic function. We refer to F as the true distribution
of X. Define the moment-generating function (mgf) of X as
h i
κ(ξ) = E ehξ,Xi for ξ ∈ Rd .
ehξ,Xi
dFξ (x1 , ..., xd ) = dF (x1 , ..., xd )
κ(ξ)
for all ξ ∈ Rd such that the moment-generating function is finite. We note the following
properties of Fξ :
(ii) Eξ [X] = ∇Λ(ξ) where Λ(ξ) = log κ(ξ) is the cumulant-generating function of X and
Eξ indicates the expectation taken with respect to the shifted measure.
The idea is now to choose ξ in a good way such that L > x occurs frequently i.e. so that
Pξ (L > x) is large, where Pξ denotes the probability under the shifted distribution. To relate
simulations under the shifted distribution with parameter ξ to the original probability, we
apply a representation formula,
Z Z
dF
px = P (L > x) = dF (y) = (y)dFξ (y)
{y:f (y)>x} {y:f (y)>x} dFξ
dF
= Eξ 1{f (X)>x} (X)
dFξ
dF
with dFξ the Radon-Nikodym derivative.
Bootstrapping
The idea of bootstrapping is to use the existing data to generate new data by resampling.
We sample with replacement and if we have N data points x1 , ..., xN , we choose a member
with probability 1/N N times to get a new data set of the same size. We can do this
procedure b times to obtain the bootstrap samples
(1) (1)
x1 , ..., xN
(2) (2)
x1 , ..., xN
..
.
(b) (b)
x1 , ..., xN .
20
If we have some parameter θ that we want to estimate, we can compute an empirical estimate
θb from the original data and estimates θbj from each of the b bootstrap samples. We can
then use the θbj to say something about the distribution of θ. b We may for example generate
confidence intervals by computing empirical quantiles using the θbj . Explicitly, define the
residuals Rj = θb − θbj and order them from smallest to largest, R1,b ≥ R2,b ≥ ... ≥ Rb,b . The
confidence bounds are then given by
h i
θb + R[b(1−β/2)],b , θb + R[b(β/2)]+1,b .
These bounds improve classical estimates based on the CLT which converge more slowly.
For small probabilities, one considers a modification of this idea, namely smoothed bootstrap.
In smoothed bootstrapping, one smoothes the data around the tail values. Namely, given
the original sample x1 , ..., xN , sample from the density
N
1 X1 x − xi
g(x) = K
N i=1 h h
1 −t2 /2
K(t) = e .
2π
One can then employ importance sampling to shift the smoothed distribution so that more
samples will be in the tail.
Data analysis
Let x1 , ..., xn be observed data. If we want to use this data to make predictions about future
values, it is natural to propose a distribution F and assume that x1 , ..., xn are realizations
of iid variables X1 , ..., Xn with distribution F . The QQ plot (quantile-quantile plot) can be
used to determine whether F is a reasonable choice of distribution. The QQ plot consists
of the points
← n−k+1
F , xk,n : k = 1, ...,n
n+1
where as usual, x1,n ≥ ... ≥ xk,n denotes the order statistics of the sample. Recall that
n−k+1
Fn← = xk,n
n+1
21
Hence we expect the QQ plot to be a straight line through 0 and with slope 1 if the data
truly comes from the reference distribution. The plot tells a bit more however. If the true
underlying distribution is an affine transformation of F , the plot will still be a straight line.
If that is the case, we can estimate proper location and scale parameters. The plot also
indicates whether the reference distribution has lighter or heavier tails than the empirical
sample. See the plots below.
Figure 1: Four examples of QQ-plots. We have 500 simulated values from the following
distributions: Standard exponential (upper left), folded/truncated normal (i.e. |X| for
X ∼ N (0, 1)) (upper right), standard lognormal (i.e. log X ∼ N (0, 1), lower left) and
the Pareto distribution with κ = 1 and α = 2 (lower right). The reference distribution is
standard exponential. The green line has slope one and intercept zero.
Consider the plots for a moment. The plot in the upper left corner shows that the data fol-
lows a straight line with slope one and intercept zero which is expected, since the reference
distribution and empirical distribution are the same. For the truncated normal, we see that
the data curves downwards, indicating lighter tails than the exponential distribution. For
the lognormal and Pareto distributions, the data curves upwards, indicating heavier tails
than the exponential distribution.
22
Standard distributions in statistics include the normal, exponential and gamma distribu-
tions. These all have in common that they are light-tailed in the sense that the mgf of
these variables exists in some neighbourhood around zero. This is not typical behaviour for
log returns in a financial situation. These distributions are usually a lot more heavy-tailed.
Heavy-tailed distributions include the regularly varying distributions such as the Pareto
along with moderately heavy-tailel distributions such as the lognormal. In these cases, the
mgf does not exist. We will now scratch the surface of the theory of regularly varying
distributions.
h(tx)
lim = tρ .
x→∞ h(x)
Example 5.2. Constant functions and log are examples of slowly varying functions. ◦
Proposition 5.3. h ∈ RVρ if and only if h(x) = L(x)xρ for L a slowly varying function.
Proof. Assume h ∈ RVρ . Define the function L(x) = h(x)x−ρ . Then for t > 0, we have
so L is slowly varying and satisfies h(x) = L(x)xρ . The converse implication is also easy
and is left to the reader.
The definition says that F is regularly varying of index α > 0 if for all t > 0, it holds that
P (X > tx)
→ t−α as x→∞
P (X > x)
Definition 5.5. The Generalised Pareto Distribution (GPD) has distribution function
−1/γ
γx
Gγ,β (x) = 1 − 1 + , x>0
β
Lemma 5.6. The Generalised Pareto Distribution with parameters β, γ > 0 is regularly
varying with index 1/γ.
γtx
!−1/γ β
!−1/γ
Gγ,β (tx) 1+ β γx +t
lim = lim = lim = t−1/γ .
x→∞ Gγ,β (x) x→∞ 1 + γx
β
x→∞ β
γx +1
Theorem 5.7 (Karamata’s Theorem). If L is slowly varying and β < −1, then
Z ∞
1
xβ L(x)dx ∼ − uβ+1 L(u), u → ∞.
u β + 1
R∞
In words, u xβ L(x)dx behaves asymptotically like the integral of a power function. The
slowly varying function plays little role asymptotically. We can now derive the Hill estimator.
Let β = −α − 1 where α > 0. Karamata’s Theorem implies
Z ∞
1
x−α−1 L(x)dx ∼ u−α L(u), u → ∞
u α
We rewrite the left hand side. First note that (log x − log u)0 = 1/x. We can now apply
integration by parts (see the appendix for a review) and obtain
Z ∞ Z ∞
∞
x−1 F (x)dx = (log x − log u)F (x) u −
(log x − log u)dF (x).
u u
F (x) decays like x−α and hence decays to zero faster than log x grows to ∞ and thus the
first term is zero. By using that dF (x) = d(1 − F (x)) = −dF (x), we get
Z ∞ Z ∞
x−1 F (x)dx = (log x − log u)dF (x)
u u
24
and so Z ∞
1 1
(log x − log u)dF (x) → , u → ∞.
F (u) u α
This is a theoretical result. To turn this into an estimator, we have to use the empirical
distribution function. Suppose we have iid data x1 , ..., xn distributed according to F . Order
the samples, x1,n ≥ ... ≥ xn,n . Let Fn denote the empirical distribution function. We can
then approximate F by Fn . Replacing F by Fn in the above expression yields
Z ∞
1 1
≈ (log x − log u)dFn (x).
α F n (u) u
for sufficiently large u. We want to simplify this expression. Let Nu denote the number of
observations greater than u i.e.
then F n (u) = Nu /n. Also recall that F n (xk,n ) = (k − 1)/n. Now choose a ”small” k and
set u = xk,n . Then
Z ∞ k
1 n n X1
≈ (log y − log xk,n )dFn (y) = (log xj,n − log xk,n )
α k−1 u k − 1 j=1 n
k
1 X
= (log xj,n − log xk,n )
k − 1 j=1
since each jump of Fn is of size 1/n and the jth jump of Fn occurs at xj,n (see the appendix).
Replacing k − 1 by k gives us the Hill estimator.
Definition 5.8. Let x1 , ..., xn be a sample from a regularly varying distribution with index
α. Let x1,n ≥ ... ≥ xn,n denote the order statistics. We call
−1
k
1 X
α
bk = (log xj,n − log xk,n )
k j=1
Example 5.10. A very classical data set in extreme value theory is the Danish fire insurance
data. The data consists of large Danish fire insurance claims from 1980 to 1990. The data
is available in the R package evir which has functions to compute estimates and make plots
related to extreme value theory. We present some plots of the data below.
estimates we can choose to report. There is no single correct answer for the choice of k and
very often, real life data is much more ugly than this particular data set. This illustrates the
importance of applying different tools in extreme value theory before drawing a conclusion.
◦
Using the Hill estimator, we can compute the Value at Risk at level β, VaRβ (the letter α
is already used). We want to solve for x in the equation P (X > x) = 1 − β. Choose k, xk,n
by looking at the Hill plot. Since X is regularly varying,
−α
x x
F (x) = F xk,n ∼ F (xk,n ) as x → ∞.
xk,n xk,n
F (u + x)
F u (x) = P (X > u + x | X > u) = , x ≥ 0.
F (u)
where γ = 1/α and β = β(u) = u/α (while β depends on u, one chooses a fixed u so that β
is also fixed). We can now describe the POT method in two steps:
(i) Estimate F (u) ≈ Nu /n.
27
and maximising this equation numerically in terms of β and γ yields the maximal likelihood
estimators β̂n and γ̂n . It is possible to construct asymptotic confidence intervals by using
the result (valid for γ > −1/2)
!
p βbn d
Nu γ bn − γ, − 1 −→ N (0, M −1 ) for Nu → ∞
β
where
1+γ −1
M −1 = (1 + γ) .
−1 2
We now return to the first problem of determining a good value of u. As usual there is a
tradeoff between getting sufficiently many datapoints and choosing a value large enough so
that the asymptotics ”kick in”. The main tool for this job is the mean-excess function.
Definition 5.12. Let X be an integrable random variable with distribution F . The mean-
excess function of X is defined as
In the following we assume that the tail parameter α satisfies α > 1. We compute
Z ∞ Z ∞
1
e(u) = (x − u)dP (X ≤ x | X > u) = (x − u)dF (x)
0 F (u) u
28
since F (x) decays to zero slower than x by the assumption α > 1. Using Karamata’s
Theorem, Theorem 5.7, we get
∞ ∞
L(u)u−α+1
Z Z
F (x)dx = L(x)x−α dx ∼ − for u → ∞
u u −α + 1
so
1 L(u)u−α+1 u
e(u) ∼ = as u→∞
L(u)u−α α−1 α−1
which shows that e(u) becomes linear asymptotically. This is a crucial observation and
hence we state the above result as a proposition.
Proposition 5.13. If F is regularly varying with index α > 1, the mean-excess function
e(u) satisfies
u
e(u) ∼ as u → ∞.
α−1
To see how the mean-excess function helps in determining a suitable u, we consider the em-
pirical mean-excess function. The idea is to replace F and F by their empirical counterparts.
The empirical mean-excess function is given by
Z ∞ n
1 n X 1
en (u) = (x − u)dFn (x) = (xj,n − u)1{xj,n >u}
Nu /n u Nu j=1 n
n
1 X
= (xj,n − u)1{xj,n >u} .
Nu j=1
If we set u = xk,n for some k = 2, 3, ..., n (k = 1 is excluded since en (u) = 0 in this case),
we can simplify the above expression to
n
1 X
en (xk,n ) = (xj,n − xk,n ).
k−1
j=k+1
Using the empirical mean-excess function we can construct a mean-excess plot by plotting
the values
{(xk,n , en (xk,n )) : k = 2, 3, ..., n} .
If the values x1 , ..., xn come from a regularly varying distribution, the plot will roughly look
like a straight line for large thresholds. For distributions with lighter tails, the mean-excess
function will either decrease or remain roughly constant. It is left to the reader to compute
some examples of mean-excess functions in the exercises. Some examples of mean-excess
plots are below.
29
Figure 4: Examples of mean-excess plots based on 500 simulated values from the following
distributions: Standard exponential (upper left), standard normal (upper right), standard
lognormal (lower left) and the Pareto distribution with κ = 1 and α = 2 (lower right).
The plots should serve not just as an example but also as a warning. The behaviour of the
large values in the plots are very chaotic, and one should be cautious in the interpretation
of such plots. To illustrate this further, the following plots are made with the exact same
distributions but with a different simulated sample.
Figure 5: More mean-excess plots with the same distributions as before, but with a new
sample.
It often occurs in practice that the tail and center of the data have different distributions.
30
For example, the tail could have a regularly varying distribution while the center has a light-
tailed distribution such as a normal or gamma distribution. In that case, the mean-excess
plot will probably only become linear for large values of u. In any case, one should use the
plot to find a value of u where the points begin to form a straight line. We illustrate this
with an example.
Example 5.14. Let us again consider the Danish fire insurance data. We want to model
the excesses using the POT method. We wish to determine a proper threshold and therefore
make a mean-excess plot:
The plot looks very linear for all thresholds, indicating Pareto tails. A word of warning:
This is typically not the case for real data! The Danish fire insurance data is in a sense
too nice to illustrate this point. Based on the plot, we choose the threshold u = 4. While
the linear trend may begin for slightly larger values, this choice also gives us sufficient data
to work with. To fit the generalized Pareto distribution, we apply maximum likelihood
estimation using the gpd function in the evir package as follows:
1 data ( danish )
2 u <- 4
3 gpdfit <- gpd ( danish , threshold = u , method = " ml " )
4 gpdfit $ par . ests
Figure 7: Blue: The tail from the fitted generalized Pareto distribution. Red: The empirical
tail from the data.
The approximation looks very good. Usually, one is not so lucky. If one is interested in
a statistical goodness of fit, a possible way (which is also implemented in evir if one uses
plot around a fitted gpd object) is to consider the excesses zi = xi − u which should
be approximately Gγb,βb distributed. Hence − log Gγb,βb(zi ) (which we call the generalised
residuals) should be approximately standard exponential distributed. We make a residual
plot and a QQ-plot of the − log Gγb,βb(zi ) against a standard exponential distribution and
get the following:
We see no clear tendencies of the generalised residuals. Some of them are quite large, but
otherwise the left plot looks good. The right plot also looks good. While some of the sample
quantiles are below the line with slope one and intercept zero, the residuals overall seem to
follow a standard exponential distribution. We conclude that the model is an adequate fit.
◦
Exercises
Exercise 2.1:
For the shifted distribution
ehξ,Xi
dFξ (x1 , ..., xd ) = dF (x1 , ..., xd )
κ(ξ)
Eξ [X] = ∇Λ(ξ).
Exercise 2.2:
In this exercise, we will get more comfortable with the concept of regular and slow variation.
1)Prove the remaining implication in Proposition 5.3.
2)Verify that the function h(x) = log log x for x sufficiently large is slowly varying.
3)Show that the Pareto distribution with parameters κ > 0 and α > 0 is regularly varying.
What is the index? Recall that the distribution function is given by
α
κ
1− , x > 0.
κ+x
Exercise 2.3:
Consider the Student t distribution with ν > 0 degrees of freedom, i.e. the distribution with
density
−(ν+1)/2
x2
Γ((ν + 1)/2)
gν (x) = √ 1+ .
νπΓ(ν/2) ν
Show that this distribution is regularly varying and determine the corresponding index.
Exercise 2.4:
Consider a regularly varying distribution function F supported on [0, ∞) of index α > 0.
β
Let X ∼ F . Prove that E[X
R ∞ ] < ∞ if β < α. Hint: Apply Karamata’s Theorem. Recall
also the formula E[X] = 0 P (X > t)dt for a positive random variable X.
One can use a different form of the Karamata theorem to show that E[X β ] = ∞ when
β > α.
Exercise 2.5:
Karamata’s Representation Theorem says that if L is slowly varying, then we can write
Rx ε(t)
t dt
L(x) = c0 (x)e x0
where c0 (x) → c0 > 0 for x → ∞ and ε(x) → 0 for x → ∞ for some x0 ≥ 0. If L is written
in this form, we call the above a Karamata representation for L.
1)Find a Karamata representation for log.
2)Prove that a function of the form above is slowly varying.
34
Exercise 2.6:
Consider the distribution F with tail F (x) = x−2 for x > 1. Then F ∈ RV−2 .
1)Implement the Hill estimator (in R for example) and a function that can simulate values
from F .
2)Simulate 50 values of F and make a Hill plot using your function from before. Also plot
the true value of the index as a line in the plot.
3)Repeat for 100, 250 and 1000 values. Comment on the results.
Exercise 2.7:
Recall that we proved the following formula for the mean-excess function of an integrable
random variable X: Z ∞
1
e(u) = F (x)dx.
F (u) u
1)Let X ∼ Exp(λ). Prove that e(u) = 1/λ.
2)Let X be Pareto distributed with parameters κ > 0 and α > 1. Prove that
κ+u
e(u) = .
α−1
Exercise 2.8:
The goal of this exercise is to prove that if X ≥ 0 is a random variable with distribution
function F and
F (x − y)
lim = eγy , y > 0
x→∞ F (x)
for some γ ∈ (0, ∞), then
e(u) → γ −1 for u → ∞.
Consider again the ”canonical example” of stock returns where the risk factors are the log
(i) (i) (i)
returns Xn+1 with Xn+1 = log Sn+1 − log Sn . What is the distribution of Xn+1 ? Inspired
by the Black-Scholes model, we could assume Xn+1 ∼ N (µ, Σ). There are several problems
with the normal assumption, however. A typical problem is that assets are very often
correlated in such a way that high (low) returns for one asset correlates with high (low)
returns for another. The normal distribution is very light-tailed, so such correlations are
often not captured. The plots below illustrate this issue.
Figure 9: Log returns of foreign exchange rates quotes against the US dollar.
35
36
Figure 10: Simulated foreign exchange rates using a bivariate normal distribution with
estimated means and covariance matrix.
From the plots, it is evident that the normal distribution fails to capture the dependency
in the tails. Furthermore, the probability mass is too concentrated around the mean. The
dependency in tails is a very typical phenomenon in financial data as illustrated in the plot
below.
To remedy the issues with the normal distribution, we introduce a class of distributions that
in some way resembles the normal distribution and shares a lot of its properties while also
37
being more flexible in terms of modelling tail behaviour. This is the class of spherical and
elliptical distributions.
1 − 12
Pd 2 1 2
f (x) = e i=1 xi = e−r /2
(2π)d/2 (2π)d/2
with r2 = x21 + · · · + x2d . Hence the density only depends on kxk = r. Graphically, the level
sets of f are spheres (or circles in two dimensions). One can say that N (0, Id ) is spherically
symmetric/rotationally invariant. Define the random variable R by R2 = X12 + · · · + Xd2 ,
then R2 ∼ χ2 (d) i.e. R2 is Chi-square distributed with d degrees of freedom. We call R
d
the radial component of X. Intuitively, we can decompose X asPX = RS with S uniformly
d−1 d
distributed on the d-dimensional unit sphere S d
= {x ∈ R : i=1 x2i = 1}. While this is
an informal approach, it gives us the idea on how to proceed formally.
Definition 6.1. For a d-dimensional random vector X, we define the characteristic function
of X as
T
ΦX (t) = E[eit X ], t ∈ Rd .
T
Remark 6.2. Note the similarity to the moment-generating function κX (t) = E[et X ]. These
two transforms satisfy similar properties. For example, two random variables have the same
distribution if and only if their characteristic functions are equal. See the appendix for more
background on these functions. The characteristic function has the advantage that it always
exists (since the integrand is bounded by one in norm) but it provides less information about
the tail behaviour than the moment-generating function.
Note that we can write ΦX (t) = ψ(ktk2 ) for the function ψ : R → R given by ψ(t) = e−t/2 .
This is a formal way of stating that ΦX doesn’t depend on the direction of t. ◦
for some univariate function ψ. ψ is called the characteristic generator of X, and we write
X ∼ Sd (ψ).
d
which is true for all tθ . By the uniqueness of the characteristic function, Xθ = X. This
gives a formal argument for the intuition of X being rotationally invariant. The following
result gives an equivalent formulation of spherical distributions.
Proposition 6.5. The following are equivalent:
(i) X has a spherical distribution.
d
(ii) X = RS with S uniformly distributed on the unit sphere Sd−1 := {x ∈ Rd : kxk = 1}
and R is a one-dimensional random variable independent of S.
Proof. We first prove that (ii) implies (i). We have
h i h h ii
ΦX (t) = ΦRS (t) = E eiht,RSi = E E eiht,RSi | R
h h ii
= E E eihRt,Si | R = E[ΦS (Rt)]
and since S is uniformly distributed on the unit sphere, ΦS only depends on the length and
not the direction. Hence ΦX (t) also only depends on the length of t and X has a spherical
distribution. We now show that (i) implies (ii). We have ΦX (t) = ψ(ktk2 ). Set s = t/ktk,
then h i
ΦX (t) = E eiktkhs,Xi ,
and by assumption, this does not depend on s, only ktk. Let S be uniformly distributed on
the unit sphere with distribution function FS . Since ΦX (t) is constant in s, we have
Z h i Z Z
iktkhs,Xi
ΦX (t) = E e dFS (s) = eiktkhs,xi dFX (x)dFS (s)
Sd−1 Sd−1 Rd
Z Z Z Z
= eiktkhs,xi dFS (s)dFX (x) = eihs,ktkxi dFS (s)dFX (x)
d d−1 d d−1
ZR Sh i Z h
R S
i
ihktkx,Si ihkxkt,Si
= E e dFX (x) = E e dFX (x)
Rd d
h h ii h R i h i
= E E eihkXkt,Si | X = E eiht,kXkSi = E eiht,RSi = ΦRS (t)
d
where we have defined R := kXk. Hence X = RS where R and S have the desired properties.
While characterisation (ii) is more intuitive, it is easier to work with definition (i) when one
wants to prove properties of spherical distributions. The following corollary tells how to
compute R and S when we know that X is spherical.
d
Corollary 6.6. Let X = RS be spherical. Then
X d
kXk, = (R, S).
kXk
Proof. The proof is from [17], see Corollary 6.22. Let f1 (x) = kxk and f2 (x) = x/kxk.
d
Since X = RS, we have
X d
kXk, = (f1 (X), f2 (X)) = (f1 (RS), f2 (RS)) = (R, S)
kXk
as desired.
39
There exist methods to find A such that AAT = Σ. One such method is the Cholesky
factorisation. This factorisation determines a lower triangular matrix A such that AAT = Σ.
In detail,
a11 0 ··· 0 a11 a12 · · · ad1 Σ11 Σ12 · · · Σ1d
.. .. ..
0 a22 · · ·
a21 a22 . . = Σ21 Σ22 .
.
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
The algorithm can (somewhat informally) be described as follows: Since Σ11 = a211 , Σ11
determines a11 . Since Σ21 = a11 a21 , Σ21 determines a21 and so on. Since we can go back
and forth between the matrix A and the matrix Σ, the notation Ed (µ, Σ, ψ) makes sense.
Using the characterisation of spherical distributions in terms of a radial component, we can
also write
X = µ + RAS
for X ∼ Ed (µ, Σ, ψ). We now turn to some properties of elliptical distributions. Afterwards
we will look at some examples.
as desired.
40
We want to be able to relate covariances to the dispersion. The following proposition shows
how these are related for elliptical distributions.
Proposition 7.2. If X ∼ Ed (µ, Σ, ψ), the covariance between the components is given by
Proof. Since covariance does not depend on the mean, we can without loss of generality
assume that E[Xj ] = 0 for all j = 1, ..., d. We have
and thus
∂ ∂ ∂
ψ(tT Σt) = (ψ 0 (w(t))(2t1 Σ12 + 2t2 Σ22 ))
∂t1 ∂t2 t=0 ∂t1 t=0
t=0
= 2ψ 0 (0)Σ12 .
Example 7.3. If Y ∼ N (0, Id ), then ψ(r) = e−r/2 as seen earlier. We see that ψ 0 (r) =
− 21 e−r/2 , so ψ 0 (0) = − 12 . If X ∼ N (µ, Σ), the above proposition tells us that Cov(Xj , Xl ) =
−2ψ 0 (0)Σjl = Σjl as expected. ◦
We list some further properties of elliptical distributions.
Theorem 7.4. Let X = µ + AY ∼ Ed (µ, Σ, ψ).
(i) (Linear combinations). If B is a k × d matrix and b ∈ Rk , then
(ii) (Marginal distributions). The marginals X1 , ..., Xd also have elliptical distributions
with the same characteristic generator. Explicitly, Xi ∼ E1 (µi , Σii , ψ).
While elliptical distributions are quite flexible as seen in the examples earlier, the above
theorem also illustrates a drawback. The flexibility of elliptical distributions are limited by
the fact that if X is elliptical, so is any coordinate. In real data, it is often the case that
the marginals have very different types of distributions. This motivates the topic for next
week, namely copulas. We now consider some examples.
Thus, X is obtained by drawing from a collection of normal random variables with random
covariance W Σ. ◦
X ∼ td (ν, µ, Σ)
Pd
where wi are the weights of the portfolio i.e. i=1 wi = 1. If we fix the expected total returns
Pd
E[Rp ] = i=1 wi µi = wT µ, we want to minimize the risk in the sense of minimising the
variance (note that this approach is different from the strategy in this course, where we
focus on risk measures). If X = µ + AY where Y ∼ N (0, Id ), then
and since Y is spherical, the variance is minimised whenever kAT wk is minimized with
respects to the weights wi .
To transfer these ideas to the setting in this course, let ρ be a risk measure which satisfies
monotonicity and translation invariance. Assume more generally that X = µ + AY is
elliptical. The goal now is to minimise ρ(L) where L = −wT µ − wT AY is the loss. We
have
ρ(L) = −wT µ + ρ(−wT AY).
d
As Y is spherical, we have Y = −Y, so we can remove the minus in the risk measure and
obtain
ρ(L) = −wT µ + ρ(wT AY) = −E[Rp ] + ρ(wT AY).
As E[Rp ] is fixed, ρ(L) is minimised whenever ρ(wT AY) is minimised. As Y is spherical,
ρ(wT AY) is minimised when kAT wk is minimised. Thus the answer remains the same as
in the classical case above.
Exercises
Exercise 3.1:
Prove Theorem 7.4. Hint: Characteristic functions!
Exercise 3.2:
Let X ∼ Γ(α, β) i.e. let X have a gamma distribution with parameters α, β > 0. Verify
that 1/X ∼ Ig(α, β). This provides an explanation for the name ”inverse gamma”.
Exercise 3.3:
1)Without using an R package, simulate 500 values of the multivariate t distribution with
2 1 1/2
ν = 3, µ = , Σ= .
0 1/2 1
Make a plot of the result. Hint: Use the result of the previous exercise.
2)Now simulate 500 values of the multivariate normal distribution with the same µ and Σ
and plot the result. Compare the plot to the one for the t distribution.
Exercise 3.4:
Without using an R package, write an R function to simulate from the multivariate normal
distribution with mean µ = (1, 0, 2) and covariance matrix
1 0 −2
Σ = 0 3 5 .
−2 5 1
Exercise 3.5:
Let S be uniformly distributed on the unit circle S1 . Simulate 500 values of S and plot the
result. Hint: Corollary 6.6.
Week 4 - Copulas I
d
Recall that if Xi has distribution function Fi , then Fi← (Ui ) = Xi with Ui a Unif(0, 1)
variable. This is the ”inverse transform method” used in simulation. Recall also that
d
Fi (Ui ) = Xi whenever Fi is continuous.
To study the problem of determining the joint distribution F , we assume that the marginal
distribution functions Fi are known. The transformation Ui := Fi (Xi ) in the continuous
case is illustrated below.
Figure 12: An illustration (in two dimensions) of the idea of a ”copula space”. The ”original
space” on the left is where the variables live, and we wish to transform them into a collection
of uniform variables with support on [0, 1]d .
44
45
Definition 8.1. A copula C is a distribution function on [0, 1]d such that all marginals are
Unif(0, 1) distributed.
To be able to go back and forth between the ”original space” and the ”copula space”, Sklar’s
Theorem is an essential tool.
Theorem 8.2 (Sklar’s Theorem). Let F be the joint distribution function of the random
vector X = (X1 , ..., Xd ) with marginal distribution functions F1 , ..., Fd . There exists a copula
C such that
F (x1 , ..., xd ) = C(F1 (x1 ), ..., Fd (xd )). (3.1)
If F1 , ..., Fd are continuous, C is unique. Conversely, given a copula C and marginal distri-
bution functions F1 , ..., Fd , then F as defined in (3.1) is a joint distribution function with
marginals F1 , ..., Fd .
Proof. For the sake of simplicity, assume that the Fi are continuous. Then Ui := Fi (Xi ) ∼
Unif(0, 1). Suppose X ∼ F with marginals Xi ∼ Fi . Let U = (F1 (X1 ), ..., Fd (Xd )) and let
C be the distribution function of U. By construction and the continuity assumption on the
Fi , C is a copula. We compute
which shows that the copula C has the desired properties. As for uniqueness, by continuity
of the Fi , we have Fi (Fi← (ui )) = ui for all ui ∈ [0, 1]. Letting xi = Fi← (ui ) in the expression
above, we get
C(u1 , ..., ud ) = F (F1← (u1 ), ..., Fd← (ud ))
and any copula C̃ satisfying C̃(F1 (x1 ), ..., Fd (xd )) = F (x1 , ..., xd ) must satisfy the same
relation. Uniqueness now follows. To prove the converse statement, let C be a copula and
F1 , ..., Fd univariate distribution functions. Let U = (U1 , ..., Ud ) have distribution function
C and define Xi := Fi← (Ui ), X := (X1 , ..., Xd ). We know that Xi ∼ Fi , so the marginal
distributions are correct. Also,
which shows that F (x1 , ..., xd ) := C(F1 (x1 ), ..., Fd (xd )) is the distribution function for X.
Sklar’s Theorem provides a recipe for constructing copulas using a known joint distribution
function. We call such copulas implicit copulas. Different examples of copulas will be given
in the next section. We first consider some more theoretical properties. We start with the
following useful characterisation of copulas.
Proposition 8.3. A function C : [0, 1]d → [0, 1] is a copula if and only if
(i) C(u1 , ..., ud ) = 0 if ui = 0 for any i = 1, ..., d.
(ii) C(1, ..., 1, ui , 1, ..., 1) = ui for any i = 1, ..., d and ui ∈ [0, 1].
46
(iii) For all (a1 , ..., ad ), (b1 , ..., bd ) ∈ [0, 1]d with ai ≤ bi , we have
2
X 2
X
··· (−1)i1 +···id C(u1i1 , ..., udid ) ≥ 0
i1 =1 id =1
The first two properties are self-explanatory. The third property is called the rectangle
inequality and can be interpreted as follows: For uniform variables (U1 , ..., Ud ), then P (a1 ≤
U1 ≤ b1 , ..., ad ≤ Ud ≤ bd ) ≥ 0.
Proposition 8.4 (Fréchet bounds). For every copula C, we have W (u) ≤ C(u) ≤ M (u)
where ( d )
X
W (u) = max ui + 1 − d, 0 and M (u) = min ui .
i=1,...,d
i=1
Proof. Let U have distribution function C. For every ui ∈ [0, 1], we have
The bound C(u) ≤ M (u) now follows by minimising over all i. Conversely, for u ∈ [0, 1]d ,
d
!
[
1 − C(u) = 1 − P (U1 ≤ u1 , ..., Ud ≤ ud ) = P {Ui > ui }
i=1
d
X d
X d
X
≤ P (Ui > ui ) = (1 − ui ) = d − ui .
i=1 i=1 i=1
Pd Pd
Thus −C(u) ≤ d − 1 − i=1 ui implying C(u) ≥ i=1 ui + 1 − d. Since C(u) ≥ 0 always
holds, the lower bound follows.
What kinds of random vectors produce these upper and lower bounds? It turns out that
the function W is not a copula for d > 2, see Example 7.24 in [17]. We can however
produce M as a copula for any d and W for d = 2. To do so, we introduce the concepts of
comonotonicity and countermonotonicity.
Definition 8.5. We say that X1 , ..., Xd are comonotone if (X1 , ..., Xd ) = (α1 (Z), ..., αd (Z))
for some univariate variable Z and nondecreasing functions α1 , ..., αd . We say that (X1 , X2 )
are countermonotonic if (X1 , X2 ) = (α(Z), β(Z)) for some univariate variable Z, some
nondecreasing function α and some nonincreasing function β.
The following proposition shows that the lower and upper Fréchet bounds arise from coun-
termonotonic and comonotonic random vectors, respectively.
Proposition 8.6. A comonotone bundle (X1 , ..., Xd ) has M as a copula, and a counter-
monotonic bundle (X1 , X2 ) in d = 2 dimensions has W as a copula.
47
Proof. Consider a comonotone bundle (X1 , ..., Xd ) and assume for simplicity that the func-
tions αi are strictly increasing and continuous. If F is the joint distribution function, we
have
F (x1 , ..., xd ) = P (X1 ≤ x1 , ..., Xd ≤ xd ) = P (α1 (Z) ≤ x1 , ..., αd (Z) ≤ xd )
= P (Z ≤ α1← (x1 ), ..., Z ≤ αd← (xd )) = P Z ≤ min αi← (xi )
i=1,...,d
9 Examples of copulas
We will study three types of copulas, namely fundamental copulas, implicit copulas and
explicit copulas.
Fundamental copulas
Fundamental copulas arise from theoretical considerations. We have already seen two ex-
amples from the Fréchet bounds.
M (u) = min ui
i=1,...,d
W (u1 , u2 ) = max{u1 + u2 − 1, 0}
The Fréchet bound copulas are not the only ”theoretical” copulas.
Example 9.2 (The independence copula). For any d > 1, the independence copula is
given by
d
Y
C(u1 , ..., ud ) = ui .
i=1
d
Y d
Y
C(F1 (x1 ), ..., Fd (xd )) = Fi (xi ) = P (Xi ≤ xi )
i=1 i=1
= P (X1 ≤ x1 , ..., Xd ≤ xd ) = F (x1 , ..., xd )
Implicit copulas
Implicit copulas arise from known joint distribution functions. Let F be a given joint
distribution function. If the marginal distribution functions Fi are continuous, we can use
Sklar’s Theorem to construct a copula C via
One can think of Σ as a correlation matrix. Note that the marginals Xi are standard normal,
so that they have common distribution function Φ. We can then construct the Gaussian
copula
CΣGa (u1 , ..., ud ) = ΦΣ (Φ−1 (u1 ), ..., Φ−1 (ud )).
What if we have a general mean and covariance matrix? Let Y = µ + BX be an affine
transformation of X where
σ1 0 · · · 0
0 σ2 · · · 0
B= .
.. ..
.. . .
0 0 ··· σd
is a diagonal matrix with σi > 0 for all i. Then Y ∼ N (µ, BΣB T ), and any desired
covariance matrix can be written in the form BΣB T . Note that Yi = µi + σi Xi is a strictly
increasing and continuous transformation of Xi , so Proposition 8.8 implies that Y and X
have the same copula, namely CΣGa . Simulating from this copula is easy. The trick is to
follow the construction in Sklar’s Theorem. To simulate a sample from CΣGa , follow the
steps:
Explicit copulas
Explicit copulas are given by a concrete formula. Some well-known examples are the fol-
lowing.
Example 9.4 (Gumbel copula). The Gumbel copula is given by
1/θ
CθGu (u1 , u2 ) = exp − (− log u1 )θ + (− log u2 )θ
We see that by choosing ϕ(t) = (− log t)θ for θ ≥ 1, we obtain the Gumbel copula. Choosing
ϕ(t) = θ1 (t−θ − 1) gives the Clayton copula.
Example 9.7 (Generalised Clayton copula). Choosing ϕ(t) = θ−δ (t−θ − 1)δ for θ > 0
and δ ≥ 1 gives the Generalised Clayton copula. The special case δ = 1 corresponds to the
Clayton copula from above. ◦
Example 9.8 (Frank copula). The Frank copula is the Archimedean copula with gener-
ator −θt
e −1
ϕ(t) = − log
eθ − 1
where θ ∈ R \ {0} is a parameter. ◦
It should of course be verified that an Archimedean copula is a copula. We state the relevant
result without proof.
Theorem 9.9. Let ϕ : [0, 1] → [0, ∞] be a continuous, strictly decreasing function such that
ϕ(1) = 0, ϕ(0) = ∞. The function C : [0, 1]2 → [0, 1] given by
While the method of constructing Archimedean copulas is certainly practical, there are
some limitations worth mentioning. An Archimedean copula is seen to be symmetric in
the two arguments which is clearly a constraint in modelling. Furthermore, the structure
of an Archimedean copula is maybe too simple since it is two dimensional but has a one-
dimensional generator.
Exercises
Exercise 4.1:
Write out the rectangle inequality (see Proposition 8.3) in the case where d = 2. Use it to
verify that the Morgenstern copula given by
Exercise 4.2:
Let C and C̃ be copulas and θ ∈ [0, 1]. Verify that θC + (1 − θ)C̃ is also a copula.
Exercise 4.3:
Let (X, Y ) have joint distribution function
Exercise 4.4:
Let X and Y be random variables with continuous distribution functions FX and FY . Let
α, β be functions, C the copula of (X, Y ) and C̃ the copula of (α(X), β(Y )).
1)If α is strictly increasing and β strictly decreasing, prove that
C̃(u1 , u2 ) = u1 − C(u1 , 1 − u2 ).
C̃(u1 , u2 ) = u2 − C(1 − u1 , u2 ).
C̃(u1 , u2 ) = u1 + u2 − 1 + C(1 − u1 , 1 − u2 ).
Exercise 4.5:
Recall that the Frank copula has generator
e−θt − 1
ϕ(t) = − log
eθ − 1
52
Exercise 4.6:
Consider the function
ϕ(t) = log(1 − θ log t)
for θ ∈ (0, 1] a parameter.
1)Verify that ϕ is a valid generator for an Archimedean copula.
2)Write the corresponding copula in explicit form.
Exercise 4.7:
The d-dimensional t copula is given by
t
Cν,Σ (u1 , ..., ud ) = tν,Σ (t−1 −1
ν (u1 ), ..., tν (ud ))
with tν,Σ the distribution function of t(ν, 0, Σ) and tν the univariate t distribution function
with ν degrees of freedom.
1)Describe a procedure to simulate from the t copula.
2)Simulate 1000 samples from the t copula with ν = 3 and
1 1/2
Σ= .
1/2 1
3)Simulate 1000 samples from the Gaussian copula with Σ above. Plot the two simulated
samples and compare. Try choosing different marginal distributions for these two copulas
and see what happens.
Week 5 - Copulas II
where ϕ : [0, 1] → [0, ∞] has the same properties as in the two-dimensional case i.e. ϕ is
strictly decreasing, convex and satisfies ϕ(0) = ∞ and ϕ(1) = 0. Is C as constructed above a
copula? The answer is no in general. One issue is that C is not even a distribution function
in general for dimensions higher than two. In order to answer the question of when C is a
copula, we need the following definition.
Definition 10.1. A decreasing function f is completely monotonic on [a, b] if
dk
(−1)k f (t) ≥ 0 for k = 1, 2, ... and t ∈ (a, b).
dtk
It turns out that the property of being completely monotonic determines whether C defined
by equation (4.2) is a copula.
Theorem 10.2. Let ϕ : [0, 1] → [0, ∞] be a continuous strictly decreasing function such
that ϕ(0) = ∞ and ϕ(1) = 0. C defined by (4.2) is a copula for all d ≥ 2 if and only if ϕ−1
is completely monotonic on [0, ∞).
Proof. See Theorem 4.6.2 in [19] and the references in the paragraph above the theorem.
In principle one should be able to check whether ϕ−1 is completely monotonic, but it is
a tedious procedure. As an alternative to verifying complete monotonicity, one can apply
Laplace transforms of distribution functions. We recall the definition.
Definition 10.3. Let G be a distribution function on [0, ∞) with G(0) = 0. The Laplace
transform of G is Z ∞
ψ(t) = e−tx dG(x).
0
Remark 10.4. Let G be a distribution function on [0, ∞) with G(0) = 0 and Y a random
variable distributed according to G. We then have the following relationship between the
Laplace transform ψ of G and the moment-generating function κY given by
ψ(t) = κY (−t).
53
54
The proposition tells us that if we want to simulate from the copula C(u) = ϕ−1 (ϕ(u1 ) +
· · · + ϕ(ud )), we should apply the following steps:
2. Simulate V ∼ G.
3. Generate iid U1 , ..., Ud ∼ Unif(0, 1) and apply the inverse transform method i.e. Wi =
←
FW i |V =v
(Ui ) with v equal to the simulated value of V from step 2.
We can actually be more specific in step 3. FWi |V =v has a proper inverse which we solve for
as follows:
−vϕ(FW←
|V =v (u)) ← ← −1 log u
e i = u ⇔ − log u = vϕ(FWi |V =v (u)) ⇔ FWi |V =v (u) = ϕ − .
v
for iid data u1 , ..., un . This of course requires that we choose a specific copula.
To determine a proper copula for a data set, we should think of properties like correlation,
tail dependence and symmetry. The tail behaviour is especially relevant in a risk manage-
ment context since we want to model large losses. Large losses also tend to ”move together”.
When one loss is large, the other losses tend to be large as well. This can for example happen
in a portfolio with stocks in similar companies. When we make the transformation from the
original space of our data to the copula space, the size of the data is not preserved, only
the ordering. Hence we need notions of correlation that only depends on ordering. This
leads to different notions of rank correlation. We will study two types, namely Kendall’s τ
and Spearman’s ρ. Afterwards, we will consider tail dependence via the coefficient of upper
(respectively lower) tail dependence.
Kendall’s τ
Definition 11.1. Kendall’s τ for (X1 , X2 ) is defined as
Kendall’s τ gives an indication of whether X1 and X2 get large together or if one tends
to get larger when the other gets smaller. If X1 and X2 tend to move together, the event
{(X1 − Y1 )(X2 − Y2 ) > 0} will have high probability, resulting in a value of ρτ (X1 , X2 ) close
to 1, and if X1 and X2 move in opposite directions, {(X1 − Y1 )(X2 − Y2 ) < 0} will have
high probability so that ρτ (X1 , X2 ) is close to -1. To make this precise, we can make the
following definition.
Definition 11.2. Consider two points (x1 , x2 ), (y1 , y2 ) ∈ R2 . We say that (x1 , x2 ) and
(y1 , y2 ) are concordant if (x1 − x2 )(y1 − y2 ) > 0 and discordant if (x1 − x2 )(y1 − y2 ) < 0.
Intuitively, ρτ (X1 , X2 ) = 1 should correspond to comonotonicity while ρτ (X1 , X2 ) = −1
should correspond to countermonotonicity. This is indeed the case. We want to be able to
estimate Kendall’s τ from data {(Xt,1 , Xt,2 ) : t = 1, ..., n}. To do so, we need to compare
all pairs (Xt,1 , Xt,2 ) and (Xs,1 , Xs,2 ) with one another. In the case of concordance, the pair
should yieldthe value 1 (indicating a positive sign) and -1 in the case of discordance. Since
there are n2 pairs in total, we obtain the estimator
−1 X
n
ρ̂τ = sign((Xt,1 − Xs,1 )(Xt,2 − Xs,2 ))
2
1≤t≤s≤n
where
1,
x>0
sign(x) = 0, x=0.
−1, x<0
We have the following results on Kendall’s τ for copulas. For their proofs, we refer to chapter
5 of [19].
Theorem 11.3. The following hold:
(i) Let (X1 , X2 ) be continuous variables with copula C. Then
Z
ρτ (X1 , X2 ) = 4 C(u1 , u2 )dC(u1 , u2 ) − 1 = 4E[C(U1 , U2 )] − 1
[0,1]2
We leave it as an exercise for the reader to verify that if (X1 , X2 ) have the Clayton copula,
ρτ (X1 , X2 ) = θ/(θ + 2) while for the Gumbel copula, ρτ (X1 , X2 ) = 1 − 1/θ.
Spearman’s ρ
Another measure of rank correlation is Spearman’s ρ defined as follows:
Definition 11.4. Consider a pair of variables (X1 , X2 ) and introduce independent copies
d d
(Y1 , Y2 ) = (Z1 , Z2 ) = (X1 , X2 ). Spearman’s ρ of (X1 , X2 ) is given by
ρS (X1 , X2 ) = 3(P ((X1 − Y1 )(X2 − Z2 ) > 0) − P ((X1 − Y1 )(X2 − Z2 ) < 0)).
57
Remark 11.6. It is not hard to verify that (see the exercises) that
which is the ordinary (linear) correlation for U1 and U2 with joint distribution function C.
The following result is useful for relating the rank correlations for the Gaussian copula.
Theorem 11.7. Let X = (X1 , X2 ) have a bivariate Gaussian distribution with copula CΣGa
where
1 ρ
Σ= .
ρ 1
Then the ordinary linear correlation ρ of X1 and X2 is related to ρτ and ρS via
2 6 ρ
ρτ (X1 , X2 ) = arcsin ρ, ρS (X1 , X2 ) = arcsin .
π π 2
Proof. See Theorem 7.42 in [17].
Tail dependence
Definition 11.8. The coefficient of upper tail dependence of X1 and X2 is given by
1 − 2u + C(u, u)
λU (X1 , X2 ) = lim .
u↑1 1−u
λU (X1 , X2 ) = lim P (F2 (X2 ) > u | F1 (X1 ) > u) = lim P (U2 > u | U1 > u).
u↑1 u↑1
58
To see this, it may help to draw a figure. The equation can be rewritten as
so that
P (U1 > u, U2 > u) 1 − 2u + C(u, u)
P (U2 > u | U1 > u) = =
P (U1 > u) 1−u
from which the claim follows.
In the exercises below and in the mandatory assignments, we will see examples of computing
the coefficient of upper tail dependence. We now introduce the corresponding notion for the
lower tail.
We have an analogous result to the one above for continuous marginal distribution functions.
Proof. The proof is essentially a simpler version of the one given for the coefficient of upper
tail dependence. We compute
λL (X1 , X2 ) = lim P (X2 ≤ F2← (u) | X1 ≤ F1← (u)) = lim P (F2 (X2 ) ≤ u | F1 (X1 ) ≤ u)
u↓0 u↓0
P (U2 ≤ u, U1 ≤ u) C(u, u)
= lim P (U2 ≤ u | U1 ≤ u) = lim = lim
u↓0 u↓0 P (U1 ≤ u) u↓0 u
as desired.
Assume ρ ∈ (−1, 1) (the cases ρ = ±1 are left as exercises). We want to show that the
coefficient of upper tail dependence λU (X1 , X2 ) is zero. Consider a t > 1 such that tρ < 1
and make the bound
P (X2 > F2← (u) | X1 > F1← (u)) ≤ P (X2 > F2← (u), X1 ≤ tF1← (u) | X1 > F1← (u))
+ P (X1 > tF1← (u) | X1 > F1← (u)).
59
Consider the second term first and let v := F1← (u). We have by L’Hospital’s rule
R ∞ −x2 /2
P (X1 > tv) e dx
lim P (X1 > tF1← (u) | X1 > F1← (u)) = lim = lim Rtv
∞ −x2 /2
u↑1 v→∞ P (X1 > v) v→∞ e dx
v
2
e−(tv) /2
= lim 2 = 0.
v→∞ e−v /2
uniformly for x ∈ [v, tv]. It follows that the limit of the first term goes to zero as well which
establishes λU (X1 , X2 ) = 0. We refer to this property as asymptotic independence in the
tails. No matter how large the correlation ρ is in the range (−1, 1), if we go far enough into
the tails, extreme events occur independently in X1 and X2 . This illustrates a potential
problem with the Gaussian copula for modelling financial data. In such data, large losses in
different variables are often correlated, and the normal copula may fail to capture this. ◦
Exercises
Exercise 5.1:
Exercise 5.2:
Recall the Clayton copula
−1/θ −1/θ
CθCl (u1 , u2 ) = (u1 + u2 − 1)−1/θ , 0<θ<∞
Exercise 5.3:
Prove the claim in Remark 11.6.
Exercise 5.4:
Let (X1 , X2 ) be comonotone. Show that λU (X1 , X2 ) = 1.
Exercise 5.5:
Consider the Archimedean copula with generator
1 − θ(1 − t)
ϕ(t) = log
t
Lendor −→ Obligor
We have a one-period model. Namely, we will consider the loss that occurs over a
single timestep such as a year, a month etc.
We have n total loans.
We have a probability of default pi for loan i. pi will depend on external factors, and
these should be incorporated into the model.
Definition 12.1. The total one-period loss (for the bank, say) is given by
n
X
L= Xi Li (1 − λi )
i=1
where Li is the size of the ith loan, λi is the recovery rate for the ith loan and
(
1, if default
Xi = .
0, if no default
The recovery rate λi is a number between [0, 1] that describes how much the bank can
recover in case of default. If, for example, a third of the loan is already repaid by the time
of default, the bank will only lose two thirds of the loan. Before considering concrete models
for credit risk, we make two important remarks.
(i) Defaults will be dependent since defaults are often triggered by external factors, for
example increases in interest rates.
(ii) Large losses are usually not caused by one default but by a large number of defaults,
even though the individual losses are often not large.
61
62
The Merton model is an example of a structural model. The three others are examples of
reduced form models.
To estimate P (Default), we need to estimate DD. One problem is that VA is not observable.
It is hard to put a number on the value of every asset of a company (buildings, furniture,
machines, patents etc.). However, the equity VE is observable. We define the equity at time
T to be
VE (T ) = (VA (T ) − K)+
since if K > VA (T ) the company defaults and the equity is zero. We recognise the above as
a call option with strike K. If we also assume a constant interest rate r, we can apply the
Black-Scholes formula,
VE (t) = VA (t)Φ(z) − Ke−r(T −t) Φ(y)
where
2
σA
log VA (t) − log K + r + 2 (T − t) √
z= √ , y = z − σA T − t.
σA T − t
This formula relates VA to VE . A specific form of the Merton model used in the industry
is the KMV model. In this model, one estimates the volatility of VE , σE = g(VA (t), σA , r).
Given µA and σA , the firm estimates the DD and hence the probability of default. All these
estimates may be very unreliable and so in the actual KMV procedure, further steps are
taken, namely:
Calculate DDi , i = 1, ..., n, where DDi is the distance to default for firm i.
Compare with past empirical observations with the same DDi . Use the empirical
frequency of default from these past loans and compare with the predictions from the
estimates.
The description of the KMV model above is vague on purpose. Since it is an industry model,
the exact procedures are not public and may change from firm to firm.
The Merton model we have considered so far only involves one firm, but the model is easily
extended to an arbitrary number of firms.
with m independent Brownian motions W1 , ..., Wm . Note that all the Brownian motions
appear in all the asset processes. This makes the VA,i dependent. One can think of the
Brownian motions
Pm as 2underlying risk factors (for example fluctuations in interest rates).
2
Letting σA,i = j=1 σA,i,j , we can explicitly solve for each VA,i and obtain
2
σA,i
Pm
µA,i − 2 (T −t)+ j=1 σA,i,j (Wj (T )−Wj (t))
VA,i (T ) = VA,i (t)e
64
2
we have Zi ∼ N (0, σA,i (T − t)). Like in the one-dimensional case, we can solve for Zi and
get !
2
σA,i
Zi = log VA,i (T ) − log VA,i (t) + − µA,i (T − t).
2
The ith firm defaults if VA,i (T ) < K where K is some threshold which we think of as the
debt of the company. We can rewrite VA,i (T ) < K as
! !
2
Zi 1 σA,i
Yi := √ < √ log K − log VA,i (t) + − µA,i (T − t) =: −DDi
σA,i T − t σA,i T − t 2
DDi is the distance to default for the ith firm, and hence default for company i means
Yi < −DDi . Since Yi ∼ N (0, 1), we have
Note that the Yi are dependent. We want to describe this dependence. Note that we can
write
m m
1 X X σA,i,j Wj (T ) − Wj (t)
Yi = √ σA,i,j (Wj (T ) − Wj (t)) = √
σA,i T − t j=1 j=1
σA,i T −t
so the Yi are weighted sums of the same iid N (0, 1)-variables. More generally, we can observe
that the Yi have the form
m
X
Yi = ci,j Rj
j=1
for constants ci,j specific to the ith loan and R1 , ..., Rm iid N (0, 1). This leads us to consider
models for the Yi called factor models. In a factor model, we assume that the Yi have the
form
Xk
Yi = aij Uj + bi Wi .
j=1
The Uj are common stochastic factors that affect all loans, and we assume that U1 , ..., Uk
are iid N (0, 1). The variable Wi is a firm specific stochastic factor, and the aij and bi are
constants. In order to estimate the default probability, we will need estimates of aij , bi and
DDi . This is not an easy task. These quantities are not observable from data, and hence we
cannot apply ordinary statistical methods to estimate them. Another approach is usually to
divide the loans into different ”classes” where for a specific class, aij and bi do not depend
on i but only on the class. This means that in a specific class, we have
k
X
Yi = aj Uj + bWi .
j=1
65
Pk
A standard approach at this point is to normalize the constants so that j=1 a2j + b2 = 1.
Then
Xk
aj Uj ∼ N (0, kak2 ), a = (a1 , ..., ak )
j=1
d 2 2
√ Yi = kakZ + bWi for Z ∼ N (0, 1). Since kak + b = 1, we can
from which it follows that
√
write kak = ρ and b = 1 − ρ for some ρ ∈ (0, 1). Yi can thus be written as
d √ p
Yi = ρZ + 1 − ρWi .
where √
di ρ
ai = √ , b= √ .
1−ρ 1−ρ
Our goal is to compute VaRα for Nn . To do so, we first compute the VaR for pi (Z). This
amounts to solving the equation 1 − α = P (pi (Z) ≥ xi ) for xi :
Let us make the simplifying assumption that di = d for all i, i.e. that the distances to default
are identical for all loans. This also implies that pi (Z) is the same for all i since the Yi are
iid conditional on Z. This allows us to write p(Z) without the subscript i. Furthermore, we
will write Nn (Z) to stress that Nn depends on Z. We claim that, conditional on Z,
Nn (Z) P
−→ p(Z)
n
P
where −→ denotes convergence in probability. See the appendix for a review if necessary.
To be precise, we claim that for every ε > 0,
Nn (Z)
P − p(Z) > ε | Z → 0
n
uniformly in Z. Let ε > 0 be given. We start by noting that
Nn (Z)
p(Z) = P (Xi = 1 | Z) = E[Xi | Z] = E |Z .
n
This comes from the fact that now,
(
1, if Yi < d
Xi =
0, if Yi ≥ d
and thus
n
X
E[Nn (Z) | Z] = E[Xj | Z] = nE[Xi | Z]
j=1
for any i. The rest of the proof is a straightforward application of Chebyshev’s inequality.
We get
Nn (Z)
P − p(Z) > ε | Z = P (|Nn (Z) − E[Nn (Z) | Z]| > nε)
n
1 p(Z)(1 − p(Z))
≤ 2 2 Var(Nn (Z) | Z) = .
n ε nε2
The final inequality is a consequence of the Xi being iid given Z (since this is true for
the Yi ) and from observing that conditional on Z, Nn (Z) has a binomial distribution with
parameters n and p(Z). The above converges to zero uniformly in Z which proves the
claim. We are now close to the goal of providing a formula for estimating the VaR. Assume
f (x) := P (p(Z) > x) is continuous. Then we have for α ∈ (0, 1) that
P
and using the result Nn (Z)/n −→ p(Z), we have the approximation
Nn (Z)
1−α≈P > VaRα (p(Z))
n
Exercises
Exercise 6.1:
P
Upgrade the statement Nn (Z)/n −→ p(Z) to Nn (Z)/n → p(Z) almost surely (conditional
on Z). Hint: Use the Markov inequality and A.2.27. The fourth central moment of the
binomial distribution with parameters n and p is given by
We will consider a simplified version of this model, namely a so-called one-factor model. In
such a model, we assume that Z is one-dimensional (so we write Z instead of Z) and that
pi (Z) = p(Z) is the same for all i. Note that this model is a generalisation of the probit
normal mixture model. Indeed, the probit normal mixture model has this exact setup but
with a specific Xi , namely
( √ √
1, if ρZ + 1 − ρWi < d
Xi = √ √
0, if ρZ + 1 − ρWi ≥ d
for Z, W1 , ..., Wn iid and standard normal as we saw last week. Returning to the one-factor
Bernoulli mixture model, we have
n
P (Nn = k | Z) = p(Z)k (1 − p(Z))n−k
k
What are choices of G and p(Z) that make the above expression mathematically tractible?
Consider the special case Z ∼ Beta(a, b) and p(Z) = Z. This model is called the Beta
mixture model. The density of Z is
1
g(z) = z a−1 (1 − z)b−1 , 0≤z≤1 where
β(a, b)
69
70
Z 1
Γ(a)Γ(b)
β(a, b) = z a−1 (1 − z)b−1 dz = .
0 Γ(a + b)
With these assumptions, we can compute P (Nn = k) explicitly as follows:
Z 1 Z 1
n n 1
P (Nn = k) = z k (1 − z)n−k g(z)dz = z k+a−1 (1 − z)n−k+b−1 dz
k k β(a, b)
0 0
n β(a + k, b + n − k)
= .
k β(a, b)
Since we now have an explicit expression for the density of Nk , we can compute all sorts of
risk measures such as VaR, ES etc. explicitly as well. While this is a nice property of the
model, we stress that the motivation for choosing Z ∼ Beta(a, b) and p(Z) = Z is purely
mathematical. Nevertheless, we continue our study of the model. How do we estimate a
and b? While a and b are not directly observed from data, we can relate it to quantities
that are observed. We know the number of defaults that have occured at a given time which
allows us to estimate the probability of default p, at least in a portfolio of homogeneous
loans (think for example of a portfolio consisting only of AAA rated loans, only B rated
loans etc.). We can also estimate the linear correlation ρL of the Xi . We now determine
the relations between these quantities and a and b in the Beta mixture model. We have
a
p = P (Default) = E[P (Default | Z)] = E[p(Z)] = E[Z] = ,
a+b
Cov(Xi , Xj ) E[(Xi − p)(Xj − p)] E[Xi Xj ] − 2pE[Xi ] + p2
ρL = = =
Var(Xi ) p(1 − p) p(1 − p)
and
We hence have two equations in two unknowns. Solving these yields the relations
1 − ρL 1 − ρL
a = (1 − p) , b=p .
ρL ρL
To summarise: We can estimate a and b from data by first estimating the default probability
and the linear correlation from the data and then apply the formulas above.
no default. The other events, Xi = k for k ≥ 2, do not have a natural interpretation, but
we choose λi small
Pn enough so that Xi > 1 happens with very low probability. We again
consider Nn = i=1 Xi , i.e. the number of defaults (at least for small λi ). We will consider
a special case of the Poisson mixture model which goes under the name CreditRisk+ . In
CreditRisk+ , we assume the following:
Z1 , ..., Zm are independent.
Zj ∼ Γ(αj , βj−1 ) with αj βj = 1.
Pm Pm
λi (Z) = λi j=1 aij Zj with aij ≥ 0 and j=1 aij = 1. Here λi > 0 is a constant for
each i.
The assumption αj βj = 1 implies that E[Zj ] = 1. With the above assumptions, we have
E[λi (Z)] = λi which gives us control over the λi functions. We need to choose them small.
With the above assumptions, it is possible to derive the distribution of Nn which is also a
motivation for the model. In order to do so, we need to introduce some theory.
Definition 13.1. For a discrete random variable Y with values in {0, 1, ...}, the function
gY (t) = E[tY ]
gN (t) = eλ(t−1) .
◦
Example 13.4. Let N have a negative binomial distribution with parameters r and p i.e.
k+r−1
P (N = k) = (1 − p)k pr , k = 0, 1, 2, ...
k
We leave it as an exercise for the reader to verify that the probability-generating function
of N is given by r
p 1
gN (t) = for |t| < .
1 − (1 − p)t p
◦
Theorem 13.5. With the assumptions of the CreditRisk+ model, we have
m αj Pn
Y 1 − δj βj i=1 λi aij
gNn (t) = , where δj = Pn .
j=1
1 − δj t 1 + βj i=1 λi aij
72
Proof. The proof is essentially a computational exercise in the assumptions of the CreditRisk+
model. We start by conditioning on Z. The conditional probability-generating function for
Nn given Z is
n
Y
gNn |Z (t) = E tNn | Z = E tX1 +···+Xn | Z = E tXi | Z
i=1
n
Y n
Y
= κXi |Z (log t) = eλi (Z)(t−1)
i=1 i=1
where we used the conditional independence and the above example for the Poisson distri-
bution. By applying the tower property, we can remove the conditioning on Z by taking an
expectation. Let fj denote the density of Zj . Then we have by independence of the Zj that
" n #
Nn Y
λi (Z)(t−1)
gNn (t) = E E t | Z = E e
i=1
Z ∞ Z n
∞Y
= ··· eλi (z)(t−1) f1 (z1 ) · · · fm (zm )dz1 · · · dzm
0 0 i=1
Z ∞ Z ∞ Pn Pm
= ··· e(t−1) i=1 λi j=1 aij zj
f1 (z1 ) · · · fm (zm )dz1 · · · dzm
0 0
Z ∞ Z ∞ Pm Pn
= ··· e(t−1) j=1 i=1 λi aij zj
f1 (z1 ) · · · fm (zm )dz1 · · · dzm
0 0
Pn
For the sake of simplicity, let µj = i=1 λi aij . Then we can continue the computation as
follows:
Z ∞ Z ∞ Pm
gNn (t) = ··· e(t−1) j=1 µj zj f1 (z1 ) · · · fm (zm )dz1 · · · dzm
Z0 ∞ Z0 ∞
= ··· e(t−1)µ1 z1 f (z1 )dz1 · · · e(t−1)µm zm f (zm )dzm
0 0
m Z ∞ m Z ∞
Y Y 1
= e(t−1)µj z f (z)dz = e(t−1)µj z α z αj −1 e−z/βj dz
j=1 0 j=1 0 βj j Γ(αj )
m Z ∞
Y 1 αj −1 −z(βj−1 −(t−1)µj )
= α z e dz.
j=1 0 βj j Γ(αj )
βj µj
δj = .
1 + βj µj
Plugging this expression back into the one for gNn (t) completes the proof.
14 Operational risk
Operational risk can be stated as ”loss from failed internal processes, people or systems
or from external events”. To elaborate a bit on this, we can roughly divide such risks in
categories. One category is repetitive human errors or repetitive operational risks (repetitive
OR). These include IT failures, errors in settlements of transactions, litigation and the like.
Other types of losses include fraud and external events such as flooding, fires, earthquakes
and terrorism (although the latter is extremely hard to model). A difficulty in operational
risk is that we often have little data available, and the data is often heavy tailed. The claim
arrivals can also be hard to model since they often occur randomly in time (and often in
clusters) and since the frequency changes over time. One can for example imagine that a
large traded volume leads to a large number of back office errors.
Under the basic indicator (BI) approach, the capital requirement to cover OR (operational
risk) losses at time n is given by
3
1 X
RCnBI (OR) = α max(GIn−i , 0),
Zn i=1
74
and GI s denotes the ”gross income” at time s. Under the advanced measurement (AM)
approach, we divide into K lines of business (typically K = 8 and lines include for example
corporate finance, trading and sales). The capital requirement to cover OR losses is given
by
XK
RCnAM (OR) = ρα (Ln,b )
b=1
Mathematical estimates
In this subsection, we will investigate methods to analyze the loss via a stochastic process
approach. Let us start by recalling the definition of a Poisson process.
Definition 14.1. A stochastic process {Nt } is called a Poisson process with intensity λ > 0
if Nt takes values in {0, 1, 2, ...} and
(i) P (Nh ≥ 1) = λh + o(h),
(ii) P (Nh ≥ 2) = o(h) and
(iii) {Nt } has stationary and independent increments.
d
Recall that stationarity means that for every s ≤ t, Xt − Xs = Xt−s . By independent
increments, we mean that for every finite partition 0 < t1 < t2 < · · · < tk , the variables
{Xti+1 − Xti }k−1
i=1 are independent. Intuitively, we think of a Poisson process as a claim
number process which satisfies
We will now model the loss of the company at time t by the process
Nt
X
Lt = Xi
i=1
with {Nt } a Poisson process with intensity λ independent of the iid sequence {Xi }. We will
discuss the following ideas based on risk theory:
Let us first discuss the Laplace transform method. Recall that the Laplace transform of a
random variable Y is given by ψY (s) = E[e−sY ]. We can compute the Laplace transform of
the loss using the tower property as follows:
h h PNt ii
ψLt (s) = E e−sLt = E E e−s i=1 Xi | Nt = E ψX (s)Nt
where we have defined ψX (s) = E e−sX1 . We continue the computation and get
∞ ∞
X λn −λ X (ψX (s)λ)n −λ
ψLt (s) = (ψX (s))n e = e = eλ(ψX (s)−1) .
n=0
n! n=0
n!
We should note that all the Laplace transforms exist since we are working with non-negative
random variables. After obtaining the Laplace-transform, numerical inversion techniques
can be applied. The method is more flexible than this however. To illustrate this, consider
a Poisson intensity that changes over time. To make this concrete, assume we are consid-
ering the time interval [0, 2], and we have Poisson processes N1 and N2 on [0, 1] and (1, 2],
(1) (2)
respectively, along with two claim sequences {Xi } and {Xi } belonging to each interval.
The total loss is obtained by summing losses from each interval i.e. L = L1 + L2 , where L1
and L2 are assumed to be independent. Then
h i
ψL (s) = E e−s(L1 +L2 ) = E e−sL1 E e−sL2 = ψL1 (s)ψL2 (s)
and we can invert this function to obtain the distribution. For those familiar with mixture
distributions, the above Laplace transform is one of a compound Poisson sum
Ñt
X
Yi , t ∈ [0, 1]
i=1
with {Ñt } a Poisson process with intensity λ1 + λ2 and Yi mixture distributed with distri-
bution function
λ1 λ2
G(y) = G1 (y) + G2 (y)
λ1 + λ2 λ1 + λ2
(1) (2)
where X1 ∼ G1 and X1 ∼ G2 . In practice, one often observes E[Nt ] < Var(Nt ), a
phenomenon called overdispersion. To remedy this, one can use a mixed Poisson process
{Nt } defined by Nt = ÑΛt where {Ñt } is a Poisson process with intensity 1 and Λ > 0 is a
random variable.
Example 14.2. Choosing Λ ∼ Γ(α, β) leads to the so-called negative binomial process. We
let the reader verify that Nt = ÑΛt is indeed negative binomial distributed. ◦
We now leave the world of Laplace transforms and discuss the next topic, namely Panjer
recursion. For this technique to be applicable, assume Nt satisfies the recursion
b
qn := P (Nt = n) = a + qn−1 , n = 1, 2, ...
n
76
for constants a and b. We also assume that Xi ∈ {1, 2, ...}. Panjer recursion yields the exact
recursive formula for pn = P (Lt = n):
n
X bi
pn = a+ P (X1 = i)pn−i , p0 = q0 .
i=1
n
The assumption Xi ∈ {1, 2, ...} is not as restrictive as it may seem. One can always scale
the values as necessary.
we consider a ”small” δ ∈ (0, 1) and the probability of a large loss over the small time
interval [0, δu],
ϕδ (u) := P (Lt > u, some 0 ≤ t ≤ δu).
Ultimately, we will choose δu = 1. Those familiar with classical ruin theory will immediately
see the connection. But one should note that this situation is slightly different since we
have no premium payments, and we are working over a finite time interval [0, δu] and not
[0, ∞). The Arwedson approximation was originally developed in the study of finite-time
ruin theory. In ruin theory, one studies the Cramér-Lundberg process
Nt
X
Ct = u + ct − Xi
i=1
with u ≥ 0 the initial capital of the company, c > 0 a constant premium rate and {Xi } the
insurance claim sizes. Arwedson considered the finite-time ruin probabilities
ΨK (u) = P (Ct < 0, for some 0 ≤ t ≤ Ku)
where 0 ≤ K < ∞. Under classical Cramér-Lundberg assumptions, Arwedson showed that
(
C
√K e−uI(K) ,
u
if K ≤ ρ
ΨK (u) ∼ −Ru
.
Ce , if K > ρ
Note that the case K > ρ corresponds to the ordinary Cramér-Lundberg estimate. I(K)
would nowadays be called the ”large deviation rate function” which describes theexponential
decay of a probability as u → ∞. R solves the equation Λ(R) := log E eR(C1 −u) = 0 (called
the adjustment coefficient in ruin theory). Here, I(K) > R for K < ρ and ρ = (Λ0 (R))−1
(one can show ρu is the ”most likely” time of ruin).
Returning to ϕδ (u), if Xi has exponential moments (i.e. is ”light tailed”), one can show
C(δ)
ϕδ (u) ∼ √ e−uJ(δ) as u→∞
u
which has connections to Arwedson’s original result as well as the exponentially shifted
measure from our discussion of importance sampling. If X1 is subexponential (think: heavy
tails), for example if X1 is regularly varying, one can prove that
ϕδ (u) ∼ DuF X (u(1 − δµ)) as u → ∞.
77
The proof relies on the concept of ”one large jump” in heavy tailed ruin problems. A con-
cept that should be familiar with someone who has studied ruin theory.
We now turn to the point of time-dependent intensity. It makes sense to assume that the
intensity of Nt changes over time.
Example 14.3. Assume Nt has intensity λn in the interval (n − 1, n] where
and {ξn } is iid N (0, 1) and Zn is a so-called AR(1) process. Zn could represent traded
volume, for example, which is linked to increases in operational risk. ◦
For time-dependent intensities of ARMA-type, similar ”Arwedson” approximations can be
derived. Namely, one can likewise show
Similar insurance-based methods are potentially useful for market risk. We end this week
(and this course) with a brief discussion on stochastic processes for market risk. Throughout
the course, we only considered one-period models, and we assumed iid returns. Real-life data
is not iid! Hence stochastic processes (time-dependent models) are called for. This is very
complicated because multiple stochastic processes are usually dependent, and this is difficult
to model, so most current research either considers dependent processes in one dimension
or coordinatwise dependence in one-period models (but not both). A very classical model
for dependence is the ARMA(p, q) model:
p
X q
X
Xn − φj Xn−j = Zn + θi Zn−i , t = 1, 2, ...
j=1 i=1
where {Zn } is an iid N (0, 1) sequence and φj , θi constants. This is an example of a time
series model. For ”sufficiently small” φj and θi , we have that
d
Xn −→ X
where {Zn } is an iid N (0, 1) sequence. A more complicated model is the GARCH(1,1)
model introduced by Bollerslev, see [7]. Here the log-returns {Rn } satisfy
Rn = σn Zn , n = 1, 2, ...
σn2 = α0 + β1 σn−1
2 2
+ α1 Rn−1 2
= α0 + σn−1 2
(β1 + α1 Zn−1 ).
Both ARCH(1) and GARCH(1,1) are examples of stochastic recursive sequences. Namely,
Vn = An Vn−1 + Bn , n = 1, 2, ...
where
Here, {(An , Bn ) : n = 1, 2, ...} is any iid sequence on (0, ∞) × R. Under certain reasonable
d
conditions, Vn −→ V , and it is natural to consider P (V > u) for large u. One can apply
renewal theoretic methods (such as those presented in the course SkadeStok) to obtain
P (V > u) ∼ Cu−R as u → ∞
where
Λ(ξ) = log E eξ log A .
Λ(R) = 0,
This shows that Pareto tails characterise the decay rate. For more complex models (e.g.
GARCH(p,q)), one needs to consider matrix recursions. This is currently an active research
area.
Exercises
Exercise 7.1:
Verify that the probability-generating function for a negative binomial variable N with
parameters r and p is given by
r
p 1
gN (t) = for |t| < .
1 − (1 − p)t p
Exercise 7.2:
Let {Ñt } be a Poisson process with intensity 1 and Λ ∼ Γ(α, β). Verify that the mixed Pois-
son process Nt = ÑΛt has a negative binomial distribution and determine the parameters.
Exercise 7.3:
Let N be a discrete random variable with N ∈ {0, 1, 2, ...}. Define qn := P (N = n) and
consider the relation
b
qn = a + qn−1 , n = 1, 2, ...
n
Prove that N satisfies this relation for proper choices of a and b in the following cases.
1)N Poisson distributed with intensity λ > 0.
2)N Binomial distributed with parameters k and p.
3)N negative binomial distributed with parameters p and r.
One can prove that these three distributions are the only distributions satisfying this recur-
sive relation.
Appendix A
Preliminaries
Proof. This proof is from [21]. Assume that tn ↑ t but H ← (t−) := limtn ↑t H ← (tn ) < H ← (t).
Then we can find x ∈ R and δ > 0 such that for all n,
Proof. Point (i) is left as an exercise. Consider (ii). h is non-decreasing, so any discontinuity
is a (positive) jump. Since a jump of h corresponds to a flat region for h← (make a drawing!),
the claim follows. One can make similar arguments for (iii) and (iv). For complete proofs
of these statements and many others concerning generalised inverses, consult [9].
80
81
d
(ii) If F is continuous, F (X) = U for U ∼ Unif(0, 1).
Proof. (i) and (ii) are left as exercises. As for (iii), we note that X ≤ x implies F (X) ≤ x
since F is non-decreasing. Conversely, consider the event {F (X) ≤ F (x), X > x}. If X > x,
we have F (X) ≤ F (x) and hence {F (X) ≤ F (x), X > x} ⊆ {F (X) = F (x), X > x} so that
F is flat on [x, X]. This implies P (F (X) ≤ F (x), X > x) = 0, completing the proof.
Exercises
Exercise A.1:
Prove (i) in Proposition A.1.3.
Exercise A.2:
Prove (i) and (ii) in Proposition A.1.4. Give a counterexample which shows that (ii) need
not hold for general distribution functions.
In this course, measurability is not an issue. We will also rarely worry about the background
space (Ω, F, P ). We now go through distribution functions in some detail since distribution
functions and quantile functions play a central role in the course.
F (x) = P (X ≤ x).
1. F is right-continuous.
2. F is non-decreasing.
3. limx→∞ F (x) = 1.
4. limx→−∞ F (x) = 0.
82
Proof. Assume that F is a distribution function for the random variable X. For ε > 0, we
have {X ≤ x + ε} ↓ {X ≤ x} for ε ↓ 0. By continuity from above for measures, F (x + ε) →
F (x) for ε ↓ 0, showing that F has property 1. If x ≤ y, then {X ≤ x} ⊆ {X ≤ y} so that
F (x) = P (X ≤ x) ≤ P (X ≤ y) = F (y), proving 2. Properties 3 and 4 follow from the fact
that X is real-valued. Conversely, suppose F satisfies properties 1 - 4. Let X = F ← (U )
where U is uniformly distributed on (0, 1). Then
P (X ≤ x) = P (F ← (U ) ≤ x) = P (U ≤ F (x)) = F (x)
Recall that a right-continuous function with left-limits is called càdlàg (French: ”continue
à droite, limite à gauche”).
The distribution function of a random variable determines its distribution. Note also the
useful identity
P (a < X ≤ b) = F (b) − F (a)
for every a < b. This generalises to the two-dimensional case as follows: If (X, Y ) has
distribution function F , then
This is best seen by making a drawing of the rectangle with coordinates (a, c), (a, d), (b, c)
and (b, d). Multivariate distribution functions have similar properties as in the univariate
case.
4. 0 ≤ F (x1 , ..., xd ) ≤ 1.
Note that the above proposition is not an if and only if statement as in the univariate
case. A counterexample is given in the exercises. We end this subsection about distribution
functions with two tables containing the most important examples for this course.
83
Table A.1: Densities and distribution functions of some common continuous distributions.
Table A.2: Densities and distribution functions of some common discrete distributions.
ΦX (t) = E[eitX ]
For the sake of brevity, we will often write cf for characteristic function and mgf for moment-
generating function. Note that the characteristic function of a random variable is always
defined. Indeed, the integrand is bounded by 1 in norm.
Example A.2.7. For the N (0, 1) distribution, the characteristic function is given by
2
Φ(t) = e−t /2
The case for the general normal distribution N (µ, σ 2 ) is left as an exercise, see also the
lemma below. ◦
84
The cf and mgf are easily generalised to a multivariate random variable X = (X1 , ..., Xd ) as
follows: h T i h T i
ΦX (t) = E eit X , κX (t) = E et X , t ∈ Rd
where the mgf is only defined in the neighbourhood of the origin where it is finite.
Lemma A.2.8. Let X be a Rd -valued random variable, a ∈ Rn and B a n × d matrix.
Then the Rn -valued random variable a + BX has cf
T
Φa+BX (t) = eia t ΦX (B T t), t ∈ Rn .
Conversely, if the cf factors, it follows immediately from the uniqueness theorem above that
the X1 , ..., Xd are independent.
Proof. See Theorem 6.34 in [13] and the paragraph following the theorem.
Maybe not surprisingly, these properties more or less carry over to the mgf. A discussion of
the result below can be found in [4], chapter 30.
Theorem A.2.12. If the mgfs of X and Y exist in a neighbourhood around zero and are
equal, then X and Y have the same distribution.
85
Corollary A.2.13. Let the variables X1 , ..., Xd have moment-generating functions κX1 , ..., κXd
that exist in a neighbourhood around zero, then X1 , ..., Xd are independent if and only
d
Y
κ(X1 ,...,Xd ) (t1 , ..., td ) = κXi (ti ).
i=1
Proposition A.2.14. Let X be a one-dimensional random variable with mgf κX that exists
in a neighbourhood (−c, c) around zero. Then X has moments of all orders and for k ∈ N,
(k)
κX (0) = E[X k ].
We end this subsection with tables containing the mgf and cf of the distributions from the
tables of distributions above.
Table A.3: Characteristic functions and moment-generating functions for the distributions
in table A.1.
Table A.4: Characteristic functions and moment-generating functions for the distributions
in table A.2.
In this case, Σ is the covariance matrix of X and µ is the mean vector i.e. E[Xi ] = µi and
Σij = Cov(Xi , Xj ) for all i, j = 1, ..., d.
Proof. See Corollary 16.2 in [15]. Note the error in equation (16.5). It should say (2π)n/2
and not 2π n/2 .
The following result will be used extensively in the discussion on spherical and elliptical
distributions.
Proposition A.2.19. Let X ∼ N (µ, Σ) be d-dimensional, let a ∈ Rn and let B be an
n × d-matrix. Then Y ∼ N (a + Bµ, BΣB T ).
and
P (Yn = 1 for all n ∈ N) ≤ P (YN = 1) = pN .
As this equality holds for all N , we can take limits on both sides and obtain P (Yn =
1 for all n ∈ N) = 0 which yields P (Yn → 0) = 1 as desired. ◦
87
as desired.
It is not immediately clear that almost sure convergence is strictly stronger than convergence
in probability. The difference is of a very technical nature. However, counterexamples exist,
and we encourage the reader to look them up. See for example [22]. It is useful to have
some tools to prove convergence almost surely and in probability. Such tools include the
Markov inequality and Chebyshev’s inequality.
Lemma A.2.24 (Markov’s inequality). Let X be a random variable. Then for any
ε > 0,
E[|X|]
P (|X| > ε) ≤ .
ε
Proof. Trivially, ε1{|X|>ε} ≤ |X|. Now take expectations on both sides and rearrange.
Corollary A.2.25 (Chebyshev’s inequality). Let X be a random variable with finite
expectation, E[|X|] < ∞. Then for any ε > 0,
Var(X)
P (|X − E[X]| > ε) ≤
ε2
Proof. Left as an exercise for the reader.
Example A.2.26. Consider a sequence of non-negative variables X1 , X2 , ... with finite
expectation and E[Xn ] = 1/n. For any ε > 0, we have by the Markov inequality that
E[Xn ] 1
P (|Xn − 0| > ε) ≤ = →0
ε nε
P
so Xn −→ 0. This result should not be surprising, considering the fact that a non-negative
random variable is zero almost surely if and only if it has mean zero. ◦
88
The above example of an application of the Markov inequality is not exactly interesting, but
we want to stress that the inequality, while a triviality, is extremely useful and flexible. The
following result shows that almost sure convergence follows if the probability P (|Xn −X| > ε)
goes to zero fast enough.
Proof. The proof is an immediate consequence of the Borel-Cantelli lemma, see Lemma 2.26
and Theorem 2.27 in [13].
Theorem A.2.28 (Strong Law of Large Numbers). Let {Xi } be an iid sequence of
random variables with E[|X1 |] < ∞. Then
n
1X
Xi → E[X1 ] a.s.
n i=1
d
We write Xn −→ X.
We now state a version of the Central Limit Theorem, often abbreviated CLT.
Theorem A.2.33 (Central LimitPTheorem). Let {Xi } be iid with E[X12 ] < ∞, µ =
n
E[X1 ] and σ 2 = Var(X1 ). Let Sn = i=1 Xi . Then
Sn − nµ d
√ −→ Z ∼ N (0, 1).
σ n
In this course, this version of the CLT suffices. It is, however, only a little part of the whole
story. More perspectives and versions of the CLT can be found in chapter 7 of [13].
Conditional expectations
The presentation in this subsection follows chapter 9 of [13]. Conditional expectations are
essential in performing computations in probability theory and statistics. The (measure the-
oretic) definition of a conditional expectation is somewhat strange at first, but the definition
has the advantage that all the theoretic properties follow almost trivially.
Intuitively, we think of the conditional expectation E[X | G] as our best guess of the value of
X given the information in G. Try to keep this intuition in mind when reading the following
examples and theoretical properties.
so both (i) and (ii) are satisfied by E[X], proving the claim. ◦
The following proposition allows us to compute a plethora of interesting examples.
Proposition A.2.37. If D1 , D2 , ... are disjoint sets in F with ∪n Dn = Ω (such a collection
is called a partition), P (Di ) > 0 for all i, G = σ(D1 , D2 , ...) and X is an integrable random
variable, then
1
R
P (D ) XdP, ω ∈ D1
RD1
1
1
E[X | G](ω) = P (D2 ) D2 XdP, ω ∈ D2
..
.
Proof. It is not hard to verify that the sigma-algebra G consists of the sets that are unions
of the Di . Since E[X | G] is G-measurable, E[X | G] must be constant when restricted to
one of the Di i.e.
c , ω ∈ D1
1
E[X | G](ω) = c2 , ω ∈ D2 .
...
Since Z Z Z
XdP = E[X | G]dP = ci dP = ci P (Di ),
Di Di Di
Proof. This follows immediately from the previous proposition by letting G = σ(N ) and
noting that σ(N ) is generated by the partition {{N = 0}, {N = 1}, ...}.
Before computing some interesting examples, we state the following list of properties of
conditional expectations.
Theorem A.2.39. Let X and Y be random variables with finite expectation, G ⊆ F a
sub-sigma-algebra.
Proof. Consider (i). We have to show that aE[X | G]+bE[X | G] satisfies the two properties
of [aX + bY | G]. Measurability is obvious. For A ∈ G, we have
Z Z Z
aE[X | G] + bE[Y | G]dP = a E[X | G]dP + b E[Y | G]dP
A
ZA Z A
=a XdP + b Y dP
A A
Z
= aX + bY dP
A
which proves the claim. As for (ii), simply choose A = Ω in the second property of condi-
tional expectations to obtain
Z Z
E[X] = XdP = E[X | G]dP = E[E[X | G]].
Ω Ω
(iii) follows immediately from the monotonicity property of integrals. As for (iv), we obvi-
ously have −|X| ≤ X ≤ |X| so from (iii), we get
which is the desired result. See the exercises for an outline of the proof of (v).
The tower property, (ii), has many names. It is also known as the law of iterated expectations
and the law of total expectation to name a few.
Example A.2.40. A typical situation in for example non-life insurance is to have a sum
of the form
XN
S= Xi
i=1
using that N and {Xi } are independent. It follows that E[S | N ] = N E[X1 ]. From the
tower property, it follows that
In practice, one does not proceed as formally as in the above example. The corollary that we
applied essentially says that when we condition on a variable, that variable can be treated
as a constant. If N was constant equal to n, we would say that E[S] = nE[X1 ]. Then we
just replace n by the random variable N to obtain E[S | N ]. Let us make this more precise
by first observing that E[X | Y ] is a function of Y .
Y
Ω R
φ
Z
R
By definition of a conditional expectation, E[X | Y ] is σ(Y )-measurable. Hence E[X | Y ] =
φ(Y ) for some function φ. While the Doob-Dynkin lemma does not provide an explicit
recipe for φ, it is possible to compute φ in many situations of interest. The following result
shows how to compute φ(y) = E[X | Y = y] in the (quite typical) case where (X, Y ) has a
density.
Theorem A.2.42. Let E[|X|] < ∞ and assume (X, Y ) has density f (x, y). Then
Z
f (x, y)
E[X | Y = y] = x dx
R g(y)
R
where g(y) = R f (x, y)dx is the density of Y .
f (x, y)
fX|Y =y (x) = ,
g(y)
which also makes sense intuitively. Given densities, the conditional expectation can be
computed as an ordinary expectation but with a conditional density.
93
Definition A.2.45 (Brownian motion). A stochastic process {Xt } which satisfies the
properties
X0 = 0,
Xt1 , Xt2 − Xt1 , ..., Xtn − Xtn−1 are independent for 0 < t1 < t2 < · · · < tn
Definition A.2.47. A stochastic process {Xt } is called adapted to the filtration {Ft } if Xt
is Ft -measurable for all t.
Example A.2.48. For any stochastic process {Xt }, we can create a filtration by letting
Ft = σ(Xs : s ≤ t). This is the smallest filtration such that {Xt } is adapted. The filtration
is sometimes called the natural filtration. ◦
Definition A.2.49. A stochastic process {Xt } is called a martingale with respect to the
filtration {Ft } if the following hold:
Martingales do not play a major role in this course. Nevertheless, some examples are
provided in the exercises.
Exercises
Exercise A.3:
If Y ∼ N (µ, σ 2 ), we say that X = exp(Y ) has a lognormal distribution with parameters µ
and σ. Using the density
1 (x−µ)2
ϕ(t) = √ e− 2σ2
2πσ 2
for the normal distribution, derive the density of the lognormal distribution.
94
Exercise A.4:
Prove Proposition A.2.5.
Exercise A.5:
Consider the function F : R2 → R given by
(
0, x < 0 or y < 0 or x + y < 1
F (x, y) = .
1, else
Exercise A.6:
Let Y be N (0, 1) and let Z be Bernoulli distributed with success parameter 1/2. Assume
Y and Z are independent. Define X1 := Y and X2 := 1{Z=1} Y − 1{Z=0} Y .
1)Verify that X1 and X2 are both N (0, 1) variabels.
2)Show that (X1 , X2 ) is not multivariate normal.
Exercise A.7:
Compute the moment-generating function for the Γ(α, β) distribution.
Exercise A.8:
Exercise A.9:
We know that the standard normal distribution has the characteristic function
2
Φ(t) = e−t /2
.
Exercise A.10:
Compute all moments of the exponential distribution.
Exercise A.11:
The Γ(λ, n) distribution for n ∈ N is called the Erlang distribution. Verify that if X ∼ Γ(λ, n)
then
d
X = Y1 + · · · + Yn
95
Exercise A.12:
Prove Proposition A.2.19.
Exercise A.13:
Prove Chebyshev’s inequality, Corollary A.2.25.
Exercise A.14:
Without using the Strong Law of Large Numbers, prove the Weak Law of Large Numbers:
If {Xi } is an iid sequence of random variables with E[X12 ] < ∞, then
n
1X P
Xi −→ E[X1 ].
n i=1
Exercise A.15:
Assume Xn → X a.s. and that f is a continous function. Prove that f (Xn ) → f (X) a.s.
Exercise A.16:
Let {Xi } be iid variables with E[X12 ] = 4 and E[X1 ] = 1. Show that
X12 + · · · Xn2
lim
n→∞ X1 + · · · Xn
Exercise A.17:
Let p ≥ 1. A sequence of random variables {Xi } with E[|Xi |p ] < ∞ for all i is said to
converge to X in Lp if E[|X|p ] < ∞ and
Lp
In that case, we write Xn −→ X.
Lp P
1)Prove that if Xn −→ X, then Xn −→ X.
2)Let {Xi } be a sequence of random variables with E[Xi ] = 0 and E[Xi2 ] < ∞ for all i.
Prove that (X1 + · · · + Xn )/n converges to zero in L2 and in probability.
3)Prove the following dominated convergence statement: If Xn → X a.s. and there exists
some variable Y with E[|Y |p ] < ∞ (p ≥ 1) such that |Xn | ≤ |Y | a.s. for all n, then
Lp
Xn −→ X.
Exercise A.18:
Assume X is a random variable with finite moment-generating function κ in the neighbour-
hood (−c, c). Prove Chernoff ’s bound
κ(α)
P (X > ε) ≤ inf .
α∈[0,c] eαε
96
Exercise A.19:
Assume N is Poisson distributed with parameter λ > 0 and {Xi } is an iid sequence indepen-
dent of N where X1 has moment-generating function κ. Compute the moment-generating
function of
N
X
Xi .
i=1
Exercise A.20:
In this exercise, we will prove (v) in Theorem A.2.39.
1)Prove the result when X = 1A0 is an indicator function.
Pn
2)Prove the result when X is a simple function, e.g. X = i=1 ci 1Ai , A1 , ..., An ∈ G.
3)One can show the following dominated convergence statement for conditional expecta-
tions: If Xn → X a.s. and |Xn | ≤ Y for a random variable Y with E[|Y |] < ∞, then
E[Xn | G] → E[X | G]. Using this result, prove (v) for a general G-measurable X. Hint:
Recall that there exists a sequence of simple functions {Xn } with Xn ↑ X.
Exercise A.21:
Prove the following extension of the tower property: If G ⊆ H ⊆ F, then
Exercise A.22:
In this exercise, we introduce the conditional variance. Assume E[X 2 ] < ∞ and that G is a
sub-sigma-algebra, then the conditional variance is defined by
Exercise A.23:
Consider Example A.2.40. Assume that the Xi have finite second moment. Show that
Hint: Use the law of total variance from the previous exercise.
97
Exercise A.24:
Let {Xt } be a Brownian motion
1)Define Yt = −Xt . Show that {Yt } is a Brownian motion.
2)Let c > 0 and define Yt = cXt/c2 . Show that {Yt } is a Brownian motion.
Exercise A.25:
A continuous time process {Xt } satisfies continuity in probability if for every sequence {tn }
of non-negative real numbers, we have
P
tn → t ⇒ Xtn −→ Xt .
Exercise A.26:
Let {Xt } be a Brownian motion and Ft = σ(Xs : s ≤ t) the natural filtration.
1)Show that {Xt } is a Brownian motion with respect to {Ft }.
2)Show that {Xt2 − t} is a Brownian motion with respect to {Ft }.
Exercise A.27:
A stochastic process {Nt } satisfying the properties
N0 = 0,
Nt1 , Nt2 − Nt1 , ..., Ntn − Ntn−1 are independent for 0 < t1 < t2 < · · · < tn ,
{Nt } has right-continuous sample paths and limits from the left,
then {Nt } is called a Poisson process with intensity λ > 0. Verify that {Nt − λt} is a
martingale with respect to the natural filtration.
A.3 Calculus
During the course, we will occasionally integrate with respect to functions of bounded vari-
ation. Examples of such functions include functions that are monotone, in particular distri-
bution functions. Here we introduce the basic theory, following the lines of [20].
i.e. the supremum of sums of absolute differences over all finite partitions of [0, t]. If
V f (t) < ∞ for all t ≥ 0, we call f a function of finite variation.
98
i.e.
n
X n
X
(f (ti ) − f (ti−1 ))+ = (f (ti ) − f (ti−1 ))− + f (t) − f (0)
i=1 i=1
and taking supremum on both sides, we obtain the decomposition f (t) − f (0) = V+f (t) −
V−f (t), which is a difference of two non-decreasing functions as desired.
99
Remark A.3.6. Note that V f has the decomposition V f = V+f + V−f since for any x ∈ R,
|x| = x+ + x− .
We are almost ready to introduce integration. However, finite variation is not quite enough.
We also require the additional property of being càdlàg.
Definition A.3.7. A function f is called càdlàg (French: continue à droite, limite à gauche)
if f is right-continuous and has left limits.
The Jordan decomposition tells us how to proceed from here. If we define integration with
respect to an increasing function, we can use linearity of the integral to define integration
with respect to general functions of bounded variation. Let f be a non-decreasing càdlàg
function. The function µf defined on the intervals (a, b] given by µf ((a, b]) = f (b) − f (a)
extends to a (positive) measure (called a Lebesgue-Stieltjes measure) on all Borel sets. Note
how this resembles the Lebesgue measure where f is just the identity. We can now define
integration in the same way as in basic measure theory.
Definition A.3.8. Let f : [0, ∞) → R be a non-decreasing càdlàg function. The Lebesgue-
Stieltjes integral of a measurable function g with respect to f is given by
Z Z ∞
g(t)df (t) := g(t)dµf (t)
(0,∞) 0
R∞
given that 0
|g(t)|dµf (t) < ∞. For any Borel set B, we define
Z Z
g(t)df (t) := 1B (t)g(t)df (t).
B (0,∞)
Example A.3.9. Let F be the distribution R function for the random variable X. If B =
(a, b], then P (X ∈ B) = F (b) − F (a) R= B dF . Since the intervals (a, b] generate the Borel
sigma-algebra, we have P (X ∈ B) = B dF for all Borel sets B. ◦
Example A.3.10. Consider a positive random variable X with finite expectation and
distribution function F . We claim that
Z ∞
E[X] = xdF (x).
0
Since P (X ∈ (a, b]) = F (b) − F (a) for all a < b, the image measure of X, P X , coincides
with the measure induced by F . Hence
Z ∞ Z
E[X] = xdP X (x) = xdF (x)
0
as desired. ◦
100
(i) Z
df (u) = f (t) − f (s).
(s,t]
(ii) Z
g(u)df (u) = g(u)∆f (t), ∆f (t) := f (t) − f (t−)
{t}
Remark A.3.13. It is not difficult to see that the result also works for other intervals such
as (a, b], (t, ∞) etc.
101
Exercises
Exercise A.28:
Prove Proposition A.3.4.
Exercise A.29:
Prove Proposition A.3.11.
Exercise A.30:
In this exercise, we will consider some classes of functions of bounded variation.
1)Let f : [0, ∞) → R be a function which is Lipschitz on every compact interval i.e. for
every [a, b] ⊆ [0, ∞), there exists a constant C such that
Exercise A.31: R
Compute the integral (0,4] tdf (t) in the following cases:
1)f (t) = k for k − 1 ≤ t ≤ k, k = 1, 2, ....
2)f (t) = et .
3)f (t) = k + et for k − 1 ≤ t ≤ k, k = 1, 2, ....
Bibliography
[1] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent mea-
sures of risk. Mathematical Finance, 9:203–228, 1999. doi: https://doi.org/10.1111/
1467-9965.00068.
[2] Søren Asmussen and Peter W. Glynn. Stochastic Simulation: Algorithms and Analysis.
Stochastic Modelling and Applied Probability. Springer New York, NY, 1 edition, 2007.
ISBN 978-0-387-30679-7.
[3] Sheldon Axler. Linear Algebra Done Right. Undergraduate Texts in Mathematics.
Springer Cham, 4 edition, 2023. ISBN 978-3-031-41025-3.
[4] Patrick Billingsley. Probability and Measure. Wiley Series in Probability and Statistics.
John Wiley & Sons Inc., anniversary edition, 2012. ISBN 978-1-118-12237-2.
[5] N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation. Encyclopedia of
Mathematics and its Applications. Cambridge University Press, 1987. doi: 10.1017/
CBO9780511721434.
[6] Tomas Björk. Arbitrage Theory in Continuous Time. Oxford University Press, 4 edition,
2020. ISBN 978–0–19–885161–5.
[7] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of
Econometrics, (31):307–327, 1986. URL https://public.econ.duke.edu/~boller/
Published_Papers/joe_86.pdf.
[8] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics,
7(1):1–26, 1979. URL http://www.jstor.org/stable/2958830.
[9] Paul Embrechts and Marius Hofert. A note on generalized inverses. 2014. URL
https://people.math.ethz.ch/~embrecht/ftp/generalized_inverse.pdf.
[10] Paul Embrechts, Claudia Klüppelberg, and Thomas Mikosch. Modelling Extremal
Events for Insurance and Finance. Stochastic Modelling and Applied Probability.
Springer Berlin, Heidelberg, 1 edition, 1997. ISBN 978-3-540-60931-5.
[11] Robert F. Engle. Autoregressive conditional heteroscedasticity with estimates of the
variance of united kingdom inflation. Econometrica, 50(4):987–1007, 1982. ISSN
00129682, 14680262. URL http://www.jstor.org/stable/1912773.
[12] Paul Glasserman. Monte Carlo Methods in Financial Engineering. Stochastic Modelling
and Applied Probability. Springer New York, NY, 1 edition, 2003. ISBN 978-0-387-
00451-8.
102
103
[13] Ernst Hansen. Stochastic Processes. Institut for Matematiske Fag Københavns Univer-
sitet, 4 edition, 2023. ISBN 978-87-71252-59-0.
[14] Henrik Hult and Filip Lindskog. Mathematical Modeling and Statistical Methods for
Risk Management. 2007.
[15] Jean Jacod and Philip Protter. Probability Essentials. Universitext. Springer Berlin,
Heidelberg, 2 edition, 2002. ISBN 978-3-642-55682-1.
[16] Rasmus Frigaard Lemvig. Stochastic Processes in Non-Life Insurance (SkadeStok) Lec-
ture Notes. URL https://rasmusfl.github.io/projects.html.
[17] Alexander J. McNeil, Rüdiger Frey, and Paul Embrechts. Quantitative Risk Manage-
ment - Concepts, Techniques and Tools. Princeton University Press, revised edition,
2015. ISBN 978-0-691-16627-8.
[18] Thomas Mikosch. Non-Life Insurance Mathematics. Universitext. Springer Berlin,
Heidelberg, 2 edition, 2009. ISBN 978-3-540-88232-9.
[19] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics. Springer
New York, NY, 2 edition, 2006. ISBN 978-0-387-28659-4.
[20] Jesper Lund Pedersen. Stochastic Processes in Life Insurance: The Dynamic Approach.
Department of Mathematical Sciences University of Copenhagen.
[21] Sidney I. Resnick. Extreme Values, Regular Variation and Point Processes. Springer
Series in Operations Research and Financial Engineering. Springer New York, NY, 1
edition, 2007. ISBN 978-0-387-75952-4.
[22] Jordan M. Stoyanov. Counterexamples in Probability. Dover Books on Mathematics.
Dover Publications Inc., 3 edition, 2013. ISBN 978-0-486-49998-7.
Index
104
105