0% found this document useful (0 votes)
4 views

Lecture Notes_ Quantitative Risk Management

These lecture notes for the Quantitative Risk Management course at the University of Copenhagen cover essential topics such as risk measures, Value at Risk (VaR), copulas, and credit risk. The notes are structured weekly and include supplementary examples and exercises to aid understanding. The document emphasizes the importance of understanding risk in finance and provides historical context through notable financial failures.

Uploaded by

kpachghare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture Notes_ Quantitative Risk Management

These lecture notes for the Quantitative Risk Management course at the University of Copenhagen cover essential topics such as risk measures, Value at Risk (VaR), copulas, and credit risk. The notes are structured weekly and include supplementary examples and exercises to aid understanding. The document emphasizes the importance of understanding risk in finance and provides historical context through notable financial failures.

Uploaded by

kpachghare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

Quantitative Risk Management (QRM) 2023/2024

Lecture notes

Rasmus Frigaard Lemvig ([email protected])

First edition
Preface to the first edition
These lecture notes were written for the course Quantitative Risk Management (abbreviated
QRM ) at the University of Copenhagen in the winter of 2023/2024. They are based on the
lectures by Jeffrey F. Collamore with some of my own additions, mostly in the form of
supplementary examples, exercises and an appendix. I also added a few more results that
I felt were helpful when solving the mandatory exercises. The notes should be sufficient to
cover the entire syllabus, and if the reader wants to dive into certain topics in more detail,
the notes and comments at the end of each week provide further references.

The notes are built up in the following way: Each week in these notes corresponds roughly
to a week of teaching. Some subsections have been moved to make the presentation more
clear, but the reader can expect the notes to follow the teaching almost one to one. The ap-
pendix at the end deserves an explanation. The appendix contains three sections. The first
is very short and concerns generalised inverses. This section is essential for the course, and
it should be clear in the text when the reader should consult this part of the appendix. As
for the other two sections, probability theory and calculus, they are purely supplementary.
I wrote them with the intend of making it easer to look up some basic notions if necessary.
The course relies on a general understanding of probability theory, and sometimes tools
like conditioning arguments and integration by parts (with respect to functions) are used
without explanation. For students with less training in such computations, it may be a good
idea to take a brief look at the relevant subsections.

Feedback in general is very appreciated. The exercises are my own addition (well, most of
them), and any suggestions on how these can be improved to be more interesting etc. are
very welcome. Last but not least, there is likely a bunch of typos remaining and maybe
even a few mathematical errors. These are all due to me. Please don’t hesitate to let me
know, if you find any.

Rasmus Frigaard Lemvig


January 2024
Table of contents

Table of contents

Week 1 - Risk measures 1


1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Loss random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Week 2 - Methods for computing VaR and extreme value theory 16


4 Computing Value at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Extreme value theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Week 3 - Spherical and elliptical distributions 35


6 Spherical and elliptical distributions . . . . . . . . . . . . . . . . . . . . . . . 37
7 Properties of elliptical distributions . . . . . . . . . . . . . . . . . . . . . . . . 39

Week 4 - Copulas I 44
8 Copulas: Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9 Examples of copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Week 5 - Copulas II 53
10 Archimedean copulas in higher dimensions . . . . . . . . . . . . . . . . . . . . 53
11 Fitting copulas to data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Week 6 - Credit risk I 61


12 Portfolio credit risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Week 7 - Credit risk II and further topics 69


13 Portfolio credit risk continued . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
14 Operational risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A Preliminaries 80
A.1 Generalised inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.2 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Bibliography 102

Index 106
Week 1 - Risk measures

1 Introduction
What is quantitative risk management about?
Quantitative risk management aims at describing and understanding risk in a financial
context. To motivate our discussion, let us start with a basic example. Suppose we have a
stock with value Sn at time n (discrete time units). Assume the Bernoulli model
2 1
S0 = 1, P (Sn+1 = 2Sn ) = and P (Sn+1 = 0.5Sn ) = for n > 0.
3 3
In this model there are some basic questions we may ask. For example, what is the risk in
this investment policy? What are the expected returns? We can compute
2 1 1 3
E[Sn | Sn−1 ] = · 2Sn−1 + · Sn−1 = Sn−1
3 3 2 2
and it follows that  n
2
Mn = Sn
3
is a martingale, and
 n  n
2 3
1 = M0 = E[Mn ] = E[Sn ] implying E[Sn ] = → ∞.
3 2
However, there is still a risk that we lose money in this investment policy. Note that we can
write
Sn = R1 · · · Rn
for Bernoulli variables Ri with P (Ri = 2) = 2/3 and P (Ri = 1/2) = 1/3. Taking log yields
n
X
log Sn = log Ri
i=1

which is a risk process that can be studied using ruin theory. If E[e−α log S1 ] = 1 for some
α > 0, we obtain the Cramér-Lundberg estimate (for some threshold u > 0)

P (log Sn < −u for some n) ∼ Ce−αu

for a constant C. In Cramér-Lundberg theory, there is a tradeoff between profit and risk
and the same rule applies to risk management in finance. This tradeoff can be studied from

1
2

several viewpoints. Sometimes this tradeoff is handled using a utility function U and then
maximizing the expected utility E[U (Sn )]. In this course we follow another approach via
so-called risk measures. This approach allows us to compute loss probabilities directly. All
these terms will get a more rigorous definition later on.

Some historical remarks


An essential part of risk management is to determine the capital needed to withstand shocks
for financial institutions. History has many examples of banks and other institutions failing
due to insufficient financial coverage. We go through some of these examples to add more
context to the following (more mathematical) discussion.

Example 1.1. Barings Bank was founded in 1762 and was one of the UK’s oldest and
most respected banks. In 1995 the bank collapsed despite having more than $900 million in
capital. This was due to unauthorized trading by a single employee. He bought straddles
(selling both one call option and put option) which in a typical market will usually expire
and provide a gain. However, market instability was caused by a Japanese earthquake, and
the bank suffered a loss of more than $1 billion, resulting in a bankruptcy. ◦

Example 1.2. In 1994, the hedge fund LTCM (Long-Term Capital Management) was
founded. The employees were experienced traders and academics. $1.3 billion were invested
with returns after two years close to 40 %. Early in 1998 net assets were $4 billion, but
by the end of the year the fund was close to default. The U.S. Federal Reserve managed a
$3.5 billion rescue package to avoid a systematic crisis in the world financial system. The
triggering event for this disaster was the devaluation of the ruble by Russia. ◦

Example 1.3. In 2023, Silicon Valley Bank collapsed. The bank owned low interest bonds
and paid even lower interest to the depositors. When the depositors withdrew their capital,
this required selling the low-interest treasury bonds, whose market prices had decreased
sharply. ◦

Example 1.4. The final example concerns a particular individual, Jesse Livermore. Liver-
more is considered the greatest short-seller in history. He succesfully shorted:

ˆ The 1906 earthquake (through a railway investment).

ˆ The 1907 market crash (”panic of 1907”).

ˆ The 1929 market crash (through ca. 100 shorts, netting about $100 million).

He also had numerous successful ”long” investments. He went bankrupt a number of times
however. He went bankrupt in 1901, 1908 and again in 1934. His book on trading remains
a classic to this day. ◦

2 Loss random variables


In describing risk, we focus on the loss instead of the profit. The following definition sets
the stage for our discussion in the first weeks of the course.
3

Definition 2.1. Let t0 , t1 , ... denote discrete timepoints (days, weeks, months or years for
example) and let ∆tn = tn+1 − tn denote the time passed between timepoint n + 1 and n.
We let Vn denote the capital at time tn , and we let Ln+1 define the loss between tn and
tn+1 e.g.
Ln+1 = −(Vn+1 − Vn ).
We let a general loss random variable be denoted by L. Many models consist of assumptions
on L. Let us consider some motivating examples. Note that in these examples, it makes
sense to think of Vn as a portfolio.
Example 2.2 (Stock investment). Let Vn = Sn with Sn the price of a certain stock at
time tn . We let Xn+1 = log Sn+1 − log Sn denote the log returns. Hence
Sn+1
eXn+1 = .
Sn
Historically, Xn+1 has often been given a normal distribution. This is motivated by the
Black-Scholes model, see the end of the examples for a concise explanation of this model. In
this model, the change of the stock price (in continuous time t) is described by the dynamics
dS(t)
= rdt + σdW (t)
S(t)
with W (t) denoting a Brownian motion. Solving the equation explicitly, assuming that we
are currently at time t so that S(t) is known, yields the expression for the stock price at
time T > t  2

r− σ2 (T −t)+σ(W (T )−W (t))
S(T ) = S(t)e .
The discrete time analogue is
 2
 √
r− σ2 (tn+1 −tn )+σ tn+1 −tn Z
Sn+1 = Sn e

with Z ∼ N (0, 1). The log return becomes

σ2
 
p
Xn+1 = log Sn+1 − log Sn = r − (tn+1 − tn ) + σ tn+1 − tn Z
2
which is normal distributed. Note that the loss Ln+1 can be written as

Ln+1 = −(Sn+1 − Sn ) = −Sn (eXn+1 − 1).

If we are at time n, the value Sn is known. A key goal of this course is to model the unknown
part Xn+1 and thereby make inference about the behaviour of the process Vn at a future
time. Note also how this approach differs from the one in classical ruin theory where the
entire positive timeline is considered. Here we model the change in capital one step at a
time. ◦
Example 2.3 (Stock investment with more assets). The preceding example can be
generalised. Assume that we have d stocks. We can then form a portfolio at time n by
d
X
Vn = αi Sn(i)
i=1
4

(i)
with αi the number of units bought of stock i and Sn the value of stock i at time n. Using
the previous example, the loss is given by
d
(i)
(i) (i)
X
Ln+1 = − αi Sn(i) (eXn+1 − 1), Xn+1 = log Sn+1 − log Sn(i) .
i=1

Again we need to come up with a model for the log returns X (1) , ..., X (d) , where we can
write
(1) (d)
Xn+1 = (Xn+1 , ..., Xn+1 ).
Very often the variables in Xn+1 are dependent. Think for example of a portfolio of stocks
in the same type of companies. This dependence structure is crucial to understand and
capture when estimating risk in a portfolio.

Example 2.4 (Bond investment). Consider a zero-coupon bond. Such an asset pays one
unit at a fixed time T . Let us briefly consider such a bond in continuous time. We have an
interest rate rt at time t, and we let Bt denote the price of the bond at time t (this price
will depend on T ). The price of the bond can be described by the dynamics

dBt = rt Bt dt,

and using the boundary condition BT = 1, we can solve the above equation and get
RT
rs ds
1 = BT = Bt e t ,

and we can rewrite this expresion in terms of Bt as


Z T
RT 1
Bt = e− t
rs ds
= e−(T −t)y(t,T ) , where y(t, T ) = rs ds.
T −t t

y(t, T ) is called the yield of the bond. Let us now consider discrete time, and say that the
current time is tn . The loss is
 
Btn+1
Ln+1 = −(Btn+1 − Btn ) = −Btn −1 .
B tn

Let us fix some notation. Denote the yield at time n by Zn = y(tn , T ), and let Xn+1 =
Zn+1 − Zn . Note that Zn is known at time tn while Xn+1 again needs to be modelled. We
Bt
can rewrite the expression Bn+1
t
as
n

Btn+1
= e−(T −tn+1 )y(tn+1 ,T )+(T −tn )y(tn ,T )
Btn
= e−(T −tn −∆tn )(Zn +Xn+1 )+(T −tn )Zn
= e∆tn Zn −(T −tn+1 )Xn+1 .

This expression makes it clear how the unknown variable Xn+1 enters the loss.

5

In the above examples, we ended up with an expression containing a term of the form
e(··· ) − 1. This makes it tempting to use a Taylor approximation since ex ≈ 1 + x for
small x, and historically, such an approximation was often considered to ease computations.
Taking the example with the stock portfolio, we let
d
(i)
X
L∆
n+1 = − αi Sn(i) Xn+1
i=1

denote the linerarized log returns. This approximation can often be problematic however
since it is of interest to consider large losses (which we will do next week).

A brief rundown of the Black-Scholes model


Consider a risk free asset with price process denoted by B (a bank account) given by the
continuous dynamics
dBt = rt Bt dt
where it is often assumed that B0 = 1. We can explicitly solve for the price and obtain
Rt
rs ds
Bt = B0 e 0 .

rt is called the interest rate and is assumed to be an adapted process. We model a risky
asset (such as a stock) with price process St by a stochastic differential equation (SDE) of
the form
dSt = µ(t, St )dt + σ(t, St )dWt
with deterministic functions µ and σ and a Brownian motion W . µ is called the local mean
rate of return for St while σ is called the volatility of St . For all the necessary results on
SDEs, consult chapter 4 and 5 of [6]. For our purposes, it suffices to know the model on an
intuitive basis. The Black-Scholes model is a special case of the above model.

Definition 2.5. The (Standard) Black-Scholes model consists of two assets with dynamics

dBt = rBt dt,


dSt = µSt dt + σSt dWt

with r, µ and σ constants.

In the language of SDEs, a process with the dynamics of St is called a Geometric Brownian
motion (GBM). Such an SDE can be solved explicitly. In our case, we may write
2
 
µ− σ2 t+σWt
St = S0 e .

The Black-Scholes model can of course be extended to include more risky assets with the
same type of dynamics as St . Such a model is naturally referred to as a multidimensional
Black-Scholes model.

The goal of arbitrage theory is to price financial derivatives i.e. products based on the price
of underlying assets. We think intuitively of an arbitrage as a money machine/free lunch
i.e. as a portfolio of assets that costs nothing and produces a positive amount of money
6

with probability one. It turns out that the absence of arbitrage is the same as the existence
of a so called equivalent martingale measure (EMM) i.e. a measure Q equivalent to the
underlying measure P (Q and P have the same null sets) and such that the discounted price
processes
St
Bt
are martingales under Q. Q is also referred to as a risk neutral measure. Under the measure
Q, the dynamics of St change to

dSt = rSt dt + σSt dWtQ

where W Q is a Brownian motion under the Q measure. Note that the volatility is unchanged
while the local mean rate of return becomes the interest rate times St . Say we have a
derivative which expires at time T , the current time is t < T and that the derivative pays
X = Φ(ST ) at time T . Note that the payout is a function of the price of the risky asset at
time T (such a derivative is called simple). If Πt [X] denotes the (arbitrage free) price of X
at time t, arbitrage theory yields the following formula.

Theorem 2.6 (Risk Neutral Valuation). The arbitrage free price of Φ(ST ) at time t < T
is given by
Πt [X] = e−r(T −t) E Q [Φ(ST ) | Ft ].

By an arbitrage free price we mean a price process that doesn’t introduce an arbitrage into
the market. Note that the above formula says that the price is given by the discounted
expected value under the risk neutral measure (given the information we currently have
available).

Example 2.7 (European call option). A European call option gives the holder the right
(but not the obligation) to buy one stock at time T at price K (the strike price). The
payout is (ST − K)+ = max{ST − K, 0} since if ST > K, we get the payout ST − K while
if ST ≤ K, the option is worthless. In the Black-Scholes model, we can solve for ST as
2
 
r− σ2 (T −t)+σ(WTQ −WtQ )
ST = St e

since we are free to choose the current starting value (so here we choose to consider the
start value at time t). If C(t, T ) denotes the price at time t, we have by the Risk Neutral
Valuation formula that

C(t, T ) = e−r(T −t) E Q [(ST − K)+ | Ft ]

and this can be computed explicitly1 . The result is known as the Black-Scholes formula. It
says that
C(t, T ) = St Φ(u) − Ke−r(T −t) Φ(v)
where
log(St /K) + (r + σ 2 /2)(T − t) √
u= √ , v = u − σ T − t.
σ T −t
1 The brave reader can carry out this computation. It involves a lot of integration by substitution.
7

Say we have a portfolio consisting of one European call option, so that the value is Vn =
C(tn , T ). We note that the risk in this portfolio can be explained by the three quantities

Zn = (Zn(1) , Zn(2) , Zn(3) ) := (log Stn , rn , σn )

since the volatility and interest rate are not constant in real life. One typically needs to
model the change in these factors (called risk factors, see the discussion below), namely
Xn+1 = Zn+1 − Zn . We can for example consider the linearized loss
 
∂C ∂C (1) ∂C (2) ∂C (3)
L∆n+1 = − ∆t + X + X + X .
∂t ∂S n+1 ∂r n+1 ∂σ n+1

In mathematical finance, these first order derivatives have names. ∂C/∂t is called ”theta”,
∂C/∂S ”delta”, ∂C/∂r ”rho” and ∂C/∂σ ”vega”. Together these quantities are referred to
as the greeks, see chapter 10 in [6].

A general risk model


All examples considered above had the same form for Vn . We could write Vn = f (tn , Zn )
for some (measurable) function f and Zn suitable random variables.

Definition 2.8. For a portfolio of the form Vn = f (tn , Zn ) with f : R+ × Rd → R a


measurable function and Zn = (Zn,1 , ..., Zn,d ) a random vector, we call the variables in Z
risk factors. We call Xn+1 = Zn+1 − Zn the risk-factor changes at time n + 1.

In the previous examples we stressed that we only need to model the change in risk factors
Xn+1 since the value Zn is already known at time tn . Hence we can write the loss entirely
in terms of the change in risk factors. Explicitly,

Ln+1 = −(Vn+1 − Vn ) = −(f (tn+1 , Zn + Xn+1 ) − f (tn , Zn )) =: l[n] (Xn+1 ).

We refer to l[n] as the loss operator. By considering the linearized loss

d
∂f X ∂f (i)
L∆
n+1 = − (tn , Zn )∆t − (tn , Zn )Xn+1
∂t i=1
∂zi

obtained by applying a first order Taylor expansion, we can similarly define the linearized
loss operator
d
∆ ∂f X ∂f
l[n] (x) := − (tn , Zn )∆t − (tn , Zn )x(i) .
∂t i=1
∂zi

3 Risk measures
We need some notion of the ”size” of a risk in order to quantify the risk of a loss.

Definition 3.1. Let L denote a loss. A risk measure ρ associates a real number to L
denoted by ρ(L).
8

A way to make the definition more formal is to let G denote the set of all measurable real-
valued functions on the background probability space. A risk measure is then a mapping
ρ : G → R. We will not worry about these details in this course. We now go through
some essential examples of risk measures. In the following, let L denote some loss random
variable.

Example 3.2. For α ∈ (0, 1), let ρ = inf{x ∈ R : P (L > x) ≤ 1 − α}. This risk measure
is called the Value at Risk at level α. One can intuitively think of VaRα (L) as the smallest
value of x such that P (L ≤ x) ≥ α. We can rewrite

VaRα = inf{x ∈ R : 1 − P (L ≤ x) ≤ 1 − α}
= inf{x ∈ R : FL (x) ≥ α}
= FL← (α) =: qα (FL )

with FL denoting the distribution function of L and FL← the generalised inverse of FL . Since
FL is a distribution function, the generalised inverse coincides with the quantile function
q(·) (L). So a statistician would simply call VaRα the α-quantile of L. See the appendix for
more information on generalised inverses and their properties. ◦

Example 3.3. For α > 0 and FL continuous and strictly increasing, we define

ESα (L) = E[L | L ≥ VaRα (L)]

called the Expected Shortfall at level α. This is the expected loss given that the loss has
surpassed the Value at Risk. We immediately see that ESα (L) ≥ VaRα (L) and that ESα
takes into account the severity of the loss in comparison to VaRα . ◦

We want to generalize the Expected Shortfall to also be valid for non-continuous distribution
functions. We will need the following lemma.

Lemma 3.4. If U is uniformly distributed on (0, 1) and L is a random variable with con-
d
tinuous distribution function FL , then L = FL← (U ).

Proof. Let Y = FL← (U ). A property of generalised inverses (see the appendix) is that
u ≤ FL (x) if and only if FL← (u) ≤ x. Hence

P (Y ≤ x) = P (FL← (U ) ≤ x) = P (U ≤ FL (x)) = FL (x)


d
since U is uniformly distributed on (0, 1). This proves Y = L as desired. 

The following proposition tells us how to generalize the notion of Expected Shortfall.

Proposition 3.5. Let L be a loss variable with FL continuous and strictly increasing. Then
Z 1
1
ESα (L) = VaRu (L)du.
1−α α

Proof. Because FL is continuous and strictly increasing, the generalised inverse is a proper
inverse. Hence
P (L ≥ VaRα (L)) = 1 − α
9

so by the previous lemma,


1 1
ESα = E[L1{L≥VaRα (L)} ] = E[L; L ≥ VaRα (L)]
P (L ≥ VaRα (L)) 1−α
1 1
= E[FL← (U ); FL← (U ) ≥ FL← (α)] = E[FL← (U ); U ≥ α]
1−α 1−α
Z 1
1 1
= E[VaRU (L); U ≥ α] = VaRu (L)du.
1−α 1−α α

Most problems include distributions that have continuous and strictly increasing distribu-
tion functions, but it is also useful to have a definition of Expected Shortfall for general
distributions. The following definition summarises the above considerations.

Definition 3.6. Let α ∈ (0, 1) and L a loss variable. We let

VaRα (L) = inf{x ∈ R : P (L > x) ≤ 1 − α}

denote the Value at Risk at level α. If E[|L|] < ∞, we define


Z 1
1
ESα (L) = VaRu (L)du
1−α α

called the Expected Shortfall at level α.

Properties/axioms of risk measures


It is natural to ask what characterises a good risk measure. We think intuitively of a risk
measure as an amount of capital needed by a financial institution to withstand large shocks.
In the above examples with VaR and ES, we had a level α, and we often think of α as
large, for example 0.95, 0.99 or 0.995. In an article by Artzner, Delbaen, Eber and Heath
[1], certain desirable properties of risk measures are suggested. They are as follows.

Definition 3.7. For a risk measure ρ and loss variables L, L1 , L2 , we consider the following
axioms/properties:

1. Translation invariance: ρ(L + a) = ρ(L) + a for every constant a ∈ R.

2. Subadditivity: ρ(L1 + L2 ) ≤ ρ(L1 ) + ρ(L2 ).

3. Positive homogeneity: ρ(λL) = λρ(L) for all λ > 0.

4. Monotonicity: L1 ≤ L2 implies ρ(L1 ) ≤ ρ(L2 ).

A risk measure satisfying all these axioms is called coherent.

The rationale for translation invariance is that adding a deterministic quantity to the loss
should increase the capital we need to set aside by exactly that amount. The rationale for
subadditivity is that diversification should reduce risk. Positive homogeneity makes sense
since if we invest more money into the samme asset, the amount of capital we need to set
aside should be multiplied by the same factor. Monotonocity also clearly makes sense. Note
10

also the similarity to the pricing principles from non-life insurance.

Value at Risk and Expected Shortfall are the two most popular choices of risk measures.
One reason many prefer Expected Shortfall over Value at Risk is that Expected Shortfall in
general satisfies all the above axioms, while Value at Risk only satisfies three.

Proposition 3.8. Let α ∈ (0, 1) and let L be a loss variable.

(i) For constants a > 0, b ∈ R, we have

VaRα (aL + b) = a VaRα (L) + b

so in particular, VaRα satisfies translation invariance and positive homogeneity.

(ii) VaRα satisfies monotonicity.

Proof. For (i), we compute

VaRα (aL + b) = inf{x ∈ R : P (aL + b > x) ≤ 1 − α}


= inf{x ∈ R : P (L > (x − b)/a) ≤ 1 − α}
= inf{ax + b ∈ R : P (L > x) ≤ 1 − α}
= a inf{x ∈ R : P (L > x) ≤ 1 − α} + b = a VaRα (L) + b.

(ii) If L1 ≤ L2 then {x ∈ R : P (L2 > x) ≤ 1 − α} ⊆ {x ∈ R : P (L1 > x) ≤ 1 − α} and the


claim follows. 

Value at Risk is not a coherent risk measure in general. A counterexample can be found in
[17], example 2.25. The reader can construct a continuous counterexample as an exercise.

Theorem 3.9. Expected Shortfall is a coherent risk measure.

Proof. Translation invariance, positive homogeneity and monotonicity follow immediately


by the previous proposition and the definition of Expected Shortfall in terms of the Value
at Risk. Subadditivity is harder to prove. We only prove it in the case where L1 and L2
have continuous distribution functions. The proof of the general case can be found in [17],
see Theorem 8.14. The following proof is from Example 2.26 of the same book. Recall from
earlier computations that for L with a continuous distribution function,

1
ESα (L) = E[L1{L≥VaRα (L)} ].
1−α

Define Ii := 1{Li ≥VaRα (Li )} for i = 1, 2 and I12 := 1{L1 +L2 ≥VaRα (L1 +L2 )} . We compute

(1 − α)(ESα (L1 ) + ESα (L2 ) − ESα (L1 + L2 )) = E[L1 I1 ] + E[L2 I2 ] − E[(L1 + L2 )I12 ]
= E[(L1 (I1 − I12 ))] + E[(L2 (I2 − I12 ))].

We now consider two cases for L1 . If L1 ≥ VaRα (L1 ), then I1 − I12 ≥ 0 and hence
L1 (I1 − I12 ) ≥ VaRα (L1 )(I1 − I12 ). If L1 < VaRα (L1 ), then I1 − I12 ≤ 0 so again we have
11

L1 (I1 − I12 ) ≥ VaRα (L1 )(I1 − I12 ). Applying the same reasoning to L2 , we get

(1 − α)(ESα (L1 ) + ESα (L2 ) − ESα (L1 + L2 )) ≥ E[VaRα (L1 )(I1 − I12 ) + VaRα (L2 )(I2 − I12 )]
= VaRα (L1 )E[I1 − I12 ] + VaRα (L2 )E[I2 − I12 ]
= VaRα (L1 )((1 − α) − (1 − α))
+ VaRα (L2 )((1 − α) − (1 − α))
=0

implying that ESα (L1 ) + ESα (L2 ) ≥ ESα (L1 + L2 ) which is the desired statement.


Computing VaR and ES


We start by considering VaR and ES in some concrete examples. Afterwards, we consider
methods for computing these risk measures directly from data and how we can form confi-
dence intervals.
Example 3.10 (Stock investment). Recall the example on stock investments from earlier.
If Sn denotes the price of the stock at time n, we had the loss variable Ln+1 = −Sn (eXn+1 −1)
where Xn+1 = log Sn+1 − log Sn denotes the log return. We assume Xn+1 ∼ N (µ, σ 2 ). In
this case, Ln+1 has a non-zero density, so we can compute VaRα (L) by solving the equation
P (Ln+1 > x) = 1 − α:
  
Xn+1 x
1 − α = P (−Sn (e − 1) > x) = P Xn+1 < log 1 −
Sn
       
x x
X − µ log 1 − Sn − µ log 1 − Sn − µ
n+1
=P <  = Φ 
σ σ σ

with Φ denoting the distribution function of a standard normal variable. Hence


 
log 1 − Sxn − µ
= Φ−1 (1 − α)
σ
and we can solve for x explicitly as follows:
 
−1 x −1 x
σΦ (1 − α) + µ = log 1 − ⇔ eσΦ (1−α)+µ
=1−
Sn Sn
−1
⇔ x = Sn − Sn eσΦ (1−α)+µ
.

If we consider more stocks, the expression becomes a lot more complicated. It gets even
worse with a more diverse portfolio (with stocks, bonds and call options for example).

Example 3.11. This is from Example 2.11 and 2.14 from [17]. Let α ∈ (0, 1) and assume
that the loss distribution FL is normal distributed with mean µ and variance σ 2 . Since FL
is continuous and strictly increasing, we have

VaRα (L) = µ + σΦ−1 (α)


12

using the properties of Value at Risk from before. We can now compute the Expected
Shortfall, where we again use that FL is continuous and strictly increasing,
  
L−µ L−µ L−µ
ESα (L) = E[L | L ≥ qα (L)] = µ + σE | ≥ qα
σ σ σ
 
L−µ L−µ
= µ + σE | ≥ Φ−1 (α)
σ σ
Z ∞
σ
=µ+ xϕ(x)dx
1 − α Φ−1 (α)

with ϕ denoting the density of a standard normal variable. Note that


 
0 d 1 − x2
ϕ (x) = √ e 2 = −xϕ(x)
dx 2π
so that xϕ(x) has −ϕ(x) as antiderivative. Hence

σ ϕ(Φ−1 (α))
ESα (L) = µ + [−ϕ(x)]∞−1
Φ (α) = µ + σ .
1−α 1−α

We now consider methods for computing VaR and ES directly from data. Up to time tn we
have the empirical observations L1 , ..., Ln . For this discussion, we make the (not so realistic)
assumption that L1 , ..., Ln is an iid sample. Order the sample and let

L1,n ≥ L2,n ≥ ... ≥ Ln,n

denote the corresponding order statistics. We define the empirical distribution function
n
(n) 1X
FL (x) = 1[Li ,∞) (x)
n i=1

that assigns equal weight to each observation. To estimate the risk measures of interest,
(n)
an idea is to replace FL (which is unknown) with the empirical distribution function FL .
This idea is justified by e.g. the Strong Law of Large Numbers which implies that
(n)
FL (x) → P (L1 ≤ x) a.s. as n→∞

for each x or the even stronger result known as the Glivenko-Cantelli Theorem which states
that
(n)
sup |FL (x) − FL (x)| → 0 a.s. as n → ∞.
x∈R

We can form the natural estimator of the Value at Risk given by


d α (L) = inf{x ∈ R : 1 − F (n) ≤ 1 − α}.
VaR L

A problem with this approach is that it is often very difficult to infer the tail behaviour of
(n)
FL from FL since we often do not have enough data in the tail of FL . However, one nice
thing about this approach is that we can explicitly solve for the estimator. We have

VaR
d α (L) = L[n(1−α)]+1,n = qbα (FL )
13

i.e. the empirical quantile. Here [n(1 − α)] denotes the largest integer less than or equal to
n(1 − α). Similarly, if FL is continuous and strictly increasing, we may compute an estimate
for ESα (L), namely
P[n(1−α)]+1
c α (L) = i=1 Li,n
ES
[n(1 − α)] + 1
i.e. the average of the observations greater than or equal to VaR
d α (L). Again we stress that
this approximation is often insufficient since we usually do not have many observations in
the tails of FL . We now turn to the problem of computing confidence intervals where we
focus on the Value at Risk. Let β ∈ (0, 1) denote some ”small” value. Our approach is to
find Ab and Bb such that
P (Ab < VaRα (L) < B)b ≥1−β

by determining A
b and B
b in such a way that

β β
P (VaRα (L) ≤ A)
b ≤ and P (VaRα (L) ≥ B)
b ≤ .
2 2
Assume that L has a density so that for the true VaRα (L) we have P (L > VaRα (L)) = 1−α.
For the observed data, each data point can land on either side of VaRα (L). Hence we have
a sequence of Bernoulli trials with probability 1 − α of landing to the right of VaRα (L)
(a success) and probability α of landing to the left of VaRα (L) (a failure). Formally, let
Z = 1{L>VaRα (L)} , then

q := P (Z = 0) = α, p := P (Z = 1) = 1 − α.
Pn
Let Y denote the number of successes in n trials i.e. Y = i=1 Zi for Zi = 1{Li >VaRα (L)}
where Li is the ith loss. Then Y ∼ Bin(n, p) and
n  
X n k n−k
P (Y ≥ j) = p q .
k
k=j

Note that Lj,n ≥ VaRα (L) if and only if Y ≥ j so using the above, we can find the smallest
j such that P (Y ≥ j) ≤ β/2. Then P (Lj,n ≥ VaRα (L)) = P (Y ≥ j) ≤ β/2 and so we
b = Lj,n . Conversely, we can find the largest i such that P (Y ≤ i) ≤ β/2. Then
set A
P (Li,n ≤ VaRα (L)) ≤ β/2 so set B
b = Li,n .

Notes and comments


Chapter 1 of [17] contains a longer informal introduction to the field of quantitative risk
management as well as more historical perspectives. Section 2.3 contains more perspectives
on different types of risk measurements. These include the notational-amount approach and
the scenario-based risk measures. [6] is a good reference on the Black-Scholes model and a
very readable introduction to pricing of financial derivatives.
14

Exercises
Exercise 1.1:
Consider a loss variable L which is exponential distributed i.e. L ∼ Exp(λ) for λ > 0.
1)Compute VaRα (L).
2)Compute ESα (L).

Exercise 1.2:
Consider a loss variable L with distribution function F given by

0,
 if x < 1
1
F (x) = 1 − 1+x , if 1 ≤ x < 3
1 − x12 ,

if x ≥ 3

1)Compute VaR0.85 (L).


2)Compute VaR0.95 (L) and VaR0.99 (L).
3)Compute ES0.85 (L).
Hint: Draw the graph of F .

Exercise 1.3:
Let L be a loss variable with distribution function
1
F (x) = x−µ
1 + e− s

where µ ∈ R and s > 0 parameters. This distribution is called the logistic distribution with
location µ and scale s. Let α ∈ (0, 1).
1)Compute VaRα (L).
2)Compute ESα (L).

Exercise 1.4:
Consider the risk measure ρ(L) = E[L]. Show/convince yourself that ρ is a coherent risk
measure. Explain why ρ may still be a bad risk measure.

Exercise 1.5:
In this exercise, we will show that the stochastic process given by
2
 
µ− σ2 t+σWt
St = se

is a solution to the stochastic differential equation

dSt = µSt dt + σSt dWt , S0 = s.

We will apply the Itô formula. The Itô formula says that if we have a continuous time
stochastic process X with differential

dXt = µ(t, Xt )dt + σ(t, Xt )dWt ,


15

and a C 2 function f , then the process Zt = f (t, Xt ) has stochastic differential given by

∂2f
 
∂f ∂f 1
dZt = (t, Xt ) + µ(t, Xt ) (t, Xt ) + σ(t, Xt )2 2 (t, Xt ) dt
∂t ∂x 2 ∂x
∂f
+ σ(t, Xt ) (t, Xt )dWt .
∂x
2
1)Identify the function f such that St = f (t, Wt ). Compute ∂f ∂f ∂ f
∂t , ∂x and ∂x2 .
2)Apply the Itô formula to show that St satisfies the stochastic differential equation.

Exercise 1.6:
Consider the Standard Black-Scholes model and the derivative that pays X = log ST at
time T . Determine the arbitrage free price of this derivative at time t < T . Assume the
natural filtration generated by the Brownian motion. Hint: It may be helpful to consult the
subsection in the appendix on stochastic processes.
Week 2 - Methods for computing VaR
and extreme value theory

4 Computing Value at Risk


Last week we introduced methods to obtain estimates for the Value at Risk and the Expected
Shortfall from data. We continue this discussion where we focus on the Value at Risk. We
will introduce four methods, namely

(i) The Variance-Covariance (Var-Cov) method,

(ii) Monte Carlo simulation,

(iii) Importance Sampling and

(iv) Bootstrapping.

The motivating example to keep in mind is the one with d investments from last week.
Recall that the loss was given by
d
X  (i) 
Ln+1 = − αi Sn(i) eXn+1 − 1
i=1

(i)
with Sn the value of the ith asset at time n, Xn+1 = log Sn+1 − log Sn the log return and
αi the number of assets bought of asset i. We now go through the different methods.

The Var-Cov method


Consider the linearized loss
d
(i)
X
L∆
n+1 = − αi Sn(i) Xn+1
i=1
x
obtained by the Taylor approximation e ≈ 1 + x. In the Variance-Covariance method, we
assume that Xn+1 ∼ N (m, Σ) i.e. a multivariate normal distribution with mean vector
µ ∈ Rd and covariance matrix Σ ∈ Rd×d . Letting
   
(1) (1)
Xn+1 α1 Sn
 .   . 
 ..  , wn =  ..  ,
Xn+1 =    
(d) (d)
Xn+1 αd Sn

16
17

we may rewrite L∆ T
n+1 = −hwn , Xn+1 i = −wn Xn+1 . By the properties of the multivariate
normal distribution (see the appendix), we have

L∆ T T
n+1 ∼ N (−wn m, wn Σwn ).

d
Let µn = −wnT m and σn2 = wnT Σwn , so that we may write L∆
n+1 = µn +σn Z for Z ∼ N (0, 1).
From last week, we can compute VaRα (L∆ n+1 ) as
−1
VaRα (L∆
n+1 ) = µn + σn Φ (α).

Finally, we can use data to obtain estimates of µ and σ 2 i.e. of m and Σ. A virtue of this
method is how simple it is to use. We get an exact analytical expression for VaRα (L∆ n+1 ).
The problem is the approximation ex ≈ 1 + x behind the method. This is not very precise
for large losses, and often we are interested in the tail of the distribution where x is large.
Also, the normality assumption is often problematic with real data.

Monte Carlo simulation


Monte Carlo simulation is a purely computational method. Suppose we simulate N samples
of Xn+1 and call these x1 , ..., xN . We work with these simulated values as if they were
empirical samples. From these we can form ”empirical” samples of Ln+1 . Call these l1 , ..., lN .
We use these to compute the empirical Value at Risk. Formally, start by ordering the samples
to obtain the order statistics
l1,N ≥ ... ≥ lN,N .
Then we can compute the estimate

VaRα (Ln+1 ) ≈ VaR


d α (Ln+1 ) = l[N (1−α)]+1,N

per the discussion last week. This method is a lot more precise compared to the Variance-
Covariance method since it does not rely on the approximation ex ≈ 1 + x. Another obvious
advantage is flexibility. The method works for any distribution (at least if we can efficiently
simulate large samples from that distribution). One problem is that the rate of convergence
of the estimate will be slow since we are working with the tail of the loss distribution. Hence
we often have to generate a very large number of samples to obtain a robust estimate. This
is especially problematic if we have a large number of assets.

Rare event simulation


Before moving on to importance sampling, we take a brief detour and consider a problem
closely related to the one above for the Monte Carlo method, namely estimating px = P (L >
x) for large values of x. To estimate px , we generate a computational iid sample l1 , ..., lN of
L. We can then form the indicators 1{l1 >x} , ..., 1{lN >x} . By the SLLN,
N
1 X
1{l >x} → E[1{L>x} ] = P (L > x) = px a.s.
N i=1 i

Define the natural estimator


N
1 X
pb(N
x
)
= 1{l >x} .
N i=1 i
18

As shown above, this estimator is consistent (it converges in probability to the true under-
(N )
lying parameter, in this case px ). A natural question to ask is how fast pbx converges.
If
N
1 X
SN = 1{l >x} ,
N i=1 i
then the CLT applies to show that
SN − N px d
√ −→ Z ∼ N (0, 1)
σx N
(N )
where σx denotes the variance of 1{l1 >x} . Note that N pbx = SN so for large N , we get (in
distribution) that
(N )
SN − N px pbx − px
Z≈ √ = √ .
σx N σx / N
We can rewrite this relation and obtain the approximation (in distribution)
σx
pb(N
x
)
≈ px + √ Z.
N
We can form an asymptotic confidence interval for px as follows. Let β ∈ (0, 1) be some
”small” value and let zβ/2 denote the β/2-quantile of N (0, 1) so that P (Z > zβ/2 ) = β/2.
Thus, with the ”high” probability 1 − β, we have
 
(N ) σx (N ) σx
px ∈ pbx − √ zβ/2 , pbx + √ zβ/2 .
N N

While the error σx / N goes to zero for N → ∞, the probability px is often also small, so
(N )
it can happen that the error still dominates the estimate pbx , even when N is very large.
To make this precise, define the relative error
σx zβ/2 σx
RE = √ = C(N )
N px px

with C(N ) some constant depending on N . We compute

σx2 = Var[1{L>x} ] = E[12{L>x} ] − E[1{L>x} ]2 = px − p2x .

We now have for the relative error that


p r
σx px − p2x 1
= = − 1 → ∞ as x → ∞.
px px px
So the relative error explodes as x gets large. Hence we need more sophisticated techniques
so that the relative error is bounded. We present one such method very briefly now.

Importance sampling
To remedy the issue of diverging relative errors, we briefly introduce the main ideas of im-
portance sampling. Consider the setup L = f (X) where X is an Rd -valued random variable
19

with distribution function F (see the appendix for a brief discussion on multidimensional dis-
tribution functions) and f is a deterministic function. We refer to F as the true distribution
of X. Define the moment-generating function (mgf) of X as
h i
κ(ξ) = E ehξ,Xi for ξ ∈ Rd .

Consider the shifted distribution given by

ehξ,Xi
dFξ (x1 , ..., xd ) = dF (x1 , ..., xd )
κ(ξ)

for all ξ ∈ Rd such that the moment-generating function is finite. We note the following
properties of Fξ :

(i) Fξ is a probability distribution.

(ii) Eξ [X] = ∇Λ(ξ) where Λ(ξ) = log κ(ξ) is the cumulant-generating function of X and
Eξ indicates the expectation taken with respect to the shifted measure.

Property (i) follows easily by the calculation


Z Z
1 κ(ξ)
dFξ (x) = ehξ,Xi dF (x) = = 1.
κ(ξ) κ(ξ)

The idea is now to choose ξ in a good way such that L > x occurs frequently i.e. so that
Pξ (L > x) is large, where Pξ denotes the probability under the shifted distribution. To relate
simulations under the shifted distribution with parameter ξ to the original probability, we
apply a representation formula,
Z Z
dF
px = P (L > x) = dF (y) = (y)dFξ (y)
{y:f (y)>x} {y:f (y)>x} dFξ
 
dF
= Eξ 1{f (X)>x} (X)
dFξ
dF
with dFξ the Radon-Nikodym derivative.

Bootstrapping
The idea of bootstrapping is to use the existing data to generate new data by resampling.
We sample with replacement and if we have N data points x1 , ..., xN , we choose a member
with probability 1/N N times to get a new data set of the same size. We can do this
procedure b times to obtain the bootstrap samples
(1) (1)
x1 , ..., xN
(2) (2)
x1 , ..., xN
..
.
(b) (b)
x1 , ..., xN .
20

If we have some parameter θ that we want to estimate, we can compute an empirical estimate
θb from the original data and estimates θbj from each of the b bootstrap samples. We can
then use the θbj to say something about the distribution of θ. b We may for example generate
confidence intervals by computing empirical quantiles using the θbj . Explicitly, define the
residuals Rj = θb − θbj and order them from smallest to largest, R1,b ≥ R2,b ≥ ... ≥ Rb,b . The
confidence bounds are then given by
h i
θb + R[b(1−β/2)],b , θb + R[b(β/2)]+1,b .

These bounds improve classical estimates based on the CLT which converge more slowly.

For small probabilities, one considers a modification of this idea, namely smoothed bootstrap.
In smoothed bootstrapping, one smoothes the data around the tail values. Namely, given
the original sample x1 , ..., xN , sample from the density

N  
1 X1 x − xi
g(x) = K
N i=1 h h

where K is a smooth function, e.g.

1 −t2 /2
K(t) = e .

One can then employ importance sampling to shift the smoothed distribution so that more
samples will be in the tail.

5 Extreme value theory


We now move on to the other topic for this week, namely extreme value theory. We start
by briefly discussing a method for assessing a distributional assumption on the underlying
data, namely the QQ plot. Afterwards, we briefly introduce distributions with regularly
varying tails and we discuss methods of doing statistics with such distributions.

Data analysis
Let x1 , ..., xn be observed data. If we want to use this data to make predictions about future
values, it is natural to propose a distribution F and assume that x1 , ..., xn are realizations
of iid variables X1 , ..., Xn with distribution F . The QQ plot (quantile-quantile plot) can be
used to determine whether F is a reasonable choice of distribution. The QQ plot consists
of the points
    
← n−k+1
F , xk,n : k = 1, ...,n
n+1
where as usual, x1,n ≥ ... ≥ xk,n denotes the order statistics of the sample. Recall that
 
n−k+1
Fn← = xk,n
n+1
21

where Fn is the empirical distribution function. If F is the true underlying distribution


function for the data, Fn → F a.s., so that
   
n−k+1 n−k+1
xk,n = Fn← ≈ F← .
n+1 n+1

Hence we expect the QQ plot to be a straight line through 0 and with slope 1 if the data
truly comes from the reference distribution. The plot tells a bit more however. If the true
underlying distribution is an affine transformation of F , the plot will still be a straight line.
If that is the case, we can estimate proper location and scale parameters. The plot also
indicates whether the reference distribution has lighter or heavier tails than the empirical
sample. See the plots below.

Figure 1: Four examples of QQ-plots. We have 500 simulated values from the following
distributions: Standard exponential (upper left), folded/truncated normal (i.e. |X| for
X ∼ N (0, 1)) (upper right), standard lognormal (i.e. log X ∼ N (0, 1), lower left) and
the Pareto distribution with κ = 1 and α = 2 (lower right). The reference distribution is
standard exponential. The green line has slope one and intercept zero.

Consider the plots for a moment. The plot in the upper left corner shows that the data fol-
lows a straight line with slope one and intercept zero which is expected, since the reference
distribution and empirical distribution are the same. For the truncated normal, we see that
the data curves downwards, indicating lighter tails than the exponential distribution. For
the lognormal and Pareto distributions, the data curves upwards, indicating heavier tails
than the exponential distribution.
22

Standard distributions in statistics include the normal, exponential and gamma distribu-
tions. These all have in common that they are light-tailed in the sense that the mgf of
these variables exists in some neighbourhood around zero. This is not typical behaviour for
log returns in a financial situation. These distributions are usually a lot more heavy-tailed.
Heavy-tailed distributions include the regularly varying distributions such as the Pareto
along with moderately heavy-tailel distributions such as the lognormal. In these cases, the
mgf does not exist. We will now scratch the surface of the theory of regularly varying
distributions.

The regularly varying class


Definition 5.1. A measurable function h : (0, ∞) → (0, ∞) is called regularly varying if
there is a ρ ∈ R such that for all t > 0,

h(tx)
lim = tρ .
x→∞ h(x)

In this case, we write h ∈ RVρ . If h ∈ RV0 , we call h slowly varying.

Example 5.2. Constant functions and log are examples of slowly varying functions. ◦

We will often use the following characterisation of regular variation.

Proposition 5.3. h ∈ RVρ if and only if h(x) = L(x)xρ for L a slowly varying function.

Proof. Assume h ∈ RVρ . Define the function L(x) = h(x)x−ρ . Then for t > 0, we have

L(tx) h(tx)(xt)−ρ h(tx) −ρ


lim = lim = lim t = tρ t−ρ = 1
x→∞ L(x) x→∞ h(x)x−ρ x→∞ h(x)

so L is slowly varying and satisfies h(x) = L(x)xρ . The converse implication is also easy
and is left to the reader. 

Definition 5.4. A distribution function F is regularly varying if F = 1 − F ∈ RV−α for


some α > 0. α is called the index of F .

The definition says that F is regularly varying of index α > 0 if for all t > 0, it holds that

P (X > tx)
→ t−α as x→∞
P (X > x)

where X has distribution function F . Equivalently by the above proposition, P (X > x) =


L(x)x−α for a slowly varying function L. Let us consider a very important example of a
regularly varying distribution.

Definition 5.5. The Generalised Pareto Distribution (GPD) has distribution function
 −1/γ
γx
Gγ,β (x) = 1 − 1 + , x>0
β

where γ > 0 and β > 0 are parameters.


23

Lemma 5.6. The Generalised Pareto Distribution with parameters β, γ > 0 is regularly
varying with index 1/γ.

Proof. Simply note that


 −1/γ
γx
Gγ,β (x) = 1+
β
so for t > 0, we have

γtx
!−1/γ β
!−1/γ
Gγ,β (tx) 1+ β γx +t
lim = lim = lim = t−1/γ .
x→∞ Gγ,β (x) x→∞ 1 + γx
β
x→∞ β
γx +1

From a statistical perspective, it is of interest to determine the tail parameter α if the


underlying distribution of the data is assumed have a regularly varying distribution. We
present two such methods. In the following, we assume F ∈ RV−α so that F (x) = L(x)x−α ,
and the goal is to estimate α.

The Hill estimator


Deriving the Hill estimator is based on the following result by Karamata.

Theorem 5.7 (Karamata’s Theorem). If L is slowly varying and β < −1, then
Z ∞
1
xβ L(x)dx ∼ − uβ+1 L(u), u → ∞.
u β + 1
R∞
In words, u xβ L(x)dx behaves asymptotically like the integral of a power function. The
slowly varying function plays little role asymptotically. We can now derive the Hill estimator.
Let β = −α − 1 where α > 0. Karamata’s Theorem implies
Z ∞
1
x−α−1 L(x)dx ∼ u−α L(u), u → ∞
u α

so for F (x) = L(x)x−α , we have


Z ∞
1
x−1 F (x)dx ∼ F (u), u → ∞.
u α

We rewrite the left hand side. First note that (log x − log u)0 = 1/x. We can now apply
integration by parts (see the appendix for a review) and obtain
Z ∞ Z ∞
∞
x−1 F (x)dx = (log x − log u)F (x) u −

(log x − log u)dF (x).
u u

F (x) decays like x−α and hence decays to zero faster than log x grows to ∞ and thus the
first term is zero. By using that dF (x) = d(1 − F (x)) = −dF (x), we get
Z ∞ Z ∞
x−1 F (x)dx = (log x − log u)dF (x)
u u
24

and so Z ∞
1 1
(log x − log u)dF (x) → , u → ∞.
F (u) u α
This is a theoretical result. To turn this into an estimator, we have to use the empirical
distribution function. Suppose we have iid data x1 , ..., xn distributed according to F . Order
the samples, x1,n ≥ ... ≥ xn,n . Let Fn denote the empirical distribution function. We can
then approximate F by Fn . Replacing F by Fn in the above expression yields
Z ∞
1 1
≈ (log x − log u)dFn (x).
α F n (u) u

for sufficiently large u. We want to simplify this expression. Let Nu denote the number of
observations greater than u i.e.

Nu = #{i : xi > u},

then F n (u) = Nu /n. Also recall that F n (xk,n ) = (k − 1)/n. Now choose a ”small” k and
set u = xk,n . Then
Z ∞ k
1 n n X1
≈ (log y − log xk,n )dFn (y) = (log xj,n − log xk,n )
α k−1 u k − 1 j=1 n
k
1 X
= (log xj,n − log xk,n )
k − 1 j=1

since each jump of Fn is of size 1/n and the jth jump of Fn occurs at xj,n (see the appendix).
Replacing k − 1 by k gives us the Hill estimator.

Definition 5.8. Let x1 , ..., xn be a sample from a regularly varying distribution with index
α. Let x1,n ≥ ... ≥ xn,n denote the order statistics. We call
 −1
k
1 X
α
bk =  (log xj,n − log xk,n )
k j=1

the Hill estimator of α.

Remark 5.9. Note that α


bk depends on k i.e. the threshold.
It is natural to ask how we choose a good value of k. Choosing k small, we get few data points
which increases the variance of the estimator, so the estimate is not sufficiently robust. On
the other hand, choosing k large makes the approximation based on Karamata’s Theorem
imprecise. Furthermore, it often happens with real data that the center and the tail have
very different distributions, so taking too many data points close to the center makes the
estimator biased. One often chooses k based on a Hill plot. A Hill plot consists of the value
pairs
{(k, α
bk ) : k = 2, ..., n}
and the value of k is chosen in a region where the estimator looks stable. We stress that
this does not always occur for real data.
25

Example 5.10. A very classical data set in extreme value theory is the Danish fire insurance
data. The data consists of large Danish fire insurance claims from 1980 to 1990. The data
is available in the R package evir which has functions to compute estimates and make plots
related to extreme value theory. We present some plots of the data below.

Figure 2: Exploratory plots of the Danish fire insurance data.

Below is a Hill plot of the data:

Figure 3: A Hill plot of the Danish fire insurance data.

The plot looks stable from around k = 300. We have α


b300 = 1.4357, and this is one of many
26

estimates we can choose to report. There is no single correct answer for the choice of k and
very often, real life data is much more ugly than this particular data set. This illustrates the
importance of applying different tools in extreme value theory before drawing a conclusion.

Using the Hill estimator, we can compute the Value at Risk at level β, VaRβ (the letter α
is already used). We want to solve for x in the equation P (X > x) = 1 − β. Choose k, xk,n
by looking at the Hill plot. Since X is regularly varying,
   −α
x x
F (x) = F xk,n ∼ F (xk,n ) as x → ∞.
xk,n xk,n

Now replace F by the empirical distribution function Fn . Since F n (xk,n ) = (k − 1)/n, we


obtain −b
 αk
k−1 x
F (x) ≈ .
n xk,n
Now solve for x and set VaRβ (X) = x.

The Peaks over threshold (POT) method


Real data often has two ”components”, namely a center component and a tail component.
Often the tail of the data is described by a different distribution than the values close to
the center. The following definition captures the idea of considering the tail component.
Definition 5.11. Given a distribution function F and a positive threshold u, we define the
excess distribution function Fu via the tail

F (u + x)
F u (x) = P (X > u + x | X > u) = , x ≥ 0.
F (u)

Rearranging, the above definition can be expressed as F (u + x) = F (u)F u (x) for x ≥ 0. If


we have a sample x1 , ..., xn and Nu = #{i : xi > u}, then F (u) ≈ Nu /n. To estimate F u (x),
we apply the generalized Pareto distribution. By the assumption of regular variation, we
have
F (tu)
→ t−α for u → ∞.
F (u)
Hence for large u,
x
 
F 1+ u u  x −α
F u (x) = ≈ 1+ .
F (u) u
There is some cheating involved here. t = 1 + x/u depends on u, but since t → 1 as u → ∞,
it is not really an issue. Recall that the GPD with parameters β, γ > 0 has tail
 −1/γ
γx  x −α
Gγ,β (x) = 1+ = 1+ for x ≥ 0
β u

where γ = 1/α and β = β(u) = u/α (while β depends on u, one chooses a fixed u so that β
is also fixed). We can now describe the POT method in two steps:
(i) Estimate F (u) ≈ Nu /n.
27

(ii) Approximate F u (x) by a GPD with parameters γ, β > 0.


There are two things we need to elaborate on concerning step (ii). First of all, we need to
choose a threshold u. Second, we need methods for estimating β and γ. Let us first address
the second issue. If we have a sample x1 , ..., xn , we start by discarding all xi ≤ u. We are
then left with a subsample z1 , ..., zNu with zi > u for all i = 1, ..., Nu . We then estimate β
and γ via maximal likelihood based on this subsample. The likelihood function is
Nu
Y d
L(γ, β; z1 , ..., zNu ) = gγ,β (zi ) for gγ,β (x) = Gγ,β (x).
i=1
dx

We can be specific and compute


 −1/γ−1
1 γx
gγ,β (x) = 1+ , x > 0.
β β
We can then consider the log-likelihood
Nu
X
l(γ, β; z1 , ..., zNu ) = log L(γ, β; z1 , ..., zNu ) = log gγ,β (zi )
i=1
Nu   
X γ+1 γzi
= − log β − log 1 +
i=1
γ β
Nu  
γ+1X γzi
= −Nu log β − log 1 +
γ i=1 β

and maximising this equation numerically in terms of β and γ yields the maximal likelihood
estimators β̂n and γ̂n . It is possible to construct asymptotic confidence intervals by using
the result (valid for γ > −1/2)
!
p βbn d
Nu γ bn − γ, − 1 −→ N (0, M −1 ) for Nu → ∞
β

where  
1+γ −1
M −1 = (1 + γ) .
−1 2
We now return to the first problem of determining a good value of u. As usual there is a
tradeoff between getting sufficiently many datapoints and choosing a value large enough so
that the asymptotics ”kick in”. The main tool for this job is the mean-excess function.
Definition 5.12. Let X be an integrable random variable with distribution F . The mean-
excess function of X is defined as

e(u) = E[X − u | X > u].

In the following we assume that the tail parameter α satisfies α > 1. We compute
Z ∞ Z ∞
1
e(u) = (x − u)dP (X ≤ x | X > u) = (x − u)dF (x)
0 F (u) u
28

and using integration by parts, we get


Z ∞ Z ∞  ∞
Z ∞ Z ∞
(x − u)dF (x) = − (x − u)dF (x) = −(x − u)F (x) u + F (x)dx = F (x)dx
u u u u

since F (x) decays to zero slower than x by the assumption α > 1. Using Karamata’s
Theorem, Theorem 5.7, we get
∞ ∞
L(u)u−α+1
Z Z
F (x)dx = L(x)x−α dx ∼ − for u → ∞
u u −α + 1
so
1 L(u)u−α+1 u
e(u) ∼ = as u→∞
L(u)u−α α−1 α−1

which shows that e(u) becomes linear asymptotically. This is a crucial observation and
hence we state the above result as a proposition.

Proposition 5.13. If F is regularly varying with index α > 1, the mean-excess function
e(u) satisfies
u
e(u) ∼ as u → ∞.
α−1

To see how the mean-excess function helps in determining a suitable u, we consider the em-
pirical mean-excess function. The idea is to replace F and F by their empirical counterparts.
The empirical mean-excess function is given by
Z ∞ n
1 n X 1
en (u) = (x − u)dFn (x) = (xj,n − u)1{xj,n >u}
Nu /n u Nu j=1 n
n
1 X
= (xj,n − u)1{xj,n >u} .
Nu j=1

If we set u = xk,n for some k = 2, 3, ..., n (k = 1 is excluded since en (u) = 0 in this case),
we can simplify the above expression to
n
1 X
en (xk,n ) = (xj,n − xk,n ).
k−1
j=k+1

Using the empirical mean-excess function we can construct a mean-excess plot by plotting
the values
{(xk,n , en (xk,n )) : k = 2, 3, ..., n} .

If the values x1 , ..., xn come from a regularly varying distribution, the plot will roughly look
like a straight line for large thresholds. For distributions with lighter tails, the mean-excess
function will either decrease or remain roughly constant. It is left to the reader to compute
some examples of mean-excess functions in the exercises. Some examples of mean-excess
plots are below.
29

Figure 4: Examples of mean-excess plots based on 500 simulated values from the following
distributions: Standard exponential (upper left), standard normal (upper right), standard
lognormal (lower left) and the Pareto distribution with κ = 1 and α = 2 (lower right).

The plots should serve not just as an example but also as a warning. The behaviour of the
large values in the plots are very chaotic, and one should be cautious in the interpretation
of such plots. To illustrate this further, the following plots are made with the exact same
distributions but with a different simulated sample.

Figure 5: More mean-excess plots with the same distributions as before, but with a new
sample.

It often occurs in practice that the tail and center of the data have different distributions.
30

For example, the tail could have a regularly varying distribution while the center has a light-
tailed distribution such as a normal or gamma distribution. In that case, the mean-excess
plot will probably only become linear for large values of u. In any case, one should use the
plot to find a value of u where the points begin to form a straight line. We illustrate this
with an example.

Example 5.14. Let us again consider the Danish fire insurance data. We want to model
the excesses using the POT method. We wish to determine a proper threshold and therefore
make a mean-excess plot:

Figure 6: Mean-excess plot of the Danish fire insurance data.

The plot looks very linear for all thresholds, indicating Pareto tails. A word of warning:
This is typically not the case for real data! The Danish fire insurance data is in a sense
too nice to illustrate this point. Based on the plot, we choose the threshold u = 4. While
the linear trend may begin for slightly larger values, this choice also gives us sufficient data
to work with. To fit the generalized Pareto distribution, we apply maximum likelihood
estimation using the gpd function in the evir package as follows:

1 data ( danish )
2 u <- 4
3 gpdfit <- gpd ( danish , threshold = u , method = " ml " )
4 gpdfit $ par . ests

This gives the estimates γ


b = 0.7209 and βb = 2.6291. We now plot the empirical tail and
the tail from the POT approximation:
31

Figure 7: Blue: The tail from the fitted generalized Pareto distribution. Red: The empirical
tail from the data.

The approximation looks very good. Usually, one is not so lucky. If one is interested in
a statistical goodness of fit, a possible way (which is also implemented in evir if one uses
plot around a fitted gpd object) is to consider the excesses zi = xi − u which should
be approximately Gγb,βb distributed. Hence − log Gγb,βb(zi ) (which we call the generalised
residuals) should be approximately standard exponential distributed. We make a residual
plot and a QQ-plot of the − log Gγb,βb(zi ) against a standard exponential distribution and
get the following:

Figure 8: Diagnostic plots for the fitted GPD.


32

We see no clear tendencies of the generalised residuals. Some of them are quite large, but
otherwise the left plot looks good. The right plot also looks good. While some of the sample
quantiles are below the line with slope one and intercept zero, the residuals overall seem to
follow a standard exponential distribution. We conclude that the model is an adequate fit.

Notes and comments


The idea of bootstrapping was proposed in the famous paper by B. Efron, [8]. For more
information on importance sampling and Monte Carlo methods, consult [12] and [2]. [5]
contains all the information one could wish for concerning regular variation, including a
proof of Karamata’s Theorem. [10] is an excellent source for extreme value theory. Chapter
3.4 covers the GPD and chapter 6 covers statistical methods, including the Hill estimator
and the POT method.
33

Exercises
Exercise 2.1:
For the shifted distribution
ehξ,Xi
dFξ (x1 , ..., xd ) = dF (x1 , ..., xd )
κ(ξ)

as presented in the subsection on importance sampling above, prove that

Eξ [X] = ∇Λ(ξ).

Exercise 2.2:
In this exercise, we will get more comfortable with the concept of regular and slow variation.
1)Prove the remaining implication in Proposition 5.3.
2)Verify that the function h(x) = log log x for x sufficiently large is slowly varying.
3)Show that the Pareto distribution with parameters κ > 0 and α > 0 is regularly varying.
What is the index? Recall that the distribution function is given by
 α
κ
1− , x > 0.
κ+x

Exercise 2.3:
Consider the Student t distribution with ν > 0 degrees of freedom, i.e. the distribution with
density
−(ν+1)/2
x2

Γ((ν + 1)/2)
gν (x) = √ 1+ .
νπΓ(ν/2) ν
Show that this distribution is regularly varying and determine the corresponding index.

Exercise 2.4:
Consider a regularly varying distribution function F supported on [0, ∞) of index α > 0.
β
Let X ∼ F . Prove that E[X
R ∞ ] < ∞ if β < α. Hint: Apply Karamata’s Theorem. Recall
also the formula E[X] = 0 P (X > t)dt for a positive random variable X.

One can use a different form of the Karamata theorem to show that E[X β ] = ∞ when
β > α.

Exercise 2.5:
Karamata’s Representation Theorem says that if L is slowly varying, then we can write
Rx ε(t)
t dt
L(x) = c0 (x)e x0

where c0 (x) → c0 > 0 for x → ∞ and ε(x) → 0 for x → ∞ for some x0 ≥ 0. If L is written
in this form, we call the above a Karamata representation for L.
1)Find a Karamata representation for log.
2)Prove that a function of the form above is slowly varying.
34

Exercise 2.6:
Consider the distribution F with tail F (x) = x−2 for x > 1. Then F ∈ RV−2 .
1)Implement the Hill estimator (in R for example) and a function that can simulate values
from F .
2)Simulate 50 values of F and make a Hill plot using your function from before. Also plot
the true value of the index as a line in the plot.
3)Repeat for 100, 250 and 1000 values. Comment on the results.

Exercise 2.7:
Recall that we proved the following formula for the mean-excess function of an integrable
random variable X: Z ∞
1
e(u) = F (x)dx.
F (u) u
1)Let X ∼ Exp(λ). Prove that e(u) = 1/λ.
2)Let X be Pareto distributed with parameters κ > 0 and α > 1. Prove that
κ+u
e(u) = .
α−1

3)Relate the results to the mean-excess plots in the discussion above.

Exercise 2.8:
The goal of this exercise is to prove that if X ≥ 0 is a random variable with distribution
function F and
F (x − y)
lim = eγy , y > 0
x→∞ F (x)
for some γ ∈ (0, ∞), then
e(u) → γ −1 for u → ∞.

1)Prove that F ◦ log ∈ RVγ .


2)Prove the above result. Hint: Karamata’s Theorem.
3)The above result also holds for γ ∈ {0, ∞}, and you can use this without proof. Prove
that if X is standard normal, then e(u) → 0. Relate this to the plots in the discussion
above.
Week 3 - Spherical and elliptical
distributions

Multivariate random vectors: dependence

Consider again the ”canonical example” of stock returns where the risk factors are the log
(i) (i) (i)
returns Xn+1 with Xn+1 = log Sn+1 − log Sn . What is the distribution of Xn+1 ? Inspired
by the Black-Scholes model, we could assume Xn+1 ∼ N (µ, Σ). There are several problems
with the normal assumption, however. A typical problem is that assets are very often
correlated in such a way that high (low) returns for one asset correlates with high (low)
returns for another. The normal distribution is very light-tailed, so such correlations are
often not captured. The plots below illustrate this issue.

Figure 9: Log returns of foreign exchange rates quotes against the US dollar.

35
36

Figure 10: Simulated foreign exchange rates using a bivariate normal distribution with
estimated means and covariance matrix.

From the plots, it is evident that the normal distribution fails to capture the dependency
in the tails. Furthermore, the probability mass is too concentrated around the mean. The
dependency in tails is a very typical phenomenon in financial data as illustrated in the plot
below.

Figure 11: Log returns from BMW and Siemens stocks.

To remedy the issues with the normal distribution, we introduce a class of distributions that
in some way resembles the normal distribution and shares a lot of its properties while also
37

being more flexible in terms of modelling tail behaviour. This is the class of spherical and
elliptical distributions.

6 Spherical and elliptical distributions


To motivate the spherical and elliptical distributions, we first briefly consider the multivari-
ate normal distribution. If X ∼ N (0, Id ) (Id denotes the d-dimensional identity matrix),
then X has density (see the appendix)

1 − 12
Pd 2 1 2
f (x) = e i=1 xi = e−r /2
(2π)d/2 (2π)d/2

with r2 = x21 + · · · + x2d . Hence the density only depends on kxk = r. Graphically, the level
sets of f are spheres (or circles in two dimensions). One can say that N (0, Id ) is spherically
symmetric/rotationally invariant. Define the random variable R by R2 = X12 + · · · + Xd2 ,
then R2 ∼ χ2 (d) i.e. R2 is Chi-square distributed with d degrees of freedom. We call R
d
the radial component of X. Intuitively, we can decompose X asPX = RS with S uniformly
d−1 d
distributed on the d-dimensional unit sphere S d
= {x ∈ R : i=1 x2i = 1}. While this is
an informal approach, it gives us the idea on how to proceed formally.

Definition 6.1. For a d-dimensional random vector X, we define the characteristic function
of X as
T
ΦX (t) = E[eit X ], t ∈ Rd .
T
Remark 6.2. Note the similarity to the moment-generating function κX (t) = E[et X ]. These
two transforms satisfy similar properties. For example, two random variables have the same
distribution if and only if their characteristic functions are equal. See the appendix for more
background on these functions. The characteristic function has the advantage that it always
exists (since the integrand is bounded by one in norm) but it provides less information about
the tail behaviour than the moment-generating function.

Example 6.3. If X ∼ N (0, Id ), simple calculations yield


1 T
ΦX (t) = e− 2 t t .

Note that we can write ΦX (t) = ψ(ktk2 ) for the function ψ : R → R given by ψ(t) = e−t/2 .
This is a formal way of stating that ΦX doesn’t depend on the direction of t. ◦

We can now introduce spherical distributions.

Definition 6.4. A random vector X in d dimensions has a spherical distribution if

ΦX (t) = ψ(ktk2 ) = ψ(t21 + · · · + t2d ), t ∈ Rd

for some univariate function ψ. ψ is called the characteristic generator of X, and we write
X ∼ Sd (ψ).

If X ∼ Sd (ψ), t ∈ Rd and Xθ denotes X rotated by θ (and similarly for tθ ), we have


h θ θ i h i
ΦXθ (tθ ) = E eiht ,X i = E eiht,Xi = ΦX (t) = ψ(ktθ k2 ) = ΦX (tθ )
38

d
which is true for all tθ . By the uniqueness of the characteristic function, Xθ = X. This
gives a formal argument for the intuition of X being rotationally invariant. The following
result gives an equivalent formulation of spherical distributions.
Proposition 6.5. The following are equivalent:
(i) X has a spherical distribution.
d
(ii) X = RS with S uniformly distributed on the unit sphere Sd−1 := {x ∈ Rd : kxk = 1}
and R is a one-dimensional random variable independent of S.
Proof. We first prove that (ii) implies (i). We have
h i h h ii
ΦX (t) = ΦRS (t) = E eiht,RSi = E E eiht,RSi | R
h h ii
= E E eihRt,Si | R = E[ΦS (Rt)]

and since S is uniformly distributed on the unit sphere, ΦS only depends on the length and
not the direction. Hence ΦX (t) also only depends on the length of t and X has a spherical
distribution. We now show that (i) implies (ii). We have ΦX (t) = ψ(ktk2 ). Set s = t/ktk,
then h i
ΦX (t) = E eiktkhs,Xi ,
and by assumption, this does not depend on s, only ktk. Let S be uniformly distributed on
the unit sphere with distribution function FS . Since ΦX (t) is constant in s, we have
Z h i Z Z
iktkhs,Xi
ΦX (t) = E e dFS (s) = eiktkhs,xi dFX (x)dFS (s)
Sd−1 Sd−1 Rd
Z Z Z Z
= eiktkhs,xi dFS (s)dFX (x) = eihs,ktkxi dFS (s)dFX (x)
d d−1 d d−1
ZR Sh i Z h
R S
i
ihktkx,Si ihkxkt,Si
= E e dFX (x) = E e dFX (x)
Rd d
h h ii h R i h i
= E E eihkXkt,Si | X = E eiht,kXkSi = E eiht,RSi = ΦRS (t)
d
where we have defined R := kXk. Hence X = RS where R and S have the desired properties.

While characterisation (ii) is more intuitive, it is easier to work with definition (i) when one
wants to prove properties of spherical distributions. The following corollary tells how to
compute R and S when we know that X is spherical.
d
Corollary 6.6. Let X = RS be spherical. Then
 
X d
kXk, = (R, S).
kXk
Proof. The proof is from [17], see Corollary 6.22. Let f1 (x) = kxk and f2 (x) = x/kxk.
d
Since X = RS, we have
 
X d
kXk, = (f1 (X), f2 (X)) = (f1 (RS), f2 (RS)) = (R, S)
kXk
as desired. 
39

We now turn to a generalisation of spherical distributions, namely elliptical distributions.


Definition 6.7. A d-dimensional random vector X has an elliptical distribution if X =
µ + AY with Y ∼ Sk (ψ) and A is a d × k matrix. We write X ∼ Ed (µ, Σ, ψ) for Σ = AAT .
We call µ the location parameter and Σ the dispersion matrix.
As a motivation for this definition, suppose X ∼ N (µ, Σ) where Σ is positive definite. From
linear algebra (see for example chapter 7 in [3]), we know that there exists some matrix A
such that AAT = Σ. If Y ∼ N (0, Id ), then
d
X = µ + AY.

There exist methods to find A such that AAT = Σ. One such method is the Cholesky
factorisation. This factorisation determines a lower triangular matrix A such that AAT = Σ.
In detail,
    
a11 0 ··· 0 a11 a12 · · · ad1 Σ11 Σ12 · · · Σ1d
 ..   ..   .. 
  0 a22 · · ·
a21 a22 .  .  = Σ21 Σ22 . 
 
 .
 . . .  . . .   . . .
 .. .. ..   .. .. ..   .. .. .. 

ad1 ad2 · · · add 0 0 · · · add Σd1 Σd2 · · · Σdd

The algorithm can (somewhat informally) be described as follows: Since Σ11 = a211 , Σ11
determines a11 . Since Σ21 = a11 a21 , Σ21 determines a21 and so on. Since we can go back
and forth between the matrix A and the matrix Σ, the notation Ed (µ, Σ, ψ) makes sense.
Using the characterisation of spherical distributions in terms of a radial component, we can
also write
X = µ + RAS
for X ∼ Ed (µ, Σ, ψ). We now turn to some properties of elliptical distributions. Afterwards
we will look at some examples.

7 Properties of elliptical distributions


We start by computing the characteristic function for an elliptical distribution.
Lemma 7.1. If X ∼ Ed (µ, Σ, ψ), then

ΦX (t) = eiht,µi ψ(tT Σt).

Proof. The proof is a straightforward computation. Write X = µ + AY with Y ∼ Sk (ψ).


We have
h i h i
ΦX (t) = E eiht,Xi = eiht,µi E eiht,AYi

and since ht, AYi = tT AY = (AT t)T Y = hAT t, Yi, we have


h i h T
i
E eiht,AYi = E eihA t,Yi = ψ(kAT tk2 ) = ψ((AT t)T AT t)
= ψ(tT AAT t) = ψ(tT Σt)

as desired. 
40

We want to be able to relate covariances to the dispersion. The following proposition shows
how these are related for elliptical distributions.
Proposition 7.2. If X ∼ Ed (µ, Σ, ψ), the covariance between the components is given by

Cov(Xj , Xl ) = −2ψ 0 (0)Σjl .

Proof. Since covariance does not depend on the mean, we can without loss of generality
assume that E[Xj ] = 0 for all j = 1, ..., d. We have

∂ ∂ ∂ ∂ h iht,Xi i ∂ ∂ h i(t1 X1 +···td Xd ) i


ΦX (t) = E e = E e
∂tj ∂tl t=0 ∂tj ∂tl t=0 ∂tj ∂tl t=0
h i h i
i(t1 X1 +···td Xd ) i(t1 X1 +···td Xd )
= E (iXj )(iXl )e = −E Xj Xl e
t=0 t=0
= −E [Xj Xl ] = − Cov(Xj , Xl )

and thus by the previous lemma,


∂ ∂ ∂ ∂
Cov(Xj , Xl ) = − ΦX (t) =− ψ(tT Σt) .
∂tj ∂tl t=0 ∂tj ∂tl t=0

Let us for simplicity assume d = 2. Then


  
 Σ11 Σ12 t1
tT Σt = t1 t2 = t21 Σ11 + 2t1 t2 Σ12 + t22 Σ22 =: w(t)
Σ21 Σ22 t2

and thus
∂ ∂ ∂
ψ(tT Σt) = (ψ 0 (w(t))(2t1 Σ12 + 2t2 Σ22 ))
∂t1 ∂t2 t=0 ∂t1 t=0

= ψ (w(t))(2t1 Σ11 + 2t2 Σ12 )(2t1 Σ12 + 2t2 Σ22 ) + ψ 0 (w(t))2Σ12


00

t=0
= 2ψ 0 (0)Σ12 .

This calculation can be generalised so that ∂ ∂ T


∂tj ∂tl ψ(t Σt) = 2ψ 0 (0)Σjl . We conclude
t=0
that
Cov(Xj , Xl ) = −2ψ 0 (0)Σjl .


Example 7.3. If Y ∼ N (0, Id ), then ψ(r) = e−r/2 as seen earlier. We see that ψ 0 (r) =
− 21 e−r/2 , so ψ 0 (0) = − 12 . If X ∼ N (µ, Σ), the above proposition tells us that Cov(Xj , Xl ) =
−2ψ 0 (0)Σjl = Σjl as expected. ◦
We list some further properties of elliptical distributions.
Theorem 7.4. Let X = µ + AY ∼ Ed (µ, Σ, ψ).
(i) (Linear combinations). If B is a k × d matrix and b ∈ Rk , then

BX + b ∼ Ek (Bµ + b, BΣB T , ψ).


41

(ii) (Marginal distributions). The marginals X1 , ..., Xd also have elliptical distributions
with the same characteristic generator. Explicitly, Xi ∼ E1 (µi , Σii , ψ).

(iii) (Convolutions). If X̃ ∼ Ed (µ̃, Σ, ψ̃) is another elliptical distribution independent of X


with the same dimension and dispersion matrix, then X + X̃ ∼ Ed (µ + µ̃, Σ, ψ̄) with
ψ̄(u) = ψ(u)ψ̃(u).

(iv) (Quadratic forms). We have

R2 = kYk2 = (X − µ)T Σ−1 (X − µ).

Here, R is called the Mahalanobis distance.

Proof. See the exercises. 

While elliptical distributions are quite flexible as seen in the examples earlier, the above
theorem also illustrates a drawback. The flexibility of elliptical distributions are limited by
the fact that if X is elliptical, so is any coordinate. In real data, it is often the case that
the marginals have very different types of distributions. This motivates the topic for next
week, namely copulas. We now consider some examples.

Example 7.5 (Normal Variance Mixture Models). Let Z ∼ N (0, Ik ), W ≥ 0 a random


variable and A a fixed d × k matrix. A normal variance mixture model is a model of the
form √
X = µ + W AZ.
One can show, conditional on W = w that

X ∼ N (µ, wΣ), Σ = AAT .

Thus, X is obtained by drawing from a collection of normal random variables with random
covariance W Σ. ◦

Example 7.6 (t-distribution). Here we consider a special case of a normal variance


mixture model, namely if we let ν ν 
W ∼ Ig ,
2 2
where Ig(α, β) denotes the inverse gamma distribution with density
 α+1
βα 1
f (x) = e−β/x .
Γ(α) x

In particular, we have ν/W ∼ χ2ν . For AAT = Σ, we write

X ∼ td (ν, µ, Σ)

and we call this distribution the multivariate t distribution.


We end this week with an application.


42

An application: Portfolio investment theory


Say we want to minimise the risk in a portfolio with d assets with returns X = (X1 , ..., Xd ),
µi = E[Xi ]. The following ideas go back to Marcowicz. Let Rp denote the total returns i.e.
d
X
Rp = wi Xi = wT X
i=1

Pd
where wi are the weights of the portfolio i.e. i=1 wi = 1. If we fix the expected total returns
Pd
E[Rp ] = i=1 wi µi = wT µ, we want to minimize the risk in the sense of minimising the
variance (note that this approach is different from the strategy in this course, where we
focus on risk measures). If X = µ + AY where Y ∼ N (0, Id ), then

Var(Rp ) = Var(wT (µ + AY)) = Var(wT AY)

and since Y is spherical, the variance is minimised whenever kAT wk is minimized with
respects to the weights wi .

To transfer these ideas to the setting in this course, let ρ be a risk measure which satisfies
monotonicity and translation invariance. Assume more generally that X = µ + AY is
elliptical. The goal now is to minimise ρ(L) where L = −wT µ − wT AY is the loss. We
have
ρ(L) = −wT µ + ρ(−wT AY).
d
As Y is spherical, we have Y = −Y, so we can remove the minus in the risk measure and
obtain
ρ(L) = −wT µ + ρ(wT AY) = −E[Rp ] + ρ(wT AY).
As E[Rp ] is fixed, ρ(L) is minimised whenever ρ(wT AY) is minimised. As Y is spherical,
ρ(wT AY) is minimised when kAT wk is minimised. Thus the answer remains the same as
in the classical case above.

Notes and comments


Chapter 6 of [17] discusses spherical and elliptical distributions. The chapter contains a
few more details and examples. Section 6.3.4 is dedicated to estimation of dispersion and
correlation in elliptical distributions.
43

Exercises
Exercise 3.1:
Prove Theorem 7.4. Hint: Characteristic functions!

Exercise 3.2:
Let X ∼ Γ(α, β) i.e. let X have a gamma distribution with parameters α, β > 0. Verify
that 1/X ∼ Ig(α, β). This provides an explanation for the name ”inverse gamma”.

Exercise 3.3:

1)Without using an R package, simulate 500 values of the multivariate t distribution with
   
2 1 1/2
ν = 3, µ = , Σ= .
0 1/2 1

Make a plot of the result. Hint: Use the result of the previous exercise.
2)Now simulate 500 values of the multivariate normal distribution with the same µ and Σ
and plot the result. Compare the plot to the one for the t distribution.

Exercise 3.4:
Without using an R package, write an R function to simulate from the multivariate normal
distribution with mean µ = (1, 0, 2) and covariance matrix
 
1 0 −2
Σ =  0 3 5 .
−2 5 1

Exercise 3.5:
Let S be uniformly distributed on the unit circle S1 . Simulate 500 values of S and plot the
result. Hint: Corollary 6.6.
Week 4 - Copulas I

8 Copulas: Basic properties


We need to be able to work with distributions that are less rigid than the elliptical distri-
butions. Let us set the stage. We have data of the form X = (X1 , ..., Xd ) (for example log
returns) and the goal is to find a suitable joint distribution function F for X. We don’t
want to exclude the possibility that the Xi have different types of distributions. Before
proceeding, we encourage the reader to recall the basic properties of generalised inverses as
described in the appendix.

d
Recall that if Xi has distribution function Fi , then Fi← (Ui ) = Xi with Ui a Unif(0, 1)
variable. This is the ”inverse transform method” used in simulation. Recall also that
d
Fi (Ui ) = Xi whenever Fi is continuous.

To study the problem of determining the joint distribution F , we assume that the marginal
distribution functions Fi are known. The transformation Ui := Fi (Xi ) in the continuous
case is illustrated below.

Figure 12: An illustration (in two dimensions) of the idea of a ”copula space”. The ”original
space” on the left is where the variables live, and we wish to transform them into a collection
of uniform variables with support on [0, 1]d .

44
45

Definition 8.1. A copula C is a distribution function on [0, 1]d such that all marginals are
Unif(0, 1) distributed.
To be able to go back and forth between the ”original space” and the ”copula space”, Sklar’s
Theorem is an essential tool.
Theorem 8.2 (Sklar’s Theorem). Let F be the joint distribution function of the random
vector X = (X1 , ..., Xd ) with marginal distribution functions F1 , ..., Fd . There exists a copula
C such that
F (x1 , ..., xd ) = C(F1 (x1 ), ..., Fd (xd )). (3.1)
If F1 , ..., Fd are continuous, C is unique. Conversely, given a copula C and marginal distri-
bution functions F1 , ..., Fd , then F as defined in (3.1) is a joint distribution function with
marginals F1 , ..., Fd .

Proof. For the sake of simplicity, assume that the Fi are continuous. Then Ui := Fi (Xi ) ∼
Unif(0, 1). Suppose X ∼ F with marginals Xi ∼ Fi . Let U = (F1 (X1 ), ..., Fd (Xd )) and let
C be the distribution function of U. By construction and the continuity assumption on the
Fi , C is a copula. We compute

C(F1 (x1 ), ..., Fd (xd )) = P (U1 ≤ F1 (x1 ), ..., Ud ≤ Fd (xd ))


= P (F1 (X1 ) ≤ F1 (x1 ), ..., Fd (Xd ) ≤ Fd (xd ))
= P (X1 ≤ x1 , ..., Xd ≤ xd ) = F (x1 , ..., xd )

which shows that the copula C has the desired properties. As for uniqueness, by continuity
of the Fi , we have Fi (Fi← (ui )) = ui for all ui ∈ [0, 1]. Letting xi = Fi← (ui ) in the expression
above, we get
C(u1 , ..., ud ) = F (F1← (u1 ), ..., Fd← (ud ))
and any copula C̃ satisfying C̃(F1 (x1 ), ..., Fd (xd )) = F (x1 , ..., xd ) must satisfy the same
relation. Uniqueness now follows. To prove the converse statement, let C be a copula and
F1 , ..., Fd univariate distribution functions. Let U = (U1 , ..., Ud ) have distribution function
C and define Xi := Fi← (Ui ), X := (X1 , ..., Xd ). We know that Xi ∼ Fi , so the marginal
distributions are correct. Also,

C(F1 (x1 ), ..., Fd (xd )) = P (U1 ≤ F1 (x1 ), ..., Ud ≤ Fd (xd ))


= P (F1← (U1 ) ≤ x1 , ..., Fd← (Ud ) ≤ xd )
= P (X1 ≤ x1 , ..., Xd ≤ xd )

which shows that F (x1 , ..., xd ) := C(F1 (x1 ), ..., Fd (xd )) is the distribution function for X.


Sklar’s Theorem provides a recipe for constructing copulas using a known joint distribution
function. We call such copulas implicit copulas. Different examples of copulas will be given
in the next section. We first consider some more theoretical properties. We start with the
following useful characterisation of copulas.
Proposition 8.3. A function C : [0, 1]d → [0, 1] is a copula if and only if
(i) C(u1 , ..., ud ) = 0 if ui = 0 for any i = 1, ..., d.
(ii) C(1, ..., 1, ui , 1, ..., 1) = ui for any i = 1, ..., d and ui ∈ [0, 1].
46

(iii) For all (a1 , ..., ad ), (b1 , ..., bd ) ∈ [0, 1]d with ai ≤ bi , we have

2
X 2
X
··· (−1)i1 +···id C(u1i1 , ..., udid ) ≥ 0
i1 =1 id =1

where uj1 = aj and uj2 = bj for all j = 1, ..., d.

The first two properties are self-explanatory. The third property is called the rectangle
inequality and can be interpreted as follows: For uniform variables (U1 , ..., Ud ), then P (a1 ≤
U1 ≤ b1 , ..., ad ≤ Ud ≤ bd ) ≥ 0.

Proposition 8.4 (Fréchet bounds). For every copula C, we have W (u) ≤ C(u) ≤ M (u)
where ( d )
X
W (u) = max ui + 1 − d, 0 and M (u) = min ui .
i=1,...,d
i=1

Proof. Let U have distribution function C. For every ui ∈ [0, 1], we have

C(u) = P (U1 ≤ u1 , ..., Ud ≤ ud ) ≤ P (Ui ≤ ui ) = ui .

The bound C(u) ≤ M (u) now follows by minimising over all i. Conversely, for u ∈ [0, 1]d ,

d
!
[
1 − C(u) = 1 − P (U1 ≤ u1 , ..., Ud ≤ ud ) = P {Ui > ui }
i=1
d
X d
X d
X
≤ P (Ui > ui ) = (1 − ui ) = d − ui .
i=1 i=1 i=1

Pd Pd
Thus −C(u) ≤ d − 1 − i=1 ui implying C(u) ≥ i=1 ui + 1 − d. Since C(u) ≥ 0 always
holds, the lower bound follows. 

What kinds of random vectors produce these upper and lower bounds? It turns out that
the function W is not a copula for d > 2, see Example 7.24 in [17]. We can however
produce M as a copula for any d and W for d = 2. To do so, we introduce the concepts of
comonotonicity and countermonotonicity.

Definition 8.5. We say that X1 , ..., Xd are comonotone if (X1 , ..., Xd ) = (α1 (Z), ..., αd (Z))
for some univariate variable Z and nondecreasing functions α1 , ..., αd . We say that (X1 , X2 )
are countermonotonic if (X1 , X2 ) = (α(Z), β(Z)) for some univariate variable Z, some
nondecreasing function α and some nonincreasing function β.

The following proposition shows that the lower and upper Fréchet bounds arise from coun-
termonotonic and comonotonic random vectors, respectively.

Proposition 8.6. A comonotone bundle (X1 , ..., Xd ) has M as a copula, and a counter-
monotonic bundle (X1 , X2 ) in d = 2 dimensions has W as a copula.
47

Proof. Consider a comonotone bundle (X1 , ..., Xd ) and assume for simplicity that the func-
tions αi are strictly increasing and continuous. If F is the joint distribution function, we
have
F (x1 , ..., xd ) = P (X1 ≤ x1 , ..., Xd ≤ xd ) = P (α1 (Z) ≤ x1 , ..., αd (Z) ≤ xd )
 
= P (Z ≤ α1← (x1 ), ..., Z ≤ αd← (xd )) = P Z ≤ min αi← (xi )
i=1,...,d

= min P (Zi ≤ αi← (xi )) = min P (αi (Z) ≤ xi )


i=1,...,d i=1,...,d

= min P (Xi ≤ xi ) = min Fi (xi ) = M (F1 (x1 ), ..., Fd (xd ))


i=1,...,d i=1,...,d

so M is a copula for (X1 , ..., Xd ). Now consider a countermonotonic bundle (X1 , X2 ) =


(α(Z), β(Z)). Assume for simplicity that α is strictly increasing and continuous, β is strictly
decreasing and continuous and that Z is continuous. If F is the distribution function of
(X1 , X2 ), we have
F (x1 , ..., xd ) = P (α(Z) ≤ x1 , β(Z) ≤ x2 ) = P (Z ≤ α← (x1 ), Z ≥ β ← (x2 ))
= P (Z ≤ α← (x1 )) − P (Z ≤ α← (x1 ), Z < β ← (x2 ))
= F1 (x1 ) − min{F1 (x1 ), 1 − F2 (x2 )}
= max{F1 (x1 ) − F1 (x1 ), F1 (x1 ) − 1 + F2 (x)}
= max{0, F1 (x1 ) + F2 (x2 ) + 1 − 2} = W (F1 (x1 ), F2 (x2 ))
so W is a copula for (X1 , X2 ). 
Remark 8.7. The implications in the above proposition are biimplications, see the notes and
comments at the end of the chapter.
What happens when we take monotone transformations of a copula? While the distribution
itself may change, the copula does not change under strictly increasing transformations as
the following result shows.
Proposition 8.8. Consider a random vector (X1 , ..., Xd ) with continuous marginal distri-
butions Fi and copula C. Let T1 , ..., Td be strictly increasing continuous functions. Then
(T1 (X1 ), ..., Td (Xd )) also has copula C.
Proof. Let X̃i := Ti (Xi ), X̃i ∼ F̃i and (X̃i , ..., X̃d ) ∼ F̃ . Let C̃ be the copula of (X̃i , ..., X̃d ).
By Sklar’s Theorem,
C̃(F̃1 (x1 ), ..., F̃d (xd )) = F̃ (x1 , ..., xd ) = P (T1 (X1 ) ≤ x1 , ..., Td (Xd ) ≤ xd )
= P (X1 ≤ T1← (x1 ), ..., Xd ≤ Td← (xd ))
= F (T1← (x1 ), ..., Td← (Xd )) = C(F1 (T1← (x1 )), ..., Fd (Td← (xd ))).

We now claim that F̃i = Fi ◦ Ti← . By definition,


F̃i (x) = P (X̃i ≤ x) = P (Ti (Xi ) ≤ x) = P (Xi ≤ Ti← (x)) = Fi (Ti← (x))
as claimed. Fi is continuous by assumption and Ti← is continuous since Ti is strictly in-
creasing. Hence F̃i is continuous. We have now proved that
C̃(F̃1 (x1 ), ..., F̃d (xd )) = C(F̃1 (x1 ), ..., F̃d (xd ))
and so C̃ = C by the uniqueness part of Sklar’s Theorem. 
48

9 Examples of copulas
We will study three types of copulas, namely fundamental copulas, implicit copulas and
explicit copulas.

Fundamental copulas
Fundamental copulas arise from theoretical considerations. We have already seen two ex-
amples from the Fréchet bounds.

Definition 9.1. For any d > 1, we call

M (u) = min ui
i=1,...,d

the comonotonicity copula. For d = 2, we call

W (u1 , u2 ) = max{u1 + u2 − 1, 0}

the countermonotonicity copula.

The Fréchet bound copulas are not the only ”theoretical” copulas.

Example 9.2 (The independence copula). For any d > 1, the independence copula is
given by
d
Y
C(u1 , ..., ud ) = ui .
i=1

Unsurprisingly, C arises from independent variables. Suppose X1 , ..., Xd are independent


with continuous distribution functions Fi . Then

d
Y d
Y
C(F1 (x1 ), ..., Fd (xd )) = Fi (xi ) = P (Xi ≤ xi )
i=1 i=1
= P (X1 ≤ x1 , ..., Xd ≤ xd ) = F (x1 , ..., xd )

so by the uniqueness from Sklar’s Theorem, C is the copula for X1 , ..., Xd .


Implicit copulas
Implicit copulas arise from known joint distribution functions. Let F be a given joint
distribution function. If the marginal distribution functions Fi are continuous, we can use
Sklar’s Theorem to construct a copula C via

C(u1 , ..., ud ) = F (F1← (u1 ), ..., Fd← (ud )).

We give a concrete example.


49

Example 9.3 (The Gaussian copula). Let X ∼ N (0, Σ) be a d-dimensional normal


vector with distribution function ΦΣ where
 
1 ρ12 · · · ρ1d
ρ12 1 · · · 
Σ= . .
 
.
 .. .. 

ρ1d ··· ··· 1

One can think of Σ as a correlation matrix. Note that the marginals Xi are standard normal,
so that they have common distribution function Φ. We can then construct the Gaussian
copula
CΣGa (u1 , ..., ud ) = ΦΣ (Φ−1 (u1 ), ..., Φ−1 (ud )).
What if we have a general mean and covariance matrix? Let Y = µ + BX be an affine
transformation of X where  
σ1 0 · · · 0
 0 σ2 · · · 0 
B= .
 
.. .. 
 .. . .
0 0 ··· σd
is a diagonal matrix with σi > 0 for all i. Then Y ∼ N (µ, BΣB T ), and any desired
covariance matrix can be written in the form BΣB T . Note that Yi = µi + σi Xi is a strictly
increasing and continuous transformation of Xi , so Proposition 8.8 implies that Y and X
have the same copula, namely CΣGa . Simulating from this copula is easy. The trick is to
follow the construction in Sklar’s Theorem. To simulate a sample from CΣGa , follow the
steps:

(i) Simulate X from the multivariate Gaussian distribution N (0, Σ).


(ii) Set Ui = Φ(Xi ) for i = 1, ..., d.
(iii) Now U = (U1 , ..., Ud ) has the distribution function CΣGa .

Explicit copulas
Explicit copulas are given by a concrete formula. Some well-known examples are the fol-
lowing.
Example 9.4 (Gumbel copula). The Gumbel copula is given by
 1/θ 
CθGu (u1 , u2 ) = exp − (− log u1 )θ + (− log u2 )θ

where 1 ≤ θ < ∞ is a parameter. ◦


Example 9.5 (Clayton copula). The Clayton copula is given by

CθCl (u1 , u2 ) = (u−θ −θ


1 + u2 − 1)
−1/θ

where 0 < θ < ∞ is a parameter. ◦


50

Both of the above examples are so-called Archimedean copulas.


Definition 9.6. Let ϕ : [0, 1] → [0, ∞] be continuous, strictly decreasing and convex with
ϕ(0) = ∞ and ϕ(1) = 0. Then

C(u1 , u2 ) = ϕ−1 (ϕ(u1 ) + ϕ(u2 ))

is called the Archimedean copula with generator ϕ.

We see that by choosing ϕ(t) = (− log t)θ for θ ≥ 1, we obtain the Gumbel copula. Choosing
ϕ(t) = θ1 (t−θ − 1) gives the Clayton copula.

Example 9.7 (Generalised Clayton copula). Choosing ϕ(t) = θ−δ (t−θ − 1)δ for θ > 0
and δ ≥ 1 gives the Generalised Clayton copula. The special case δ = 1 corresponds to the
Clayton copula from above. ◦

Example 9.8 (Frank copula). The Frank copula is the Archimedean copula with gener-
ator  −θt 
e −1
ϕ(t) = − log
eθ − 1
where θ ∈ R \ {0} is a parameter. ◦
It should of course be verified that an Archimedean copula is a copula. We state the relevant
result without proof.
Theorem 9.9. Let ϕ : [0, 1] → [0, ∞] be a continuous, strictly decreasing function such that
ϕ(1) = 0, ϕ(0) = ∞. The function C : [0, 1]2 → [0, 1] given by

C(u1 , u2 ) = ϕ−1 (ϕ(u1 ) + ϕ(u2 ))

is a copula if and only if ϕ is convex.

Proof. See [19], Theorem 4.1.4. 

While the method of constructing Archimedean copulas is certainly practical, there are
some limitations worth mentioning. An Archimedean copula is seen to be symmetric in
the two arguments which is clearly a constraint in modelling. Furthermore, the structure
of an Archimedean copula is maybe too simple since it is two dimensional but has a one-
dimensional generator.

Notes and comments


Chapter 7 in [17] covers copulas in more generality than here. In particular, see section
7.2.1 for a complete characterisation of the copulas of countermonotonic and comonotonic
random vectors. If the reader reads the section on Archimedean copulas (section 7.4),
they should be aware that ϕ and ϕ−1 are switched. For more details on copulas as a
subject, we refer to the book by Nelsen, [19]. Chapter 4 about Archimedean copulas contains
all the theoretical details on Archimedean copulas as well as a table of copulas with the
corresponding generators, see page 116-119. Some of the exercises below are also borrowed
from [19].
51

Exercises
Exercise 4.1:
Write out the rectangle inequality (see Proposition 8.3) in the case where d = 2. Use it to
verify that the Morgenstern copula given by

C(u1 , u2 ) = u1 u2 (1 + δ(1 − u1 )(1 − u2 )),

for δ ∈ [−1, 1] a parameter, is indeed a copula.

Exercise 4.2:
Let C and C̃ be copulas and θ ∈ [0, 1]. Verify that θC + (1 − θ)C̃ is also a copula.

Exercise 4.3:
Let (X, Y ) have joint distribution function

H(x, y) = (1 + e−x + e−y )−1

for all (x, y) ∈ R2 .


1)Verify that X and Y both have the logistic distribution i.e.

FX (x) = (1 + e−x )−1 , FY (y) = (1 + e−y )−1 .

2)Show that (X, Y ) has the copula


u1 u2
C(u1 , u2 ) = .
u1 + u2 − u1 u2

Exercise 4.4:
Let X and Y be random variables with continuous distribution functions FX and FY . Let
α, β be functions, C the copula of (X, Y ) and C̃ the copula of (α(X), β(Y )).
1)If α is strictly increasing and β strictly decreasing, prove that

C̃(u1 , u2 ) = u1 − C(u1 , 1 − u2 ).

2)If α is strictly decreasing and β is strictly increasing, prove that

C̃(u1 , u2 ) = u2 − C(1 − u1 , u2 ).

3)If α and β are strictly increasing, prove that

C̃(u1 , u2 ) = u1 + u2 − 1 + C(1 − u1 , 1 − u2 ).

Exercise 4.5:
Recall that the Frank copula has generator
e−θt − 1
 
ϕ(t) = − log
eθ − 1
52

with θ ∈ R \ {0} a parameter.


1)Verify that the Frank copula is indeed a copula using Theorem 9.9.
2)Write the Frank copula in explicit form.

Exercise 4.6:
Consider the function
ϕ(t) = log(1 − θ log t)
for θ ∈ (0, 1] a parameter.
1)Verify that ϕ is a valid generator for an Archimedean copula.
2)Write the corresponding copula in explicit form.

Exercise 4.7:
The d-dimensional t copula is given by
t
Cν,Σ (u1 , ..., ud ) = tν,Σ (t−1 −1
ν (u1 ), ..., tν (ud ))

with tν,Σ the distribution function of t(ν, 0, Σ) and tν the univariate t distribution function
with ν degrees of freedom.
1)Describe a procedure to simulate from the t copula.
2)Simulate 1000 samples from the t copula with ν = 3 and
 
1 1/2
Σ= .
1/2 1

3)Simulate 1000 samples from the Gaussian copula with Σ above. Plot the two simulated
samples and compare. Try choosing different marginal distributions for these two copulas
and see what happens.
Week 5 - Copulas II

10 Archimedean copulas in higher dimensions


Last week we introduced Archimedean copulas as a recipe for constructing copulas in two
dimensions. How can we generalise this construction to higher dimensions? The logical next
step would be to propose

C(u1 , ..., ud ) = ϕ−1 (ϕ(u1 ) + · · · ϕ(ud )) (4.2)

where ϕ : [0, 1] → [0, ∞] has the same properties as in the two-dimensional case i.e. ϕ is
strictly decreasing, convex and satisfies ϕ(0) = ∞ and ϕ(1) = 0. Is C as constructed above a
copula? The answer is no in general. One issue is that C is not even a distribution function
in general for dimensions higher than two. In order to answer the question of when C is a
copula, we need the following definition.
Definition 10.1. A decreasing function f is completely monotonic on [a, b] if

dk
(−1)k f (t) ≥ 0 for k = 1, 2, ... and t ∈ (a, b).
dtk
It turns out that the property of being completely monotonic determines whether C defined
by equation (4.2) is a copula.
Theorem 10.2. Let ϕ : [0, 1] → [0, ∞] be a continuous strictly decreasing function such
that ϕ(0) = ∞ and ϕ(1) = 0. C defined by (4.2) is a copula for all d ≥ 2 if and only if ϕ−1
is completely monotonic on [0, ∞).
Proof. See Theorem 4.6.2 in [19] and the references in the paragraph above the theorem. 
In principle one should be able to check whether ϕ−1 is completely monotonic, but it is
a tedious procedure. As an alternative to verifying complete monotonicity, one can apply
Laplace transforms of distribution functions. We recall the definition.
Definition 10.3. Let G be a distribution function on [0, ∞) with G(0) = 0. The Laplace
transform of G is Z ∞
ψ(t) = e−tx dG(x).
0

Remark 10.4. Let G be a distribution function on [0, ∞) with G(0) = 0 and Y a random
variable distributed according to G. We then have the following relationship between the
Laplace transform ψ of G and the moment-generating function κY given by

ψ(t) = κY (−t).

53
54

Lemma 10.5. A function ψ on [0, ∞) is the Laplace transform of a distribution function


G if and only if ψ is completely monotonic and ψ(0) = 1.
The above results provide a strategy to verify that ϕ−1 is completely monotonic. It suffices
to show that ϕ−1 is a Laplace transform of some distribution function G.
Example 10.6. Consider the Clayton copula with generator ϕ(t) = (t−θ − 1)/θ. We can
solve for ϕ−1 and get
(1/θ)1/θ
ϕ−1 (t) = (θt + 1)−1/θ = .
(t + 1/θ)1/θ
Now recall that the gamma distribution with parameters α, β > 0 has the moment-generating
function  α
β
κ(t) =
β−t
so if we let α = β = 1/θ, we have that the Laplace transform of this distribution is given
by ϕ−1 (t). We conclude that the Clayton copula extends to any dimension. Explicitly,
CθCl (u1 , ..., ud ) = (u−θ −θ −θ
1 + u2 + · · · + ud − d + 1)
−1/θ

is a copula on [0, 1]d for any d ≥ 2.



Archimedean copulas constructed using Laplace transforms of distributions deserve their
own name.
Definition 10.7. An LT-Archimedean copula is an Archimedean copula if the generator
ϕ satisfies ϕ−1 = ψ, where ψ is the Laplace transform of some distribution function G on
[0, ∞).
How do we simulate random vectors with a given copula? If we have an LT-Archimedean
copula, the following proposition provides a recipe.
Proposition 10.8. Let G be a distribution function on [0, ∞) and V ∼ G. Let ψ =
ϕ−1 denote the Laplace transform of V . Suppose we have variables W1 , ..., Wd which are
conditionally independent given V with conditional distribution function
FWi |V =v (u) = e−vϕ(u) .
Then the distribution function of W satisfies FW (u) = C(u).
Proof. The proof is a straightforward computation using a conditioning argument and con-
ditional independence:
Z ∞
FW (u) = P (W1 ≤ u1 , ..., Wd ≤ ud ) = P (W1 ≤ u1 , ..., Wd ≤ ud | V = v)dG(v)
0
Z d
∞Y Z d
∞Y
= P (Wi ≤ ui | V = v)dG(v) = e−vϕ(ui ) dG(v)
0 i=1 0 i=1
Z ∞
−v(ϕ(u1 )+···+ϕ(ud ))
= e dG(v) = ψ(ϕ(u1 ) + · · · + ϕ(ud ))
0
−1
=ϕ (ϕ(u1 ) + · · · + ϕ(ud )) = C(u).

55

The proposition tells us that if we want to simulate from the copula C(u) = ϕ−1 (ϕ(u1 ) +
· · · + ϕ(ud )), we should apply the following steps:

1. Identify the distribution G having the Laplace transform ψ = ϕ−1 .

2. Simulate V ∼ G.

3. Generate iid U1 , ..., Ud ∼ Unif(0, 1) and apply the inverse transform method i.e. Wi =

FW i |V =v
(Ui ) with v equal to the simulated value of V from step 2.

We can actually be more specific in step 3. FWi |V =v has a proper inverse which we solve for
as follows:
 
−vϕ(FW←
|V =v (u)) ← ← −1 log u
e i = u ⇔ − log u = vϕ(FWi |V =v (u)) ⇔ FWi |V =v (u) = ϕ − .
v

Hence Wi in step 3 should be set to


 
log Ui
Wi = ψ − .
V

11 Fitting copulas to data


If a copula C has been chosen for the data, one can apply maximal likelihood methods in the
case where C depends on a parameter θ. If cθ denotes the density of C, then the maximal
likelihood estimator θ̂ would maximise the log-likelihood
n
X
log L(θ; u1 , ..., un ) = log cθ (ui )
i=1

for iid data u1 , ..., un . This of course requires that we choose a specific copula.

To determine a proper copula for a data set, we should think of properties like correlation,
tail dependence and symmetry. The tail behaviour is especially relevant in a risk manage-
ment context since we want to model large losses. Large losses also tend to ”move together”.
When one loss is large, the other losses tend to be large as well. This can for example happen
in a portfolio with stocks in similar companies. When we make the transformation from the
original space of our data to the copula space, the size of the data is not preserved, only
the ordering. Hence we need notions of correlation that only depends on ordering. This
leads to different notions of rank correlation. We will study two types, namely Kendall’s τ
and Spearman’s ρ. Afterwards, we will consider tail dependence via the coefficient of upper
(respectively lower) tail dependence.

Kendall’s τ
Definition 11.1. Kendall’s τ for (X1 , X2 ) is defined as

ρτ (X1 , X2 ) = P ((X1 − Y1 )(X2 − Y2 ) > 0) − P ((X1 − Y1 )(X2 − Y2 ) < 0)


d
where (Y1 , Y2 ) is independent of (X1 , X2 ) with (Y1 , Y2 ) = (X1 , X2 ).
56

Kendall’s τ gives an indication of whether X1 and X2 get large together or if one tends
to get larger when the other gets smaller. If X1 and X2 tend to move together, the event
{(X1 − Y1 )(X2 − Y2 ) > 0} will have high probability, resulting in a value of ρτ (X1 , X2 ) close
to 1, and if X1 and X2 move in opposite directions, {(X1 − Y1 )(X2 − Y2 ) < 0} will have
high probability so that ρτ (X1 , X2 ) is close to -1. To make this precise, we can make the
following definition.
Definition 11.2. Consider two points (x1 , x2 ), (y1 , y2 ) ∈ R2 . We say that (x1 , x2 ) and
(y1 , y2 ) are concordant if (x1 − x2 )(y1 − y2 ) > 0 and discordant if (x1 − x2 )(y1 − y2 ) < 0.
Intuitively, ρτ (X1 , X2 ) = 1 should correspond to comonotonicity while ρτ (X1 , X2 ) = −1
should correspond to countermonotonicity. This is indeed the case. We want to be able to
estimate Kendall’s τ from data {(Xt,1 , Xt,2 ) : t = 1, ..., n}. To do so, we need to compare
all pairs (Xt,1 , Xt,2 ) and (Xs,1 , Xs,2 ) with one another. In the case of concordance, the pair
should yieldthe value 1 (indicating a positive sign) and -1 in the case of discordance. Since
there are n2 pairs in total, we obtain the estimator
 −1 X
n
ρ̂τ = sign((Xt,1 − Xs,1 )(Xt,2 − Xs,2 ))
2
1≤t≤s≤n

where 
1,
 x>0
sign(x) = 0, x=0.

−1, x<0

We have the following results on Kendall’s τ for copulas. For their proofs, we refer to chapter
5 of [19].
Theorem 11.3. The following hold:
(i) Let (X1 , X2 ) be continuous variables with copula C. Then
Z
ρτ (X1 , X2 ) = 4 C(u1 , u2 )dC(u1 , u2 ) − 1 = 4E[C(U1 , U2 )] − 1
[0,1]2

for uniform variables U1 , U2 on [0, 1] with joint distribution function C.


(ii) Let (X1 , X2 ) be random variables with Archimedean copula with generator ϕ. Then
Z 1
ϕ(t)
ρτ (X1 , X2 ) = 1 + 4 0
dt.
0 ϕ (t)

We leave it as an exercise for the reader to verify that if (X1 , X2 ) have the Clayton copula,
ρτ (X1 , X2 ) = θ/(θ + 2) while for the Gumbel copula, ρτ (X1 , X2 ) = 1 − 1/θ.

Spearman’s ρ
Another measure of rank correlation is Spearman’s ρ defined as follows:
Definition 11.4. Consider a pair of variables (X1 , X2 ) and introduce independent copies
d d
(Y1 , Y2 ) = (Z1 , Z2 ) = (X1 , X2 ). Spearman’s ρ of (X1 , X2 ) is given by
ρS (X1 , X2 ) = 3(P ((X1 − Y1 )(X2 − Z2 ) > 0) − P ((X1 − Y1 )(X2 − Z2 ) < 0)).
57

The intuition for Spearman’s ρ is the same as for Kendall’s τ .

Theorem 11.5. Let X1 and X2 be continuous variables with copula C. Then


Z
ρS (X1 , X2 ) = 12 C(u1 , u2 )dC(u1 , u2 ) − 3 = 12E[C(U1 U2 )] − 3
[0,1]2

where (U1 , U2 ) have distribution function C.

Remark 11.6. It is not hard to verify that (see the exercises) that

E[(U1 − EU1 )(U2 − EU2 )]


12E[U1 U2 ] − 3 = p p = ρL (U1 , U2 )
Var(U1 ) Var(U2 )

which is the ordinary (linear) correlation for U1 and U2 with joint distribution function C.
The following result is useful for relating the rank correlations for the Gaussian copula.

Theorem 11.7. Let X = (X1 , X2 ) have a bivariate Gaussian distribution with copula CΣGa
where  
1 ρ
Σ= .
ρ 1
Then the ordinary linear correlation ρ of X1 and X2 is related to ρτ and ρS via
2 6 ρ
ρτ (X1 , X2 ) = arcsin ρ, ρS (X1 , X2 ) = arcsin .
π π 2
Proof. See Theorem 7.42 in [17]. 

Tail dependence
Definition 11.8. The coefficient of upper tail dependence of X1 and X2 is given by

λU (X1 , X2 ) = lim P (X2 > F2← (u) | X1 > F1← (u))


u↑1

where Fi denotes the distribution function of Xi , i = 1, 2.

λU essentially measures how X2 behaves when X1 gets large. We leave it as an exercise


for the reader to verify that in the extreme case where X1 and X2 are comonotone, then
λU (X1 , X2 ) = 1. To compute the coefficient of upper tail dependence, the following result
is often useful.

Proposition 11.9. Assume X1 and X2 have continuous distribution functions F1 and F2


and unique copula C. Then

1 − 2u + C(u, u)
λU (X1 , X2 ) = lim .
u↑1 1−u

Proof. Since F1 and F2 are continuous, we have

λU (X1 , X2 ) = lim P (F2 (X2 ) > u | F1 (X1 ) > u) = lim P (U2 > u | U1 > u).
u↑1 u↑1
58

Note that we have the following:

P (U1 > u) + P (U2 > u) + P (U1 ≤ u, U2 ≤ u) = 1 + P (U1 > u, U2 > u).

To see this, it may help to draw a figure. The equation can be rewritten as

(1 − u) + (1 − u) + C(u, u) = 1 + P (U1 > u, U2 > u)

so that
P (U1 > u, U2 > u) 1 − 2u + C(u, u)
P (U2 > u | U1 > u) = =
P (U1 > u) 1−u
from which the claim follows. 

In the exercises below and in the mandatory assignments, we will see examples of computing
the coefficient of upper tail dependence. We now introduce the corresponding notion for the
lower tail.

Definition 11.10. The coefficient of lower tail dependence of X1 and X2 is given by

λL (X1 , X2 ) = lim P (X2 ≤ F2← (u) | X1 ≤ F1← (u)).


u↓0

We have an analogous result to the one above for continuous marginal distribution functions.

Proposition 11.11. Assume X1 and X2 have continuous distribution functions F1 and F2


and unique copula C. Then
C(u, u)
λL (X1 , X2 ) = lim .
u↓0 u

Proof. The proof is essentially a simpler version of the one given for the coefficient of upper
tail dependence. We compute

λL (X1 , X2 ) = lim P (X2 ≤ F2← (u) | X1 ≤ F1← (u)) = lim P (F2 (X2 ) ≤ u | F1 (X1 ) ≤ u)
u↓0 u↓0
P (U2 ≤ u, U1 ≤ u) C(u, u)
= lim P (U2 ≤ u | U1 ≤ u) = lim = lim
u↓0 u↓0 P (U1 ≤ u) u↓0 u

as desired. 

Example 11.12. Let (X1 , X2 ) ∼ N (0, Σ) with


 
1 ρ
Σ= .
ρ 1

Assume ρ ∈ (−1, 1) (the cases ρ = ±1 are left as exercises). We want to show that the
coefficient of upper tail dependence λU (X1 , X2 ) is zero. Consider a t > 1 such that tρ < 1
and make the bound

P (X2 > F2← (u) | X1 > F1← (u)) ≤ P (X2 > F2← (u), X1 ≤ tF1← (u) | X1 > F1← (u))
+ P (X1 > tF1← (u) | X1 > F1← (u)).
59

Consider the second term first and let v := F1← (u). We have by L’Hospital’s rule
R ∞ −x2 /2
P (X1 > tv) e dx
lim P (X1 > tF1← (u) | X1 > F1← (u)) = lim = lim Rtv
∞ −x2 /2
u↑1 v→∞ P (X1 > v) v→∞ e dx
v
2
e−(tv) /2
= lim 2 = 0.
v→∞ e−v /2

We now consider the limit of the first term i.e.


R tv
P (X2 > v | X1 = x)dΦ(x)
lim P (X2 > F2← (u), X1 ≤ tF1← (u) | X1 > F1← (u)) = lim v R∞ .
u↑1 v→∞
v
dΦ(x)
2 2
p X2 | X1 = x ∼ N (ρx, 1 − ρ ). Letting Y ∼ N (ρx, 1 − ρ ) and setting
We now use that
Z = (Y − ρx)/ 1 − ρ2 ∼ N (0, 1), we can rewrite
!
v − ρx
P (X2 > v | X1 = x) = P Z > p .
1 − ρ2

Consider x ∈ [v, tv]. If x = v, then v − ρx = v − ρv > 0. Similarly, if x = tv, we have (due


to the assumption tρ < 1) that v − ρx = v − tρv > 0. We conclude that
!
v − ρx
lim P Z > p =0
v→∞ 1 − ρ2

uniformly for x ∈ [v, tv]. It follows that the limit of the first term goes to zero as well which
establishes λU (X1 , X2 ) = 0. We refer to this property as asymptotic independence in the
tails. No matter how large the correlation ρ is in the range (−1, 1), if we go far enough into
the tails, extreme events occur independently in X1 and X2 . This illustrates a potential
problem with the Gaussian copula for modelling financial data. In such data, large losses in
different variables are often correlated, and the normal copula may fail to capture this. ◦

Notes and comments


The book by Nelsen, [19], contains all the information about copulas that we use in this
course. The interested reader can find many supplementary examples of computations of
the different rank correlations. Most of the proofs of the results this week can be found
there as well, and if not, references are given.
60

Exercises
Exercise 5.1:

1)Recall that the Clayton copula has generator


1 −θ
ϕ(t) = (t − 1)
θ
Verify that ρτ = θ(θ + 2).
2)Recall that the Gumbel copula has generator

ϕ(t) = (− log t)θ .

Verify that ρτ = 1 − 1/θ.

Exercise 5.2:
Recall the Clayton copula
−1/θ −1/θ
CθCl (u1 , u2 ) = (u1 + u2 − 1)−1/θ , 0<θ<∞

and Gumbel copula,


 1/θ 
CθGu (u1 , u2 ) = exp − (− log u1 )θ + (− log u2 )θ , 1 ≤ θ < ∞.

1)Compute λU for the Clayton copula.


2)Compute λU for the Gumbel copula.

Exercise 5.3:
Prove the claim in Remark 11.6.

Exercise 5.4:
Let (X1 , X2 ) be comonotone. Show that λU (X1 , X2 ) = 1.

Exercise 5.5:
Consider the Archimedean copula with generator
 
1 − θ(1 − t)
ϕ(t) = log
t

for θ ∈ [−1, 1) a parameter.


1)Write the copula explicitly.
2)Compute λU for this copula.
Week 6 - Credit risk I

12 Portfolio credit risk


Introduction and setup
Credit risk is the risk associated with loans and other obligations, namely the risk that a
financial party is not able to pay what it owes to another party (for a loan, this is called
default). Simply put, we have a situation where a financial institution (such as a bank)
called the lender lends money to another party called the obligor,

Lendor −→ Obligor

In this course, we consider a very simple case. We assume the following:

ˆ We have a one-period model. Namely, we will consider the loss that occurs over a
single timestep such as a year, a month etc.
ˆ We have n total loans.
ˆ We have a probability of default pi for loan i. pi will depend on external factors, and
these should be incorporated into the model.

Definition 12.1. The total one-period loss (for the bank, say) is given by
n
X
L= Xi Li (1 − λi )
i=1

where Li is the size of the ith loan, λi is the recovery rate for the ith loan and
(
1, if default
Xi = .
0, if no default

The recovery rate λi is a number between [0, 1] that describes how much the bank can
recover in case of default. If, for example, a third of the loan is already repaid by the time
of default, the bank will only lose two thirds of the loan. Before considering concrete models
for credit risk, we make two important remarks.

(i) Defaults will be dependent since defaults are often triggered by external factors, for
example increases in interest rates.
(ii) Large losses are usually not caused by one default but by a large number of defaults,
even though the individual losses are often not large.

61
62

We will study the following models:

ˆ The Merton model/KMV model.


ˆ The Probit normal mixture model.
ˆ The Bernoulli mixture model.
ˆ The Poisson mixture model.

The Merton model is an example of a structural model. The three others are examples of
reduced form models.

The Merton model


Merton’s model considers a firm as an obligor. The model consists of a process VA in
continuous time describing the total value of the assets of the firm by time t and a fixed
number K called the debt to be paid at time T . According to the model, VA behaves like
the assets in the Black-Scholes model:
dVA (t) = µA VA (t)dt + σA VA (t)dW (t)
where W is a standard Brownian motion. The reader can consult the rundown of the Black-
Scholes model in week 1. Since VA is a Geometric Brownian motion, we have the explicit
solution for the assets at time T in terms of the value at the current time t, namely
 2

σA
µA − 2 (T −t)+σA (W (T )−W (t))
VA (T ) = VA (t)e .
Note that Z := W (T ) − W (t) ∼ N (0, T − t). Default in this model means by definition that
VA (T ) < K. We can rewrite this as follows:
2
 
σA
VA (T ) < K ⇔ log VA (t) + µA − (T − t) + σA Z < log K
2
 2 
σ
log K − log VA (t) + 2A − µA (T − t)
⇔ >Z
σA
 2 
σ
log K − log VA (t) + 2A − µA (T − t)
⇔ √ >Y
σA T − t

where Y := Z/ T − t ∼ N (0, 1). Note that in the above expression, Y is the only source
of randomness since the value VA (t) is known at time t. We summarise our findings in the
following definition.
Definition 12.2. In the above setup of Merton’s model, the quantity
 2 
σ
log K − log VA (t) + 2A − µA (T − t)
DD := − √
σA T − t
is called the distance to default. The probability of default can thus be written
P (Default) = P (Y < −DD)
with Y ∼ N (0, 1).
63

To estimate P (Default), we need to estimate DD. One problem is that VA is not observable.
It is hard to put a number on the value of every asset of a company (buildings, furniture,
machines, patents etc.). However, the equity VE is observable. We define the equity at time
T to be
VE (T ) = (VA (T ) − K)+
since if K > VA (T ) the company defaults and the equity is zero. We recognise the above as
a call option with strike K. If we also assume a constant interest rate r, we can apply the
Black-Scholes formula,
VE (t) = VA (t)Φ(z) − Ke−r(T −t) Φ(y)
where
2
 
σA
log VA (t) − log K + r + 2 (T − t) √
z= √ , y = z − σA T − t.
σA T − t
This formula relates VA to VE . A specific form of the Merton model used in the industry
is the KMV model. In this model, one estimates the volatility of VE , σE = g(VA (t), σA , r).
Given µA and σA , the firm estimates the DD and hence the probability of default. All these
estimates may be very unreliable and so in the actual KMV procedure, further steps are
taken, namely:

ˆ Consider n other firms.

ˆ Calculate DDi , i = 1, ..., n, where DDi is the distance to default for firm i.

ˆ Compare with past empirical observations with the same DDi . Use the empirical
frequency of default from these past loans and compare with the predictions from the
estimates.

The description of the KMV model above is vague on purpose. Since it is an industry model,
the exact procedures are not public and may change from firm to firm.

The Merton model we have considered so far only involves one firm, but the model is easily
extended to an arbitrary number of firms.

The multivariate Merton model


In the multivariate Merton model, we consider n firms with asset processes
m
X
dVA,i (t) = µA,i VA,i (t)dt + VA,i (t) σA,i,j dWj (t), i = 1, ..., n
j=1

with m independent Brownian motions W1 , ..., Wm . Note that all the Brownian motions
appear in all the asset processes. This makes the VA,i dependent. One can think of the
Brownian motions
Pm as 2underlying risk factors (for example fluctuations in interest rates).
2
Letting σA,i = j=1 σA,i,j , we can explicitly solve for each VA,i and obtain
 2
σA,i

Pm
µA,i − 2 (T −t)+ j=1 σA,i,j (Wj (T )−Wj (t))
VA,i (T ) = VA,i (t)e
64

which is very reminiscent of the univariate case. Letting


m
X
Zi = σA,i,j (Wj (T ) − Wj (t)),
j=1

2
we have Zi ∼ N (0, σA,i (T − t)). Like in the one-dimensional case, we can solve for Zi and
get !
2
σA,i
Zi = log VA,i (T ) − log VA,i (t) + − µA,i (T − t).
2

The ith firm defaults if VA,i (T ) < K where K is some threshold which we think of as the
debt of the company. We can rewrite VA,i (T ) < K as
! !
2
Zi 1 σA,i
Yi := √ < √ log K − log VA,i (t) + − µA,i (T − t) =: −DDi
σA,i T − t σA,i T − t 2

DDi is the distance to default for the ith firm, and hence default for company i means
Yi < −DDi . Since Yi ∼ N (0, 1), we have

P (Default for company i) = P (Yi < −DDi ) = Φ(−DDi ).

Note that the Yi are dependent. We want to describe this dependence. Note that we can
write
m m
1 X X σA,i,j Wj (T ) − Wj (t)
Yi = √ σA,i,j (Wj (T ) − Wj (t)) = √
σA,i T − t j=1 j=1
σA,i T −t

so the Yi are weighted sums of the same iid N (0, 1)-variables. More generally, we can observe
that the Yi have the form
m
X
Yi = ci,j Rj
j=1

for constants ci,j specific to the ith loan and R1 , ..., Rm iid N (0, 1). This leads us to consider
models for the Yi called factor models. In a factor model, we assume that the Yi have the
form
Xk
Yi = aij Uj + bi Wi .
j=1

The Uj are common stochastic factors that affect all loans, and we assume that U1 , ..., Uk
are iid N (0, 1). The variable Wi is a firm specific stochastic factor, and the aij and bi are
constants. In order to estimate the default probability, we will need estimates of aij , bi and
DDi . This is not an easy task. These quantities are not observable from data, and hence we
cannot apply ordinary statistical methods to estimate them. Another approach is usually to
divide the loans into different ”classes” where for a specific class, aij and bi do not depend
on i but only on the class. This means that in a specific class, we have
k
X
Yi = aj Uj + bWi .
j=1
65

Pk
A standard approach at this point is to normalize the constants so that j=1 a2j + b2 = 1.
Then
Xk
aj Uj ∼ N (0, kak2 ), a = (a1 , ..., ak )
j=1

d 2 2
√ Yi = kakZ + bWi for Z ∼ N (0, 1). Since kak + b = 1, we can
from which it follows that

write kak = ρ and b = 1 − ρ for some ρ ∈ (0, 1). Yi can thus be written as
d √ p
Yi = ρZ + 1 − ρWi .

A model with Yi of this form is called a probit normal mixture model.

Estimating VaR in the probit normal mixture model


√ √
Let Z, W1 , ..., Wn be iid N (0, 1) and Yi = ρZ + 1 − ρWi for ρ ∈ (0, 1). Default for the
ith loan means Yi < di for some threshold di (earlier denoted by −DDi ). Now let
(
1, if Yi < di
Xi =
0, if Yi ≥ di
Pn
then Nn = i=1 Xi is the total number of defaults in the portfolio of loans. We focus on
the number of defaults and not the total loss, and the goal is to compute/estimate the VaR.
We carry out this computation in several steps. Define pi (Z) = P (Xi = 1 | Z) i.e. the
probability that the ith loan defaults given Z. Then
√ p
pi (Z) = P (Yi < di | Z) = P ( ρZ + 1 − ρWi < di | Z)
√ √ 
di − ρZ di − ρZ
  
= P Wi < √ |Z =Φ √
1−ρ 1−ρ
d
since Wi ∼ N (0, 1). As Z = −Z, we can rewrite the above to
√  √
di − ρZ
  
di ρ
Φ √ =Φ √ +√ Z = Φ(ai + bZ)
1−ρ 1−ρ 1−ρ

where √
di ρ
ai = √ , b= √ .
1−ρ 1−ρ
Our goal is to compute VaRα for Nn . To do so, we first compute the VaR for pi (Z). This
amounts to solving the equation 1 − α = P (pi (Z) ≥ xi ) for xi :

1 − α = P (pi (Z) ≥ xi ) = P (Φ(ai + bZ) ≥ xi ) = P (ai + bZ ≥ Φ−1 (xi ))


Φ−1 (xi ) − ai
   −1 
Φ (xi ) − ai
=P Z≥ =1−Φ
b b

hence we need to solve


 −1
Φ−1 (xi ) − ai

Φ (xi ) − ai
α=Φ ⇔ Φ−1 (α) = ⇔ ai + bΦ−1 (α) = Φ−1 (xi )
b b
66

which yields the final answer xi = Φ(ai + bΦ−1 (α)). Hence


 r 
−1 di ρ −1
VaRα (pi (Z)) = Φ(ai + bΦ (α)) = Φ √ + Φ (α) .
1−ρ 1−ρ

Let us make the simplifying assumption that di = d for all i, i.e. that the distances to default
are identical for all loans. This also implies that pi (Z) is the same for all i since the Yi are
iid conditional on Z. This allows us to write p(Z) without the subscript i. Furthermore, we
will write Nn (Z) to stress that Nn depends on Z. We claim that, conditional on Z,

Nn (Z) P
−→ p(Z)
n
P
where −→ denotes convergence in probability. See the appendix for a review if necessary.
To be precise, we claim that for every ε > 0,
 
Nn (Z)
P − p(Z) > ε | Z → 0
n
uniformly in Z. Let ε > 0 be given. We start by noting that
 
Nn (Z)
p(Z) = P (Xi = 1 | Z) = E[Xi | Z] = E |Z .
n
This comes from the fact that now,
(
1, if Yi < d
Xi =
0, if Yi ≥ d

which implies that the Xi are identically distributed so that

E[X1 | Z] = E[X2 | Z] = · · · = E[Xn | Z]

and thus
n
X
E[Nn (Z) | Z] = E[Xj | Z] = nE[Xi | Z]
j=1

for any i. The rest of the proof is a straightforward application of Chebyshev’s inequality.
We get
 
Nn (Z)
P − p(Z) > ε | Z = P (|Nn (Z) − E[Nn (Z) | Z]| > nε)
n
1 p(Z)(1 − p(Z))
≤ 2 2 Var(Nn (Z) | Z) = .
n ε nε2
The final inequality is a consequence of the Xi being iid given Z (since this is true for
the Yi ) and from observing that conditional on Z, Nn (Z) has a binomial distribution with
parameters n and p(Z). The above converges to zero uniformly in Z which proves the
claim. We are now close to the goal of providing a formula for estimating the VaR. Assume
f (x) := P (p(Z) > x) is continuous. Then we have for α ∈ (0, 1) that

1 − α = P (p(Z) > VaRα (p(Z)))


67

P
and using the result Nn (Z)/n −→ p(Z), we have the approximation
 
Nn (Z)
1−α≈P > VaRα (p(Z))
n

which yields the Basel formula


 r 
d ρ −1
VaRα (Nn (Z)) ≈ n VaRα (p(Z)) = nΦ √ + Φ (α) .
1−ρ 1−ρ

Notes and comments


See section 10.3 in [17] for more on the Merton model. Section 10.1 gives an informal
introduction to credit risk.
68

Exercises
Exercise 6.1:
P
Upgrade the statement Nn (Z)/n −→ p(Z) to Nn (Z)/n → p(Z) almost surely (conditional
on Z). Hint: Use the Markov inequality and A.2.27. The fourth central moment of the
binomial distribution with parameters n and p is given by

np(1 − p)(1 + (3n − 6)p(1 − p)).


Week 7 - Credit risk II and further
topics

13 Portfolio credit risk continued


The Bernoulli mixture model
In this model, we assume common factors Z = (Z1 , ..., Zm ). These could for example be
economic factors such as interest rates. We again let pi (Z) = P (Xi = 1 | Z) with Xi = 1
when company i defaults and Xi = 0 otherwise just like before.
PnWe assume that the defaults
(i.e. the Xi ) are independent given Z. We again let Nn = i=1 Xi and note that the Xi
are Bernoulli variables conditional on Z with success probability pi (Z). This also implies
that Nn is binomial with parameters n and pi (Z) conditional on Z.

We will consider a simplified version of this model, namely a so-called one-factor model. In
such a model, we assume that Z is one-dimensional (so we write Z instead of Z) and that
pi (Z) = p(Z) is the same for all i. Note that this model is a generalisation of the probit
normal mixture model. Indeed, the probit normal mixture model has this exact setup but
with a specific Xi , namely
( √ √
1, if ρZ + 1 − ρWi < d
Xi = √ √
0, if ρZ + 1 − ρWi ≥ d

for Z, W1 , ..., Wn iid and standard normal as we saw last week. Returning to the one-factor
Bernoulli mixture model, we have
 
n
P (Nn = k | Z) = p(Z)k (1 − p(Z))n−k
k

as was also noted before. If Z has distribution function G, we can write


 Z
n
P (Nn = k) = p(z)k (1 − p(z)n−k dG(z).
k R

What are choices of G and p(Z) that make the above expression mathematically tractible?
Consider the special case Z ∼ Beta(a, b) and p(Z) = Z. This model is called the Beta
mixture model. The density of Z is
1
g(z) = z a−1 (1 − z)b−1 , 0≤z≤1 where
β(a, b)

69
70

Z 1
Γ(a)Γ(b)
β(a, b) = z a−1 (1 − z)b−1 dz = .
0 Γ(a + b)
With these assumptions, we can compute P (Nn = k) explicitly as follows:
 Z 1  Z 1
n n 1
P (Nn = k) = z k (1 − z)n−k g(z)dz = z k+a−1 (1 − z)n−k+b−1 dz
k k β(a, b)
  0 0
n β(a + k, b + n − k)
= .
k β(a, b)

Since we now have an explicit expression for the density of Nk , we can compute all sorts of
risk measures such as VaR, ES etc. explicitly as well. While this is a nice property of the
model, we stress that the motivation for choosing Z ∼ Beta(a, b) and p(Z) = Z is purely
mathematical. Nevertheless, we continue our study of the model. How do we estimate a
and b? While a and b are not directly observed from data, we can relate it to quantities
that are observed. We know the number of defaults that have occured at a given time which
allows us to estimate the probability of default p, at least in a portfolio of homogeneous
loans (think for example of a portfolio consisting only of AAA rated loans, only B rated
loans etc.). We can also estimate the linear correlation ρL of the Xi . We now determine
the relations between these quantities and a and b in the Beta mixture model. We have
a
p = P (Default) = E[P (Default | Z)] = E[p(Z)] = E[Z] = ,
a+b
Cov(Xi , Xj ) E[(Xi − p)(Xj − p)] E[Xi Xj ] − 2pE[Xi ] + p2
ρL = = =
Var(Xi ) p(1 − p) p(1 − p)

and

E[Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1, Xj = 1) = E[P (Xi = 1, Xj = 1 | Z)]


a(a + 1)
= E[P (Xi = 1 | Z)P (Xj = 1 | Z)] = E[Z 2 ] = .
(a + b)(a + b + 1)

We hence have two equations in two unknowns. Solving these yields the relations
1 − ρL 1 − ρL
a = (1 − p) , b=p .
ρL ρL
To summarise: We can estimate a and b from data by first estimating the default probability
and the linear correlation from the data and then apply the formulas above.

The Poisson mixture model


The Poisson mixture model is different from the previous models in the sense that the
Xi can attain infinitely many values, namely Xi ∈ {0, 1, ...}. We assume that the Xi are
independent given Z and that they have a Poisson distribution conditionally on Z, namely

(λi (Z))k −λi (Z)


P (Xi = k | Z) = e
k!
for some function λi . Just like earlier, Z is a collection of m variables which we interpret
as some underlying factors. We think of Xi = 1 as a default of the ith loan and Xi = 0 as
71

no default. The other events, Xi = k for k ≥ 2, do not have a natural interpretation, but
we choose λi small
Pn enough so that Xi > 1 happens with very low probability. We again
consider Nn = i=1 Xi , i.e. the number of defaults (at least for small λi ). We will consider
a special case of the Poisson mixture model which goes under the name CreditRisk+ . In
CreditRisk+ , we assume the following:
ˆ Z1 , ..., Zm are independent.
ˆ Zj ∼ Γ(αj , βj−1 ) with αj βj = 1.
Pm Pm
ˆ λi (Z) = λi j=1 aij Zj with aij ≥ 0 and j=1 aij = 1. Here λi > 0 is a constant for
each i.
The assumption αj βj = 1 implies that E[Zj ] = 1. With the above assumptions, we have
E[λi (Z)] = λi which gives us control over the λi functions. We need to choose them small.
With the above assumptions, it is possible to derive the distribution of Nn which is also a
motivation for the model. In order to do so, we need to introduce some theory.
Definition 13.1. For a discrete random variable Y with values in {0, 1, ...}, the function

gY (t) = E[tY ]

is called the probability-generating function of Y .


Remark 13.2. Notice the relation gY (t) = E[eY log t ] = κY (log t) with κY the moment
generating function of Y . This relation implies that the probability-generating function (if
it exists) determines the distribution of Y .
Example 13.3. Let N be Poisson distributed with parameter λ > 0. Then
t
−1)
κN (t) = eλ(e , t ∈ R.

Hence the probability-generating function is

gN (t) = eλ(t−1) .


Example 13.4. Let N have a negative binomial distribution with parameters r and p i.e.
 
k+r−1
P (N = k) = (1 − p)k pr , k = 0, 1, 2, ...
k
We leave it as an exercise for the reader to verify that the probability-generating function
of N is given by  r
p 1
gN (t) = for |t| < .
1 − (1 − p)t p

Theorem 13.5. With the assumptions of the CreditRisk+ model, we have
m  αj Pn
Y 1 − δj βj i=1 λi aij
gNn (t) = , where δj = Pn .
j=1
1 − δj t 1 + βj i=1 λi aij
72

Proof. The proof is essentially a computational exercise in the assumptions of the CreditRisk+
model. We start by conditioning on Z. The conditional probability-generating function for
Nn given Z is
n
 Y
gNn |Z (t) = E tNn | Z = E tX1 +···+Xn | Z = E tXi | Z
    
i=1
n
Y n
Y
= κXi |Z (log t) = eλi (Z)(t−1)
i=1 i=1

where we used the conditional independence and the above example for the Poisson distri-
bution. By applying the tower property, we can remove the conditioning on Z by taking an
expectation. Let fj denote the density of Zj . Then we have by independence of the Zj that
" n #
  Nn Y
λi (Z)(t−1)

gNn (t) = E E t | Z = E e
i=1
Z ∞ Z n
∞Y
= ··· eλi (z)(t−1) f1 (z1 ) · · · fm (zm )dz1 · · · dzm
0 0 i=1
Z ∞ Z ∞ Pn Pm
= ··· e(t−1) i=1 λi j=1 aij zj
f1 (z1 ) · · · fm (zm )dz1 · · · dzm
0 0
Z ∞ Z ∞ Pm Pn
= ··· e(t−1) j=1 i=1 λi aij zj
f1 (z1 ) · · · fm (zm )dz1 · · · dzm
0 0
Pn
For the sake of simplicity, let µj = i=1 λi aij . Then we can continue the computation as
follows:
Z ∞ Z ∞ Pm
gNn (t) = ··· e(t−1) j=1 µj zj f1 (z1 ) · · · fm (zm )dz1 · · · dzm
Z0 ∞ Z0 ∞
= ··· e(t−1)µ1 z1 f (z1 )dz1 · · · e(t−1)µm zm f (zm )dzm
0 0
m Z ∞ m Z ∞
Y Y 1
= e(t−1)µj z f (z)dz = e(t−1)µj z α z αj −1 e−z/βj dz
j=1 0 j=1 0 βj j Γ(αj )
m Z ∞
Y 1 αj −1 −z(βj−1 −(t−1)µj )
= α z e dz.
j=1 0 βj j Γ(αj )

We now compute each integral (denoted by Ij ) in the product:


Z ∞
1 −1
Ij = αj z αj −1 e−z(βj −(t−1)µj ) dz
0 βj Γ(αj )
(βj − (t − 1)µj )−αj Γ(αj ) ∞ (βj−1 − (t − 1)µj )αj
−1 Z
−1
= αj z αj −1 e−z(βj −(t−1)µj )
dz
βj Γ(αj ) 0 Γ(αj )
 αj
1 1 1 − δj
= αj −1 = =
βj (βj − (t − 1)µj )αj (1 − (t − 1)βj µj )αj 1 − δj t
73

by noting that δj as defined in the theorem is given by

βj µj
δj = .
1 + βj µj

Plugging this expression back into the one for gNn (t) completes the proof. 

Remark 13.6. Note that the terms


 αj
1 − δj
1 − δj t

are probability-generating functions for negative binomial variables with parameters p =


1 − δj and r = αj .
In principle, one can invert gNn . There is a whole litterature dedicated to inverting moment
and probability-generating functions. We will not pursue this here. We will only mention
that one can make a very crude approximation that relies on Markov’s inequality, namely

E[tNn ] gNn (t)


P (Nn > k) ≤ k
=
t tk
for every t > 0. One can then minimise this expression over t.

14 Operational risk
Operational risk can be stated as ”loss from failed internal processes, people or systems
or from external events”. To elaborate a bit on this, we can roughly divide such risks in
categories. One category is repetitive human errors or repetitive operational risks (repetitive
OR). These include IT failures, errors in settlements of transactions, litigation and the like.
Other types of losses include fraud and external events such as flooding, fires, earthquakes
and terrorism (although the latter is extremely hard to model). A difficulty in operational
risk is that we often have little data available, and the data is often heavy tailed. The claim
arrivals can also be hard to model since they often occur randomly in time (and often in
clusters) and since the frequency changes over time. One can for example imagine that a
large traded volume leads to a large number of back office errors.

Approaches in analyzing operational risk


We will now discuss the basics of two approaches in analyzing operational risk. These are

ˆ The basic indicator approach.

ˆ The advanced measurement approach.

Under the basic indicator (BI) approach, the capital requirement to cover OR (operational
risk) losses at time n is given by
3
1 X
RCnBI (OR) = α max(GIn−i , 0),
Zn i=1
74

where α ≈ 0.15 is a constant,

Zn = #years where GIn−i > 0, i = 1, 2, 3,

and GI s denotes the ”gross income” at time s. Under the advanced measurement (AM)
approach, we divide into K lines of business (typically K = 8 and lines include for example
corporate finance, trading and sales). The capital requirement to cover OR losses is given
by
XK
RCnAM (OR) = ρα (Ln,b )
b=1

where 0.99 ≤ α ≤ 0.999 and ρα is a risk measure such as VaRα or ESα .

Mathematical estimates
In this subsection, we will investigate methods to analyze the loss via a stochastic process
approach. Let us start by recalling the definition of a Poisson process.
Definition 14.1. A stochastic process {Nt } is called a Poisson process with intensity λ > 0
if Nt takes values in {0, 1, 2, ...} and
(i) P (Nh ≥ 1) = λh + o(h),
(ii) P (Nh ≥ 2) = o(h) and
(iii) {Nt } has stationary and independent increments.
d
Recall that stationarity means that for every s ≤ t, Xt − Xs = Xt−s . By independent
increments, we mean that for every finite partition 0 < t1 < t2 < · · · < tk , the variables
{Xti+1 − Xti }k−1
i=1 are independent. Intuitively, we think of a Poisson process as a claim
number process which satisfies

P (1 claim in [t, t + h]) = λh + o(h), P (≥ 2 claims in [t, t + h]) = o(h).

We will now model the loss of the company at time t by the process
Nt
X
Lt = Xi
i=1

with {Nt } a Poisson process with intensity λ independent of the iid sequence {Xi }. We will
discuss the following ideas based on risk theory:

ˆ Laplace transform method.


ˆ Panjer recursion.
ˆ A sophisticated large deviation approach based on the ”Arwedson approximation”
from risk theory.
ˆ Time-dependent intensity.
ˆ Stochastic processes for market risk.
75

Let us first discuss the Laplace transform method. Recall that the Laplace transform of a
random variable Y is given by ψY (s) = E[e−sY ]. We can compute the Laplace transform of
the loss using the tower property as follows:
h h PNt ii
ψLt (s) = E e−sLt = E E e−s i=1 Xi | Nt = E ψX (s)Nt
   

where we have defined ψX (s) = E e−sX1 . We continue the computation and get
 

∞ ∞
X λn −λ X (ψX (s)λ)n −λ
ψLt (s) = (ψX (s))n e = e = eλ(ψX (s)−1) .
n=0
n! n=0
n!

We should note that all the Laplace transforms exist since we are working with non-negative
random variables. After obtaining the Laplace-transform, numerical inversion techniques
can be applied. The method is more flexible than this however. To illustrate this, consider
a Poisson intensity that changes over time. To make this concrete, assume we are consid-
ering the time interval [0, 2], and we have Poisson processes N1 and N2 on [0, 1] and (1, 2],
(1) (2)
respectively, along with two claim sequences {Xi } and {Xi } belonging to each interval.
The total loss is obtained by summing losses from each interval i.e. L = L1 + L2 , where L1
and L2 are assumed to be independent. Then
h i
ψL (s) = E e−s(L1 +L2 ) = E e−sL1 E e−sL2 = ψL1 (s)ψL2 (s)
   

= e−λ1 (ψX (1) (s)−1) e−λ2 (ψX (2) (s)−1) ,

and we can invert this function to obtain the distribution. For those familiar with mixture
distributions, the above Laplace transform is one of a compound Poisson sum

Ñt
X
Yi , t ∈ [0, 1]
i=1

with {Ñt } a Poisson process with intensity λ1 + λ2 and Yi mixture distributed with distri-
bution function
λ1 λ2
G(y) = G1 (y) + G2 (y)
λ1 + λ2 λ1 + λ2
(1) (2)
where X1 ∼ G1 and X1 ∼ G2 . In practice, one often observes E[Nt ] < Var(Nt ), a
phenomenon called overdispersion. To remedy this, one can use a mixed Poisson process
{Nt } defined by Nt = ÑΛt where {Ñt } is a Poisson process with intensity 1 and Λ > 0 is a
random variable.

Example 14.2. Choosing Λ ∼ Γ(α, β) leads to the so-called negative binomial process. We
let the reader verify that Nt = ÑΛt is indeed negative binomial distributed. ◦

We now leave the world of Laplace transforms and discuss the next topic, namely Panjer
recursion. For this technique to be applicable, assume Nt satisfies the recursion
 
b
qn := P (Nt = n) = a + qn−1 , n = 1, 2, ...
n
76

for constants a and b. We also assume that Xi ∈ {1, 2, ...}. Panjer recursion yields the exact
recursive formula for pn = P (Lt = n):
n  
X bi
pn = a+ P (X1 = i)pn−i , p0 = q0 .
i=1
n

The assumption Xi ∈ {1, 2, ...} is not as restrictive as it may seem. One can always scale
the values as necessary.

We now consider the Arwedson approximation. For the loss process


Nt
X
Lt = Xi ,
i=1

we consider a ”small” δ ∈ (0, 1) and the probability of a large loss over the small time
interval [0, δu],
ϕδ (u) := P (Lt > u, some 0 ≤ t ≤ δu).
Ultimately, we will choose δu = 1. Those familiar with classical ruin theory will immediately
see the connection. But one should note that this situation is slightly different since we
have no premium payments, and we are working over a finite time interval [0, δu] and not
[0, ∞). The Arwedson approximation was originally developed in the study of finite-time
ruin theory. In ruin theory, one studies the Cramér-Lundberg process
Nt
X
Ct = u + ct − Xi
i=1

with u ≥ 0 the initial capital of the company, c > 0 a constant premium rate and {Xi } the
insurance claim sizes. Arwedson considered the finite-time ruin probabilities
ΨK (u) = P (Ct < 0, for some 0 ≤ t ≤ Ku)
where 0 ≤ K < ∞. Under classical Cramér-Lundberg assumptions, Arwedson showed that
(
C
√K e−uI(K) ,
u
if K ≤ ρ
ΨK (u) ∼ −Ru
.
Ce , if K > ρ
Note that the case K > ρ corresponds to the ordinary Cramér-Lundberg estimate. I(K)
would nowadays be called the ”large deviation rate function” which describes theexponential
decay of a probability as u → ∞. R solves the equation Λ(R) := log E eR(C1 −u) = 0 (called


the adjustment coefficient in ruin theory). Here, I(K) > R for K < ρ and ρ = (Λ0 (R))−1
(one can show ρu is the ”most likely” time of ruin).

Returning to ϕδ (u), if Xi has exponential moments (i.e. is ”light tailed”), one can show
C(δ)
ϕδ (u) ∼ √ e−uJ(δ) as u→∞
u
which has connections to Arwedson’s original result as well as the exponentially shifted
measure from our discussion of importance sampling. If X1 is subexponential (think: heavy
tails), for example if X1 is regularly varying, one can prove that
ϕδ (u) ∼ DuF X (u(1 − δµ)) as u → ∞.
77

The proof relies on the concept of ”one large jump” in heavy tailed ruin problems. A con-
cept that should be familiar with someone who has studied ruin theory.

We now turn to the point of time-dependent intensity. It makes sense to assume that the
intensity of Nt changes over time.
Example 14.3. Assume Nt has intensity λn in the interval (n − 1, n] where

λn = F (Zn ), Zn = cZn−1 + ξn , |c| < 1,

and {ξn } is iid N (0, 1) and Zn is a so-called AR(1) process. Zn could represent traded
volume, for example, which is linked to increases in operational risk. ◦
For time-dependent intensities of ARMA-type, similar ”Arwedson” approximations can be
derived. Namely, one can likewise show

ϕδ (u) ∼ DuF X (u(1 − δµ)) as u → ∞.

Similar insurance-based methods are potentially useful for market risk. We end this week
(and this course) with a brief discussion on stochastic processes for market risk. Throughout
the course, we only considered one-period models, and we assumed iid returns. Real-life data
is not iid! Hence stochastic processes (time-dependent models) are called for. This is very
complicated because multiple stochastic processes are usually dependent, and this is difficult
to model, so most current research either considers dependent processes in one dimension
or coordinatwise dependence in one-period models (but not both). A very classical model
for dependence is the ARMA(p, q) model:
p
X q
X
Xn − φj Xn−j = Zn + θi Zn−i , t = 1, 2, ...
j=1 i=1

where {Zn } is an iid N (0, 1) sequence and φj , θi constants. This is an example of a time
series model. For ”sufficiently small” φj and θi , we have that
d
Xn −→ X

i.e. that Xn converges to a stationary distribution where X is normally distributed. In an


evolution of log-returns of stocks, one typically observes the following:
ˆ Log-returns contain many ”large” values i.e. the data is heavy-tailed.
ˆ Exceedances of high thresholds occur in clusters, i.e. we have dependence in the tails.
ˆ While dependent, returns show little serial correlation.
ˆ Absolute (or squared) returns show strong serial correlation.
ˆ Volatility varies over time.
To address these issues, the GARCH models were introduced. The first of these models,
the ARCH(1) model, was introduced by Engle, see [11]. In this model, the log-returns {Rn }
satisfy
Rn2 = (φ0 + φ1 Rn−1
2
)Zn2 , n = 1, 2, ...
78

where {Zn } is an iid N (0, 1) sequence. A more complicated model is the GARCH(1,1)
model introduced by Bollerslev, see [7]. Here the log-returns {Rn } satisfy

Rn = σn Zn , n = 1, 2, ...

where {Zn } is iid N (0, 1) and

σn2 = α0 + β1 σn−1
2 2
+ α1 Rn−1 2
= α0 + σn−1 2
(β1 + α1 Zn−1 ).

Both ARCH(1) and GARCH(1,1) are examples of stochastic recursive sequences. Namely,

Vn = An Vn−1 + Bn , n = 1, 2, ...

where

Vn = Rn2 for ARCH(1),


Vn = σn2 for GARCH(1, 1).

Here, {(An , Bn ) : n = 1, 2, ...} is any iid sequence on (0, ∞) × R. Under certain reasonable
d
conditions, Vn −→ V , and it is natural to consider P (V > u) for large u. One can apply
renewal theoretic methods (such as those presented in the course SkadeStok) to obtain

P (V > u) ∼ Cu−R as u → ∞

where
Λ(ξ) = log E eξ log A .
 
Λ(R) = 0,
This shows that Pareto tails characterise the decay rate. For more complex models (e.g.
GARCH(p,q)), one needs to consider matrix recursions. This is currently an active research
area.

Notes and comments


The computation in the proof of Theorem 13.5 is inspired by the one in section 12.2 of [14].
For more information about Panjer recursion, we refer to [18], section 3.3.3. In the final
discussion on operational risk, many tools from ruin theory were discussed. Ruin theory
and related tools such as renewal theory were discussed in the course SkadeStok. See [16]
for lecture notes from the last run of the course.
79

Exercises
Exercise 7.1:
Verify that the probability-generating function for a negative binomial variable N with
parameters r and p is given by
 r
p 1
gN (t) = for |t| < .
1 − (1 − p)t p

Exercise 7.2:
Let {Ñt } be a Poisson process with intensity 1 and Λ ∼ Γ(α, β). Verify that the mixed Pois-
son process Nt = ÑΛt has a negative binomial distribution and determine the parameters.

Exercise 7.3:
Let N be a discrete random variable with N ∈ {0, 1, 2, ...}. Define qn := P (N = n) and
consider the relation  
b
qn = a + qn−1 , n = 1, 2, ...
n
Prove that N satisfies this relation for proper choices of a and b in the following cases.
1)N Poisson distributed with intensity λ > 0.
2)N Binomial distributed with parameters k and p.
3)N negative binomial distributed with parameters p and r.

One can prove that these three distributions are the only distributions satisfying this recur-
sive relation.
Appendix A

Preliminaries

A.1 Generalised inverses


Definition A.1.1. Let h : R → R be a non-decreasing function. We define the generalised
inverse of h as
h← (t) = inf{x ∈ R : h(x) ≥ t}.
We have the convention inf ∅ = ∞.

Proposition A.1.2. For a non-decreasing function h, h← is left-continuous.

Proof. This proof is from [21]. Assume that tn ↑ t but H ← (t−) := limtn ↑t H ← (tn ) < H ← (t).
Then we can find x ∈ R and δ > 0 such that for all n,

H ← (tn ) < x < H ← (t) − δ.

As x ∈ {y ∈ R : H(y) ≥ tn }, we have H(x) ≥ tn for all n. Let n → ∞, then H(x) ≥ t so by


definition of H ← , H ← (t) ≤ x. This is in contradiction to x < H ← (t) − δ. 

The following properties of generalised inverses will be useful.

Proposition A.1.3. Let h be non-decreasing.

(i) x ≥ h← (t) if and only if h(x) ≥ t.

(ii) h is continuous if and only if h← is strictly increasing.

(iii) h(h← (t)) = t for all t if and only if h is continuous.

(iv) h is strictly increasing if and only if h← (h(x)) = x for all x.

Proof. Point (i) is left as an exercise. Consider (ii). h is non-decreasing, so any discontinuity
is a (positive) jump. Since a jump of h corresponds to a flat region for h← (make a drawing!),
the claim follows. One can make similar arguments for (iii) and (iv). For complete proofs
of these statements and many others concerning generalised inverses, consult [9]. 

Proposition A.1.4. Let F be the distribution function of the random variable X.


d
(i) F −1 (U ) = X for U ∼ Unif(0, 1).

80
81

d
(ii) If F is continuous, F (X) = U for U ∼ Unif(0, 1).

(iii) P (X ≤ x) = P (F (X) ≤ F (x)).

Proof. (i) and (ii) are left as exercises. As for (iii), we note that X ≤ x implies F (X) ≤ x
since F is non-decreasing. Conversely, consider the event {F (X) ≤ F (x), X > x}. If X > x,
we have F (X) ≤ F (x) and hence {F (X) ≤ F (x), X > x} ⊆ {F (X) = F (x), X > x} so that
F is flat on [x, X]. This implies P (F (X) ≤ F (x), X > x) = 0, completing the proof. 

Exercises
Exercise A.1:
Prove (i) in Proposition A.1.3.

Exercise A.2:
Prove (i) and (ii) in Proposition A.1.4. Give a counterexample which shows that (ii) need
not hold for general distribution functions.

A.2 Probability theory


Distribution functions
Definition A.2.1. If (Ω, F, P ) is a probability space, a random variable is a measurable
map X : (Ω, F) → (R, B) where B is the Borel sigma-algebra on R.

In this course, measurability is not an issue. We will also rarely worry about the background
space (Ω, F, P ). We now go through distribution functions in some detail since distribution
functions and quantile functions play a central role in the course.

Definition A.2.2. For a random variable X, we define the distribution function F of X as

F (x) = P (X ≤ x).

Similarly, if X = (X1 , ..., Xd ) is Rd -valued, the distribution function F is given by

F (x1 , ..., xd ) = P (X1 ≤ x1 , ..., Xd ≤ xd ).

If F is a distribution function for X, we call F = 1 − F the survival function of X.

In the univariate case, we have a nice characterisation of distribution functions.

Proposition A.2.3. A function F : R → R is the unique distribution function of a random


variable if and only if the following properties hold:

1. F is right-continuous.

2. F is non-decreasing.

3. limx→∞ F (x) = 1.

4. limx→−∞ F (x) = 0.
82

Proof. Assume that F is a distribution function for the random variable X. For ε > 0, we
have {X ≤ x + ε} ↓ {X ≤ x} for ε ↓ 0. By continuity from above for measures, F (x + ε) →
F (x) for ε ↓ 0, showing that F has property 1. If x ≤ y, then {X ≤ x} ⊆ {X ≤ y} so that
F (x) = P (X ≤ x) ≤ P (X ≤ y) = F (y), proving 2. Properties 3 and 4 follow from the fact
that X is real-valued. Conversely, suppose F satisfies properties 1 - 4. Let X = F ← (U )
where U is uniformly distributed on (0, 1). Then

P (X ≤ x) = P (F ← (U ) ≤ x) = P (U ≤ F (x)) = F (x)

by Proposition A.1.3. Hence X has distribution function F . 

Recall that a right-continuous function with left-limits is called càdlàg (French: ”continue
à droite, limite à gauche”).

Corollary A.2.4. Every distribution function is càdlàg.

Proof. A non-decreasing function has left-limits. As a distribution is right-continuous by


the above result, the corollary follows. 

The distribution function of a random variable determines its distribution. Note also the
useful identity
P (a < X ≤ b) = F (b) − F (a)
for every a < b. This generalises to the two-dimensional case as follows: If (X, Y ) has
distribution function F , then

P (a < X ≤ b, c < Y ≤ d) = F (a, c) + F (b, d) − F (a, d) − F (b, c). (A.1)

This is best seen by making a drawing of the rectangle with coordinates (a, c), (a, d), (b, c)
and (b, d). Multivariate distribution functions have similar properties as in the univariate
case.

Proposition A.2.5. Any multivariate distribution function F : Rd → R satisfies the fol-


lowing properties:

1. F is non-decreasing in each variable.

2. F is right-continuous in each variable.

3. limx1 ,...,xd →∞ F (x1 , ..., xd ) = 1.

4. 0 ≤ F (x1 , ..., xd ) ≤ 1.

5. limxi →−∞ F (x1 , ..., xd ) = 0 for every i = 1, ..., d.

Proof. Left as an exercise for the reader. 

Note that the above proposition is not an if and only if statement as in the univariate
case. A counterexample is given in the exercises. We end this subsection about distribution
functions with two tables containing the most important examples for this course.
83

Distribution Density Distribution function Parameters


2 Rx
Normal, N (µ, σ 2 ) ϕ(x) = √2πσ 1
2
e−(x−µ) /2 Φ(x) = −∞ ϕ(t)dt (µ, σ 2 ) ∈ R × (0, ∞)
Exponential, Exp(λ) λe−λx , x > 0 1 − e−λx , x > 0 λ ∈ (0, ∞)
βα
Rx
Gamma, Γ(α, β) f (x) = Γ(α) xα−1 e−βx , x > 0 0
f (t)dt, x > 0 (α, β) ∈ (0, ∞)2
 −(ν+1)/2 Rx
x2
Student t f (x) = Γ((ν+1)/2)

νπΓ(ν/2)
1 + ν tν (x) = −∞ f (t)dt ν ∈ (0, ∞)
2 2 Rx
Lognormal(µ, σ )2
f (x) = √1
x 2πσ 2
e−(log x−µ) /2σ , x >0 0 
f (t)dt, x > 0 (µ, σ 2 ) ∈ R × (0, ∞)
α

ακ κ
Pareto (κ+x)κ+1 , x >0 1 − κ+x ,x > 0 (α, κ) ∈ (0, ∞)2

Table A.1: Densities and distribution functions of some common continuous distributions.

Distribution Density P (N = k) Distribution function Parameters


k Pk λi −λ
λ −λ
Poisson(λ) k! e , k = 0, 1, 2, ... e , k = 0, 1, 2, ... λ ∈ (0, ∞)
n k Pi=0
k
i! 
n i
Binomial(n, p) k p (1 − p)n−k , k = 0, 1, ..., n i=0 i p (1 − p)
n−i
, k = 0, 1, ..., n n ∈ N, p ∈ [0, 1]
k
Geometric(p) (1 − p) p, k = 0, 1, 2, ... 1 − (1 − p)k+1 , k = 0, 1, 2, ... p ∈ [0, 1]
k+r−1
 Pk i+r−1

Negative binomial k (1 − p)k pr , k = 0, 1, 2, ... i=0 i (1 − p)i pr , k = 0, 1, 2, ... p ∈ [0, 1], r ∈ N

Table A.2: Densities and distribution functions of some common discrete distributions.

Characteristic functions and moment-generating functions


An alternative characterization of distributions is via moment-generating functions and
characteristic functions.

Definition A.2.6. For a random variable X, the function

ΦX (t) = E[eitX ]

is called the characteristic function of X. If there exists a neighbourhood (−a, a) of zero


(a > 0) such that
κX (t) = E[etX ], t ∈ (−a, a)
is finite, we call κX the moment-generating function of X.

For the sake of brevity, we will often write cf for characteristic function and mgf for moment-
generating function. Note that the characteristic function of a random variable is always
defined. Indeed, the integrand is bounded by 1 in norm.

Example A.2.7. For the N (0, 1) distribution, the characteristic function is given by
2
Φ(t) = e−t /2

The case for the general normal distribution N (µ, σ 2 ) is left as an exercise, see also the
lemma below. ◦
84

The cf and mgf are easily generalised to a multivariate random variable X = (X1 , ..., Xd ) as
follows: h T i h T i
ΦX (t) = E eit X , κX (t) = E et X , t ∈ Rd

where the mgf is only defined in the neighbourhood of the origin where it is finite.
Lemma A.2.8. Let X be a Rd -valued random variable, a ∈ Rn and B a n × d matrix.
Then the Rn -valued random variable a + BX has cf
T
Φa+BX (t) = eia t ΦX (B T t), t ∈ Rn .

Similarly, whenever the mgfs exist,


T
κa+BX (t) = ea t κX (B T t).

Proof. The proof is left to the reader. 

The cf has the following important properties.


Theorem A.2.9. If X and Y are random variables with the same characteristic functions,
ΦX = ΦY , then X and Y have the same distribution.

Proof. See Theorem 14.1 in [15]. 

Corollary A.2.10. The variables X1 , ..., Xd are independent if and only if


d
Y
ΦX (t1 , ..., td ) = ΦXi (ti )
i=1

for all t1 , ..., td where X = (X1 , ..., Xd ).

Proof. Assume the X1 , ..., Xd are independent. Then


h i d
 Y
ΦX (t1 , ..., td ) = E ei(t1 X1 +···td Xd ) = E eit1 X1 · · · E eitd Xd =
  
ΦXi (ti ).
i=1

Conversely, if the cf factors, it follows immediately from the uniqueness theorem above that
the X1 , ..., Xd are independent.


Proposition A.2.11. Let X be a one-dimensional random variable with cf ΦX . If E[|X|k ] <


∞ for some k ∈ N, then ΦX is C k (k times differentiable and the k’th derivative is contin-
uous) and
(m)
ΦX (0) = im E[X m ], m = 1, ..., k.

Proof. See Theorem 6.34 in [13] and the paragraph following the theorem. 

Maybe not surprisingly, these properties more or less carry over to the mgf. A discussion of
the result below can be found in [4], chapter 30.
Theorem A.2.12. If the mgfs of X and Y exist in a neighbourhood around zero and are
equal, then X and Y have the same distribution.
85

Corollary A.2.13. Let the variables X1 , ..., Xd have moment-generating functions κX1 , ..., κXd
that exist in a neighbourhood around zero, then X1 , ..., Xd are independent if and only
d
Y
κ(X1 ,...,Xd ) (t1 , ..., td ) = κXi (ti ).
i=1

Proposition A.2.14. Let X be a one-dimensional random variable with mgf κX that exists
in a neighbourhood (−c, c) around zero. Then X has moments of all orders and for k ∈ N,
(k)
κX (0) = E[X k ].
We end this subsection with tables containing the mgf and cf of the distributions from the
tables of distributions above.

Distribution cf mgf Constraint for mgf


µt− 12 t2 σ 2 µt+ 12 t2 σ 2
Normal, N (µ, σ 2 ) e e t∈R
λ λ
Exponential, Exp(λ) t ∈ (−∞, λ)
 λ−itα  λ−tα
β β
Gamma, Γ(α, β) β−it β−t t ∈ (−∞, β)
Student t No explicit form Doesn’t exist -
Lognormal(µ, σ 2 ) No explicit form Doesn’t exist -
Pareto No explicit form Doesn’t exist -

Table A.3: Characteristic functions and moment-generating functions for the distributions
in table A.1.

Distribution cf mgf Constraint for mgf


λ(eit −1) λ(et −1)
Poisson(λ) e e t∈R
Binomial(n, p) (peit + 1 − p)n (pet + 1 − p)n t∈R
p p
Geometric(p) t < − log(1 − p)
1−(1−p)eit r 1−(1−p)et r
p p
Negative binomial 1−(1−p)eit 1−(1−p)et t < − log(1 − p)

Table A.4: Characteristic functions and moment-generating functions for the distributions
in table A.2.

The multivariate normal distribution


Definition A.2.15. An Rd -valued random variable X = (X1 , ..., Xd ) is multivariate normal
if for every a ∈ Rd , the real-valued random variable aT X has a normal distribution.
The definition does not say that being multivariate normal is the same as all marginal
variables being normal. A counterexample is provided in the exercises. In order to prove
results with the multivariate normal distribution, the following theorem is essential.
Theorem A.2.16. X is multivariate normal of dimension d if and only if there exists a
symmetric positive semi-definite matrix Σ ∈ Rd×d and a vector µ ∈ Rd such that
T
µ− 12 tT Σt
ΦX (t) = eit , t ∈ Rd .
86

In this case, Σ is the covariance matrix of X and µ is the mean vector i.e. E[Xi ] = µi and
Σij = Cov(Xi , Xj ) for all i, j = 1, ..., d.

Proof. See Theorem 16.1 in [15]. 

The theorem allows us to define the following.


Definition A.2.17. For a multivariate normal vector X, we write X ∼ N (µ, Σ) where µ
is the mean vector and Σ is the covariance matrix. If Σ is invertible (i.e. det Σ 6= 0), we say
that X has a regular multivariate normal distribution. Otherwise, X is called singular.
Theorem A.2.18. A regular multivariate normal variable X ∼ N (µ, Σ) in Rd has density
1 T
Σ−1 (x−µ)
f (x) = √ e−(x−µ) , x ∈ Rd
(2π)d/2 det Σ

with respect to Lebesgue measure on Rd .

Proof. See Corollary 16.2 in [15]. Note the error in equation (16.5). It should say (2π)n/2
and not 2π n/2 . 

The following result will be used extensively in the discussion on spherical and elliptical
distributions.
Proposition A.2.19. Let X ∼ N (µ, Σ) be d-dimensional, let a ∈ Rn and let B be an
n × d-matrix. Then Y ∼ N (a + Bµ, BΣB T ).

Proof. Left as an exercise for the reader 

Convergence concepts and results


In this subsection, we will briefly touch upon the convergence concepts that we will use in
this course.
Definition A.2.20. Let X, X1 , X2 , ... be random variables. We say that the sequence {Xn }
converges almost surely to X for n → ∞ if the event {Xn → X for n → ∞} has probability
one. We write
P (Xn → X) = 1.
Example A.2.21. Let X1 , X2 , ... be iid Bernoulli distributed with success probability p ∈
(0, 1) i.e. P (Xi = 1) = p and P (Xi = 0) = 1 − p for all i. Consider the product process
Yn = X1 · · · Xn . We claim that Yn → 0 a.s. Indeed, note that Yn ∈ {0, 1} a.s. and

P (Yn → 0) = P (Yn = 0 for some n ∈ N) = 1 − P (Yn = 1 for all n ∈ N).

By independence, we have for any N ∈ N that

P (YN = 1) = P (X1 = 1, ..., XN = 1) = pN

and
P (Yn = 1 for all n ∈ N) ≤ P (YN = 1) = pN .
As this equality holds for all N , we can take limits on both sides and obtain P (Yn =
1 for all n ∈ N) = 0 which yields P (Yn → 0) = 1 as desired. ◦
87

Almost sure convergence is a strong form of convergence. A weaker type of convergence is


convergence in probability.
Definition A.2.22. Let X, X1 , X2 , ... be random variables. We say that the sequence {Xn }
converges in probability to X for n → ∞ if for every ε > 0, we have
lim P (|Xn − X| > ε) = 0.
n→∞
P
We write Xn −→ X.
Lemma A.2.23. Almost sure convergence implies convergence in probability.
Proof. This proof is from [13]. Let {Xn } be a sequence of random variables converging
almost surely to X, and let ε > 0 be given. Consider an ω such that Xn (ω) → X(ω). There
exists an N ∈ N (depending on ω) such that
|Xn (ω) − X(ω)| ≤ ε for n ≥ N,
implying that 1{|Xn −X|>ε} (ω) = 0 for n ≥ N . It follows that 1{|Xn −X|>ε} (ω) → 0 for
n → ∞ and since this holds for almost every ω, we have 1{|Xn −X|>ε} → 0 almost surely. As
this function is bounded by 1, dominated convergence implies
Z
P (|Xn − X| > ε) = 1{|Xn −X|>ε} dP → 0

as desired. 
It is not immediately clear that almost sure convergence is strictly stronger than convergence
in probability. The difference is of a very technical nature. However, counterexamples exist,
and we encourage the reader to look them up. See for example [22]. It is useful to have
some tools to prove convergence almost surely and in probability. Such tools include the
Markov inequality and Chebyshev’s inequality.
Lemma A.2.24 (Markov’s inequality). Let X be a random variable. Then for any
ε > 0,
E[|X|]
P (|X| > ε) ≤ .
ε
Proof. Trivially, ε1{|X|>ε} ≤ |X|. Now take expectations on both sides and rearrange. 
Corollary A.2.25 (Chebyshev’s inequality). Let X be a random variable with finite
expectation, E[|X|] < ∞. Then for any ε > 0,
Var(X)
P (|X − E[X]| > ε) ≤
ε2
Proof. Left as an exercise for the reader. 
Example A.2.26. Consider a sequence of non-negative variables X1 , X2 , ... with finite
expectation and E[Xn ] = 1/n. For any ε > 0, we have by the Markov inequality that
E[Xn ] 1
P (|Xn − 0| > ε) ≤ = →0
ε nε
P
so Xn −→ 0. This result should not be surprising, considering the fact that a non-negative
random variable is zero almost surely if and only if it has mean zero. ◦
88

The above example of an application of the Markov inequality is not exactly interesting, but
we want to stress that the inequality, while a triviality, is extremely useful and flexible. The
following result shows that almost sure convergence follows if the probability P (|Xn −X| > ε)
goes to zero fast enough.

Proposition A.2.27 (Borel-Cantelli Criterion for Almost Sure Convergence). Let


X, X1 , X2 , ... be random variables. If for every ε > 0,

X
P (|Xn − X| > ε) < ∞,
n=1

then Xn → X almost surely.

Proof. The proof is an immediate consequence of the Borel-Cantelli lemma, see Lemma 2.26
and Theorem 2.27 in [13]. 

The following result is a cornerstone of probability theory.

Theorem A.2.28 (Strong Law of Large Numbers). Let {Xi } be an iid sequence of
random variables with E[|X1 |] < ∞. Then
n
1X
Xi → E[X1 ] a.s.
n i=1

Proof. A proof can be found in [13], see Theorem 4.25. 

An even weaker form of convergence than convergence in probability is convergence in


distribution.

Definition A.2.29. A sequence of random variables X1 , X2 , ... is said to converge in dis-


tribution to X if for every continuous and bounded function f : R → R,
Z Z
f (Xn )dP → f (X)dP.

d
We write Xn −→ X.

Remark A.2.30. Convergence in distribution is also called weak convergence.


The above definition is difficult to check in practice. The following results provide much
easier ways to check convergence in distribution.

Theorem A.2.31 (Helly-Bray). Let X, X1 , X2 , ... be random variables with distribution


d
functions F, F1 , F2 , .... Xn −→ X if and only if there exists a dense subset A ⊆ R such that
Fn (x) → F (x) for x ∈ A. In this case, A can be chosen to be the set of continuity points of
F.

Proof. See Theorem 18.4 in [15] or Theorem 6.18 in [13]. 


P d
Proposition A.2.32. If Xn −→ X, then Xn −→ X.

Proof. See Theorem 18.2 in [15] or Lemma 6.12 in [13]. 


89

We now state a version of the Central Limit Theorem, often abbreviated CLT.

Theorem A.2.33 (Central LimitPTheorem). Let {Xi } be iid with E[X12 ] < ∞, µ =
n
E[X1 ] and σ 2 = Var(X1 ). Let Sn = i=1 Xi . Then

Sn − nµ d
√ −→ Z ∼ N (0, 1).
σ n

Proof. See Theorem 21.1 in [15]. 

In this course, this version of the CLT suffices. It is, however, only a little part of the whole
story. More perspectives and versions of the CLT can be found in chapter 7 of [13].

Conditional expectations
The presentation in this subsection follows chapter 9 of [13]. Conditional expectations are
essential in performing computations in probability theory and statistics. The (measure the-
oretic) definition of a conditional expectation is somewhat strange at first, but the definition
has the advantage that all the theoretic properties follow almost trivially.

Definition A.2.34. Let X be a random variable on (Ω, F, P ), E[|X|] < ∞ and G ⊆ F a


sub-sigma-algebra. The conditional expectation of X with respect to G, denoted by E[X | G],
is a random variable satisfying the following properties:

(i) E[X | G] is G-measurable.

(ii) For every A ∈ G, Z Z


XdP = E[X | G]dP.
A A

Intuitively, we think of the conditional expectation E[X | G] as our best guess of the value of
X given the information in G. Try to keep this intuition in mind when reading the following
examples and theoretical properties.

It is by no means trivial that the conditional expectation exists. An elegant construction


is via the Radon-Nikodym theorem, see [13] chapter 8 and Theorem 9.1. We also remark
that E[X | G] is only unique almost surely. To verify theoretical statements concerning
E[X | G], it suffices to verify the two properties above. If another variable Z satisfies the
above assumptions, we have E[X | G] = Z a.s. In the following, we will omit writing a.s.
when considering computations involving conditional expectations. Also, if G = σ(Y ) is the
smallest sigma-algebra making Y measurable (intuitively, the information Y contains), we
will write E[X | Y ] instead of E[X | σ(Y )].

Example A.2.35. Assume X is G-measurable. We claim that X = E[X | G]. X satisfies


(i) by assumption and for A ∈ G, we have
Z Z
XdP = E[X | G]dP
A A

by definition of E[X | G], verifying (ii). ◦


90

Example A.2.36. Assume X is independent of G i.e. P (A ∩ {X ∈ B}) = P (A)P (X ∈ B)


for all Borel sets B and A ∈ G. We claim that E[X | G] = E[X]. E[X] is constant and thus
trivially G-measurable. Also, we get for A ∈ G that
Z Z
XdP = E[1A X] = E[1A ]E[X] = P (A)E[X] = E[X]dP
A A

so both (i) and (ii) are satisfied by E[X], proving the claim. ◦
The following proposition allows us to compute a plethora of interesting examples.
Proposition A.2.37. If D1 , D2 , ... are disjoint sets in F with ∪n Dn = Ω (such a collection
is called a partition), P (Di ) > 0 for all i, G = σ(D1 , D2 , ...) and X is an integrable random
variable, then 
1
R
P (D ) XdP, ω ∈ D1
RD1

 1

1
E[X | G](ω) = P (D2 ) D2 XdP, ω ∈ D2
 ..


.

Proof. It is not hard to verify that the sigma-algebra G consists of the sets that are unions
of the Di . Since E[X | G] is G-measurable, E[X | G] must be constant when restricted to
one of the Di i.e. 
c , ω ∈ D1
 1


E[X | G](ω) = c2 , ω ∈ D2 .
 ...

Since Z Z Z
XdP = E[X | G]dP = ci dP = ci P (Di ),
Di Di Di

the claim follows. 

Corollary A.2.38. Let N be a random variable with N ∈ {0, 1, 2, ...}. If X is an integrable


random variable, then

1
R
XdP on {N = 0}
 P (N =0) R{N =0}


1
E[X | N ] = P (N =1) {N =1} XdP on {N = 1}
 ..


.

Proof. This follows immediately from the previous proposition by letting G = σ(N ) and
noting that σ(N ) is generated by the partition {{N = 0}, {N = 1}, ...}. 

Before computing some interesting examples, we state the following list of properties of
conditional expectations.
Theorem A.2.39. Let X and Y be random variables with finite expectation, G ⊆ F a
sub-sigma-algebra.

(i) For a, b ∈ R, E[aX + bY | G] = aE[X | G] + bE[Y | G] (linearity).


(ii) E[X] = E[E[X | G]] (tower property).
91

(iii) If X ≤ Y then E[X | G] ≤ E[Y | G] (monotonicity).

(iv) |E[X | G]| ≤ E[|X| | G] (triangle inequality).

(v) If X is G-measurable, E[XY | G] = X[Y | G].

Proof. Consider (i). We have to show that aE[X | G]+bE[X | G] satisfies the two properties
of [aX + bY | G]. Measurability is obvious. For A ∈ G, we have
Z Z Z
aE[X | G] + bE[Y | G]dP = a E[X | G]dP + b E[Y | G]dP
A
ZA Z A

=a XdP + b Y dP
A A
Z
= aX + bY dP
A

which proves the claim. As for (ii), simply choose A = Ω in the second property of condi-
tional expectations to obtain
Z Z
E[X] = XdP = E[X | G]dP = E[E[X | G]].
Ω Ω

(iii) follows immediately from the monotonicity property of integrals. As for (iv), we obvi-
ously have −|X| ≤ X ≤ |X| so from (iii), we get

−E[|X| | G] ≤ E[X | G] ≤ E[|X| | G]

which is the desired result. See the exercises for an outline of the proof of (v). 

The tower property, (ii), has many names. It is also known as the law of iterated expectations
and the law of total expectation to name a few.

Example A.2.40. A typical situation in for example non-life insurance is to have a sum
of the form
XN
S= Xi
i=1

where {Xi } is an iid sequence independent of N , a random variable taking values in


{0, 1, 2,...}. Assume both X1 and N have finite expectation. What is the expectation
of S? Using Corollary A.2.38, we have1
Z Z n
1 1 X
SdP = Xi dP
P (N = n) {N =n} P (N = n) {N =n} i=1
" n #
1 X
= E[1{N =n} ]E Xi
P (N = n) i=1
= nE[X1 ]

1 We should in principle first verify that E[|S|] < ∞.


92

using that N and {Xi } are independent. It follows that E[S | N ] = N E[X1 ]. From the
tower property, it follows that

E[S] = E[E[S | N ]] = E[N E[X1 ]] = E[N ]E[X1 ].

In practice, one does not proceed as formally as in the above example. The corollary that we
applied essentially says that when we condition on a variable, that variable can be treated
as a constant. If N was constant equal to n, we would say that E[S] = nE[X1 ]. Then we
just replace n by the random variable N to obtain E[S | N ]. Let us make this more precise
by first observing that E[X | Y ] is a function of Y .

Theorem A.2.41 (Doob-Dynkin lemma). If a random variable Z is σ(Y )-measurable,


then there exists a measurable function φ such that Z = φ(Y ).

Proof. See Theorem 9.23 in [13]. 

Y
Ω R

φ
Z

R
By definition of a conditional expectation, E[X | Y ] is σ(Y )-measurable. Hence E[X | Y ] =
φ(Y ) for some function φ. While the Doob-Dynkin lemma does not provide an explicit
recipe for φ, it is possible to compute φ in many situations of interest. The following result
shows how to compute φ(y) = E[X | Y = y] in the (quite typical) case where (X, Y ) has a
density.

Theorem A.2.42. Let E[|X|] < ∞ and assume (X, Y ) has density f (x, y). Then
Z
f (x, y)
E[X | Y = y] = x dx
R g(y)
R
where g(y) = R f (x, y)dx is the density of Y .

Proof. This is Corollary 9.28 in [13]. 

Remark A.2.43. By recalling that the conditional density of X given Y = y is defined by

f (x, y)
fX|Y =y (x) = ,
g(y)

we could also write the above result as


Z
E[X | Y = y] = xfX|Y =y (x)dx
R

which also makes sense intuitively. Given densities, the conditional expectation can be
computed as an ordinary expectation but with a conditional density.
93

Stochastic processes (including Brownian motions)


In this subsection, we provide a very brief review of stochastic processes in continuous time.

Definition A.2.44. A (continuous time) stochastic process is a collection of random vari-


ables {Xt } indexed by t ∈ [0, ∞).

Definition A.2.45 (Brownian motion). A stochastic process {Xt } which satisfies the
properties

ˆ X0 = 0,

ˆ Xt − Xs ∼ N (0, t − s) for all 0 ≤ s < t and

ˆ Xt1 , Xt2 − Xt1 , ..., Xtn − Xtn−1 are independent for 0 < t1 < t2 < · · · < tn

is called a (standard) Brownian motion.

To model a flow of information in continuous time, we need the notion of a filtration.

Definition A.2.46. A filtration is a sequence {Ft } of sigma-algebras indexed by t ∈ [0, ∞)


such that s ≤ t implies Fs ⊆ Ft . We have F0 = {∅, Ω} by convention.

Definition A.2.47. A stochastic process {Xt } is called adapted to the filtration {Ft } if Xt
is Ft -measurable for all t.

Example A.2.48. For any stochastic process {Xt }, we can create a filtration by letting
Ft = σ(Xs : s ≤ t). This is the smallest filtration such that {Xt } is adapted. The filtration
is sometimes called the natural filtration. ◦

A particular nice type of stochastic process is a martingale.

Definition A.2.49. A stochastic process {Xt } is called a martingale with respect to the
filtration {Ft } if the following hold:

ˆ {Xt } is adapted to {Ft }.

ˆ E[|Xt |] < ∞ for each t.

ˆ For every s ≤ t, E[Xt | Fs ] = Xs .

Martingales do not play a major role in this course. Nevertheless, some examples are
provided in the exercises.

Exercises
Exercise A.3:
If Y ∼ N (µ, σ 2 ), we say that X = exp(Y ) has a lognormal distribution with parameters µ
and σ. Using the density
1 (x−µ)2
ϕ(t) = √ e− 2σ2
2πσ 2
for the normal distribution, derive the density of the lognormal distribution.
94

Exercise A.4:
Prove Proposition A.2.5.

Exercise A.5:
Consider the function F : R2 → R given by
(
0, x < 0 or y < 0 or x + y < 1
F (x, y) = .
1, else

1)Verify that F satisfies all the properties in Proposition A.2.5.


2)Show that F cannot be a distribution function for a pair of random variables (X, Y ).
Hint: Use equation (A.1). Now consider a = c = 1/3, b = d = 1.

Exercise A.6:
Let Y be N (0, 1) and let Z be Bernoulli distributed with success parameter 1/2. Assume
Y and Z are independent. Define X1 := Y and X2 := 1{Z=1} Y − 1{Z=0} Y .
1)Verify that X1 and X2 are both N (0, 1) variabels.
2)Show that (X1 , X2 ) is not multivariate normal.

Exercise A.7:
Compute the moment-generating function for the Γ(α, β) distribution.

Exercise A.8:

1)Compute the moment-generating function for the Bernoulli distribution i.e. P (X = 1) = p


and P (X = 0) = 1 − p for p ∈ [0, 1].
2)Compute the moment-generating function for the Binomial(n, p) distribution. Hint: Use
the previous exercise.
3)Compute the moment-generating function for the Geometric(p) distribution.

Exercise A.9:

1)Prove Lemma A.2.8.

We know that the standard normal distribution has the characteristic function
2
Φ(t) = e−t /2
.

2)Compute the characteristic function for the N (µ, σ 2 ) distribution.

Exercise A.10:
Compute all moments of the exponential distribution.

Exercise A.11:
The Γ(λ, n) distribution for n ∈ N is called the Erlang distribution. Verify that if X ∼ Γ(λ, n)
then
d
X = Y1 + · · · + Yn
95

with Y1 , ..., Yn iid exponential distributed with parameter λ.

Exercise A.12:
Prove Proposition A.2.19.

Exercise A.13:
Prove Chebyshev’s inequality, Corollary A.2.25.

Exercise A.14:
Without using the Strong Law of Large Numbers, prove the Weak Law of Large Numbers:
If {Xi } is an iid sequence of random variables with E[X12 ] < ∞, then
n
1X P
Xi −→ E[X1 ].
n i=1

Exercise A.15:
Assume Xn → X a.s. and that f is a continous function. Prove that f (Xn ) → f (X) a.s.

Exercise A.16:
Let {Xi } be iid variables with E[X12 ] = 4 and E[X1 ] = 1. Show that

X12 + · · · Xn2
lim
n→∞ X1 + · · · Xn

exists a.s. and determine the value.

Exercise A.17:
Let p ≥ 1. A sequence of random variables {Xi } with E[|Xi |p ] < ∞ for all i is said to
converge to X in Lp if E[|X|p ] < ∞ and

lim E[|Xn − X|p ] = 0.


n→∞

Lp
In that case, we write Xn −→ X.
Lp P
1)Prove that if Xn −→ X, then Xn −→ X.
2)Let {Xi } be a sequence of random variables with E[Xi ] = 0 and E[Xi2 ] < ∞ for all i.
Prove that (X1 + · · · + Xn )/n converges to zero in L2 and in probability.
3)Prove the following dominated convergence statement: If Xn → X a.s. and there exists
some variable Y with E[|Y |p ] < ∞ (p ≥ 1) such that |Xn | ≤ |Y | a.s. for all n, then
Lp
Xn −→ X.

Exercise A.18:
Assume X is a random variable with finite moment-generating function κ in the neighbour-
hood (−c, c). Prove Chernoff ’s bound

κ(α)
P (X > ε) ≤ inf .
α∈[0,c] eαε
96

Exercise A.19:
Assume N is Poisson distributed with parameter λ > 0 and {Xi } is an iid sequence indepen-
dent of N where X1 has moment-generating function κ. Compute the moment-generating
function of
N
X
Xi .
i=1

Hint: Tower property.

Exercise A.20:
In this exercise, we will prove (v) in Theorem A.2.39.
1)Prove the result when X = 1A0 is an indicator function.
Pn
2)Prove the result when X is a simple function, e.g. X = i=1 ci 1Ai , A1 , ..., An ∈ G.
3)One can show the following dominated convergence statement for conditional expecta-
tions: If Xn → X a.s. and |Xn | ≤ Y for a random variable Y with E[|Y |] < ∞, then
E[Xn | G] → E[X | G]. Using this result, prove (v) for a general G-measurable X. Hint:
Recall that there exists a sequence of simple functions {Xn } with Xn ↑ X.

Exercise A.21:
Prove the following extension of the tower property: If G ⊆ H ⊆ F, then

E[E[X | H] | G] = E[X | G].

Exercise A.22:
In this exercise, we introduce the conditional variance. Assume E[X 2 ] < ∞ and that G is a
sub-sigma-algebra, then the conditional variance is defined by

Var(X | G) = E[X 2 | G] − E[X | G]2 .

1)Prove the law of total variance,

Var(X) = E[Var(X | G)] + Var(E[X | G]).

2)Assume X is G-measurable. Prove that Var(X | G) = 0.


3)Assume Y is G-measurable. Prove that Var(X + Y | G) = Var(X | G).
4)Assume X is independent of G. Prove that Var(X | G) = Var(X).
5)Give an intuitive interpretation of the previous three subproblems.

Exercise A.23:
Consider Example A.2.40. Assume that the Xi have finite second moment. Show that

Var(S) = E[N ] Var(X1 ) + Var(N )E[X1 ]2 .

Hint: Use the law of total variance from the previous exercise.
97

Exercise A.24:
Let {Xt } be a Brownian motion
1)Define Yt = −Xt . Show that {Yt } is a Brownian motion.
2)Let c > 0 and define Yt = cXt/c2 . Show that {Yt } is a Brownian motion.

Exercise A.25:
A continuous time process {Xt } satisfies continuity in probability if for every sequence {tn }
of non-negative real numbers, we have
P
tn → t ⇒ Xtn −→ Xt .

Show that a Brownian motion satisfies continuity in probability.

Exercise A.26:
Let {Xt } be a Brownian motion and Ft = σ(Xs : s ≤ t) the natural filtration.
1)Show that {Xt } is a Brownian motion with respect to {Ft }.
2)Show that {Xt2 − t} is a Brownian motion with respect to {Ft }.

Exercise A.27:
A stochastic process {Nt } satisfying the properties

ˆ N0 = 0,

ˆ Nt − Ns is Poisson distributed with parameter λ(t − s),

ˆ Nt1 , Nt2 − Nt1 , ..., Ntn − Ntn−1 are independent for 0 < t1 < t2 < · · · < tn ,

ˆ {Nt } has right-continuous sample paths and limits from the left,

then {Nt } is called a Poisson process with intensity λ > 0. Verify that {Nt − λt} is a
martingale with respect to the natural filtration.

A.3 Calculus
During the course, we will occasionally integrate with respect to functions of bounded vari-
ation. Examples of such functions include functions that are monotone, in particular distri-
bution functions. Here we introduce the basic theory, following the lines of [20].

Definition A.3.1. Let f : [0, ∞) → R be a function. The variation of f on the interval


[0, t] is given by
( n )
X
f
V (t) = sup |f (ti ) − f (ti−1 )| : 0 = t0 < t1 < · · · < tn = t
i=1

i.e. the supremum of sums of absolute differences over all finite partitions of [0, t]. If
V f (t) < ∞ for all t ≥ 0, we call f a function of finite variation.
98

Example A.3.2. Let f : [0, ∞) → R be monotone. We claim that f is of finite variation. If


f is non-decreasing, this follows immediately from the fact that if 0 = t0 < t1 < · · · < tn = t
is a partition of [0, t], we have
n
X n
X
|f (ti ) − f (ti−1 )| = f (ti ) − f (ti−1 ) = f (t) − f (0)
i=1 i=1

by telescoping. Hence V f (t) = f (t) − f (0) and f is of finite variation. An analogous


argument works for the case where f is non-increasing. ◦
Recall that for x ∈ R, x+ = max{x, 0} and x− = − min{x, 0}. It is easily seen that
x = x+ − x− .
Definition A.3.3. For a function f : [0, ∞) → R, we define the positive variation by
( n )
f
X
+
V+ (t) = sup (f (ti ) − f (ti−1 )) : 0 = t0 < t1 < · · · < tn = t
i=1

and the negative variation by


( n )
f
X

V− (t) = sup (f (ti ) − f (ti−1 )) : 0 = t0 < t1 < · · · < tn = t .
i=1

Proposition A.3.4. Let f, g : [0, ∞) → R be functions of finite variation.

(i) V f , V+f and V−f are non-decreasing.


(ii) af + bg is a function of finite variation for any a, b ∈ R.

Proof. Left as an exercise for the reader. 


The motivation for introducing the negative and positive variation is the following central
result.
Theorem A.3.5 (Jordan decomposition). A function f : [0, ∞) → R is of finite vari-
ation if and only if f can be written as the difference of two non-decreasing functions. A
possible decomposition is f (t) − f (0) = V+f (t) − V−f (t).
Proof. Assume f = g −h for non-decreasing functions g and h. g and h are of finite variation
as shown in the example above, and the previous proposition now implies that f is of finite
variation. Conversely, assume f is of finite variation. For any partition 0 = t0 < t1 < · · · <
tn = t of [0, t], we have
n
X n
X
f (t) − f (0) = +
(f (ti ) − f (ti−1 )) − (f (ti ) − f (ti−1 ))−
i=1 i=1

i.e.
n
X n
X
(f (ti ) − f (ti−1 ))+ = (f (ti ) − f (ti−1 ))− + f (t) − f (0)
i=1 i=1

and taking supremum on both sides, we obtain the decomposition f (t) − f (0) = V+f (t) −
V−f (t), which is a difference of two non-decreasing functions as desired. 
99

Remark A.3.6. Note that V f has the decomposition V f = V+f + V−f since for any x ∈ R,
|x| = x+ + x− .
We are almost ready to introduce integration. However, finite variation is not quite enough.
We also require the additional property of being càdlàg.
Definition A.3.7. A function f is called càdlàg (French: continue à droite, limite à gauche)
if f is right-continuous and has left limits.
The Jordan decomposition tells us how to proceed from here. If we define integration with
respect to an increasing function, we can use linearity of the integral to define integration
with respect to general functions of bounded variation. Let f be a non-decreasing càdlàg
function. The function µf defined on the intervals (a, b] given by µf ((a, b]) = f (b) − f (a)
extends to a (positive) measure (called a Lebesgue-Stieltjes measure) on all Borel sets. Note
how this resembles the Lebesgue measure where f is just the identity. We can now define
integration in the same way as in basic measure theory.
Definition A.3.8. Let f : [0, ∞) → R be a non-decreasing càdlàg function. The Lebesgue-
Stieltjes integral of a measurable function g with respect to f is given by
Z Z ∞
g(t)df (t) := g(t)dµf (t)
(0,∞) 0
R∞
given that 0
|g(t)|dµf (t) < ∞. For any Borel set B, we define
Z Z
g(t)df (t) := 1B (t)g(t)df (t).
B (0,∞)

If f is a càdlàg function of bounded variation, we have the Jordan decomposition f (t) −


f (0) = V+f (t) − V−f (t), and we define the Lebesgue-Stieltjes integral of g by
Z ∞ Z ∞ Z ∞
g(t)df (t) = g(t)dV+ (t) − g(t)dV− (t).
0 0 0

The integral is well-defined whenever


Z ∞ Z ∞ Z ∞
|g(t)|dV f (t) = |g(t)|dV+f (t) + |g(t)|dV−f (t) < ∞.
0 0 0

Example A.3.9. Let F be the distribution R function for the random variable X. If B =
(a, b], then P (X ∈ B) = F (b) − F (a) R= B dF . Since the intervals (a, b] generate the Borel
sigma-algebra, we have P (X ∈ B) = B dF for all Borel sets B. ◦
Example A.3.10. Consider a positive random variable X with finite expectation and
distribution function F . We claim that
Z ∞
E[X] = xdF (x).
0

Since P (X ∈ (a, b]) = F (b) − F (a) for all a < b, the image measure of X, P X , coincides
with the measure induced by F . Hence
Z ∞ Z
E[X] = xdP X (x) = xdF (x)
0

as desired. ◦
100

We now go through some important properties of the Lebesgue-Stieltjes integral.


Proposition A.3.11. Let f be a càdlàg function of finite variation. Assume all integrals
below are well-defined.

(i) Z
df (u) = f (t) − f (s).
(s,t]

(ii) Z
g(u)df (u) = g(u)∆f (t), ∆f (t) := f (t) − f (t−)
{t}

with f (t−) = lims↑t f (s) the limit from the left.


(iii) Z
g(u)df (u) = 0
(s,t]

if f is constant on (s, t].

Proof. Left as an exercise for the reader. 


These properties will allow us to compute integrals with respect to functions that are piece-
wise constant. An important example is the empirical distribution function as we shall see in
the first weeks of the course. We end this subsection (and the appendix) with the following
key result.
Theorem A.3.12 (Integration by parts). Let f and g be càdlàg functions of finite
variation. Then (assuming all integrals are well-defined)
Z Z
f (t)g(t) − f (0)g(0) = g(s)df (x) + f (s−)dg(s).
(0,t] (0,t]

Proof. Note first that


f (t)g(t) − f (0)g(0) − f (0)(g(t) − g(0)) − g(0)(f (t) − f (0)) = (f (t) − f (0))(g(t) − g(0)).
The result is now a direct consequence of the following computation based on Fubini’s
theorem:
Z Z Z Z Z Z
(f (t) − f (0))(g(t) − g(0)) = df (u)dg(s) = df (u)dg(s) + df (u)dg(s)
(0,t] (0,t] (0,t] (0,s] (0,t] (s,t]
Z Z Z
= (f (s) − f (0))dg(s) + dg(s)df (u)
(0,t] (0,t] (0,u)
Z Z
= f (s)dg(s) − f (0)(g(t) − g(0)) + g(u−) − g(0)df (u)
(0,t] (0,t]
Z Z
= f (s)dg(s) − f (0)(g(t) − g(0)) + g(u−)df (u) − g(0)(f (t) − f (0)).
(0,t] (0,t]


Remark A.3.13. It is not difficult to see that the result also works for other intervals such
as (a, b], (t, ∞) etc.
101

Exercises
Exercise A.28:
Prove Proposition A.3.4.

Exercise A.29:
Prove Proposition A.3.11.

Exercise A.30:
In this exercise, we will consider some classes of functions of bounded variation.
1)Let f : [0, ∞) → R be a function which is Lipschitz on every compact interval i.e. for
every [a, b] ⊆ [0, ∞), there exists a constant C such that

|f (s) − f (u)| ≤ C|s − u|, s, u ∈ [a, b].

Prove that f is of finite variation.


2)Let f : [0, ∞) → R be C 1 . Prove that f is Lipschitz on every compact interval and
conclude that f is of bounded variation.

Exercise A.31: R
Compute the integral (0,4] tdf (t) in the following cases:
1)f (t) = k for k − 1 ≤ t ≤ k, k = 1, 2, ....
2)f (t) = et .
3)f (t) = k + et for k − 1 ≤ t ≤ k, k = 1, 2, ....
Bibliography

[1] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent mea-
sures of risk. Mathematical Finance, 9:203–228, 1999. doi: https://doi.org/10.1111/
1467-9965.00068.
[2] Søren Asmussen and Peter W. Glynn. Stochastic Simulation: Algorithms and Analysis.
Stochastic Modelling and Applied Probability. Springer New York, NY, 1 edition, 2007.
ISBN 978-0-387-30679-7.
[3] Sheldon Axler. Linear Algebra Done Right. Undergraduate Texts in Mathematics.
Springer Cham, 4 edition, 2023. ISBN 978-3-031-41025-3.
[4] Patrick Billingsley. Probability and Measure. Wiley Series in Probability and Statistics.
John Wiley & Sons Inc., anniversary edition, 2012. ISBN 978-1-118-12237-2.
[5] N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation. Encyclopedia of
Mathematics and its Applications. Cambridge University Press, 1987. doi: 10.1017/
CBO9780511721434.
[6] Tomas Björk. Arbitrage Theory in Continuous Time. Oxford University Press, 4 edition,
2020. ISBN 978–0–19–885161–5.
[7] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of
Econometrics, (31):307–327, 1986. URL https://public.econ.duke.edu/~boller/
Published_Papers/joe_86.pdf.
[8] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics,
7(1):1–26, 1979. URL http://www.jstor.org/stable/2958830.
[9] Paul Embrechts and Marius Hofert. A note on generalized inverses. 2014. URL
https://people.math.ethz.ch/~embrecht/ftp/generalized_inverse.pdf.
[10] Paul Embrechts, Claudia Klüppelberg, and Thomas Mikosch. Modelling Extremal
Events for Insurance and Finance. Stochastic Modelling and Applied Probability.
Springer Berlin, Heidelberg, 1 edition, 1997. ISBN 978-3-540-60931-5.
[11] Robert F. Engle. Autoregressive conditional heteroscedasticity with estimates of the
variance of united kingdom inflation. Econometrica, 50(4):987–1007, 1982. ISSN
00129682, 14680262. URL http://www.jstor.org/stable/1912773.
[12] Paul Glasserman. Monte Carlo Methods in Financial Engineering. Stochastic Modelling
and Applied Probability. Springer New York, NY, 1 edition, 2003. ISBN 978-0-387-
00451-8.

102
103

[13] Ernst Hansen. Stochastic Processes. Institut for Matematiske Fag Københavns Univer-
sitet, 4 edition, 2023. ISBN 978-87-71252-59-0.
[14] Henrik Hult and Filip Lindskog. Mathematical Modeling and Statistical Methods for
Risk Management. 2007.
[15] Jean Jacod and Philip Protter. Probability Essentials. Universitext. Springer Berlin,
Heidelberg, 2 edition, 2002. ISBN 978-3-642-55682-1.
[16] Rasmus Frigaard Lemvig. Stochastic Processes in Non-Life Insurance (SkadeStok) Lec-
ture Notes. URL https://rasmusfl.github.io/projects.html.
[17] Alexander J. McNeil, Rüdiger Frey, and Paul Embrechts. Quantitative Risk Manage-
ment - Concepts, Techniques and Tools. Princeton University Press, revised edition,
2015. ISBN 978-0-691-16627-8.
[18] Thomas Mikosch. Non-Life Insurance Mathematics. Universitext. Springer Berlin,
Heidelberg, 2 edition, 2009. ISBN 978-3-540-88232-9.
[19] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics. Springer
New York, NY, 2 edition, 2006. ISBN 978-0-387-28659-4.
[20] Jesper Lund Pedersen. Stochastic Processes in Life Insurance: The Dynamic Approach.
Department of Mathematical Sciences University of Copenhagen.
[21] Sidney I. Resnick. Extreme Values, Regular Variation and Point Processes. Springer
Series in Operations Research and Financial Engineering. Springer New York, NY, 1
edition, 2007. ISBN 978-0-387-75952-4.
[22] Jordan M. Stoyanov. Counterexamples in Probability. Dover Books on Mathematics.
Dover Publications Inc., 3 edition, 2013. ISBN 978-0-486-49998-7.
Index

ESα , 9 conditional variance, 97


VaRα , 9 continuity in probability, 98
ARMA(p, q) model, 78 converge in distribution, 89
RVρ , 22 convergence in distribution, 89
convergence in probability, 88
adapted process, 94 copula, 45
adjustment coefficient, 77 explicit copulas, 50
advanced measurement approach, 74 fundamental copulas, 48
almost sure convergence, 87 implicit copulas, 49
ARCH model, 78 countermonotonicity, 47
Archimedean copula, 50 countermonotonicity copula, 48
Arwedson approximation, 77 Cramér-Lundberg process, 77
CreditRisk+ , 72
Basel formula, 68 càdlàg, 100
basic indicator approach, 74 càdlàg function, 83
Bernoulli mixture model, 70
Beta mixture model, 70 DD, 63
Black-Scholes formula, 6 default, 62
Black-Scholes model, 5 discordant, 57
bootstrapping, 19 dispersion matrix, 39
Borel-Cantelli criterion, 89 distance to default, 63
Brownian motion, 94 distribution function, 82
Doob Dynkin lemma, 93
Central Limit Theorem, 90
characteristic function, 37, 84 elliptical distribution, 39
characteristic generator, 37 EMM, 6
Chebyshev’s inequality, 88 equity, 64
Chernoff’s bound, 96 equivalent martingale measure, 6
Cholesky factorisation, 39 Erlang distribution, 95
Clayton copula, 50 European call option, 6
Generalised Clayton copula, 50 excess distribution function, 26
CLT, 90 Expected Shortfall, 9
coefficient of lower tail dependence, 59
coefficient of upper tail dependence, 58 factor model, 65
comonotonicity, 47 filtration, 94
comonotonicity copula, 48 finite variation function, 98
completely monotonic function, 54 Frank copula, 50
concordant, 57 Fréchet bounds, 46

104
105

GARCH model, 79 multivariate normal distribution, 86


Gaussian copula, 49 regular, 87
GBM, 5 singular, 87
generalised inverse, 81
Generalised Pareto Distribution, 22 natural filtration, 94
Geometric Brownian motion, 5 negative binomial process, 76
Glivenko-Cantelli Theorem, 12 Normal Variance Mixture Model, 41
GPD, 22
greeks, 7 obligor, 62
Gumbel copula, 50 one-factor model, 70
overdispersion, 76
Hill estimator, 24
Hill plot, 24 Panjer recursion, 76
Peaks over threshold method, 26
importance sampling, 18 Poisson mixture model, 71
independence copula, 48 Poisson process, 75, 98
Integration by parts, 101 mixed Poisson process, 76
integration by parts, 23, 28 POT method, 26
inverse gamma distribution, 41 probability-generating function, 72
Itô formula, 14 probit normal mixture model, 66

Jordan decomposition (of a function of QQ plot, 20


finite variation), 99
random variable, 82
Kendall’s τ , 56 recovery rate, 62
KMV model, 64 regular varying function, 22
regularly varying
law of iterated expectation, 91 for a distribution function, 22
law of total variance, 97 index for a distribution function, 22
Lebesgue-Stieltjes integral, 100 relative error, 18
lender, 62 repetitive OR, 74
linearized log returns, 5 risk factor, 7
linearized loss operator, 7 risk measure, 7
local mean rate of return, 5 coherent, 9
location parameter, 39 monotonicity, 9
loss operator, 7 positive homogeneity, 9
loss random variable, 3 subadditivity, 9
LT-Archimedean copula, 55 translation invariance, 9
risk neutral measure, 6
Mahalanobis distance, 41 Risk Neutral Valuation formula, 6
Markov’s inequality, 88 risk-factor change, 7
martingale, 94
mean-excess function, 27 SDE, 5
empirical, 28 shifted distribution, 19
mean-excess plot, 28 Sklar’s Theorem, 45
Merton’s model, 63 slowly varying function, 22
moment-generating function, 19, 84 smoothed bootstrap, 20
Monte Carlo simulation, 17 Spearman’s ρ, 57
Morgenstern copula, 52 spherical distribution, 37
106

stochastic differential equation, 5 Var-Cov method, 16


stochastic process, 94 variation, 98
stochastic recursive sequence, 79 negative variation, 99
strike price, 6 positive variation, 99
Strong Law of Large Numbers, 89 volatility, 5
survival function, 82
weak convergence, 89
tower property, 91 Weak Law of Large Numbers, 96

Value at Risk, 9 zero-coupon bond, 4

You might also like