0% found this document useful (0 votes)
3 views22 pages

PM00171 Dvi

The document discusses the awarding of the 2007 Abel Prize to S R S Varadhan for his contributions to the theory of large deviations in probability. It provides an introduction to large deviations, emphasizing their significance in estimating probabilities of rare events, and highlights key theorems such as Cramer's theorem and Sanov's theorem. The document also outlines the mathematical foundations and applications of large deviations in fields like insurance and statistics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

PM00171 Dvi

The document discusses the awarding of the 2007 Abel Prize to S R S Varadhan for his contributions to the theory of large deviations in probability. It provides an introduction to large deviations, emphasizing their significance in estimating probabilities of rare events, and highlights key theorems such as Cramer's theorem and Sanov's theorem. The document also outlines the mathematical foundations and applications of large deviations in fields like insurance and statistics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Proc. Indian Acad. Sci. (Math. Sci.) Vol. 118, No. 2, May 2008, pp. 161–182.

© Printed in India

Large deviations: An introduction to 2007 Abel prize

S RAMASUBRAMANIAN
Stat.-Math. Unit, Indian Statistical Institute, 8th Mile, Mysore Road,
Bangalore 560 059, India
E-mail: [email protected]; [email protected]

MS received 24 July 2007

Abstract. 2007 Abel prize has been awarded to S R S Varadhan for creating a unified
theory of large deviations. We attempt to give a flavour of this branch of probability
theory, highlighting the role of Varadhan.

Keywords. Large deviation principle (LDP); rate function; Cramer’s theorem; Sanov’s
theorem; Esscher transform/tilt; convex conjugates; Laplace’s method; Varadhan’s
lemma; weak convergence of probability measures; empirical distribution; Himilton–
Jacobi equation; Burger’s equation; variational formula; sample path LDP; Brownian
motion/diffusion; Markov processes; ergodicity; Wentzell–Freidlin theory; exit problem;
Feynman–Kac formula; occupation time; principal eigenvalue; Donsker–Varadhan
theory.

1. Introduction
The award of the prestigious Abel prize for 2007 to Professor S R S Varadhan has been
widely acclaimed, especially among the mathematical community in India. The Abel prize,
perhaps the highest award in mathematics, has been instituted by the Norwegian Academy
of Science and Letters since 2003; this annual award is patterned roughly along the lines
of Nobel prize in sciences. The citation says that Varadhan is being given the award “for
his fundamental contributions to probability theory and in particular for creating a unified
theory of large deviations”.
Large deviations is a part of probability theory; it provides asymptotic estimates for
probabilities of rare events. It may be pointed out that the strong law of large numbers
and the central limit theorem, the versatile classical limit theorems of probability theory,
concern typical events. As large deviation estimates deal with probabilities of rare events
the methods needed are more subtle. Moreover, context specific techniques play a major
role though there are quite a few general principles. In this write-up we attempt to give an
elementary introduction to large deviations, of course, highlighting the role of Varadhan.

2. Actuarial origin and theorems of Cramer and Sanov


Suppose X1 , X2 , . . . are independent identically distributed real-valued random variables,
real-valued i.i.d.’s for short. Let F (x) = P (Xi ≤ x), x ∈ R denote their common
distribution function, and m = R x dF (x) their common mean (expectation or ‘average
value’). If m exists then the strong law of large numbers states that n1 (X1 + · · · + Xn ) → m

161
162 S Ramasubramanian

with probability 1 as n → ∞. This forms a basis for the validity of many statistical and
scientific procedures of taking averages. If a > m then the above implies that
 
1
lim P (X1 + · · · + Xn ) > a = 0. (2.1)
n→∞ n
However one would like to know at what rate convergence in (2.1) takes place.
Denote Sn = X1 + · · · + Xn , n ≥ 1. As the central limit theorem (CLT) explains
the prevalence of Gaussian distribution in various aspects of nature, one wonders if CLT
can shed more light. By the classical CLT for sums of i.i.d.’s, assuming that the common
variance is 1, we have for a > m,
   
1 1 √
P Sn > a = P √ (Sn − nm) > n(a − m)
n n

≈ 1 − ( n(a − m)) → 0, as n → ∞

where
 x 1 1 2
(x) = √ e− 2 y dy, x ∈ R. (2.2)
−∞ 2π
Here (·) is the standard normal (Gaussian) distribution function. So the CLT is not
powerful enough to discern the rate.
The event {Sn > na} is a typical ‘rare event’ of interest in insurance. For example, Xi
can denote the claim amount of policy holder i in a given year, and hence Sn denotes the
total claim amount of n policy holders. Assuming a big portfolio for the insurance company
(that is, n is very large), any estimate for P (Sn > na), where a > m, gives information
about the ‘right tail’ of the total claim amount payable by the company in a year.
As another illustration from insurance, Sn can be regarded as the cumulative net payout
(that is, claim payment minus income from premiums and interests) in n years. The initial
capital u0 of the company is generally quite large. If Sn exceeds u0 then the company is
ruined.
It is easy to see why actuaries would be interested in the tail behaviour of Sn . They
would like to have an idea of how bad an extremely bad year can be, and perhaps fine-
tune premium rates or reinsurance levels. It is no wonder that the problem attracted the
attention of the great Swedish probabilist Harald Cramer, a pioneer in putting statistics
as well as insurance modelling on firm mathematical foundations. However, F Esscher, a
Scandinavian actuary, may have been the first to look at the problem and come up with
some interesting ideas (in 1932) which were later sharpened/extended by Cramer.
To appreciate Cramer’s result let us first look at two examples.

Example 2.1. Let Xi be i.i.d. random variables such that P (Xi = 0) = P (Xi = 1) = 21 ;
that is, {Xi } is a sequence of i.i.d. Bernoulli random variables with parameter 21 . Note that
 
m = 21 . Let a ∈ 21 , 1 . It is easily seen that (as Sn has a binomial distribution)

2−n Qn (a) ≤ P (Sn ≥ na) ≤ (n + 1)2−n Qn (a),


 
where Qn (a) = maxk≥an nk . As a > 21 , the maximum is attained at k0 = the smallest
√   
integer ≥ an. Using Stirling’s formula n! = nn e−n 2π n 1 + O n1 it is not difficult to
An introduction to 2007 Abel prize 163

see that
1
lim log Qn (a) = −a log a − (1 − a) log(1 − a).
n→∞ n
From this it follows that
1
lim log P (Sn ≥ na) = −[log 2 + a log a + (1 − a) log(1 − a)].
n→∞ n
Example
 2.2. Let {Xi } be i.i.d. N (0, 1) random variables so that P (Xi ∈ A) =
A n(x) dx, A ⊂ R where

1 1
n(x) = √ exp − x 2 , x ∈ R. (2.3)
2π 2
Note that P (Xi ≤ x) = (x) where  is given by (2.2).
 In this case m = 0. Note that the
n
empirical mean n1 Sn = n1 i=1 X i has the N 0, 1
n distribution. Hence by the properties
of normal distributions, for any a > 0,
 
1 √
P Sn ≥ a = 2[1 − (a n)]. (2.4)
n
Now for any y > 0 clearly
   
3 1
1 − 4 n(y) < n(y) < 1 + 2 n(y).
y y
Integrating the above over [z, ∞), where z > 0,
 
1 1 1
− 3 n(z) < [1 − (z)] < n(z). (2.5)
z z z
From (2.4) and (2.5) it easily follows that
 
1 1 1
lim log P Sn ≥ a = − a 2 . (2.6)
n→∞ n n 2
  
Thus, the probability of the rare event n1 Sn ≥ a is of the order exp − 21 na 2 . This is
a typical large deviations statement, and 21 a 2 is an example of a rate function. 2

Cramer’s theorem is about an analogue of the above for sums of i.i.d. random variables.
Assume that the moment generating function (or the Laplace transform) of X1 exists,
that is,

M(t)  E[etX1 ] = etx dF (x) < ∞, ∀ t ∈ R. (2.7)
R

For a > E(X1 ), by Chebyshev’s inequality


     
1 θ
P Sn ≥ a = P exp Sn ≥ e θa
n n
     n
−θa θ −θa θ
≤ e E exp Sn =e M
n n
164 S Ramasubramanian

for any θ > 0. Putting θ = nt we now see that (as θ is arbitrary)


 
1 1
lim sup log P Sn ≥ a ≤ inf [−ta + log M(t)]
n→∞ n n t≥0
 
= − sup[ta − log M(t)]
t≥0

= − sup[ta − log M(t)] ,


t∈R
1 2
where the last step follows as a > E(X1 ). In Example 2.2 we have M(t) = e 2 t and hence
2 a = supt∈R [ta − log M(t)]. A similar comment is true also of Example 2.1.
1 2

In fact we have the following theorem, which is the starting point in large deviations.

Theorem 2.3 (Cramer, 1938). Let {Xi } be real-valued i.i.d.’s having finite moment
generating function M(·). Then for any a > E(X1 ),
 
1 1
lim log P Sn ≥ a = −I (a), (2.8)
n→∞ n n
where

I (a)  sup {at − log M(t): t ∈ R} . (2.9)

Similarly for a < E(X1 ),


 
1 1
lim log P Sn ≤ a = −I (a). (2.10)
n→∞ n n
2

For a proof, see [H], [DZ] and [V2].


Note that log M(t) is a convex function. One can show that I (·) is also convex and that

log M(t) = sup{ta − I (a): a ∈ R}. (2.11)

Thus log M(t) and I (a) are convex conjugates. The rate function I (·) is also known as the
Fenchel–Legendre transform of the logarithmic moment generating function log M(·).
As seen above, the upper bound in (2.8) is an easy consequence of Chebyshev’s inequa-
lity. The key idea in the proof of the lower bound is an ‘exponential tilting’ or Esscher
transform of the distribution, a device having its origins again in insurance problems. With
F (·) and M(·) as above, for each fixed t ∈ R the Esscher transform is defined by
1
dF̃t (x) = etx dF (x).
M(t)

Under the tilted distribution the rare event n1 Sn ≥ a becomes a typical event, thereby
facilitating analysis (see [H]).
See [F] for an account of Cramer’s theorem in the context of the central limit problem;
also illustrations from risk theory are sprinkled often in [F]. For detailed account of insur-
ance models, and for the role played by Esscher transform in estimating ruin probabilities,
An introduction to 2007 Abel prize 165

see [RSST]. According to Varadhan, variations of Esscher transform is a recurring theme


in large deviations.
Under the hypothesis of Cramer’s theorem the rate function I has the following
properties:

(i) I has compact level sets, that is, I −1 ([0, c]) is compact for all c < ∞; in particular
I is lower semicontinuous;
(ii) I (z) ≥ 0 with equality if and only if z = E(X1 );
(iii) I is convex on R.
 
If Xi has the Bernoulli distribution with parameter 0 < p < 1, then I (a) = a log pa +
 1−a 
(1 − a) log 1−p , for a ∈ [0, 1], and I (a) = ∞, otherwise. Similarly, if Xi has the
 
Poisson distribution with parameter λ > 0, then I (a) = λ − a + a log λa , for a ≥ 0, and
I (a) = ∞ otherwise.
With the assumptions as in Theorem 2.3, if a > m note that I (z) ≥ I (a) for all z ≥ a.
So the result (2.8) can be rephrased as, denoting A = [a, ∞),
 
1 1
lim log P Sn ∈ A = − inf I (z). (2.12)
n→∞ n n z∈A

Such a statement holds also for nice subsets A. 2


Even a reasonable formulation of Cramer’s theorem in Rd was perhaps achieved only
when the general framework for large deviations was given by Varadhan. Finer aspects of
convexity, like a minimax theorem, are needed in the proof (see [V2]).
In 1957, Russian probabilist Sanov established an important extension of Cramer’s
theorem to empirical distributions of real-valued i.i.d.’s. We shall describe it briefly. Let
μ be a probability measure on R . For y ∈ R we shall denote by δy the Dirac measure
concentrated at y. Let {Yi : i ≥ 1} be a real-valued i.i.d. sequence defined on a probability
space (S, F, P ) with common distribution μ. Set

1 n
NnY (ω, ·)  δY (ω) (·), ω ∈ S, n ≥ 1. (2.13)
n i=1 i

For each n, ω note that NnY (ω, ·) is a probability measure on R; {NnY } is called the family
of empirical distributions of {Yi }.
Let M(R) denote the set of all probability measures on the real line R. This is a closed
convex subset of the topological vector space of all finite signed measures on R with the
topology ofweak convergence
 of measures; that is νn converges to ν, denoted νn ⇒ ν, if
and only if f dνn → f dν for all f ∈ Cb (R).
n
Denote Xi (ω) = δYi (ω) ∈ M(R). Hence NnY = n1 i=1 Xi , n ≥ 1, is a family
of random variables taking values in M(R). For any n, ω note that NnY (ω, (−∞, y]) =
n
i=1 I(−∞,y] (Yi (ω))  Fn (y, ω) for all y ∈ R; so for fixed n, ω note that Fn (·, ω)
1
n
is the distribution function of the probability measure NnY (ω, ·). By law of large numbers
Fn (y, ω) → F (y)  μ((−∞, y]), for any y ∈ R as n → ∞ for a.e. ω. That is, NnY (ω, ·)
converges, in the topology of M(R), to μ as n → ∞, for a.e. ω. So questions concerning
probabilities of rare events, like P (NnY ∈ U ) where U is neighbourhood of μ, become
meaningful.
By analogy with Cramer’s theorem the rate would involve the logarithmic moment
generating function of Xi , and its convex conjugate. As Xi is M(R)-valued random
166 S Ramasubramanian

variable its logarithmic moment generating function is a function on Cb (R) (= space of


bounded continuous functions on R) given by
log M(g) = log E[exp g, Xi ] = log E[exp g, δYi ]

= log eg(y) dμ(y), g ∈ Cb (R). (2.14)
R
We expect the rate function to be given by
I (ν) = sup{ g, ν − log M(g): g ∈ Cb (R)}, ν ∈ M(R). (2.15)

(Here g, ν = R g(y) dν(y).) In fact a more tractable expression for the rate function
can be given. Define for ν ∈ M(R),
 dν
R f (y)[log f (y)] dμ(y), if f := dμ exists
J (ν|μ)  (2.16)
+∞, otherwise
where dν/dμ is the Radon Nikodym derivative of ν with respect to μ. In the case of
Bernoulli distribution with parameter p ∈ (0, 1) it is easily seen that I (a) = J (ν|μ), 0 <
a < 1 with μ = Bernoulli (p), ν = Bernoulli (a). J (ν|μ) is referred to as the relative
entropy of ν with respect to μ; it is also called Kullback–Leibler information in statistics.
The preceding heuristics suggest the following:
Theorem 2.4 (Sanov, 1957). Let the empirical distributions {NnY } be given by (2.13).
Then for any convex open set A ⊂ M(R),
1
lim log P (NnY ∈ A) = − inf{I (ν): ν ∈ A}, (2.17)
n→∞ n
where the rate function I is given by (2.15). Moreover I (ν) = J (ν|μ), ν ∈ M(R) with
J (·|μ) given by (2.16). 2
For proof and extensions, see [DZ] and [DS].
Though the two results are equivalent, a difference between the set up of Cramer’s
theorem and that of Sanov’s theorem is worth pointing out. The former deals with deviations
away from a number (i.e. the mean) and the rate is a function on R, whereas the latter
concerns deviations away from a measure and hence the rate function is defined on the space
of probability measures. As Sanov’s theorem represents a greater conceptual sophistication,
it is sometimes referred to as an example of level 2 large deviations (see [V3], [SW] and
[DZ]). However, the importance of Sanov’s work was perhaps fully realised only when
the infinite dimensional version was proved by Donsker and Varadhan in the mid 70’s; see
[DS].
Another major development was initiated by Chernoff in 1952. He initiated a program
in which questions about asymptotic efficiency of statistical tests and performance of
estimators were analyzed using large deviations. Further developments/refinements of the
results of Cramer, Sanov and Chernoff were made in the late 50’s/early 60’s. These include
in particular the works of R R Bahadur, R Ranga Rao and J Sethuraman at the Indian
Statistical Institute.
These and related works ensured the statistical pedigree of large deviations. For accounts
of the above, see [B], [H], [DZ] and [DS]. Moreover these must have made an impression
on Varadhan, who was a graduate student at the Indian Statistical Institute, Kolkata during
1959–62.
An introduction to 2007 Abel prize 167

3. “Asymptotic probabilities and differential equations”


The title of this section is borrowed from that of the landmark 1966 paper of
Varadhan [V1].
Perhaps the best known example of a convex conjugate pair is the Lagrangian and the
Hamiltonian from mechanics via calculus of variations. In mechanics, the Lagrangian
denotes the difference between the kinetic energy and the potential energy, while the
Hamiltonian is the sum of the two.
Here we consider a simplified Hamiltonian. We assume H : R → R is a uniformly
convex function; that is H  (·) ≥ c > 0. This function plays the role of the Hamiltonian.
Let T > 0 be fixed. We consider the terminal value problem for the Hamilton–Jacobi
equation

Ut − H (Ux ) = 0, in (0, T ) × R
. (3.1)
U (T , x) = G(x), x ∈ R

Here G(·) is a known continuous function, and Ut , Ux denote respectively the derivatives
with respect to t, x. Let L denote the corresponding Lagrangian, that is

L(z) = sup{zx − H (x): x ∈ R}, z ∈ R (3.2)

is the convex conjugate of H . It is known from calculus of variations that the weak solution
U of (3.1) is given by the ‘variational principle’
 T
U (t, x) = sup G(w(T )) − L(ẇ(s)) ds: w(t) = x, w is C 1 , (3.3)
t

where ẇ(s) = dw/ds(s).

Remarks.

(i) In calculus ofvariations one considers the initial value problem for Ut + H (Ux ) = 0.
t
The quantity 0 L(ẇ(s)) ds is called an ‘action functional’. The analogue of (3.3) is
then an infimum, and hence is called the principle of least action. The reason for our
considering the ‘backward problem’ (3.1) is that the expression (3.3) can be readily
tied up with Varadhan’s lemma later.
(ii) In optimal control theory, the modern avatar of calculus of variations, a cost functional
is minimised/maximised as in (3.3), and a nonlinear PDE like (3.1) is derived via a
dynamic programming principle.
(iii) PDE in (3.1) can also arise as a tool for solving initial/terminal value problems for
certain scalar conservation laws of the form ut − (H (u))x = 0. In fact this served
as the motivation for Varadhan [V1]. For example, if H (x) = 21 x 2 then the inviscid
Burger’s equation ut − uux = 0 is transformed into an equation like (3.1) by taking
certain indefinite integrals. See Varadhan [V1] for a brief discussion on this, and Evans
[E] for a detailed account.

Let Du [0, T ] = {w: w right continuous on [0, T ] into R, and w(t−) exists for each t},
with the topology of uniform convergence. Let Dac = {w ∈ Du [0, T ]: w(t) = w(0) +
168 S Ramasubramanian
t T
0 ξ(s) ds, 0 ≤ t ≤ T and 0 |L(ξ(s))|ds < ∞}. Define the function I : Du [0, T ] →
[0, ∞] by
 T
0 L(ẇ(s)) ds, if w ∈ Dac
I (w) = (3.4)
+∞, otherwise.
Then it is not difficult to show that I (·) given by (3.4) has properties similar to the rate
function of Cramer’s theorem, albeit on a more complicated space.
An expression similar to r.h.s. of (3.3) crops up naturally in Laplace’s method in classical
asymptotic analysis. Assuming appropriate integrability conditions, and denoting  · k
for the norm in Lk (R), note that
 
1
lim log e nγ (x)
dx = lim log eγ (·) n = log eγ (·) ∞
n→∞ n R n→∞

= sup{γ (x): x ∈ R} (3.5)

for any nice function γ (·) on R. (Use of  · n →  · ∞ in (3.5) was suggested by R Bhatia
in place of an earlier argument.) In particular, if γ (x) = g(x) − I (x) where g is a bounded
continuous function and I (·) ≥ 0 is like a rate function then

1
lim log eng(x) e−nI (x) dx = sup {g(x) − I (x): x ∈ R}. (3.6)
n→∞ n R

Note the similarity between the right-hand sides of (3.3) and (3.6). In addition, for each
n suppose dPn (x) = e−nI (x) dx is a probability measure. Then for large a, by similar
analysis on [a, ∞),
 ∞
1 1
lim log Pn ([a, ∞)) = lim log e−nI (x) dx
n→∞ n n→∞ n a

= sup{−I (x): x ≥ a} = − inf{I (x): x ≥ a}. (3.7)

Note the resemblance between (3.7) and (2.12). Clearly (3.5)–(3.7) suggest that there
could be a close connection between suitable families of probability measures (like those
encountered in Cramer’s theorem), and approximation schemes for solving differential
equations (like (3.1)).
See [E] for an application of Laplace’s method in the asymptotics of viscous Burger’s
equation.
At this stage it is convenient to introduce Varadhan’s unifying framework for large
deviations. The idea is to characterize the limiting behaviour of a family {P } of probability
measures as ↓ 0, in terms of a rate function. Let (S, d) be a complete separable metric
space, and F denote its Borel σ -algebra. The required abstraction is contained in the
following two key definitions.

DEFINITION 3.1
A function I : S → [0, ∞] is called a rate function if I ≡ ∞ and if the level set
{x ∈ S: I (x) ≤ c} is compact in S for each c < ∞.
In particular, a rate function is lower semicontinuous; that is, I −1 ([0, c]) is closed in S
for all c < ∞, which is equivalent to lim inf n→∞ I (xn ) ≥ I (x) whenever xn → x in S.
An introduction to 2007 Abel prize 169

DEFINITION 3.2
Let {P : > 0} be a family of probability measures on (S, F}. The family {P } is said to
satisfy the large deviation principle (LDP) with rate function I if
(a) I is a rate function,
(b) for every closed set C ⊆ S,
lim sup log P (C) ≤ − inf I (y), (3.8)
→0 y∈C

(c) for every open set A ⊆ S,


lim inf log P (A) ≥ − inf I (y). (3.9)
→0 y∈A

Remarks.
(i) Let {Xi } be as in Cramer’s theorem and Pn denote the distribution of n1 (X1 +· · ·+Xn ),
for n ≥ 1. Then, with = n−1 , Cramer’s theorem says that {Pn } satisfies LDP with
rate function given by (2.9) (see (2.12)).
(ii) In Sanov’s theorem S = M(R); it is a complete separable metric space in the topology
of weak convergence; see [P]. Also = n−1 , Pn = distribution of NnY , n ≥ 1. So
Sanov’s theorem says that {Pn } satisfies LDP with relative entropy given by (2.16) as
the rate function.
(iii) In the place of (3.8), (3.9) the more intuitive stipulation that
lim log P (M) = − inf I (y)
→0 y∈M

turns out too strong to be useful. For example, if P is nonatomic for all , then
taking M = {x}, x ∈ S, the above can hold only if I (·) ≡ ∞. This would rule out
most of the interesting cases. It turns out that (3.8), (3.9) are enough to yield a rich
theory.
(iv) The framework of complete separable metric space is known to be optimal for a rich
theory of weak convergence; see [P]. In the case of large deviations too this seems to
be so.
(v) We can now formulate Cramer’s theorem in Rd . Let {Xi } be Rd -valued i.i.d.’s with
finite moment generating function M. Let I (z) = sup{ θ, z − log M(θ): θ ∈ Rd }
for z ∈ Rd . Let Pn denote the distribution of n1 (X1 + · · · + Xn ), n = 1, 2, . . . .
Then {Pn } satisfies the LDP with rate function I (see [V2] or [DZ] for a
proof).

The following elementary result gives a way of getting new families satisfying LDP’s
through continuous maps. This is also a main reason for not insisting that the rate function
be convex.
Theorem 3.3 (Contraction principle). Let {P } satisfy the LDP with a rate function I (·).
Let (Ŝ, d̂) be a complete separable metric space, and π : S → Ŝ a continuous function.
Put P̂ = P π −1 , > 0. Then {P̂ } also satisfies the LDP with the rate function

inf{I (x): x ∈ π −1 (y)}, if π −1 (y) = φ
Iˆ(y) =
∞, otherwise.
2
170 S Ramasubramanian

Note. In the above π can also depend on , with some additional assumptions (see [V2]).
Recall that {P } converges weakly to P (denoted P ⇒ P ) if
 
lim f (x) dP (x) = f (x) dP (x)
→0 S S

for any bounded continuous function f . Also P ⇒ P is equivalent to any one of

lim sup P (C) ≤ P (C), C ⊆ S, C closed,


→0

lim inf P (A) ≥ P (A), A ⊆ S, A open.


→0

The formal similarity between (3.8), (3.9) and the above suggests that LDP may be suit-
able for handling convergence of integrals of exponential functionals. Indeed we have the
following fundamental result, which is the key to diverse applications.

Theorem 3.4 (Varadhan’s lemma (1966)). Let {P } satisfy the LDP with a rate function
I (·). Then for any bounded continuous function g on S,
   
1
lim log exp g(x) dP (x) = sup{g(x) − I (x): x ∈ S}. (3.10)
→0 S

Thus Varadhan’s lemma is an extension of Laplace’s method to an abstract  setting


 (see
[V1], [V2] and [DZ] for proof and discussion). Moreover the factor exp 1 g(·) in (3.10)
is reminiscent of the Esscher tilt. This strategy highlights the contribution of ‘rare events’,
that is, sets with very small P -measure where g(·) may take large values.

Example 3.5. We now present a ‘toy example’, taken from den Hollander [H], to indicate
that probabilities of rare
 events can
 decisively
 influence
 asymptotic expectations. Let {Xi }
be i.i.d.’s such that P Xi = 21 = P Xi = 23 = 21 . Let Pn denote the distribution of
n (X1 + · · · + Xn ), n ≥ 1. By Cramer’s theorem {Pn } satisfies the LDP with rate function
1

        
log 2 + z − 21 log z − 21 + 23 − z log 23 − z , if 1
2 ≤z≤ 3
2
I (z) =
∞, otherwise.

Now
 n   
1 n
E Xi =   exp(n log x) dPn (x) = exp(n g(x)) dPn (x),
n i=1 1 3
2,2 R

where g is a bounded continuous function on R such that g(x) = log x, 21 ≤ x ≤ 23 . By


Varadhan’s lemma,
 n   
1 1 n
1 3
lim log E Xi = sup log x − I (x): x ∈ ,
n→∞ n n i=1 2 2

 b, say. (3.11)
An introduction to 2007 Abel prize 171

It can be shown easily that b > 0. By the law large numbers n1 (X1 + · · · + Xn ) → 1 with
probability 1. So one might naively expect l.h.s. of (3.11) to be zero. However, as shown
above it is not so. Thus the asymptotic expectation is determined not by the typical (or
almost sure) behaviour but by the rare event when n1 ni=1 Xi takes values near x ∗ , the
value where supremum is attained in (3.11). 2

Exit problem, discussed in the next section, gives a more concrete example where rare
event determines the quantity/characteristic of interest.
Now, as I (·) given by (3.4) is a rate function, the similarity between the r.h.s. of (3.3)
and (3.10) is quite striking. In fact, if we can have a family {Qn } of probability measures
on Du [0, T ] satisfying LDP with rate function given by (3.4), then (3.10) gives an approx-
imation scheme for the solution to (3.1).
Suppose H (θ ) = c log MF (θ ), θ ∈ R where MF is the moment generating function
(Laplace transform) of a probability distribution F on R. Let {Xi } be an i.i.d. sequence with
distribution F , and Sn = X1 + · · · + Xn , n ≥ 1. For n = 1, 2, . . . define the stochastic
process Zn (t) = n1 S[nt] , 0 ≤ t ≤ T ; (here [z] denotes the integer part of z). Then the
sample paths (or trajectories) of Zn (·) are in Du [0, T ]. Let Qn denote the probability
measure induced by the process Zn on Du [0, T ]; (Qn may be called the distribution of the
process Zn (·)). It can be proved that {Qn } satisfies LDP with rate function given by (3.4);
this is basically a functional version of Cramer’s theorem, proved in Varadhan [V1].
If the Hamiltonian is not a logarithmic moment generating function then the approxima-
tion scheme, though similar in spirit, is more involved. But once again, it uses processes
with independent increments. Hamilton–Jacobi equations (of the type (3.1)) with non-zero
right side can also be handled (see [V1]).
Even at the pain of repetition, it may be worth mentioning the following. Thanks to
the work of Hopf, Lax and Oleinik, it was shown only in the late 50’s/early 60’s that U
given by (3.3) is the weak solution, in a suitable sense, to (3.1). In more modern jargon
(3.3) gives the viscosity solution to (3.1) (see [E] for a detailed discussion on this circle of
ideas). Varadhan [V1] has given an approximation scheme for (3.3) in terms of probabilistic
objects. On the way, a unifying framework for large deviations has been synthesized, with
Varadhan’s lemma set to play a key role.

4. Sample path LDP, Wentzell–Freidlin theory and exit problem


We begin with LDP in connection with the best known stochastic process, viz. the Brownian
motion.
Let C0 [0, T ] = {w: w continuous on [0, T ] into R , and w(0) = 0}. Let X(t, w) =
w(t), 0 ≤ t ≤ T , w ∈ C0 [0, T ] denote the coordinate projections. Let P denote the
standard one-dimensional Wiener measure on C0 [0, T ]. So P (C0 [0, T ]) = 1, and under P ,
for 0 < t1 < t2 < · · · < tk < T the random variables X(t1 ), X(t2 ) − X(t1 ), . . . , X(tk ) −
X(tk−1 ) are independent having respectively N (0, t1 ), N (0, t2 − t1 ), . . . , N(0, tk − tk−1 )
distributions. That is, under P the stochastic process {X(t): 0 ≤ t ≤ T } is a standard
one-dimensional Brownian motion. √
For > 0, let P denote the distribution of  1the process { X(t): t ≥ 0}; so for any
Borel set A ⊆ C0 [0, T ] note that P (A) = P √ A . Clearly P ⇒ δ0 as ↓ 0 where δ0
is the probability measure concentrated on the function which is identically 0.
We look at an example, some aspects of which had been alluded to earlier, to justify
why an LDP for {P } may be useful.
172 S Ramasubramanian

Example 4.1. Let T > 0 and g be a continuous function on R. Consider the terminal value
problem for the viscous Burger’s equation:

ut − u ux + 21 uxx = 0, in (0, T ) × R
. (4.1)
u (T , x) = g(x), x ∈ R

Here > 0 is a parameter. As ↓ 0 we expect u to converge to the solution to the equation


 
1 2
ut − u = 0, in (0, T ) × R (4.2)
2 x

with terminal value u(T , x) = g(x), x ∈ R.


The conservation law (4.2) is also called inviscid Burger’s equation. If we denote by
U , G the indefinite integrals (with respect to the space variable) of u , g respectively,
then
1 1
Ut − (Ux )2 + Uxx = 0, in (0, T ) × R (4.3)
2 2
with terminal value U (T , x) = G(x), x ∈ R. If U solving (4.3) can be obtained then,
u = Ux solves (4.1). The nonlinear equation (4.3) can be transformed into the heat
equation by the Cole–Hopf transformation V (t, x) = exp(− 1 U (t, x)). Then

Vt + 2 Vxx = 0, in (0, T ) × R
  . (4.4)
V (T , x) = exp − 1 G(x)

Once heat equation is encountered, can probability be far behind?


It is known that the solution to (4.4) can be written in terms of the heat kernel or
equivalently the Brownian motion; see [KS]. Indeed
   
1 1 (y − x)2
V (t, x) = exp − G(y) · √ exp − dy
R 2π (T − t) 2 (T − t)
   
1 1 z2
= exp − G(x + z) · √ exp − dz
R 2π (T − t) 2 (T − t)
   
1
= exp − G(x + z) · dP XT−1−t (z)
R
   
1
= exp − G(x + X(T − t, w)) dP (w). (4.5)
C0 [0,T ]

Inverting the Cole–Hopf transformation we see that

−U (t, x) = log V (t, x)


 
1
= log exp [−G(x + X(T − t, w))] dP (w) .
C0 [0,T ]
(4.6)
An introduction to 2007 Abel prize 173

Clearly r.h.s. of (4.6) suggests that limit of U as ↓ 0 can be handled using Varadhan’s
lemma, once it is shown that {P : > 0} satisfies the LDP and the rate function is identified.
Also as ↓ 0 we expect U to converge to the solution of the Hamilton–Jacobi equation
1
Ut − (Ux )2 = 0, in (0, T ) × R (4.7)
2
with the terminal value U (T , x) = G(x), x ∈ R. From this, solution to (4.2) can be
obtained by differentiating with respect to x. Here the Hamiltonian is H (y) = 21 y 2 and
hence the Lagrangian is L(z) = 21 z2 . Note that the approximation scheme suggested here
is somewhat different from the one discussed in the preceding section. This problem was
considered by Donsker and his student Schilder at the Courant Institute around 1965,
serving as another motivation for [V1]. 2

Now define IB : C0 [0, T ] → [0, ∞] by


 1 T
2 0 |ẇ(s)| ds, if w ∈ Dac ∩ C0 [0, T ]
2
IB (w) = (4.8)
∞, otherwise

where Dac is as in (3.4) with L(x) = 21 x 2 ; so IB is the restriction of I given by (3.4) to


C0 [0, T ] with L(x) = 21 x 2 .

Theorem 4.2 (Schilder 1966). {P : > 0} satisfies the LDP with rate function IB given
by (4.8). An analogous result also holds for the d-dimensional Brownian motion. 2

An important ingredient of the proof is the Cameron–Martin formula which gives the
Radon–Nikodym derivative of translation by an absolutely continuous function with
respect to the Wiener measure (see Varadhan [V2] for a proof). In view of Example 2.2 and
Cramer’s theorem the rate function IB may not be surprising. Theorem 4.2 is an example
of a sample path large deviations principle. This is a level 1 LDP like Cramer’s theorem.
A far reaching generalization of the above is the LDP for diffusion processes, again
a sample path LDP, due to Wentzell and Friedlin (1970); some special cases had been
considered earlier by Varadhan. A diffusion process can be represented as a solution to
a stochastic differential equation. Let {X(t): t ≥ 0} denote a standard d-dimensional
Brownian motion, where d ≥ 1 is an integer. Let σ (·), b(·) respectively be (d × d) matrix-
valued, Rd -valued functions on Rd . The stochastic differential equation

dZ(t) = σ (Z(t)) dX(t) + b(Z(t)) dt (4.9)

with initial value Z(0) = z0 is interpreted as the stochastic integral equation


 t  t
Z(t) = z0 + σ (Z(s)) dX(s) + b(Z(s)) ds. (4.10)
0 0
t
Here expressions of the form 0 ξ(s) dX(s) denote Ito integrals. When σ, b are Lipschitz
continuous, by a Picard iteration unique solution to (4.10) can be obtained. The diffusion
process given by (4.10) is a Markov process; that is, if the ‘present’ is known, then the
‘past’ and the ‘future’ of the process are independent; this is also called the memoryless
property (see [KS]). The sample paths of the diffusion are continuous; so the process
{Z(t): 0 ≤ t ≤ T } induces a probability measure on C([0, T ]: Rd ).
174 S Ramasubramanian

Theorem 4.3 (Wentzell–Freidlin, 1970). Let σ, b be Lipschtz continuous. Assume that


a(·)  σ (·) σ (·)† is uniformly positive definite; that is, ∃ λ0 > 0 such that a(x)ξ, ξ  ≥
λ0 |ξ |2 , for all x, ξ ∈ Rd . Let x ∈ Rd be fixed. For > 0 consider the diffusion process

dZ ,x (t) = σ (Z ,x (t)) dX(t) + b(Z ,x (t)) dt (4.11)

with initial value Z ,x (0) = x. Let Q ,x denote the probability measure induced on
C([0, T ]: Rd ) by the process {Z ,x (t): 0 ≤ t ≤ T }. Then {Q ,x : > 0} satisfies LDP with
the rate function
1  T −1
2 0 ẇ(t) − b(w(t)), a (w(t)) (ẇ(t) − b(w(t)))dt, if w ∈ D
x
Ix (w) = (4.12)
∞, otherwise

where
 t
D = w ∈ C([0, T ]: R ): w(t) = x +
x d
ξ(s) ds, ξ ∈ L2 [0, T ] .
0

Remark. If σ (·) ≡ identity matrix, b(·) ≡ 0 then the above reduces to Schilder’s theorem.
In fact, if σ (·) ≡ constant, then the above result is a simple consequence of Schilder’s
theorem and the contraction principle. So the expression (4.12) may not be surprising;
however the proof in the general case involves a delicate approximation (see [FW]
and [V2]).
We next indicate a connection between diffusions and second order elliptic/parabolic
PDE’s. With σ, a, b as in Theorem 4.3 define the elliptic differential operator L by

1  d
∂ 2 g(x)  d
∂g(x)
Lg(x) = aij (x) + bi (x) , (4.13)
2 i,j =1 ∂xi ∂xj i=1
∂xi

where a(·) = ((aij (·))). The operator L is called the infinitesimal generator of the diffu-
sion process Z(·) given by (4.9) and (4.10). The probabilistic behaviour of the diffusion
is characterized by L . In particular, the transition probability density function of Z(·) is
the fundamental solution to the parabolic operator ∂t∂ + L. (For example, the generator
d ∂2
corresponding to Brownian motion is the d-dimensional Laplacian 21  := 1
2 i=1 ∂x 2 ,
i
and the heat kernel is the corresponding transition probability density function.)
(see [KS]).
Let G ⊂ Rd be a bounded smooth domain. Consider the (Dirichlet) boundary value
problem

Lu(x) = −g(x), x ∈ G with u(x) = f (x), x ∈ ∂G (4.14)

where g, f are known functions. Then the unique solution to (4.14) can be written as
  τ 
u(x) = E f (Z(τ )) + g(Z(s)) ds Z(0) = x , x ∈ Ḡ (4.15)
0

where Z is the diffusion given by (4.9), (4.10), and τ = inf{t > 0: Z(t) ∈ G} = first exit
time from G. Note that r.h.s. of (4.15) denotes taking expectation given that Z(0) = x.
An introduction to 2007 Abel prize 175

The random variable τ is an example of a stopping time. As a(·) is uniformly positive


definite, we have τ < ∞ with probability 1. Thus (4.15) gives a stochastic representation
to the solution to the Dirichlet problem (4.14); this can be proved using stochastic calculus.
(see [KS]).
For > 0, x ∈ Rd let Z ,x (·) be given by (4.11). The infinitesimal generator L of the
diffusion Z ,x is given by


d
∂ 2v d
∂v
L v(x) = aij (x) (x) + bi (x) (x). (4.16)
2 i,j =1
∂xi ∂xj i=1
∂xi

For each > 0, with the same g, f , the problem

L u (x) = −g(x), x ∈ G with u (x) = f (x), x ∈ ∂G (4.17)

has the solution


  τ 
u (x) = E f (Z ,x
(τ )) + g(Z ,x
(s)) ds , x ∈ Ḡ (4.18)
0

where τ = inf{t > 0: Z ,x (t) ∈ G}. In particular, x → Ex (τ ) := E(τ | Z ,x (0) = x)


is the solution to (4.17) with f ≡ 0, g ≡ 1.
For = 0 note that the equation (4.11) becomes the ODE

dz(t) = b(z(t)) dt, with z(0) = x. (4.19)

We make the following assumption:

(A) There exists x0 ∈ G (an interior point) such that for any x ∈ Ḡ the solution z(·) to
(4.19) with initial value z(0) = x satisfies z(t) ∈ G for all t > 0 and limt→∞ z(t) = x0 ;
that is, x0 is the unique stable equilibrium point in Ḡ of the ODE (4.19).
Some questions of interest are: What happens to u as ↓ 0? In particular, what about
Ex (τ ) as ↓ 0? What can one say about the hitting distribution on ∂G in the limit?
For small , the trajectories of the diffusion Z ,x (·) are close to the deterministic tra-
jectory z(·) with very high probability. And, as the deterministic trajectory z(·) does not
exit G at all, a reasonable guess would be that the system Z ,x tends to stay inside G for
small . In such an eventuality note that limiting exit time and exit place are not defined.
To get a handle on the problem, we proceed differently. By continuity of sample paths,
Z ,x (τ ) is ∂G-valued. So for any > 0, the hitting distribution, i.e. the distribution of
Z ,x (τ ) is a probability measure on ∂G. Since ∂G is compact this family of probability
measures has limit points.
To appreciate the importance of the problem let us look at two situations. The first
example is from chemistry, which is the origin of the ‘exit problem’. It is known that
molecules need to overcome a potential barrier to be able to participate in a chemical
reaction. As the molecules are in motion, their energy is modelled by a diffusion of the
type Z ,x (·), oscillating about a stable state; here > 0 is the so-called Arrhenius factor.
The potential barrier θ is represented by the diameter of the domain G. In general,  θ .
So exit from the ‘right end’ of G for small means reaction will proceed. Hence the
asymptotic rate of exit at the right end of the potential well, as ↓ 0, gives a very good
estimate of reaction rate (see [Kp], [Sc] for more background information and ad hoc
-expansion method due to Kramers).
176 S Ramasubramanian

The second example is from engineering, concerning track loss in radar systems. In such
a system the observed tracing error, due to evasive maneuvres of the target as well as to
observation noise, is modelled by a diffusion of the type Z ,x (·). Here gives the variance
parameter in the observation noise. As radar systems are quite sophisticated this parameter
is very small compared to the actual tracing error. Since the observation device has a limited
field of view, Z ,x (·) ceases to model the observation process as soon as the tracking error
exits from the field of view. So exiting the domain in this case is an undesirable event.
Hence information on probability of exit, mean time of exit, exiting place on ∂G, etc. may
be useful in designing optimal devices (see [DZ] for a detailed discussion).
Motivated by the rate function in Theorem 4.3, for 0 < t < ∞ define

1 t
It (y(·)) = (ẏ(s) − b(y(s))), a −1 (y(s))(ẏ(s) − b(y(s))) ds
2 0
if y is absolutely continuous with square integrable derivative ẏ. Set
ϕt (x, y) = inf{It (y(·)): y(0) = x, y(t) = y, y absolutely
continuous, ẏ square integrable},

ϕ(x, y) = inf ϕt (x, y), x, y ∈ Rd .


t>0
Heuristically ϕt (x, y) can be interpreted as the cost of forcing the diffusion Z ,x (·) to be
at the point y at time t. Define the function ϕ̄ by
ϕ̄(y) = ϕ(x0 , y), y ∈ Ḡ (4.20)
where x0 is the unique stable equilibrium point of the ODE (4.19) as in (A).
Theorem 4.4 (Wentzell–Freidlin, 1970). Let σ, b, a be as in Theorem 4.3. Assume (A).
Further assume that there exists ȳ ∈ ∂G such that ϕ̄(ȳ) < ϕ̄(y) for y ∈ ∂G, y = ȳ. Then
the following hold.
(i) For any x ∈ G, Z ,x (τ ) → ȳ with probability 1. So the hitting distribution converges
(weakly) to δȳ as ↓ 0, whatever be the starting point.
(ii) Let u be the solution to (4.17) for continuous boundary data f , and g ≡ 0. Then
lim ↓0 u (x) = f (ȳ) for any x ∈ G. (Part (ii) is an immediate consequence of part
(i).)
(iii) lim ↓0 log Ex (τ ) = ϕ̄(ȳ), for any x ∈ G. 2
The intuitive explanation is along the following lines. “Any wandering away from x0
has an overwhelmingly high probability of being pushed back to x0 , and it is not the time
spent near any part of ∂G that matters but the a priori chance for a direct, fast exit due to
a rare segment in the Brownian motion’s path” – see p. 198 of [DZ]. This is possible since
p0 (t, x, y) > 0 for any t > 0 (however small), any x, y ∈ Rd (however large |x − y| may
be), where p0 is the heat kernel. For proof and discussion, see [V2], [FW] and [DZ].
An interesting application of sample path LDP arises in queueing networks. Identifica-
tion of the rate function in a general setting is not very easy. But once the rate function
is found and is regular, it can be used to characterize decoupling of data sources and to
define ‘effective bandwidth’ of each source; this is of importance to traffic engineering
in communication networks. Moreover the most likely way in which buffers overflow
can be determined by considering ‘minimizing large deviation paths’ for certain reflected
Brownian motion processes that arise as heavy traffic models for queueing networks
(see [SW], [AD], [RD], [DR] and references therein).
An introduction to 2007 Abel prize 177

5. LDP for occupation times: Prelude to Donsker–Varadhan theory


We next look at large deviation methods in connection with principal eigenvalues.
Let V (·) be a continuous periodic function on R with period 2π , and consider the
problem

∂u 1 ∂ 2u
(t, x) = (t, x) + V (x) u(t, x), t > 0, x ∈ R
∂t 2 ∂x 2
with the initial value u(0, x) = 1. By Feynman–Kac formula the solution is given by
  t 
u(t, x) = Ex exp V (X(s)) ds
0
  t 
 E exp V (X(s)) ds X(0) = x ,
0

where X(·) denotes one-dimensional Brownian motion; this can be proved using stochastic
calculus; see [KS].
Since V and the initial value are periodic, x → u(t, x) is also periodic. Note that
Y (t)  X(t) mod 2π, t ≥ 0 is the Brownian motion on the 1-dimensional torus (circle)
T. So the problem as well as the solution can be considered on T rather than on R. In other
words, the problem is basically
 
∂u 1 ∂2
(t, θ ) = Au(t, θ )  + V (θ ) u(t, θ ), t > 0, θ ∈ T
∂t 2 ∂θ 2

u(0, θ) = 1, θ ∈ T (5.1)

and the solution, by Feyman–Kac formula, is


  t 
u(t, θ ) = E exp V (Y (s))ds Y (0) = θ , t ≥ 0, θ ∈ T. (5.2)
0

2
The one-dimensional Schrödinger operator A  21 ∂θ

2 + V (θ) is an unbounded operator

with domain (A) ⊂ L (T). It is known from the theory of second-order elliptic differential
2

equations that A−1 is a bounded self-adjoint compact operator. So by spectral theory A has
a sequence {λi } of eigenvalues, and a corresponding sequence {ψi (·)} of eigenfunctions
such that limm→∞ λm = −∞, λ1 > λ2 ≥ λ3 ≥ . . . , the principal eigenvalue λ1 is
of multiplicity one and the corresponding eigenfunction ψ1 (·) > 0 (see [E] or [K]). The
semigroup {Tt } corresponding to (5.1) can be formally written as {etA } and hence by
spectral theory again


u(t, θ ) = (etA 1)(θ ) = eλk t ψk , 1ψk (θ ), (5.3)
k=1

where 1 denotes the function which is identically 1 on T, and ·, · denotes the inner product
in L2 (T). As λ1 > λi , i ≥ 2 and ψ1 > 0, from (5.3) we have

u(t, θ ) = eλ1 t ψ1 , 1ψ1 (θ ) [1 + O(e−(λ1 −λ2 )t )]


178 S Ramasubramanian

and consequently

1
lim log u(t, θ ) = λ1 . (5.4)
t→∞ t
This is a result due to Kac.
Now the bilinear form associated with A is
 
1 
B[f, g] = Af, g = f (θ ) g(θ )dθ + V (θ) f (θ) g(θ)dθ
T 2 T
 
1 
=− f (θ ) g  (θ )dθ + V (θ) f (θ) g(θ) dθ, (5.5)
T 2 T

where in the last step we have used integration by parts and periodicity. It is known by the
classical Rayleigh–Ritz variational formula (see [E] or [K]) that the principal eigenvalue
λ1 can be given, in view of (5.5) by

λ1 = sup{B[g, g]: g differentiable, gL2 = 1}


 
1
= sup V (θ ) g (θ ) dθ −
2
|g  (θ )|2 dθ: g differentiable, gL2 = 1 . (5.6)
T 2 T

Similar analysis is possible also on R if limx→±∞ V (x) = −∞. The above discussion
basically means that the Perron–Frobenius theorem for nonnegative irreducible matrices
goes over to self-adjoint second-order elliptic operators.
A natural question, whose implications turn out to be far reaching is: Is there a
direct way of getting (5.6) from (5.2) without passing through differential equation (5.1)
or the interpretation of the limit in (5.4) as an eigenvalue? If it is possible to do so,
t
then one can replace 0 V (Y (s))ds by more general functionals of the form F (Y (t))
depending on Brownian paths and hope to calculate limt→∞ 1t log E[exp(F (Y (t)))].
In such a case there may be no connection with differential equations. Moreover
one can also consider processes other than Brownian motion. Donsker’s firm convic-
tion that something deep was going on here propelled the investigation along these
lines.
Put f (θ) = g 2 (θ ). Then what we seek can be written as
  
1 1 t
lim log E exp t V (Y (s)) ds Y (0) = y
t→∞ t t 0
 
1 1
= sup V (θ ) f (θ ) dθ − |f  (θ )|2 dθ : f L1 = 1, f ≥ 0
T 8 T f (θ )
(5.7)

for any y ∈ T. If (5.7) can be considered as a special case of (3.10) then our
 t purpose would
be served by Varadhan’s lemma. Also (5.7) implies that the factor exp( 0 V (Y (s))ds) in
the Feyman–Kac formula (5.2) can be viewed upon as an Esscher tilt.
Towards this, let  = {w: [0, ∞) → T: w continuous}; this can be taken as the basic
probability space. Define Y (t, w) = w(t), t ≥ 0, w ∈ . For y ∈ T, let Py denote the
probability measure on  making {Y (t): t ≥ 0} a Brownian motion on T starting at y; that
An introduction to 2007 Abel prize 179

is, Py is the distribution induced on  by the Brownian motion on the torus starting at y.
For t ≥ 0, w ∈ , A ⊆ T define

1 t
M(t, w, A) = IA (Y (s, w)) ds, (5.8)
t 0
denoting the proportion of time that the trajectory Y (·, w) spends in the set A during the
period [0, t]. This is called the occupation time. Note that A → M(t, w, A) is a probability
measure on the torus. Let M(T) denote the space of probability measures on T, endowed
with the topology of weak convergence of probability measures. For t ≥ 0 fixed, let Mt
denote the map w → M(t, w, ·) ∈ M(T). Let Qt  Py Mt−1 denote the distribution of
(y)
(y) (y)
Mt . So Qt is a probability measure on M(T); in other words, Qt ∈ M(M(T)), for
any t ≥ 0, y ∈ T.
Now observe that
 
1 t
V (Y (s, w)) ds = V (θ ) M(t, w, dθ) (5.9)
t 0 T

and consequently
  
1 t
E exp t V (Y (s)) ds Y (0) = y
t 0
   
= exp t V (θ ) M(t, w, dθ) dPy (w)
 T
   
(y)
= exp t V (θ ) μ(dθ) dQt (μ)
M(T) T

(y)
= exp(t(μ)) dQt (μ), (5.10)
M(T)

where

(μ) = V (θ ) μ(dθ ), μ ∈ M(T). (5.11)
T

Note that (5.9), (5.10) imply that l.h.s. of (5.7) is of the same form as l.h.s. of (3.10) with
(y)
S = M(T), = 1t , P = Qt , g(x) = (μ). It can be shown that I0 (·) defined by
 1 1 
8 T f (θ) |f (θ )| dθ, if dμ(θ) = f (θ) dθ, and f differentiable
2
I0 (μ) =
∞, otherwise
(5.12)

is the rate function on M(T); note that M(T) is a compact metric space by Prohorov’s
theorem. In fact we have the following:
(y)
Theorem 5.1 (Donsker–Varadhan, 1974). For any y ∈ T, the family {Qt : t ≥ 0} of
probability measures on M(T), induced by the occupation time functionals of Brownian
motion on T, satisfies LDP with rate function I0 given by (5.12). Consequently, by
Varadhan’s lemma (5.7) holds. 2
180 S Ramasubramanian

For proof, see [DV1]. Moreover asymptotics of functionals of the form (M(t, w, dθ))
can be described. Like Sanov’s theorem, the above result of Donsker and Varadhan is a
level 2 LDP.
The basic space in the above set up is the torus which has a canonical measure, viz.
the rotation invariant (Haar) measure dθ . The basic process is the Brownian motion on
the torus. Its generator is the Laplacian which is uniformly elliptic and self-adjoint. Hence
the normalized Haar measure on the torus turns out to be the unique ergodic probabi-
lity measure for the basic process. This important fact has played a major role in the
background.
The above result is the proverbial tip of the iceberg. It led to an extensive study, by
Donsker and Varadhan, of LDP for occupation times for Markov chains and processes.
Some of the results were also independently obtained by Gartner [G]. This in turn formed
the basis for providing a variational formula for the principal eigenvalue of a not necessarily
self-adjoint second-order elliptic differential operator, a solution to the problem of Wiener
sausage, etc. However it is not powerful enough to deal with the polaron problem from
statistical physics.
For this a LDP at the process level is needed. This is called level 3 large deviations.
A crowning achievement is the LDP for empirical distributions of Markov processes, due
to Donsker and Varadhan. We briefly describe this far reaching extension of Theorem 5.1.
Note that (5.8) can also be written as

1  2n
M(t, w, A) = lim δY (tk2−n ,w) (A).
n→∞ 2n k=1

On the r.h.s. of the above we have a sequence of empirical distributions. To handle large
deviation problems, the proper way to extend the notion of empirical distribution to stochas-
tic processes turns out to be as follows.
Let  be a complete separable metric space. Let  = {w: w right continuous on
(−∞, ∞) into , and w(t−) exists for all t}. Under the Skorokhod topology on bounded
intervals,  is a complete separable metric space. Let + denote the corresponding space
of -valued functions on [0, ∞). For r ∈ (−∞, ∞) let θr denote the translation map on
 given by θr w(s) = w(r + s).
For w ∈ , t > 0 let wt be such that wt (s + t) = wt (s) for all s ∈ (−∞, ∞), and
wt (s) = w(s) for 0 ≤ s < t; that is, the segment of w on [0, t) is extended periodically to
get wt . For w ∈ , t > 0, B ⊂  define
 t
1
Rt,w (B) = IB (θr wt )dr. (5.13)
t 0

It can be shown that Rt,w (θσ B) = Rt,w (B) for any B ⊆ , σ > 0. So B → Rt,w (B)
is a translation invariant probability measure on . Let MS () denote the space of all
translation invariant probability measures on , with the topology of weak convergence.
This is a complete separable metric space. For fixed t ≥ 0, note that w → Rt,w is a
mapping from  into MS ().It is called the empirical distribution functional.
t
We write M(t, w, A) = 1t 0 IA (w(s))ds, A ⊆  to denote the analogue of (5.8) in
the present context. It can be seen that M(t, w, ·) = Rt,w π0−1 , where π0 is the projection
from  onto  given by w → w(0). Thus the occupation time functional is the marginal
distribution of the empirical distribution functional.
An introduction to 2007 Abel prize 181

Let P0,x denote the distribution of a -valued ergodic Markov process starting from
x ∈  at time 0; it is a probability measure on + . For t ≥ 0, x ∈ , let ζt(x) be defined by

ζt(x) (E) = P0,x {w ∈ : Rt,w ∈ E}, E ⊆ MS (). (5.14)

So ζt(x) is a probability measure on MS () . As the Markov process is ergodic there is a


unique invariant probability measure ν on . Let μ ∈ MS () be the translation invariant
measure on  with ν as its marginal distribution; that is, μ is the stationary Markov process
with ν as its marginal distribution. Using ergodic theorem it can be proved that ζt(x) ⇒ δμ
as t → ∞ for any x ∈ , where δμ denotes the Dirac measure concentrated at μ.
A stochastic process, in particular a Markov process, can be identified with an appro-
priate element of M(+ ), while a stationary stochastic process can be identified with an
element of MS (). For each ergodic Markov process we associate a stationary stochas-
tic process. Since {P0,x } as well as {Rt,w } represent stochastic processes, an LDP for
{ζt(x) } ⊂ M(MS ()) is considered an example of the highest level large deviations.
It is a deep result due to Donsker and Varadhan (1983) that, under suitable conditions,
for any x ∈  the family {ζt(x) : t ≥ 0} satisfies LDP with a rate function H (·) defined on
MS () in terms of a relative entropy function. For details, see [DV2], as well as [V2] and
[DS]. The rate function H (·) is called entropy with respect to the Markov process {P0,x }.
The process level LDP turned out to be essential for Donsker and Varadhan (1983) to
solve the polaron problem. This problem involves showing that the limit
  t t 
1 e−|s−r|
η(α)  lim log E exp α dr ds
t→∞ t 0 0 |B(s) − B(r)|

exists, where B(·) is the three-dimensional Brownian motion, and establishing a conjecture
made in 1949 by Pekar concerning the asymptotics of η(α) as α → ∞. For a description
of the polaron problem, see [R] (see [V2] and the references therein for details).
In this write-up we have attempted to give just a flavour of large deviations. While [V2]
and [V3] give succinct overview, [DS], [DZ] and [H] are excellent textbooks on the subject;
the latter two also discuss applications to statistics, physics, chemistry, engineering, etc.
An interested reader may also look up [DE], [El], [FW], [O], [Sm], [SW] and [FK] for
diverse aspects, applications and further references.

Acknowledgment
This is an expanded version of M N Gopalan Endowment Lecture given at the 22nd Annual
Conference of Ramanujan Mathematical Society held at National Institute of Technology –
Surathkal, Karnataka in June 2007. It is a pleasure to thank R Bhatia for his encouragement
and useful discussions. The author thanks V S Borkar for constructive suggestions on an
earlier draft. Thanks are also due to an anonymous referee for spotting an error and for
useful suggestions.

References
[AD] Atar R and Dupuis P, Large deviations and queueing networks: methods for rate
function identification, Stochastic Process. Appl. 84 (1999) 255–296
182 S Ramasubramanian

[B] Bahadur R R, Some Limit Theorems in Statistics (Philadelphia: SIAM) (1971)


[DZ] Dembo A and Zeitouni O, Large Deviations Techniques and Applications (Boston:
Jones and Bartlett) (1993)
[DS] Deuschel J and Stroock D W, Large Deviations (Boston: Academic Press) (1989)
[DV1] Donsker M D and Varadhan S R S, Asymptotic evaluation of certain Wiener integrals
for large time, in: Functional Integration and its Applications (ed.) A M Arthurs,
(Oxford: Clarendon) (1975) pp. 15–33
[DV2] Donsker M D and Varadhan S R S, Asymptotic evaluation of certain Markov process
expectations for large time, IV. Comm. Pure Appl. Math. 36 (1983) 183–212
[DE] Dupuis P and Ellis R S, A weak convergence approach to the theory of large deviations
(New York: Wiley) (1997)
[DR] Dupuis P and Ramanan K, A time-reversed representation for the tail probabilities of
stationary reflected Brownian motion, Stochastic Process. Appl. 98 (2002) 253–287
[El] Ellis R S, Entropy, large deviations and statistical mechanics (New York: Springer)
(1985)
[E] Evans L C, Partial Differential Equations (Providence: Amer. Math. Society) (1998)
[F] Feller W, An Introduction to Probability Theory and its Applications, Vol. II
(New Delhi: Wiley-Eastern) (1969)
[FK] Feng J and Kurtz T G, Large Deviations for Stochastic Processes (Providence: Amer.
Math. Soc.) (2006)
[FW] Freidlin M I and Wentzell A D, Random perturbations of dynamical systems
(New York: Springer) (1998)
[G] Gartner J, On large deviations from the invariant measure, Theory Probab. Appl. 22
(1977) 24–39
[H] den Hollander F, Large Deviations (Providence: Amer. Math. Soc.) (2000)
[Kp] van Kampen N G, Stochastic processes in Physics and Chemistry (Amsterdam: North
Holland) (1981)
[KS] Karatzas I and Shreve S E, Brownian Motion and Stochastic Calculus (New York:
Springer) (1988)
[K] Kesavan S, Topics in Functional Analysis and Applications (New Delhi: Wiley-
Eastern) (1989)
[O] Orey S, Large deviations in ergodic theory, in Seminar on Stochastic Processes 12
(Basel: Birkhauser) (1985) pp. 195–249
[P] Parthasarathy K R, Probability measures on metric spaces (New York: Academic
Press) (1967)
[RD] Ramanan K and Dupuis P, Large deviation properties of data streams that share a
buffer, Ann. Appl. Probab. 8 (1998) 1070–1129
[R] Roepstorff G, Path integral approach to quantum physics: An introduction (Berlin:
Springer) (1996)
[RSST] Rolski T, Schmidli H, Schmidt V and Teugels J, Stochastic processes for insurance
and finance (New York: Wiley) (2001)
[Sc] Schuss Z, Theory and applications of stochastic differential equations (New York:
Wiley) (1980)
[SW] Schwartz A and Weiss A, Large Deviations for Performance Analysis, Queues,
Communications and Computing (London: Chapman and Hall) (1995)
[Sm] Simon B, Functional integration and quantum physics (New York: Academic Press)
(1979)
[V1] Varadhan S R S, Asymptotic Probabilities and Differential Equations, Comm. Pure
Appl. Math. 19 (1966) 261–286
[V2] Varadhan S R S, Large Deviations and Applications (Philadelphia: SIAM) (1984)
[V3] Varadhan S R S, Large deviations and applications, Expo. Math. 3 (1985) 251–272

You might also like