PM00171 Dvi
PM00171 Dvi
© Printed in India
S RAMASUBRAMANIAN
Stat.-Math. Unit, Indian Statistical Institute, 8th Mile, Mysore Road,
Bangalore 560 059, India
E-mail: [email protected]; [email protected]
Abstract. 2007 Abel prize has been awarded to S R S Varadhan for creating a unified
theory of large deviations. We attempt to give a flavour of this branch of probability
theory, highlighting the role of Varadhan.
Keywords. Large deviation principle (LDP); rate function; Cramer’s theorem; Sanov’s
theorem; Esscher transform/tilt; convex conjugates; Laplace’s method; Varadhan’s
lemma; weak convergence of probability measures; empirical distribution; Himilton–
Jacobi equation; Burger’s equation; variational formula; sample path LDP; Brownian
motion/diffusion; Markov processes; ergodicity; Wentzell–Freidlin theory; exit problem;
Feynman–Kac formula; occupation time; principal eigenvalue; Donsker–Varadhan
theory.
1. Introduction
The award of the prestigious Abel prize for 2007 to Professor S R S Varadhan has been
widely acclaimed, especially among the mathematical community in India. The Abel prize,
perhaps the highest award in mathematics, has been instituted by the Norwegian Academy
of Science and Letters since 2003; this annual award is patterned roughly along the lines
of Nobel prize in sciences. The citation says that Varadhan is being given the award “for
his fundamental contributions to probability theory and in particular for creating a unified
theory of large deviations”.
Large deviations is a part of probability theory; it provides asymptotic estimates for
probabilities of rare events. It may be pointed out that the strong law of large numbers
and the central limit theorem, the versatile classical limit theorems of probability theory,
concern typical events. As large deviation estimates deal with probabilities of rare events
the methods needed are more subtle. Moreover, context specific techniques play a major
role though there are quite a few general principles. In this write-up we attempt to give an
elementary introduction to large deviations, of course, highlighting the role of Varadhan.
161
162 S Ramasubramanian
with probability 1 as n → ∞. This forms a basis for the validity of many statistical and
scientific procedures of taking averages. If a > m then the above implies that
1
lim P (X1 + · · · + Xn ) > a = 0. (2.1)
n→∞ n
However one would like to know at what rate convergence in (2.1) takes place.
Denote Sn = X1 + · · · + Xn , n ≥ 1. As the central limit theorem (CLT) explains
the prevalence of Gaussian distribution in various aspects of nature, one wonders if CLT
can shed more light. By the classical CLT for sums of i.i.d.’s, assuming that the common
variance is 1, we have for a > m,
1 1 √
P Sn > a = P √ (Sn − nm) > n(a − m)
n n
√
≈ 1 − ( n(a − m)) → 0, as n → ∞
where
x 1 1 2
(x) = √ e− 2 y dy, x ∈ R. (2.2)
−∞ 2π
Here (·) is the standard normal (Gaussian) distribution function. So the CLT is not
powerful enough to discern the rate.
The event {Sn > na} is a typical ‘rare event’ of interest in insurance. For example, Xi
can denote the claim amount of policy holder i in a given year, and hence Sn denotes the
total claim amount of n policy holders. Assuming a big portfolio for the insurance company
(that is, n is very large), any estimate for P (Sn > na), where a > m, gives information
about the ‘right tail’ of the total claim amount payable by the company in a year.
As another illustration from insurance, Sn can be regarded as the cumulative net payout
(that is, claim payment minus income from premiums and interests) in n years. The initial
capital u0 of the company is generally quite large. If Sn exceeds u0 then the company is
ruined.
It is easy to see why actuaries would be interested in the tail behaviour of Sn . They
would like to have an idea of how bad an extremely bad year can be, and perhaps fine-
tune premium rates or reinsurance levels. It is no wonder that the problem attracted the
attention of the great Swedish probabilist Harald Cramer, a pioneer in putting statistics
as well as insurance modelling on firm mathematical foundations. However, F Esscher, a
Scandinavian actuary, may have been the first to look at the problem and come up with
some interesting ideas (in 1932) which were later sharpened/extended by Cramer.
To appreciate Cramer’s result let us first look at two examples.
Example 2.1. Let Xi be i.i.d. random variables such that P (Xi = 0) = P (Xi = 1) = 21 ;
that is, {Xi } is a sequence of i.i.d. Bernoulli random variables with parameter 21 . Note that
m = 21 . Let a ∈ 21 , 1 . It is easily seen that (as Sn has a binomial distribution)
see that
1
lim log Qn (a) = −a log a − (1 − a) log(1 − a).
n→∞ n
From this it follows that
1
lim log P (Sn ≥ na) = −[log 2 + a log a + (1 − a) log(1 − a)].
n→∞ n
Example
2.2. Let {Xi } be i.i.d. N (0, 1) random variables so that P (Xi ∈ A) =
A n(x) dx, A ⊂ R where
1 1
n(x) = √ exp − x 2 , x ∈ R. (2.3)
2π 2
Note that P (Xi ≤ x) = (x) where is given by (2.2).
In this case m = 0. Note that the
n
empirical mean n1 Sn = n1 i=1 X i has the N 0, 1
n distribution. Hence by the properties
of normal distributions, for any a > 0,
1 √
P Sn ≥ a = 2[1 − (a n)]. (2.4)
n
Now for any y > 0 clearly
3 1
1 − 4 n(y) < n(y) < 1 + 2 n(y).
y y
Integrating the above over [z, ∞), where z > 0,
1 1 1
− 3 n(z) < [1 − (z)] < n(z). (2.5)
z z z
From (2.4) and (2.5) it easily follows that
1 1 1
lim log P Sn ≥ a = − a 2 . (2.6)
n→∞ n n 2
Thus, the probability of the rare event n1 Sn ≥ a is of the order exp − 21 na 2 . This is
a typical large deviations statement, and 21 a 2 is an example of a rate function. 2
Cramer’s theorem is about an analogue of the above for sums of i.i.d. random variables.
Assume that the moment generating function (or the Laplace transform) of X1 exists,
that is,
M(t) E[etX1 ] = etx dF (x) < ∞, ∀ t ∈ R. (2.7)
R
In fact we have the following theorem, which is the starting point in large deviations.
Theorem 2.3 (Cramer, 1938). Let {Xi } be real-valued i.i.d.’s having finite moment
generating function M(·). Then for any a > E(X1 ),
1 1
lim log P Sn ≥ a = −I (a), (2.8)
n→∞ n n
where
Thus log M(t) and I (a) are convex conjugates. The rate function I (·) is also known as the
Fenchel–Legendre transform of the logarithmic moment generating function log M(·).
As seen above, the upper bound in (2.8) is an easy consequence of Chebyshev’s inequa-
lity. The key idea in the proof of the lower bound is an ‘exponential tilting’ or Esscher
transform of the distribution, a device having its origins again in insurance problems. With
F (·) and M(·) as above, for each fixed t ∈ R the Esscher transform is defined by
1
dF̃t (x) = etx dF (x).
M(t)
Under the tilted distribution the rare event n1 Sn ≥ a becomes a typical event, thereby
facilitating analysis (see [H]).
See [F] for an account of Cramer’s theorem in the context of the central limit problem;
also illustrations from risk theory are sprinkled often in [F]. For detailed account of insur-
ance models, and for the role played by Esscher transform in estimating ruin probabilities,
An introduction to 2007 Abel prize 165
(i) I has compact level sets, that is, I −1 ([0, c]) is compact for all c < ∞; in particular
I is lower semicontinuous;
(ii) I (z) ≥ 0 with equality if and only if z = E(X1 );
(iii) I is convex on R.
If Xi has the Bernoulli distribution with parameter 0 < p < 1, then I (a) = a log pa +
1−a
(1 − a) log 1−p , for a ∈ [0, 1], and I (a) = ∞, otherwise. Similarly, if Xi has the
Poisson distribution with parameter λ > 0, then I (a) = λ − a + a log λa , for a ≥ 0, and
I (a) = ∞ otherwise.
With the assumptions as in Theorem 2.3, if a > m note that I (z) ≥ I (a) for all z ≥ a.
So the result (2.8) can be rephrased as, denoting A = [a, ∞),
1 1
lim log P Sn ∈ A = − inf I (z). (2.12)
n→∞ n n z∈A
1 n
NnY (ω, ·) δY (ω) (·), ω ∈ S, n ≥ 1. (2.13)
n i=1 i
For each n, ω note that NnY (ω, ·) is a probability measure on R; {NnY } is called the family
of empirical distributions of {Yi }.
Let M(R) denote the set of all probability measures on the real line R. This is a closed
convex subset of the topological vector space of all finite signed measures on R with the
topology ofweak convergence
of measures; that is νn converges to ν, denoted νn ⇒ ν, if
and only if f dνn → f dν for all f ∈ Cb (R).
n
Denote Xi (ω) = δYi (ω) ∈ M(R). Hence NnY = n1 i=1 Xi , n ≥ 1, is a family
of random variables taking values in M(R). For any n, ω note that NnY (ω, (−∞, y]) =
n
i=1 I(−∞,y] (Yi (ω)) Fn (y, ω) for all y ∈ R; so for fixed n, ω note that Fn (·, ω)
1
n
is the distribution function of the probability measure NnY (ω, ·). By law of large numbers
Fn (y, ω) → F (y) μ((−∞, y]), for any y ∈ R as n → ∞ for a.e. ω. That is, NnY (ω, ·)
converges, in the topology of M(R), to μ as n → ∞, for a.e. ω. So questions concerning
probabilities of rare events, like P (NnY ∈ U ) where U is neighbourhood of μ, become
meaningful.
By analogy with Cramer’s theorem the rate would involve the logarithmic moment
generating function of Xi , and its convex conjugate. As Xi is M(R)-valued random
166 S Ramasubramanian
Here G(·) is a known continuous function, and Ut , Ux denote respectively the derivatives
with respect to t, x. Let L denote the corresponding Lagrangian, that is
is the convex conjugate of H . It is known from calculus of variations that the weak solution
U of (3.1) is given by the ‘variational principle’
T
U (t, x) = sup G(w(T )) − L(ẇ(s)) ds: w(t) = x, w is C 1 , (3.3)
t
Remarks.
(i) In calculus ofvariations one considers the initial value problem for Ut + H (Ux ) = 0.
t
The quantity 0 L(ẇ(s)) ds is called an ‘action functional’. The analogue of (3.3) is
then an infimum, and hence is called the principle of least action. The reason for our
considering the ‘backward problem’ (3.1) is that the expression (3.3) can be readily
tied up with Varadhan’s lemma later.
(ii) In optimal control theory, the modern avatar of calculus of variations, a cost functional
is minimised/maximised as in (3.3), and a nonlinear PDE like (3.1) is derived via a
dynamic programming principle.
(iii) PDE in (3.1) can also arise as a tool for solving initial/terminal value problems for
certain scalar conservation laws of the form ut − (H (u))x = 0. In fact this served
as the motivation for Varadhan [V1]. For example, if H (x) = 21 x 2 then the inviscid
Burger’s equation ut − uux = 0 is transformed into an equation like (3.1) by taking
certain indefinite integrals. See Varadhan [V1] for a brief discussion on this, and Evans
[E] for a detailed account.
Let Du [0, T ] = {w: w right continuous on [0, T ] into R, and w(t−) exists for each t},
with the topology of uniform convergence. Let Dac = {w ∈ Du [0, T ]: w(t) = w(0) +
168 S Ramasubramanian
t T
0 ξ(s) ds, 0 ≤ t ≤ T and 0 |L(ξ(s))|ds < ∞}. Define the function I : Du [0, T ] →
[0, ∞] by
T
0 L(ẇ(s)) ds, if w ∈ Dac
I (w) = (3.4)
+∞, otherwise.
Then it is not difficult to show that I (·) given by (3.4) has properties similar to the rate
function of Cramer’s theorem, albeit on a more complicated space.
An expression similar to r.h.s. of (3.3) crops up naturally in Laplace’s method in classical
asymptotic analysis. Assuming appropriate integrability conditions, and denoting · k
for the norm in Lk (R), note that
1
lim log e nγ (x)
dx = lim log eγ (·) n = log eγ (·) ∞
n→∞ n R n→∞
for any nice function γ (·) on R. (Use of · n → · ∞ in (3.5) was suggested by R Bhatia
in place of an earlier argument.) In particular, if γ (x) = g(x) − I (x) where g is a bounded
continuous function and I (·) ≥ 0 is like a rate function then
1
lim log eng(x) e−nI (x) dx = sup {g(x) − I (x): x ∈ R}. (3.6)
n→∞ n R
Note the similarity between the right-hand sides of (3.3) and (3.6). In addition, for each
n suppose dPn (x) = e−nI (x) dx is a probability measure. Then for large a, by similar
analysis on [a, ∞),
∞
1 1
lim log Pn ([a, ∞)) = lim log e−nI (x) dx
n→∞ n n→∞ n a
Note the resemblance between (3.7) and (2.12). Clearly (3.5)–(3.7) suggest that there
could be a close connection between suitable families of probability measures (like those
encountered in Cramer’s theorem), and approximation schemes for solving differential
equations (like (3.1)).
See [E] for an application of Laplace’s method in the asymptotics of viscous Burger’s
equation.
At this stage it is convenient to introduce Varadhan’s unifying framework for large
deviations. The idea is to characterize the limiting behaviour of a family {P } of probability
measures as ↓ 0, in terms of a rate function. Let (S, d) be a complete separable metric
space, and F denote its Borel σ -algebra. The required abstraction is contained in the
following two key definitions.
DEFINITION 3.1
A function I : S → [0, ∞] is called a rate function if I ≡ ∞ and if the level set
{x ∈ S: I (x) ≤ c} is compact in S for each c < ∞.
In particular, a rate function is lower semicontinuous; that is, I −1 ([0, c]) is closed in S
for all c < ∞, which is equivalent to lim inf n→∞ I (xn ) ≥ I (x) whenever xn → x in S.
An introduction to 2007 Abel prize 169
DEFINITION 3.2
Let {P : > 0} be a family of probability measures on (S, F}. The family {P } is said to
satisfy the large deviation principle (LDP) with rate function I if
(a) I is a rate function,
(b) for every closed set C ⊆ S,
lim sup log P (C) ≤ − inf I (y), (3.8)
→0 y∈C
Remarks.
(i) Let {Xi } be as in Cramer’s theorem and Pn denote the distribution of n1 (X1 +· · ·+Xn ),
for n ≥ 1. Then, with = n−1 , Cramer’s theorem says that {Pn } satisfies LDP with
rate function given by (2.9) (see (2.12)).
(ii) In Sanov’s theorem S = M(R); it is a complete separable metric space in the topology
of weak convergence; see [P]. Also = n−1 , Pn = distribution of NnY , n ≥ 1. So
Sanov’s theorem says that {Pn } satisfies LDP with relative entropy given by (2.16) as
the rate function.
(iii) In the place of (3.8), (3.9) the more intuitive stipulation that
lim log P (M) = − inf I (y)
→0 y∈M
turns out too strong to be useful. For example, if P is nonatomic for all , then
taking M = {x}, x ∈ S, the above can hold only if I (·) ≡ ∞. This would rule out
most of the interesting cases. It turns out that (3.8), (3.9) are enough to yield a rich
theory.
(iv) The framework of complete separable metric space is known to be optimal for a rich
theory of weak convergence; see [P]. In the case of large deviations too this seems to
be so.
(v) We can now formulate Cramer’s theorem in Rd . Let {Xi } be Rd -valued i.i.d.’s with
finite moment generating function M. Let I (z) = sup{ θ, z − log M(θ): θ ∈ Rd }
for z ∈ Rd . Let Pn denote the distribution of n1 (X1 + · · · + Xn ), n = 1, 2, . . . .
Then {Pn } satisfies the LDP with rate function I (see [V2] or [DZ] for a
proof).
The following elementary result gives a way of getting new families satisfying LDP’s
through continuous maps. This is also a main reason for not insisting that the rate function
be convex.
Theorem 3.3 (Contraction principle). Let {P } satisfy the LDP with a rate function I (·).
Let (Ŝ, d̂) be a complete separable metric space, and π : S → Ŝ a continuous function.
Put P̂ = P π −1 , > 0. Then {P̂ } also satisfies the LDP with the rate function
inf{I (x): x ∈ π −1 (y)}, if π −1 (y) = φ
Iˆ(y) =
∞, otherwise.
2
170 S Ramasubramanian
Note. In the above π can also depend on , with some additional assumptions (see [V2]).
Recall that {P } converges weakly to P (denoted P ⇒ P ) if
lim f (x) dP (x) = f (x) dP (x)
→0 S S
The formal similarity between (3.8), (3.9) and the above suggests that LDP may be suit-
able for handling convergence of integrals of exponential functionals. Indeed we have the
following fundamental result, which is the key to diverse applications.
Theorem 3.4 (Varadhan’s lemma (1966)). Let {P } satisfy the LDP with a rate function
I (·). Then for any bounded continuous function g on S,
1
lim log exp g(x) dP (x) = sup{g(x) − I (x): x ∈ S}. (3.10)
→0 S
Example 3.5. We now present a ‘toy example’, taken from den Hollander [H], to indicate
that probabilities of rare
events can
decisively
influence
asymptotic expectations. Let {Xi }
be i.i.d.’s such that P Xi = 21 = P Xi = 23 = 21 . Let Pn denote the distribution of
n (X1 + · · · + Xn ), n ≥ 1. By Cramer’s theorem {Pn } satisfies the LDP with rate function
1
log 2 + z − 21 log z − 21 + 23 − z log 23 − z , if 1
2 ≤z≤ 3
2
I (z) =
∞, otherwise.
Now
n
1 n
E Xi = exp(n log x) dPn (x) = exp(n g(x)) dPn (x),
n i=1 1 3
2,2 R
b, say. (3.11)
An introduction to 2007 Abel prize 171
It can be shown easily that b > 0. By the law large numbers n1 (X1 + · · · + Xn ) → 1 with
probability 1. So one might naively expect l.h.s. of (3.11) to be zero. However, as shown
above it is not so. Thus the asymptotic expectation is determined not by the typical (or
almost sure) behaviour but by the rare event when n1 ni=1 Xi takes values near x ∗ , the
value where supremum is attained in (3.11). 2
Exit problem, discussed in the next section, gives a more concrete example where rare
event determines the quantity/characteristic of interest.
Now, as I (·) given by (3.4) is a rate function, the similarity between the r.h.s. of (3.3)
and (3.10) is quite striking. In fact, if we can have a family {Qn } of probability measures
on Du [0, T ] satisfying LDP with rate function given by (3.4), then (3.10) gives an approx-
imation scheme for the solution to (3.1).
Suppose H (θ ) = c log MF (θ ), θ ∈ R where MF is the moment generating function
(Laplace transform) of a probability distribution F on R. Let {Xi } be an i.i.d. sequence with
distribution F , and Sn = X1 + · · · + Xn , n ≥ 1. For n = 1, 2, . . . define the stochastic
process Zn (t) = n1 S[nt] , 0 ≤ t ≤ T ; (here [z] denotes the integer part of z). Then the
sample paths (or trajectories) of Zn (·) are in Du [0, T ]. Let Qn denote the probability
measure induced by the process Zn on Du [0, T ]; (Qn may be called the distribution of the
process Zn (·)). It can be proved that {Qn } satisfies LDP with rate function given by (3.4);
this is basically a functional version of Cramer’s theorem, proved in Varadhan [V1].
If the Hamiltonian is not a logarithmic moment generating function then the approxima-
tion scheme, though similar in spirit, is more involved. But once again, it uses processes
with independent increments. Hamilton–Jacobi equations (of the type (3.1)) with non-zero
right side can also be handled (see [V1]).
Even at the pain of repetition, it may be worth mentioning the following. Thanks to
the work of Hopf, Lax and Oleinik, it was shown only in the late 50’s/early 60’s that U
given by (3.3) is the weak solution, in a suitable sense, to (3.1). In more modern jargon
(3.3) gives the viscosity solution to (3.1) (see [E] for a detailed discussion on this circle of
ideas). Varadhan [V1] has given an approximation scheme for (3.3) in terms of probabilistic
objects. On the way, a unifying framework for large deviations has been synthesized, with
Varadhan’s lemma set to play a key role.
Example 4.1. Let T > 0 and g be a continuous function on R. Consider the terminal value
problem for the viscous Burger’s equation:
ut − u ux + 21 uxx = 0, in (0, T ) × R
. (4.1)
u (T , x) = g(x), x ∈ R
Clearly r.h.s. of (4.6) suggests that limit of U as ↓ 0 can be handled using Varadhan’s
lemma, once it is shown that {P : > 0} satisfies the LDP and the rate function is identified.
Also as ↓ 0 we expect U to converge to the solution of the Hamilton–Jacobi equation
1
Ut − (Ux )2 = 0, in (0, T ) × R (4.7)
2
with the terminal value U (T , x) = G(x), x ∈ R. From this, solution to (4.2) can be
obtained by differentiating with respect to x. Here the Hamiltonian is H (y) = 21 y 2 and
hence the Lagrangian is L(z) = 21 z2 . Note that the approximation scheme suggested here
is somewhat different from the one discussed in the preceding section. This problem was
considered by Donsker and his student Schilder at the Courant Institute around 1965,
serving as another motivation for [V1]. 2
Theorem 4.2 (Schilder 1966). {P : > 0} satisfies the LDP with rate function IB given
by (4.8). An analogous result also holds for the d-dimensional Brownian motion. 2
An important ingredient of the proof is the Cameron–Martin formula which gives the
Radon–Nikodym derivative of translation by an absolutely continuous function with
respect to the Wiener measure (see Varadhan [V2] for a proof). In view of Example 2.2 and
Cramer’s theorem the rate function IB may not be surprising. Theorem 4.2 is an example
of a sample path large deviations principle. This is a level 1 LDP like Cramer’s theorem.
A far reaching generalization of the above is the LDP for diffusion processes, again
a sample path LDP, due to Wentzell and Friedlin (1970); some special cases had been
considered earlier by Varadhan. A diffusion process can be represented as a solution to
a stochastic differential equation. Let {X(t): t ≥ 0} denote a standard d-dimensional
Brownian motion, where d ≥ 1 is an integer. Let σ (·), b(·) respectively be (d × d) matrix-
valued, Rd -valued functions on Rd . The stochastic differential equation
with initial value Z ,x (0) = x. Let Q ,x denote the probability measure induced on
C([0, T ]: Rd ) by the process {Z ,x (t): 0 ≤ t ≤ T }. Then {Q ,x : > 0} satisfies LDP with
the rate function
1 T −1
2 0 ẇ(t) − b(w(t)), a (w(t)) (ẇ(t) − b(w(t)))dt, if w ∈ D
x
Ix (w) = (4.12)
∞, otherwise
where
t
D = w ∈ C([0, T ]: R ): w(t) = x +
x d
ξ(s) ds, ξ ∈ L2 [0, T ] .
0
Remark. If σ (·) ≡ identity matrix, b(·) ≡ 0 then the above reduces to Schilder’s theorem.
In fact, if σ (·) ≡ constant, then the above result is a simple consequence of Schilder’s
theorem and the contraction principle. So the expression (4.12) may not be surprising;
however the proof in the general case involves a delicate approximation (see [FW]
and [V2]).
We next indicate a connection between diffusions and second order elliptic/parabolic
PDE’s. With σ, a, b as in Theorem 4.3 define the elliptic differential operator L by
1 d
∂ 2 g(x) d
∂g(x)
Lg(x) = aij (x) + bi (x) , (4.13)
2 i,j =1 ∂xi ∂xj i=1
∂xi
where a(·) = ((aij (·))). The operator L is called the infinitesimal generator of the diffu-
sion process Z(·) given by (4.9) and (4.10). The probabilistic behaviour of the diffusion
is characterized by L . In particular, the transition probability density function of Z(·) is
the fundamental solution to the parabolic operator ∂t∂ + L. (For example, the generator
d ∂2
corresponding to Brownian motion is the d-dimensional Laplacian 21 := 1
2 i=1 ∂x 2 ,
i
and the heat kernel is the corresponding transition probability density function.)
(see [KS]).
Let G ⊂ Rd be a bounded smooth domain. Consider the (Dirichlet) boundary value
problem
where g, f are known functions. Then the unique solution to (4.14) can be written as
τ
u(x) = E f (Z(τ )) + g(Z(s)) ds Z(0) = x , x ∈ Ḡ (4.15)
0
where Z is the diffusion given by (4.9), (4.10), and τ = inf{t > 0: Z(t) ∈ G} = first exit
time from G. Note that r.h.s. of (4.15) denotes taking expectation given that Z(0) = x.
An introduction to 2007 Abel prize 175
d
∂ 2v d
∂v
L v(x) = aij (x) (x) + bi (x) (x). (4.16)
2 i,j =1
∂xi ∂xj i=1
∂xi
(A) There exists x0 ∈ G (an interior point) such that for any x ∈ Ḡ the solution z(·) to
(4.19) with initial value z(0) = x satisfies z(t) ∈ G for all t > 0 and limt→∞ z(t) = x0 ;
that is, x0 is the unique stable equilibrium point in Ḡ of the ODE (4.19).
Some questions of interest are: What happens to u as ↓ 0? In particular, what about
Ex (τ ) as ↓ 0? What can one say about the hitting distribution on ∂G in the limit?
For small , the trajectories of the diffusion Z ,x (·) are close to the deterministic tra-
jectory z(·) with very high probability. And, as the deterministic trajectory z(·) does not
exit G at all, a reasonable guess would be that the system Z ,x tends to stay inside G for
small . In such an eventuality note that limiting exit time and exit place are not defined.
To get a handle on the problem, we proceed differently. By continuity of sample paths,
Z ,x (τ ) is ∂G-valued. So for any > 0, the hitting distribution, i.e. the distribution of
Z ,x (τ ) is a probability measure on ∂G. Since ∂G is compact this family of probability
measures has limit points.
To appreciate the importance of the problem let us look at two situations. The first
example is from chemistry, which is the origin of the ‘exit problem’. It is known that
molecules need to overcome a potential barrier to be able to participate in a chemical
reaction. As the molecules are in motion, their energy is modelled by a diffusion of the
type Z ,x (·), oscillating about a stable state; here > 0 is the so-called Arrhenius factor.
The potential barrier θ is represented by the diameter of the domain G. In general, θ .
So exit from the ‘right end’ of G for small means reaction will proceed. Hence the
asymptotic rate of exit at the right end of the potential well, as ↓ 0, gives a very good
estimate of reaction rate (see [Kp], [Sc] for more background information and ad hoc
-expansion method due to Kramers).
176 S Ramasubramanian
The second example is from engineering, concerning track loss in radar systems. In such
a system the observed tracing error, due to evasive maneuvres of the target as well as to
observation noise, is modelled by a diffusion of the type Z ,x (·). Here gives the variance
parameter in the observation noise. As radar systems are quite sophisticated this parameter
is very small compared to the actual tracing error. Since the observation device has a limited
field of view, Z ,x (·) ceases to model the observation process as soon as the tracking error
exits from the field of view. So exiting the domain in this case is an undesirable event.
Hence information on probability of exit, mean time of exit, exiting place on ∂G, etc. may
be useful in designing optimal devices (see [DZ] for a detailed discussion).
Motivated by the rate function in Theorem 4.3, for 0 < t < ∞ define
1 t
It (y(·)) = (ẏ(s) − b(y(s))), a −1 (y(s))(ẏ(s) − b(y(s))) ds
2 0
if y is absolutely continuous with square integrable derivative ẏ. Set
ϕt (x, y) = inf{It (y(·)): y(0) = x, y(t) = y, y absolutely
continuous, ẏ square integrable},
∂u 1 ∂ 2u
(t, x) = (t, x) + V (x) u(t, x), t > 0, x ∈ R
∂t 2 ∂x 2
with the initial value u(0, x) = 1. By Feynman–Kac formula the solution is given by
t
u(t, x) = Ex exp V (X(s)) ds
0
t
E exp V (X(s)) ds X(0) = x ,
0
where X(·) denotes one-dimensional Brownian motion; this can be proved using stochastic
calculus; see [KS].
Since V and the initial value are periodic, x → u(t, x) is also periodic. Note that
Y (t) X(t) mod 2π, t ≥ 0 is the Brownian motion on the 1-dimensional torus (circle)
T. So the problem as well as the solution can be considered on T rather than on R. In other
words, the problem is basically
∂u 1 ∂2
(t, θ ) = Au(t, θ ) + V (θ ) u(t, θ ), t > 0, θ ∈ T
∂t 2 ∂θ 2
u(0, θ) = 1, θ ∈ T (5.1)
2
The one-dimensional Schrödinger operator A 21 ∂θ
∂
2 + V (θ) is an unbounded operator
with domain (A) ⊂ L (T). It is known from the theory of second-order elliptic differential
2
equations that A−1 is a bounded self-adjoint compact operator. So by spectral theory A has
a sequence {λi } of eigenvalues, and a corresponding sequence {ψi (·)} of eigenfunctions
such that limm→∞ λm = −∞, λ1 > λ2 ≥ λ3 ≥ . . . , the principal eigenvalue λ1 is
of multiplicity one and the corresponding eigenfunction ψ1 (·) > 0 (see [E] or [K]). The
semigroup {Tt } corresponding to (5.1) can be formally written as {etA } and hence by
spectral theory again
∞
u(t, θ ) = (etA 1)(θ ) = eλk t ψk , 1ψk (θ ), (5.3)
k=1
where 1 denotes the function which is identically 1 on T, and ·, · denotes the inner product
in L2 (T). As λ1 > λi , i ≥ 2 and ψ1 > 0, from (5.3) we have
and consequently
1
lim log u(t, θ ) = λ1 . (5.4)
t→∞ t
This is a result due to Kac.
Now the bilinear form associated with A is
1
B[f, g] = Af, g = f (θ ) g(θ )dθ + V (θ) f (θ) g(θ)dθ
T 2 T
1
=− f (θ ) g (θ )dθ + V (θ) f (θ) g(θ) dθ, (5.5)
T 2 T
where in the last step we have used integration by parts and periodicity. It is known by the
classical Rayleigh–Ritz variational formula (see [E] or [K]) that the principal eigenvalue
λ1 can be given, in view of (5.5) by
Similar analysis is possible also on R if limx→±∞ V (x) = −∞. The above discussion
basically means that the Perron–Frobenius theorem for nonnegative irreducible matrices
goes over to self-adjoint second-order elliptic operators.
A natural question, whose implications turn out to be far reaching is: Is there a
direct way of getting (5.6) from (5.2) without passing through differential equation (5.1)
or the interpretation of the limit in (5.4) as an eigenvalue? If it is possible to do so,
t
then one can replace 0 V (Y (s))ds by more general functionals of the form F (Y (t))
depending on Brownian paths and hope to calculate limt→∞ 1t log E[exp(F (Y (t)))].
In such a case there may be no connection with differential equations. Moreover
one can also consider processes other than Brownian motion. Donsker’s firm convic-
tion that something deep was going on here propelled the investigation along these
lines.
Put f (θ) = g 2 (θ ). Then what we seek can be written as
1 1 t
lim log E exp t V (Y (s)) ds Y (0) = y
t→∞ t t 0
1 1
= sup V (θ ) f (θ ) dθ − |f (θ )|2 dθ : f L1 = 1, f ≥ 0
T 8 T f (θ )
(5.7)
for any y ∈ T. If (5.7) can be considered as a special case of (3.10) then our
t purpose would
be served by Varadhan’s lemma. Also (5.7) implies that the factor exp( 0 V (Y (s))ds) in
the Feyman–Kac formula (5.2) can be viewed upon as an Esscher tilt.
Towards this, let = {w: [0, ∞) → T: w continuous}; this can be taken as the basic
probability space. Define Y (t, w) = w(t), t ≥ 0, w ∈ . For y ∈ T, let Py denote the
probability measure on making {Y (t): t ≥ 0} a Brownian motion on T starting at y; that
An introduction to 2007 Abel prize 179
is, Py is the distribution induced on by the Brownian motion on the torus starting at y.
For t ≥ 0, w ∈ , A ⊆ T define
1 t
M(t, w, A) = IA (Y (s, w)) ds, (5.8)
t 0
denoting the proportion of time that the trajectory Y (·, w) spends in the set A during the
period [0, t]. This is called the occupation time. Note that A → M(t, w, A) is a probability
measure on the torus. Let M(T) denote the space of probability measures on T, endowed
with the topology of weak convergence of probability measures. For t ≥ 0 fixed, let Mt
denote the map w → M(t, w, ·) ∈ M(T). Let Qt Py Mt−1 denote the distribution of
(y)
(y) (y)
Mt . So Qt is a probability measure on M(T); in other words, Qt ∈ M(M(T)), for
any t ≥ 0, y ∈ T.
Now observe that
1 t
V (Y (s, w)) ds = V (θ ) M(t, w, dθ) (5.9)
t 0 T
and consequently
1 t
E exp t V (Y (s)) ds Y (0) = y
t 0
= exp t V (θ ) M(t, w, dθ) dPy (w)
T
(y)
= exp t V (θ ) μ(dθ) dQt (μ)
M(T) T
(y)
= exp(t(μ)) dQt (μ), (5.10)
M(T)
where
(μ) = V (θ ) μ(dθ ), μ ∈ M(T). (5.11)
T
Note that (5.9), (5.10) imply that l.h.s. of (5.7) is of the same form as l.h.s. of (3.10) with
(y)
S = M(T), = 1t , P = Qt , g(x) = (μ). It can be shown that I0 (·) defined by
1 1
8 T f (θ) |f (θ )| dθ, if dμ(θ) = f (θ) dθ, and f differentiable
2
I0 (μ) =
∞, otherwise
(5.12)
is the rate function on M(T); note that M(T) is a compact metric space by Prohorov’s
theorem. In fact we have the following:
(y)
Theorem 5.1 (Donsker–Varadhan, 1974). For any y ∈ T, the family {Qt : t ≥ 0} of
probability measures on M(T), induced by the occupation time functionals of Brownian
motion on T, satisfies LDP with rate function I0 given by (5.12). Consequently, by
Varadhan’s lemma (5.7) holds. 2
180 S Ramasubramanian
For proof, see [DV1]. Moreover asymptotics of functionals of the form (M(t, w, dθ))
can be described. Like Sanov’s theorem, the above result of Donsker and Varadhan is a
level 2 LDP.
The basic space in the above set up is the torus which has a canonical measure, viz.
the rotation invariant (Haar) measure dθ . The basic process is the Brownian motion on
the torus. Its generator is the Laplacian which is uniformly elliptic and self-adjoint. Hence
the normalized Haar measure on the torus turns out to be the unique ergodic probabi-
lity measure for the basic process. This important fact has played a major role in the
background.
The above result is the proverbial tip of the iceberg. It led to an extensive study, by
Donsker and Varadhan, of LDP for occupation times for Markov chains and processes.
Some of the results were also independently obtained by Gartner [G]. This in turn formed
the basis for providing a variational formula for the principal eigenvalue of a not necessarily
self-adjoint second-order elliptic differential operator, a solution to the problem of Wiener
sausage, etc. However it is not powerful enough to deal with the polaron problem from
statistical physics.
For this a LDP at the process level is needed. This is called level 3 large deviations.
A crowning achievement is the LDP for empirical distributions of Markov processes, due
to Donsker and Varadhan. We briefly describe this far reaching extension of Theorem 5.1.
Note that (5.8) can also be written as
1 2n
M(t, w, A) = lim δY (tk2−n ,w) (A).
n→∞ 2n k=1
On the r.h.s. of the above we have a sequence of empirical distributions. To handle large
deviation problems, the proper way to extend the notion of empirical distribution to stochas-
tic processes turns out to be as follows.
Let be a complete separable metric space. Let = {w: w right continuous on
(−∞, ∞) into , and w(t−) exists for all t}. Under the Skorokhod topology on bounded
intervals, is a complete separable metric space. Let + denote the corresponding space
of -valued functions on [0, ∞). For r ∈ (−∞, ∞) let θr denote the translation map on
given by θr w(s) = w(r + s).
For w ∈ , t > 0 let wt be such that wt (s + t) = wt (s) for all s ∈ (−∞, ∞), and
wt (s) = w(s) for 0 ≤ s < t; that is, the segment of w on [0, t) is extended periodically to
get wt . For w ∈ , t > 0, B ⊂ define
t
1
Rt,w (B) = IB (θr wt )dr. (5.13)
t 0
It can be shown that Rt,w (θσ B) = Rt,w (B) for any B ⊆ , σ > 0. So B → Rt,w (B)
is a translation invariant probability measure on . Let MS () denote the space of all
translation invariant probability measures on , with the topology of weak convergence.
This is a complete separable metric space. For fixed t ≥ 0, note that w → Rt,w is a
mapping from into MS ().It is called the empirical distribution functional.
t
We write M(t, w, A) = 1t 0 IA (w(s))ds, A ⊆ to denote the analogue of (5.8) in
the present context. It can be seen that M(t, w, ·) = Rt,w π0−1 , where π0 is the projection
from onto given by w → w(0). Thus the occupation time functional is the marginal
distribution of the empirical distribution functional.
An introduction to 2007 Abel prize 181
Let P0,x denote the distribution of a -valued ergodic Markov process starting from
x ∈ at time 0; it is a probability measure on + . For t ≥ 0, x ∈ , let ζt(x) be defined by
exists, where B(·) is the three-dimensional Brownian motion, and establishing a conjecture
made in 1949 by Pekar concerning the asymptotics of η(α) as α → ∞. For a description
of the polaron problem, see [R] (see [V2] and the references therein for details).
In this write-up we have attempted to give just a flavour of large deviations. While [V2]
and [V3] give succinct overview, [DS], [DZ] and [H] are excellent textbooks on the subject;
the latter two also discuss applications to statistics, physics, chemistry, engineering, etc.
An interested reader may also look up [DE], [El], [FW], [O], [Sm], [SW] and [FK] for
diverse aspects, applications and further references.
Acknowledgment
This is an expanded version of M N Gopalan Endowment Lecture given at the 22nd Annual
Conference of Ramanujan Mathematical Society held at National Institute of Technology –
Surathkal, Karnataka in June 2007. It is a pleasure to thank R Bhatia for his encouragement
and useful discussions. The author thanks V S Borkar for constructive suggestions on an
earlier draft. Thanks are also due to an anonymous referee for spotting an error and for
useful suggestions.
References
[AD] Atar R and Dupuis P, Large deviations and queueing networks: methods for rate
function identification, Stochastic Process. Appl. 84 (1999) 255–296
182 S Ramasubramanian